* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Physiologically-Inspired Model for the Visual Tuning Properties of
Nonsynaptic plasticity wikipedia , lookup
Neuroeconomics wikipedia , lookup
Neuroethology wikipedia , lookup
Molecular neuroscience wikipedia , lookup
Single-unit recording wikipedia , lookup
Neural engineering wikipedia , lookup
Artificial neural network wikipedia , lookup
Multielectrode array wikipedia , lookup
Stimulus (physiology) wikipedia , lookup
Clinical neurochemistry wikipedia , lookup
Caridoid escape reaction wikipedia , lookup
Holonomic brain theory wikipedia , lookup
Time perception wikipedia , lookup
Neuroesthetics wikipedia , lookup
Neural oscillation wikipedia , lookup
Binding problem wikipedia , lookup
Recurrent neural network wikipedia , lookup
Neuroanatomy wikipedia , lookup
Circumventricular organs wikipedia , lookup
Types of artificial neural networks wikipedia , lookup
Neural modeling fields wikipedia , lookup
Neural correlates of consciousness wikipedia , lookup
Embodied language processing wikipedia , lookup
Convolutional neural network wikipedia , lookup
Central pattern generator wikipedia , lookup
Biological neuron model wikipedia , lookup
Development of the nervous system wikipedia , lookup
Metastability in the brain wikipedia , lookup
Neural coding wikipedia , lookup
Pre-Bötzinger complex wikipedia , lookup
Optogenetics wikipedia , lookup
Neuropsychopharmacology wikipedia , lookup
Premovement neuronal activity wikipedia , lookup
Synaptic gating wikipedia , lookup
Mirror neuron wikipedia , lookup
Neural binding wikipedia , lookup
Efficient coding hypothesis wikipedia , lookup
Nervous system network models wikipedia , lookup
Proceedings of the 2008 International Conference on Cognitive Systems University of Karlsruhe, Karlsruhe, Germany, April 2-4, 2008 Physiologically-inspired model for the visual tuning properties of mirror neurons Falk Fleischer, Antonino Casile, Martin A. Giese has been explained by dynamic predictive motor models that simulate the action in synchrony with the visual stimulus [6], [7], [8], [9], [10], [11], [3]. However, the reconstruction of three-dimensional structure, in particular from monocular image sequences, is a difficult computational problem. A large body of results on the recognition of static shapes suggests that the visual system might not reconstruct the full 3D structure of recognized objects. Instead, it seems to base recognition on an integration of information extracted from two-dimensional views of objects [12], [13], [14], [15]. This raises the question whether the visual tuning properties of mirror neurons can be explained within a similar framework, without an accurate reconstruction of the three-dimensional scene geometry. The aims of this paper are twofold: First, we try to develop a model for the visual tuning properties of mirror neurons that is physiologically plausible, and which, at a later stage can be compared to electrophysiological data. This makes it necessary that the model operates on real video sequences which also can be presented to monkeys in electrophysiological experiments. Second, we try to devise a model that explains visual tuning properties of mirror neurons without the need of the 3D reconstruction of effector and object geometry, in order to test the computational feasibility of the recognition of goal-directed actions within a view-based framework. In the following, we first present the model and its components. We then show some example simulations that reproduce typical properties of mirror neurons. Finally, we discuss implications and further extensions of this work. Abstract— Mirror neurons are a class of neurons that have been found on the premotor cortex of monkeys, and which are active during the motor planning and the visual observation of actions. These neurons have recently received a vast amount of interest in cognitive neuroscience and robotics and have been discussed as potential basis for the imitation learning and understanding of actions. However, their visual tuning properties are only poorly understood. Most existing models assume that the tuning properties of mirror neurons might be based on a reconstruction of the three-dimensional structure of action and object, a computationally difficult problem. In line with a broad body of work on object recognition, we present a model that explains visual properties of mirror neurons without this requirement. The proposed model is based on a small number of physiologically well-established principles. In addition, it postulates novel neural mechanisms for the integration of information about object and effector movement, which can be tested in electrophysiological experiments. I. INTRODUCTION Mirror neurons are a class of neurons that have been first described in the premotor cortex of monkeys. These neurons respond as well when the animal prepares motor actions, as when it perceives motor actions executed by other monkeys or humans [1]. Recently, mirror neurons have received a vast amount of interest in cognitive neuroscience, and also in robotics, since they have been discussed as physiological basis of the imitation learning of actions and potentially also of action understanding [2], [3]. Beyond the fact that they are active during motor planning, mirror neurons have a number of interesting visual tuning properties. They are selective for subtle differences between actions, like power vs. precision grip. At the same time, the are highly invariant against the position of the action in the visual field and partially even against the view of the action. However, their response is critically dependent on the correct spatial arrangement of the effector and the goal object, and usually they respond only to functionally effective actions. In addition, mirror neurons typically fail to respond to mimicked actions without goal object [4], [5]. Many existing models for the mirror neuron system assume a reconstruction of the three-dimensional structure of goal object and effector motion by the visual system. Recognition within such three-dimensional representations II. MODEL The developed model is based on principles that have been successfully applied before for the modeling of object recognition [16], [17], [18], [19] and movement recognition [20]. An overview of the model architecture is shown in Figure 1. In the following, the individual components and principles will be described in more detail. The architecture is based on three main components: (1) A hierarchical neural model for the recognition of goal objects and effector (hand) shapes from video frames, where the middle levels of this hierarchical model are optimized by feature learning; (2) a simple recurrent neural circuit for the realization of temporal sequence selectivity for the recognition of effector movements, and (3) a physiologically plausible mechanism that combines the spatial information about the goal object and the posture, position and orientation This work was supported by the DFG, HFSP, EC FP6 project COBOL and the Hermann Lilly Schilling Stiftung. All authors are with the Hertie Institute for Clinical Brain Research, Tübingen, Germany [email protected], {antonino.casile,martin.giese}@uni-tuebingen.de M. Giese is also with the School of Psychology, University of Bangor, UK 19 of the effector. The highest level of this mechanism is formed by the model ’mirror neurons’. A. Hierarchical recognition network for objects and effector shapes The recognition of the shapes of the goal object and the effector (hand) is based on a hierarchical neural recognition model. Very similar models have been proposed to account for variety of experimental results in object recognition [17] and motion recognition [20]. Each video frame is analyzed by a hierarchy of neural feature detectors. The complexity of the extracted features increases along the hierarchy. At the same time, also the size of the receptive fields and the invariance of the detectors against position and scale changes increase along the hierarchy. The individual computational steps are further lined out below, and we refer to [19] and [21] for further details. An overview of the model neurons is given in Table I. 1) Local orientation filters (areas V1/V2): Local orientations are extracted from the video frames using Gabor filters with 12 orientations and 3 different scales that are selective for different spatial positions. Signifying by Gk (u, v) a normalized Gabor filter with zero mean and sum of squares 1, the response xk of a Gabor filter Gk to a patch of pixels P (u, v) from the input image is given by: < P, Gk > √ (1) xk = β + < P, P > ‘V1’ ‘V2/V4’ ‘IT’ … ‘IT/STS’ ‘STS’ Relative position map ‘Precision grip’ ‘AIP’ In this expression the scalar product is defined as P < f, g >= u,v f (u, v)g ∗ (u, v), and the positive constant β avoids division by zero. Consistent with [21], we assume a mutual inhibition between the filters with the same spatial position and scale, but with different orientations: Let xmin and xmax be the minimum and the maximum of the responses over all orientations. Then a local threshold was defined by the expression η = xmin + b(xmax − xmin ), where b is a positive constant that defines the strength of the inhibition. The effective output of the orientation filters was given by x˜k = [[xk − η]+ − T ]+ , where [x]+ = max(x, 0) and T > 0 is a global threshold. Responses of these orientation filters from the first layer of the model, with selectivity for the same orientation and scale within a limited spatial neighborhood, were pooled by a maximum operation in order to achieve partial position invariance [22], [17]. Model neurons that realize this pooling are similar to complex cells in primary visual cortex that are characterized by a limited degree of position and scale invariance. 2) Detectors for intermediate form features (area V4): The output signals from the position-invariant orientation detectors in the previous layer within a limited spatial region were used to construct form features with increased complexity. The selectivity of the detectors for such intermediate level features was established by learning (see below). Such intermediate feature correspond, for example to fragments of hands or objects (Figure 1). The tuning properties of these ‘PF, F5’ ‘Precision grip’ Fig. 1. Overview of the model. detectors is given by radial basis functions of the form ! 2 kY − Um kF (2) ym = exp − 2σ 2 Nm where the matrix Y signifies the responses from a patch (of the position grid) from the previous layer, and where the matrix Um is a template pattern that is established by learning. The integer Nm signifies the number of non-zero elements in the matrix Um . Detectors for each particular feature, defined by Um , are replicated for different spatial positions. Like for the orientation detectors, the responses of the detectors with same feature selectivity but different position selectivity within a spatial neighborhood are pooled using a maximum operation. This defines a hierarchy layer with model neurons that detect optimized mid-level features with partial invariance against position changes. 3) Detectors for complete object forms and hand shapes (area IT): The next hierarchy layer implements exactly the same computational functions as the layers described 20 TABLE I PARAMETERS OF THE HIERARCHICAL MODEL FOR OBJECT AND 3) Compute the responses ym of all existing mid-level feature detectors according to Equation (2). 4) Only if the maximum of these responses is below a given threshold (TL > 0.95) add a new template to the already existing features. 5) Repeat steps from 2) until a sufficient number of features have been learned. This algorithm implements a form of competitive online feature learning, which potentially also could be implemented by an appropriate online learning rule that recruits additional neurons for features that are not yet sufficiently well represented. EFFECTOR RECOGNITION layer 1 2 3 4 5 6 # filter types 36 36 226 226 17 17 receptive field size (deg) 0.63–1.09 0.74–1.20 2.12–2.57 2.29–2.75 > 4.0 > 4.0 total # of neurons 2332800 259200 1627200 101700 7650 850 in the last section. In this case, the receptive field sizes are large enough to encompass whole objects and effector configurations. The resulting feature detectors respond selectively to views of objects and hands, being sensitive to size and orientation. This selectivity provides information that is critical to determine whether a grip is functional or dysfunctional for a particular object. The object and hand detectors were again modeled by radial basis functions which were optimized by training of Support Vector Machines [23] that classify one pattern (object or hand view) against all others. This step is not physiologically plausible and will be replaced later by physiologically-inspired learning rules. Again detectors with the same selectivity were realized for different spatial positions and their responses were pooled with a maximum operation within a spatial neighborhood to achieve position invariance. However, contrasting with many object recognition models, the neurons on the highest hierarchy level of our recognition hierarchy have still a coarse tuning for the positions of the object and the effector. This is consistent with neurophysiological data [13]. In addition, this property is necessary for extracting the relative positions of effector and object, which is crucial for distinguishing functional and dysfunctional actions. C. Temporal sequence selectivity (area STS) To recognize effector movements it is not sufficient to detect individual keyframes. Only if these keyframes arise in the correct temporal order the action should be recognized. To implement sequence-selective recognition we exploited a recurrent neural mechanisms that has been proposed before in the context of biologically inspired models for movement recognition [20]. We assume that the outputs of the neural detectors for individual effector shapes for a specific action l, signified by zkl (t), provide input to snapshot neurons that encode the temporal order of individual effector shapes. Selectivity for temporal order is achieved by introducing asymmetric lateral connections between these neurons. The dynamics of the resulting network is given by the equation τr ṙkl (t) = −rkl (t) + (3) ! X l w(k − m) [rm (t)]+ + zkl (t) − hr m where hr is a parameter that determines the resting level, and where the parameter τr determines the time constant of the dynamics. The function w is an asymmetric interaction kernel that, in principle, can be learned efficiently by timedependent Hebbian learning [24]. The responses of all snapshot neurons that encode the same action pattern are integrated by motion pattern neurons, which smooth the activity over time. Their response depends on the maximum of the activities rkl (t) of the corresponding snapshot neurons: B. Learning of mid-level features The templates Um on the middle levels of the hierarchy are established by learning. As training data set, we used images that contain the relevant object or effector view. The first step of the learning process is a selection of the most strongly activated features, passed on from the previous hierarchy level, for each spatial position of the given training image. The result is a single dominant feature per position. This selection procedure could be implemented neurally by a winner-takes-all inhibition between the neurons encoding different features at the same position. New templates defining novel intermediate features are created using the following iterative procedure that makes the creation of novel templates dependent on the performance of the existing ones for the given training image: τs ṡl (t) = −sl (t) + max [rkl (t)]+ − hs k (4) The motion pattern neurons become active during specific movement sequences, e.g. for grasping with a precision or power grip. Such neurons have been found in the superior temporal sulcus of monkey [25]. D. Mirror neurons: Integration of information from object and effector (areas AIP, PF, F5) 1) Create an initial 2D template feature Uk , which is derived from the response of the previous hierarchy level for a randomly chosen region and a random training example. 2) Get a new input patch X from a random example image. The highest levels of the model integrate the following signals about object and effector: (1) Type of the dynamic effector action that is signalled by the motion pattern neurons, and (2) the spatial relationship between the moving effector and the goal object. The necessary information about 21 the positions of the object and the effector is extracted from the highest level of the form-hierarchy, which is not completely position-invariant and thus encodes these positions coarsely within a retinal frame of reference. In addition, the recognized effector view predicts a range of object positions that are suitable for an effective grip. This permits to derive whether the effector action likely will be successful or not, dependent on the object position. We postulate a simple physiological mechanism for the integration of these different pieces of information that is centrally based on a relative position map which is constructed by pooling the output signals from the neurons encoding effector and object views. By pooling the signals of all neurons that represent views of objects at similar spatial positions one can derive a population vector that provides a coarse estimate of the object position [26], [27]. More precisely, by pooling the activity of all object view neurons that represent objects close to the position (u, v) in the retinal frame of reference, one can derive a field of population activity aO (u, v) that has a peak at the position (uO , vO ) of the object. In the same way, one can derive an activity field aE (u, v) that has a peak at the position (uE , vE ) of the effector by pooling the signals of all effector-view neurons that have been trained with similar effector positions. From these two activity fields a relative position map is derived, which encodes the position of the object relative to the effector. For this purpose one defines an activity map that integrates the pooled signals defined before in a multiplicative manner: Z aRP (u, v) = aO (u′ , v ′ ) aE (u′ − u, v ′ − v) du′ dv ′ (5) function in the form: Z l a = aRP (u, v) gl (u, v) du dv (6) The output of the affordance neurons is only positive if object and effector are present and positioned correctly relative to each other. Neurons that are tuned to the relationship between objects and grips have been found in the parietal cortex of monkeys, e.g. in area AIP [28], and imaging studies suggest that such areas might be activated also by purely visual stimulation [29]. The highest level of the model is given by mirror neurons that multiply the output of the motion pattern neurons with that of the corresponding affordance neurons: ml (t) = al (t) ∗ sl (t). By the multiplicative interactions the mirror neurons only respond when the appropriate action is present, and when the goal objects is appropriately positioned relative to the effector. III. RESULTS A. Testing procedure The model was tested with real video sequences showing a human actor grasping an object with power and precision grips. The videos (640x480 pixels, 30 frames/sec, RGB color mode) were recorded in front of a standardized black background. The object was a simple ball (diameter 8 cm). The hand of the actor started from a resting position on a table at a distance of about 30 cm next to the the object. The recorded video sequences had a length between 34 and 54 frames. From the original video a subregion of 360x180 pixels was extracted that contains the whole effector movement and the goal object. In addition, images with a size of 120x120 pixels were extracted that contain only the hand or the goal object for the training of the recognition hierarchy. The background was subtracted using a threshold operation, and images were converted to grayscale for further processing. This neural map simply sums the products of the activities of the two neural populations, dependent on the relative positions between effector and goal objects that are encoded by the neurons that are selective for object and effector views. The last equation can be spatially discretized and can be implemented by summation (pooling) and multiplication of the signals of the appropriately chosen neurons. Due to the multiplication, all neurons of the relative position map will be inactive if either no object or no effector is present in the visual stimulus. The recognized view of the effector provides also information about the range of positions in which an object has to be positioned to permit effective grasping with grip type l. Specifically, the grip will be dysfunctional if the objects is, for example, positioned next to the hand as opposed to between the thumb and the index finger for a precision grip. It is easy to learn the spatial region within the relative position map that corresponds to functional grips. This region is indicated by an orange line in the relative position map in Figure 1. One can define a ’receptive field’ function gl (u, v) that corresponds to this region, which has high values within the region and values close to zero outside. We postulate the existence of affordance neurons whose response is constructed by computing the output of the relative position map weighted by this receptive field B. Discrimination between power and precision grip After training, the model was tested with video sequences of a precision and a power grip. The responses of two mirror neurons at the highest hierarchy level of the model are shown in Figure 2. The left panel shows the responses of a model mirror neuron that had been trained with a power grip, and which was tested with a video sequence of a power and a precision grip. Since the initial phases of both grip types is very similar the neuron becomes activated initially by both grips. However, after some time the preshaping of the hand leads to different hand configurations for both grips, resulting in a decay of the response of the neuron for the precision grip, while the response continues to increase strongly for the power grip. The right panel shows the equivalent behavior for a mirror neuron that had been trained with a precision grip. Initially the neuron responds for both grip types, but after a while the response for a powergrip stimulus breaks down, while the response for a precision grip stimulus continues to increase. The simulated mirror neurons in the model are thus selective for the grip type, even in presence of objects. 22 Mirror neuron selective for precision grip Mirror neuron selective for power grip experiments. In spite of this simplicity, the model works on real video sequences. In its present elementary form, the model reproduces qualitatively a number of key properties of mirror neurons: (1) tuning for the subtle differences between grips; (2) failure to respond to mimicked actions without goal object, and (3) tuning to the temporal order of the presented action. All these properties were reproduced without an explicit reconstruction of the 3D geometry of the effector or the goal object. It seems thus that at least some of the visual tuning properties of mirror neurons can be reproduced without a precise metric three-dimensional internal model. Due to the embedded mechanism for sequence selectivity the proposed model is predictive and can, in principle, account for psychophysical results that show a facilitation of the recognition of future effector configurations from previously observed partial actions [30], [31]. Differing from several other models, which assume prediction in a highdimensional space of motor patterns [3], our model assumes the existence of prediction also in the domain of visual patterns. The current version does not include a memory mechanism that would allow to code the presence and location of occluded objects [32], [33]. However, it has been shown that neural networks similar to the one we used to realize sequence selectivity can also model memory activity [34]. Including such mechanisms, e.g. in the object recognition pathway, would allow to model the persistent iring of mirror neurons during occlusions of the goal object [35]. In general, the question arises how a predictive visual representations interacts with representations for motor patterns, which almost certainly reflect the 3D structure of the planned actions. At the level of mirror neurons, however to our knowledge, no conclusive data exists that would allow to decide if such neurons represent the 3D structure of motor actions, the 2D structure of learned pattern sequences, or a more abstract, potentially even non-metric representation of actions. Recordings in our own lab show that the responses of a large fraction of mirror neurons in area F5 are viewdependent. This seems to contradict an invariant effectorcentered representation as fundamental coordinate frame of the operation of mirror neurons. Future detailed electrophysiological in close interaction with quantitative theoretical approaches will help to clarify how different frames of reference are represented at the level of mirror neurons. One might ask the question whether our model is suitable for robotics applications. Most existing robot systems for the imitation learning of movements by visual observation use traditional computer-vision front ends [36]. However, the fact that models strongly inspired by neuroscience have achieved performance levels in object detection [19] and action recognition [37] make their application in the context of robotics a feasible alternative. From the viewpoint of computational neuroscience, such applications are an ideal testbed for verifying the computational power of different neural implementations of computational operations. To link the existing architecture to a real robotic system it needs to be augmented by a learned transformation from 2D into 3D 0.7 0.6 0.6 0.5 0.5 Response Response Power grip Precision grip 0.4 0.3 Power grip Precision grip 0.2 0.3 0.2 0.1 0.1 0 0.4 −1000 −500 Time [ms] 0 0 −1000 −500 Time [ms] 0 Fig. 2. Responses of two mirror neurons that are selective for a power grip (left) and a precision grip (right). Fig. 3. Response of a mirror neuron during a) a normal movement towards the object, b) a normal movement not towards the object (’mimicked’), c) a movement in reversed temporal order. Two additional properties that are typical for real mirror neurons are illustrated in Figure 3. The solid line indicates again the behavior of a mirror neuron that has been trained with a power grip in a normal situation where the movement is presented with the correct temporal order of the frames in presence of a goal object. In this situation, the mirror neuron shows a strong response. If however the temporal order of the effector movement is reversed (red line), or if the goal object is not present at the correct position (green line), i.e. a mimicked action, the activity breaks down. The model neuron is thus selective for the correct temporal sequence of the action and requires the presence of a goal object. IV. CONCLUSIONS AND FUTURE WORK A. Conclusions We have presented a neurophysiologically plausible model for the visual tuning properties of mirror neurons. The proposed architecture provides only a first step towards a more detailed modeling of physiological data. However, the model is based on a number of simple neural mechanisms that, in principle, can be validated in electrophysiological 23 joint coordinates. In computer vision a variety of algorithms have been proposed that solve this problem (see [38] for a review). The major bottleneck for online applications is, however, the high computation time, in particular on the earlier levels of the recognition hierarchy. This might be overcome by implementing parts of the model on a graphics processing unit [39]. [16] D. Perrett and M. Oram, “Neurophysiology of shape processing,” IVC, vol. 11, no. 6, pp. 317–333, 1993. [17] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in visual cortex,” Nat. Neurosci., vol. 2, pp. 1019–1025, 1999. [18] B. W. Mel and J. W. Fiser, “Minimizing binding errors using learned conjunctive features,” Neural Comput., vol. 12, pp. 731–762, 2000. [19] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust object recognition with cortex-like mechanisms.” IEEE Trans Pattern Anal Mach Intell, vol. 29, pp. 411–426, 2007. [20] M. A. Giese and T. Poggio, “Neural mechanisms for the recognition of biological movements,” Nat Rev Neurosci, vol. 4, pp. 179–192, 2003. [21] J. Mutch and D. G. Lowe, “Multiclass object recognition with sparse, localized features,” in Proc. IEEE Conf. on Comp. Vision and Pattern Recogn. 2006, vol. 1, 2006, pp. 11–18. [22] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, pp. 193–202, 1980. [23] V. Vapnik, Statistical Learning Theory. Wiley, 1998. [24] J. Jastorff, Z. Kourtzi, and M. A. Giese, “Learning to discriminate complex movements: Biological versus artificial trajectories,” J. Vis., vol. 6, pp. 791–804, 2006. [25] D. Perrett, M. Harries, R. Bevan, S. Thomas, P. Benson, A. Mistlin, A. Chitty, H. JK, and J. Ortega, “Frameworks of analysis for the neural representation of animate objects and actions,” Journal of Experimental Biology, vol. 146, pp. 87–113, 1989. [26] A. Georgopoulos, J. Kalaska, R. Caminiti, and J. Massey, “On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex,” J. Neurosci., vol. 12, pp. 1527–1537, 1982. [27] A. Pouget, P. Dayan, and R. Zemel, “Information processing with population codes.” Nat Rev Neurosci, vol. 1, pp. 125–132, 2000. [28] A. Murata, V. Gallese, G. Luppino, M. Kaseda, and H. Sakata, “Selectivity for the shape, size, and orientation of objects for grasping in neurons of monkey parietal area aip,” Journal of Neurophysiology, vol. 83, pp. 2580–2601, 2000. [29] C. S. Konen and S. Kastner, “Two hierarchically organized neural systems for object information in human visual cortex.” Nature Neuroscience, vol. 11, pp. 224–231, 2008. [30] K. Verfaillie and A. Daems, “Predicting point-light actions in realtime,” Visual Cognition, vol. 9, pp. 217–232, 2002. [31] M. Graf, B. Reitzner, C. Corves, A. Casile, M. Giese, and W. Prinz, “Predicting point-light actions in real-time,” Journal of Neurophysiology, vol. 36 Suppl 2, pp. T22–T32, 2007. [32] M. Umilta, E. Kohler, V. Gallese, L. Fogassi, L. Fadiga, C. Keysers, and G. Rizzolatti, “I know what you are doing - a neurophysiological study,” Neuron, vol. 31, pp. 155–165, 2001. [33] C. Baker, C. Keysers, T. Jellema, B. Wicker, and D. Perrett, “Neuronal representation of disappearing and hidden objects in temporal cortex of the macaque,” Exp. Brain Research, vol. 140, pp. 375–381, 2001. [34] A. Compte, N. Brunel, P. S. Goldman-Rakic, and X. J. Wang, “Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model.” Cerebral Cortex, vol. 10, no. 9, pp. 910–923, September 2000. [35] J. Bonaiuto, E. Rosta, and M. Arbib, “Extending the mirror neuron system model, i. audible actions and invisible grasps.” Biological Cybernetics, vol. 96, pp. 9–38, 2007. [36] Y. Demiris and A. Billard, “Special issue on robot learning by observation, demonstration, and imitation,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 37, pp. 254–255, 2007. [37] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired system for action recognition,” in Proc. IEEE Conf. on Comp. Vision (ICCV) 2007, 2007, pp. 1–8. [38] V. Lepetit and P. Fua, “Monocular model-based 3d tracking of rigid objects: A survey,” Foundations and Trends in Computer Graphics and Vision, vol. 1, pp. 1–89, 2005. [39] M. Rumpf and R. Strzodka, “Graphics processor units: New prospects for parallel computing,” in Numerical Solution of Partial Differential Equations on Parallel Computers, A. M. Bruaset and A. Tveito, Eds. Springer, 2005, vol. 51, pp. 89–134. B. Future Work Future work will focus on refining the individual components of the model and fitting it in detail to available electrophysiological, behavioral and imaging data. In addition, specific electrophysiological experiments will be devised that test directly individual postulated neural mechanisms at the level of mirror neurons in premotor cortex (area F5). V. ACKNOWLEDGMENTS We thank L. Omlor for help with the video recordings. This work was supported by DFG (SFB 550), EC FP6 project COBOL the Volkswagenstiftung and the Hermann Lilly Schilling Stiftung. R EFERENCES [1] G. di Pellegrino, L. Fadiga, L. Fogassi, V. Gallese, and G. Rizzolatti, “Understanding motor events: a neurophysiological study.” Exp Brain Res, vol. 91, pp. 176–180, 1992. [2] G. Rizzolatti and L. Craighero, “The mirror-neuron system.” Annu Rev Neurosci, vol. 27, pp. 169–192, 2004. [3] E. Oztop, M. Kawato, and M. A. Arbib, “Mirror neurons and imitation: A computationally guided review,” Neural Networks, vol. 19, pp. 254– 271, 2006. [4] V. Gallese, L. Fadiga, L. Fogassi, and G. Rizzolatti, “Action recognition in the premotor cortex,” Brain, vol. 119, pp. 593–609, 1996. [5] G. Rizzolatti, V. Fadiga, Luciano Gallese, and L. Fogassi, “Premotor cortex and the recognition of motor actions,” Cognitive Brain Research, vol. 3, pp. 131–141, 1996. [6] A. H. Fagg and M. A. Arbib, “Modeling parietal–premotor interactions in primate control of grasping,” Neural Networks, vol. 11, pp. 1277– 1303, 1998. [7] E. Oztop and M. A. Arbib, “Schema design and implementation of the grasp-related mirror neuron system,” Biological Cybernetics, vol. 87, pp. 116–140, 2002. [8] M. Haruno, D. M. Wolpert, and M. M. Kawato, “Mosaic model for sensorimotor learning and control,” Neural Comput., vol. 13, pp. 2201– 2220, 2001. [9] G. Metta, G. Sandini, L. Natale, L. Craighero, and L. Fadiga, “Understanding mirror neurons: A bio-robotic approach,” Interaction Studies, vol. 7, pp. 197–232, 2006. [10] Y. Demiris and G. Simmons, “Perceiving the unusual: temporal properties of hierarchical motor representations for action perception,” Neural Netw., vol. 19, no. 3, pp. 272–284, 2006. [11] W. Erlhagen, A. Mukovskiy, E. Bicho, G. Panin, C. Kiss, A. Knoll, H. T. van Schie, and H. Bekkering, “Goal-directed imitation for robots: A bio-inspired approach to action understanding and skill learning.” Robotics and Autonomous Systems, vol. 54, pp. 353–360, 2006. [12] T. Poggio and S. Edelman, “A network that learns to recognize 3d objects,” Nature, vol. 343, pp. 263–266, 1990. [13] N. Logothetis, J. Pauls, and T. Poggio, “Shape representation in the inferior temporal cortex of monkeys.” Current Biology, vol. 5, pp. 552–563, 1995. [14] S. Edelman, Representation and Recognition in Vision. MIT Press, 1999. [15] M. J. Tarr and H. H. Bülthoff, Eds., Object recognition in man, monkey, and machine. Cambridge, MA, USA: MIT Press, 1998. 24