* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The encoding and decoding of com-
Visual search wikipedia , lookup
Binding problem wikipedia , lookup
Visual selective attention in dementia wikipedia , lookup
Perception of infrasound wikipedia , lookup
Neuroanatomy wikipedia , lookup
Eyeblink conditioning wikipedia , lookup
Sensory cue wikipedia , lookup
Multielectrode array wikipedia , lookup
Neural modeling fields wikipedia , lookup
Neural oscillation wikipedia , lookup
Central pattern generator wikipedia , lookup
Holonomic brain theory wikipedia , lookup
Premovement neuronal activity wikipedia , lookup
Neuropsychopharmacology wikipedia , lookup
Recurrent neural network wikipedia , lookup
Development of the nervous system wikipedia , lookup
Psychophysics wikipedia , lookup
Metastability in the brain wikipedia , lookup
Types of artificial neural networks wikipedia , lookup
Neuroesthetics wikipedia , lookup
Embodied cognitive science wikipedia , lookup
Optogenetics wikipedia , lookup
Synaptic gating wikipedia , lookup
C1 and P1 (neuroscience) wikipedia , lookup
Stimulus (physiology) wikipedia , lookup
Convolutional neural network wikipedia , lookup
Biological neuron model wikipedia , lookup
Visual servoing wikipedia , lookup
Time perception wikipedia , lookup
Channelrhodopsin wikipedia , lookup
Neural correlates of consciousness wikipedia , lookup
Nervous system network models wikipedia , lookup
Neural coding wikipedia , lookup
Inferior temporal gyrus wikipedia , lookup
The encoding and decoding of complex visual stimuli: a neural model to optimize and read out a temporal population code. Andre Luiz Luvizotto TESI DOCTORAL UPF / 2012 Director de la tesi Prof. Dr. Paul Verschure, Department of Information and Communication Technologies ii By My Self and licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported You are free to Share – to copy, distribute and transmit the work Under the following conditions: • Attribution – You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). • Noncommercial – You may not use this work for commercial purposes. • No Derivative Works – You may not alter, transform, or build upon this work. With the understanding that: Waiver – Any of the above conditions can be waived if you get permission from the copyright holder. Public Domain – Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license. Other Rights – In no way are any of the following rights affected by the license: • Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; • The author’s moral rights; • Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. iv Notice – For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page. The court’s PhD was appointed by the recto of the Universitat Pompeu Fabra on .............................................., 2012. Chairman: Secretary: Member: The doctoral defense was held on ......................................................., 2012, at the Universitat Pompeu Fabra and scored as ................................................... PRESIDENT MEMBERS SECRETARY To my parents... Acknowledgements This dissertation has been developed in the department of technology of the Universitat Pompeu Fabra in the research lab SPECS. My first and sincere thanks goes to Prof. Paul Verschure who since the beginning constantly encouraged and supported me in the development of my research ideas and goals with his supervision. Also thanks to all SPECS members for many useful insights, discussions and feedback. My deepest gratitude to my great friend Cesar Rennó-Costa for the extensive support, discussions and valuable overall collaboration that contributed crucially to this thesis. I would like to thank Zenon Mathews and Martı́ Sanchez-Fibla for invaluable support over these years. And my great friend Prof. Jonatas Manzolli for linking me with Paul and SPECS. I’m deeply thankful to my parents, Ismael and Angela, and my brothers George and Mauly for all the love and support throughout this journey. They have been always ready for me. And also my ”youngest parents” Tila and Giancarlo. You have both contributed so much with your love and care! My deepest and heartfelt thanks to my beloved Sassi. I have not enough words to express how important you have been to me. Your support, company, love, lecker food and patience made everything easier and much more beautiful! Thanks and I love you. Finally, I thank my heroes John, Paul, George, Ringo, Hendrix, Slash, Vaughan, EVH, Mr. Vernon and many others for providing me with the x xi rock’n’roll stamina to write this dissertation! Abstract The mammalian visual system has a remarkable capacity of processing a large amount of information within milliseconds under widely varying conditions into invariant representations. Recently a model of the primary visual system exploited the unique feature of dense local excitatory connectivity of the neo-cortex to match these criteria. The model rapidly generates invariant representations integrating the activity of spatially distributed modeled neurons into a so-called Temporal Population Code (TPC). In this thesis, we first investigate an issue that has persisted TPC since its introduction: to extend the concept to a biologically compatible readout stage. We propose a novel neural readout circuit based on wavelet transform that decodes the TPC over different frequency bands. We show that, in comparison with pure linear readouts used previously, the proposed system provides a robust, fast and highly compact representation of visual input. We then generalized this optimized encoding-decoding paradigm to deal with a number of robotics application in real-world tasks to investigate its robustness. Our results show that complex stimuli such as human faces, hand gestures and environmental cues can be reliably encoded by TPC which provides a powerful biologically plausible framework for real-time object recognition. In addition, our results suggest that the representation of sensory input can be built into a spatial-temporal code interpreted and parsed in series of wavelet like components by higher visual areas. xiv Resumen El sistema visual dels mamfers t una remarcable capacitat per processar informaci en intervals de temps de mili-segons sota condicions molt variables i adquirir representacions invariants d’aquesta informaci. Recentment un model del crtex primari visual explota les caracterstiques d’alta connectivitat excitatriu local del neocortex per modelar aquestes capacitats. El model integra rpidament l’activitat repartida espaialment de les neurones i genera codificacions invariants que s’anomenen Temporal Population Codes (TPC). Aqu investiguem una qesti que ha persistit des de la introducci del TPC: estudiar un procs biolgicament possible capa de fer la lectura d’aquestes codificacions. Nosaltres proposem un nou circuit neuronal de lectura basat en la Wavelet Transform que decodifica la senyal TPC en diferents intervals de freqncia. Monstrem que, comparat amb lectures purament lineals utilitzades previament, el sistema proposat proporciona una representaci robusta, rpida i compacta de l’entrada visual. Tamb presentem una generalitzaci d’aquest paradigma de codificacidecodificaci optimitzat que apliquem a diferents tasques de visi per computador i a la visi dins del context de la robtica. Els resultats del nostre estudi suggereixen que la representaci d’escenes visuals complexes, com cares humanes, gestos amb les mans i senyals del medi ambient podrien ser codificades pel TPC el qual es pot considerar un poders marc biolgic per reconeixement d’objectes en temps real. A ms a ms, els nostres resultats suggereixen que la representaci de l’entrada sensorial pot ser integrada en un codi espai-temporal interpretat i analitzat en una serie de components Wavelet per rees visuals superiors. xv Publications Included in the thesis Peer-reviewed • Andre Luvizotto, César Rennó-Costa, and Paul F.M.J. Verschure. A wavelet based neural model to optimize and read out a temporal population code. Frontiers in Computational Neuroscience, 6(21): 14, 2012c • Andre Luvizotto, César Rennó-Costa, and Paul F.M.J. Verschure. A framework for mobile robot navigation using a temporal population code. In Springer Lecture Notes in Computer Science - Living Machines, page 12, 2012b • Andre Luvizotto, Maxime Petit, Vasiliki Vouloutsi, and Et Al. Experimental and Functional Android Assistant : I . A Novel Architecture for a Controlled Human-Robot Interaction Environment. In IROS 2012 (Submitted), 2012a • Andre Luvizotto, César Rennó-Costa, Ugo Pattacini, and Paul Verschure. The encoding of complex visual stimuli by a canonical model of the primary visual cortex: temporal population coding for face recognition on the iCub robot. In IEEE International Conference on Robotics and Biomimetics, page 6, Thailand, 2011 xvii xviii publications Other publications as co-author • César Rennó-Costa, André Luvizotto, Alberto Betella, Marti Sanchez Fibla, and Paul F. M. J. Verschure. Internal drive regulation of sensorimotor reflexes in the control of a catering assistant autonomous robot. In Lecture Notes in Artificial Intelligence: Living Machines, 2012 • Maxime Petit, Stéphane Lallée, Jean-David Boucher, Grégoire Pointeau, Pierrick Cheminade, Dimitri Ognibene, Eris Chinellato, Ugo Pattacini, Ilaria Gori, Giorgio Metta, Uriel Martinez-Hernandez, Hector Barron, Martin Inderbitzin, Andre Luvizotto, Vicky Vouloutsi, Yannis Demiris, and Peter Ford Dominey. The Coordinating Role of Language in Real-Time Multi-Modal Learning of Cooperative Tasks. IEEE Transactions on Autonomous Mental Development (TAMD), 2012 • César Rennó-Costa, André L Luvizotto, Encarni Marcos, Armin Duff, Martı́ Sánchez-Fibla, and Paul F M J Verschure. Integrating Neuroscience-based Models Towards an Autonomous Biomimetic Synthetic. In 2011 IEEE International Conference on RObotics and BIOmimetics (IEEE-ROBIO 2011), Phuket Island, Thailand, 2011. IEEE, IEEE • Armin Duff, César Rennó-Costa, Encarni Marcos, Andre Luvizotto, Andrea Giovannucci, Marti Sanchez-Fibla, Ulysses Bernardet, and Paul Verschure. From Motor Learning to Interaction Learning in Robots, volume 264 of Studies in Computational Intelligence. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-051807. doi: 10.1007/978-3-642-05181-4. URL http://www.springerlink. com/content/v348576tk12u628h • Sylvain Le Groux, Jonatas Manzolli, Marti Sanchez, Andre Luvizotto, Anna Mura, Aleksander Valjamae, Christoph Guger, Robert Prueckl, Ulysses Bernardet, and Paul Verschure. Disembodied and publications xix Collaborative Musical Interaction in the Multimodal Brain Orchestra. In Proceedings of the international conference on New Interfaces for Musical Expression, 2010. URL http://www.citeulike.org/ user/slegroux/article/8492764 Contents Abstract xiv Resumen xv Publications xvii List of Figures xxiv List of Tables xxxiii 1 Introduction 1 1.1 Early visual areas . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Higher order visual areas: the extrastriate areas and object recognition in the brain . . . . . . . . . . . . . . . . . . . . 1.3 3 9 Coding strategies and mechanisms used by the neo-cortex to provide invariant representation of the visual information 11 1.4 How can the key components encapsulated in a temporal code be decoded by different cortical areas involved in the visual process? . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 Can TPC be used in the recognition of multi-modal sensory input such as human faces and gestures to provide form perception primitives? . . . . . . . . . . . . . . . . . . . . . 18 2 A wavelet based neural model to optimize and read out a temporal population code 2.1 21 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 xxi xxii contents 2.2 Material and methods . . . . . . . . . . . . . . . . . . . . . 27 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . 52 3 Temporal Population Code for Face Recognition on the iCub Robot 53 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4 Using a temporal population to recognize gestures on the humanoid robot iCub. 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Material and methods . . . . . . . . . . . . . . . . . . . . . 77 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5 A framework for mobile robot navigation using a temporal population code 89 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6 Conclusion 103 A Distributed Adaptive Control (DAC) Integrated into a Novel Architecture for Controlled Human-Robot Interaction 109 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 contents xxiii A.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Bibliography 133 List of Figures 1.1 Parallel pathways in the early visual system. Redrawn based on (Nassi and Callaway, 2009) . . . . . . . . . . . . . . . . . . 1.2 5 Schematics of receptive field’s response in simple and complex cells: The V1 cells are orientation selective, i.e the cell’s activity increases or decreases when the orientation of edges and bars matches the preferred orientation φ of the cell. . . . . . . . . . 1.3 8 Overview of the Temporal Population code architecture. In the model proposed by Wyss et al. (2003b), the input stimulus passes through an edge-detect stage that approximate the receptive field’s characteristics of LGN cells. In the next stage, the LGN output is continuously projected to a network of laterally connected integrate-and-fire neurons with properties found in the primary visual cortex. Due to the recurrent connections, the visual stimulus becomes encoded over the network’s activity trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 xxiv list of figures 2.1 xxv The TPC encoding model. In a first step, the input image is projected to the LGN stage where its edges are enhanced. In the next stage, the LGN output passes through a set of Gabor filters that resemble the orientation selectivity characteristics found in the receptive fields of V1 neurons. Here we show the output response of one Gabor filter as input for the V1 spiking model. After the image onset, the sum of the V1 network’s spiking activity over time gives rise to a temporal representation of the input image. This temporal signature of the spatial input is the, so called temporal population code, or TPC. . . . 28 2.2 The TPC encoding paradigm. The stimulus, here represented by a star, is projected topographically onto a map of interconnected cortical neurons. When a neuron spikes, its action potential is distributed over a neighborhood of a given radius. The lateral transmission delay of these connections is 1 ms/unit. Because of these lateral intra-cortical interactions, the stimulus becomes encoded in the network’s activity trace. The TPC representation is defined by the spatial average of the population activity over a certain time window. The invariances that the TPC encoding renders are defined by the local excitatory connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 Computational properties of the two types of neurons used in the simulations: regular (RS) and burst spiking (BS). The RS neuron shows a mean inter spike interval of about 25 ms (40 Hz). The BS type displays a similar inter-burst interval with a within burst inter-spike interval of approximately 7 ms (140 Hz) every 35 ms (28 Hz). . . . . . . . . . . . . . . . . . . . . . 33 xxvi 2.4 list of figures a) a) Neuronal readout circuit based on wavelet decomposition. The buffer cells B1 and B2 integrate, in time, the network activity performing a low-pass approximation of the signal over two adjacent time windows given by the asynchronous inhibition received from cell A. The differentiation performed by the excitatory and inhibitory connections to W gives rise to a band-pass filtering process analogous to the wavelet detail levels. b) An example of band-pass filtering performed by the wavelet circuit where only the frequency range corresponding to the resolution level Dc3 is kept in the spectrum. . . . . . . . 34 2.5 The stimulus classes used in the experiments after the edge enhancement of the LGN stage. . . . . . . . . . . . . . . . . . . 36 2.6 The stimulus set. a) Image-based prototypes (no jitter in the vertices applied) and the globally most different exemplars with normalized distance equal one. The distortions can be very severe as in the case of class number one. b) Histogram of the normalized Euclidean distances between the class exemplars and the class prototypes in the spatial domain. . . . . . . . . . 38 2.7 Baseline classification ratio using Euclidean distance among the images from the stimulus set in the spatial domain. . . . . . . . 39 2.8 Comparison among the correct classification ratio for different resonance frequencies of the wavelet filters for both types of neurons RS and BS. The frequency bands of the TPC signal is represented by the wavelet coefficients Dc1 to Ac5 in a multiresolution scheme. The network time window is 128 ms. . . . . 40 xxvii list of figures 2.9 Speed of encoding. Number of bits encoded by the network’s activity trace as a function of time. The RS-TPC and BSTPC curves represent the bits encoded by the network’s activity trace without the wavelet circuit. The RS-Wav and BS-Wav correspond to the bits encoded by the wavelet coefficients using the Dc3 resolution level for RS neurons and the Dc5 for BS neurons respectively. For a time window of 128 ms the Dc3 level has 16 coefficients and the Dc5 has only 4 coefficients. The dots in the figure represent the moment in time where the coefficients are generated. . . . . . . . . . . . . . . . . . . . . . 42 2.10 Single-sided amplitude spectrum of the wavelet prototype for each stimulus class used in the simulations. The signals x(t) where reconstructed in time using the wavelet coefficients from the Dc3 and Dc5 levels for RS and BS neurons respectively. The shaded areas shows the optimal frequency response of the Dc3 level (62 Hz to 125 Hz) and of the Dc5 level (15.5 Hz to 31 Hz). The less pronounced responses around 400 Hz are aliasing effects due to the signal reconstruction to calculate the Fourier transform (see discussion). . . . . . . . . . . . . . . . . . . . . . 45 2.11 Prototype based classification hit matrices. For each class in the training we average the wavelet coefficients to form class prototypes. In the classification process, the euclidean distance between the classification set and the prototypes are calculated. A stimulus is assigned to a the class with smaller euclidean distance to the respective class prototype. . . . . . . . . . . . 46 2.12 Distribution of errors in the wavelet-based prototype classification with relation to the Euclidean distances within the prototyped classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 xxviii 3.1 list of figures Schematic of the lateral propagation using Discrete Fourier Transform (DFT). The lateral propagation is performed as a filtering operation in the frequency domain. The matrix of spikes generated at time t is multiplied by the filters of lateral propagation Fd over the next time steps t + 1 and t + 2. The output is given by the inverse DFT of the result. . . . . . . . . 60 3.2 BASSIS is a multi-scale biomimetic architecture organized at three different levels of control: reactive, adaptive and contextual. It is based on the well established DAC architecture. See text for further details. . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Model overview. The faces are detected and cropped from the input provided by the iCub’s camera image. The cropped faces are resized to a fixed resolution of 128x128 pixels and convolved with the orientation selective filters. The output of each orientation layer is processed by separated neural populations as explained above. The spike activity is summed over a specific time window rendering the Temporal Population Code, or TPC. 66 3.4 Cropped faces from Yale face database B used as a benchmark to compare TPC with other methods of face recognition available in the literature (Georghiades et al., 2001) (Lee et al., 2005). 67 3.5 Speed of encoding. Average correct classification ratio given by the network’s activity trace as a function of time for different values of the input threshold Ti . . . . . . . . . . . . . . . . . . 68 3.6 Response clustering. The entries of the hit matrix represent the number of times a stimulus class is assigned to a response class over the optimal time window of 21 ms and Ti of 0.6. . . . 69 3.7 Face data set. Training set (Green), Classification set (Blue) and Misclassified faces (Red). . . . . . . . . . . . . . . . . . . . 70 3.8 Classification ratio using a spatial subsampling strategy where the network activity is read over multiples subregions. . . . . . 70 3.9 Classification ratio using the cropped faces from the Yale databse. 71 list of figures 4.1 xxix The gestures used in the Rock-Paper-Scissors game. In the game, the players usually count until three before showing the gestures. The objective is to select a gesture which defeats that of the opponent. Rock breaks scissors, scissors cut paper and paper covers captures rock. Unlike a truly random selection method, like coin flipping or a dice, in the Rock-Paper-Scissors game is possible to recognize and predict the behavior of an opponent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Visual model diagram. In the first step, the input image is color filtered in the hue, saturation and value (HSV) space and segmented from the background. The segmented image is then projected to the LGN stage where its edges are enhanced. In the next stage, the LGN output passes through a set of Gabor filters that resemble the orientation selectivity characteristics found in the receptive field of V1 neurons. Here we show the output response of one Gabor filter as input for the V1 spiking model. After the image onset, the sum of the V1 network’s spiking activity over time gives rise to a temporal representation of the input image. This temporal signature of the spatial input is the so called temporal population code, or TPC. The TPC output is then read out by a wavelet readout system. . . 78 4.3 Schematic of the encoding paradigm. The stimulus, a human hand, is continuously projected onto the network of laterally connected integrate and fire model neurons. The lateral transmission delay of these connections is 1 ms/unit. When the membrane potential crosses a certain threshold, a spike occur and its action potential is distributed over a neighborhood of a given radius. Because of these lateral intra-cortical interactions, the stimulus becomes encoded in the network’s activity trace. The TPC signal is generated by summing the total population activity over a certain time window. . . . . . . . . . . . 79 xxx 4.4 list of figures Computational properties after frequency adaptation of the integrate and fire neuron used in the simulations. The spikes in this regular spiking neuron model occurs every 25 ms approximately (40 Hz). 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Wavelet transform time-scale representation. At each resolution level, the number of wavelet coefficient drops by factor of 2 (dyadic representation) as well as the frequency range of the low-pass signal given by the approximation coefficients. The detail coefficients can be interpreted as a band-pass signal, with frequency range equal to the difference in frequency between the actual and previous approximation levels. . . . . . . . . . . 83 4.6 The stimulus classes used in the experiments. Here 12 exemplars for each class are shown. 4.7 . . . . . . . . . . . . . . . . . . 84 Classification hit matrix for the baseline. In the classification process, the correlation between the classification set (25% of the stimuli )and the training set (75%) are calculated. A stimulus is assigned to a the class with smaller mean correlation to the respective class in the training set. 4.8 . . . . . . . . . . . . . 86 Classification hit matrix. In the classification process, the euclidean distance between the classification set and the prototypes are calculated. A stimulus is assigned to a the class with smaller euclidean distance to the respective class prototype. 5.1 . 87 Architecture scheme. The visual system exchanges information with both the attention system and the hippocampus model to support navigation. The attention system is responsible for determining which parts of the scene are considered for the image representation. The final position representation is given by the formation of place cells that show high rates of firing whenever the robot is in a specific location in the environment, i.e. place cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 list of figures 5.2 xxxi Visual model overview. The saliency regions are detected and cropped from the input image provided by camera image. The cropped areas, subregions of the visual field, have a fixed resolution of 41x41 pixels. Each subregion is convolved with difference of gaussian (DoG) operator that approximates the properties of the receptive field of LGN cells. The output of the LGN stage is processed by the simulated cortical neural population and their spiking activity is summed over a specific time window rendering the Temporal Population Code. . . . . . . . 95 5.3 Experimental environment. a) The indoor environment was divided in a 5x5 grid of 0.6 m2 square bins, compromising 25 sampling positions. b) For every sampling position, a 360 degrees panorama was generate to simulate the rat’s visual input. A TPC responses is calculated for each salient region, 21 in total. The collection of TPC vectors for the environment is clustered into 7 classes. Finally, a single image is represented by the cluster distribution of its TPC vectors using a histogram of 7 bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Pairwise correlation of the TPC histograms in the environment. We calculate the correlation among all the possible combinations of two positions in the environment. We average the correlation values into distance intervals, according to the sampling distances used in the image acquisition. This result suggests that we can produce a linear curve of distance based on the correlation values calculated using the TPC histograms. . . 99 5.5 Place cells acquired by the combination of TPC based visual representations and the E%-max WTA model of the hippocampus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 A.1 The iCub ready to play with the objects that are on the Reactable. The table provides an appealing atmosphere for social interactions between humans and the iCub. . . . . . . . . . . . 113 xxxii list of figures A.2 BASSIS is a multi-scale biomimetic architecture organized at three different levels of control reactive, adaptive and contextual. It is based on the well established DAC architecture. See text for further details. . . . . . . . . . . . . . . . . . . . . . . . 115 A.3 Reactable system overview. See text for further explanation. . 117 A.4 PMP architecture: at lower level the pmpServer module computes the trajectory, targets and obstacles in the space feeding the Cartesian Interface with respect to the trajectory that has to be followed. The user relies on a pmpClient in order to ask the server to move, add and remove objects, and eventually to start new reaching tasks. At the highest level the pmpActions library provides a set of complex actions, such as ”Push” or ”Move the object’, simply implemented as combinations of primitives. . . . . . . . . . . . . . . . . . . . . . . . . . 121 A.5 A cartoon of the experimental scenario. The distances for the left (L), middle (M) and right (R) positions are giving according to the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.6 Box plots of GEx and GEy for the grasp experiments. The error in placing an object on the table increases from left to right with respect to the x axis. No significant difference was observed for y. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.7 Box plots for the place experiment for the different conditions analyzed: source and target. Both conditions have an impact in the error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.8 Average time needed for the shared plan tasks. The learning box, represents the time need for the human to create the task. The execution, represents the time needed by the interaction to perform the learned action. . . . . . . . . . . . . . . . . . . . 131 List of Tables 2.1 Parameters used for the simulations. . . . . . . . . . . . . . . . 32 3.1 Parameters used for the simulations. See text for further explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2 Comparison of TPC with other face recognition methods. The results were extracted from Lee et al. (2005) were the reader can find the references for each method. . . . . . . . . . . . . . 72 4.1 Parameters used for the simulations. . . . . . . . . . . . . . . . 84 5.1 Parameters used for the simulations. . . . . . . . . . . . . . . . 96 A.1 Excerpt of available OPC properties . . . . . . . . . . . . . . . 118 xxxiii Chapter Introduction In our daily life we are constantly exposed to situations where recognizing different objects, faces, colors or even estimating the speed of a moving object plays a key role in the way we behave. To recognize different scenes under widely varying conditions, the visual system must solve the hard problem of building invariant representations of available sensory information. Invariant representation of visual input is a key element for a number of tasks in different species. For example, It has been shown in foraging experiments with rats that the animal relies on visual landmarks to explore the environment and to produce an internal representation of the world. This strategy allows not only rats, but a number of other animals, to successfully navigate over different territories while searching for food (Tamara and Timberlake, 2011; Zoccolan et al., 2009). Furthermore, the invariant representation of visual content must be very efficiently performed, capturing all the rich details necessary to distinguish among different situations in a very fast and non-redundant way so that prototypes of a class can emerge and be efficiently stored in memory and/or serve on-going action. Most work done in invariance of the visual cortex is related to the primary visual cortex. Despite intense researches in the field of vision and neuroscience, it still remains unclear what are 1 1 2 introduction the underlying cortical mechanisms responsible for building invariant representations (Pinto et al., 2008). This question has been of great interest to the field of computational neuroscience and also roboticists building artificial systems acting in real-world situations (Riesenhuber and Poggio, 2000). This is because, biological systems far outperform advanced artificial systems in terms of solving real-world visual tasks. In classical models of visual perception invariant representations emerge in the form of activity patterns at the highest level of an hierarchical multilayer network of spatial feature detectors (Fukushima, 1980a; Riesenhuber and Poggio, 1999; Serre et al., 2007). These models are extensions of the Hubel and Wiesel simple-to-complex cell hierarchy. In this approach, invariances are achieved at the cost of increasing the number of connections between the different layers in the hierarchy. However, these models seem to be based on a fundamental assumption that is not consistent with cortical anatomy. Another inconsistency of traditional artificial neural networks, such as the perceptron and multi-layer perceptrons (Rosenblatt, 1958; Rumelhart et al., 1986), is their time independence. These models were designed to deal with purely static spatial patterns of input. Time is considered as another dimension, which makes those models extremely hard-wired. From the biological perspective, it is clear that time is not treated as another spatial dimension at the input level. Thus a central question in this discussion is how time can be incorporated in the representation of a spatial input. In this dissertation we address this issue proposing a plausible circuitry for cortical processing of sensory input. We develop a model of the early visual system aiming to capture the key ingredients necessary to build up a robust invariant and compact representation of the visual world. The proposed model is fully grounded on physiological and anatomical properties of the mammalian visual system and gives a functional interpretation to the dense intra-cortical connectivity observed in the cortex. In this 1.1. early visual areas 3 chapter, we start from a brief review of the brain areas involved in the representation and categorization of objects, defining in detail the questions and proposed solutions. 1.1 Early visual areas Retina The sensory input organ for the visual system is the eye, where the first steps of seeing begin in the retina (Nassi and Callaway, 2009). The light that hits the eye passes through the cornea and the lens before arriving the retina. The retina contains a dense array of photoreceptors, called rods and cones, that encode the intensity of light as a function of 2D position, wavelength and time. The rods are in charge of scotopic vision, or night vision, and get saturated in day light. They are very sensitive to light and can respond to a single photon. The cones are not very sensitive to light, and thus are responsible for day light vision. Cones come in three subtypes with distinct spectral absorption functions: short (S), medium (M), and long (L), which form the bases for chromatic vision. A single cone by itself is color blind in the sense that its activation depends both on the wavelength and intensity of the stimulus. The signals from different classes of photoreceptors must be compared in order to access the incoming information about color (Conway et al., 2010). The existence of coneopponent retinal ganglion cells (RGC) that perform such comparisons is well established in primate. The axons of (RGC) form the optic nerve connecting the eye to the brain. There are three types of RGC that have been particularly well characterized. Morphologically defined, ON and OFF Midget RGC are the origin of the parvocellular pathway. Both midget and parvocellular cells provide a L versus M color-opponent signal to the parvocellular layers of the lateral geniculate nucleus (LGN), which are often associate to red-green colors. Parasol RGC constitute the magnocellular pathway and convey a broad- 4 introduction band, achromatic signal to the magnocellular layers of the LGN. Cells in this pathway have large receptive fields, highly sensitive to contrast and spatial frequency. Finally, small and large bistratified RGC compose the koniocellular pathway and convey a S signals against L+M, or blue-yellow color-opponent signal to koniocellular layers of the LGN (Nassi and Callaway, 2009). In both areas, retina and LGN, the cells have center-surround receptive fields (Lee et al., 2000). These neurons have an ON-center/OFF-surround responses to light intensity, or vice versa. An ON-center cell has its activity increased when light hits its receptive field’s center and decreased activity when light hits the center’s surround, and vice-versa for the OFF-center cells. Most researchers argue that the retina’s principal function is to convey the visual image through the optic nerve to the brain. However, recent results suggest that already in this early stage of the visual pathway the chromatic or achromatic characteristics of a stimulus are encoded by a complex temporal response of RGC (Gollisch and Meister, 2008) (Gollisch and Meister, 2010). Lateral geniculate nucleus The output of these specialized channels are projected to the LGN of the thalamus. The primate LGN has a laminar organization, divided into 6 layers that receives information from the RGC. The most ventral layers (layers 1-2) receive the inputs from the magnocellular pathway, while the remaining layers receive input from the parvocellular pathway. Wiesel and Hubel (1966) found that Monkey neurons in the magnocellular layers of LGN were largely color-blind, with achromatic, large and highly contrast sensitive receptive fields. Whereas the neurons in the parvocellular layers had smaller, colour-opponent and poorly contrast sensitive receptive fields. Further work revealed that the koniocellular pathway connects in segregated stripes between the Parvocellular and Magnocellular layers of 5 1.1. early visual areas LGN (reviewed by (Callaway, 2005) and (Shapley and Hawken, 2011)) (Fig. 1.1 a). Retina LGN Parasol Ganglion Cell Luminance B) Midget Koniocellular Bistratified Magnocellular Parasol Parvocellular Layers Small Bistratified Ganglion Cell Blue-Yellow colour opponency Parvocellular Koniocellular Layers Midget Ganglion Cell Red-Green colour opponency Magnocellular Layers A) V1 2/3 4A 4B 4Cα 4Cβ 5 6 Figure 1.1: Parallel pathways in the early visual system. Redrawn based on (Nassi and Callaway, 2009) In the primary visual cortex, the output of parvo, magno and koniocellular cells from the LGN keeps segregated. Parvo cells terminate in layer 4Cβ, whereas magno cells innervate layer 4Cα. Konio cells are the input to 6 introduction layers 2/3. These distinct anatomical projections persuaded early investigators that parvo and magno channels remain functionally isolated in V1 which is still a source of controversy (Nassi and Callaway, 2009) (Fig. 1.1 a). The receptive field of LGN cells exhibit an ON-OFF response to light intensity. Modeling studies have used filters based on difference of Gaussian (DoG) to approximate LGN cells receptive field’s characteristics (Einevoll and Plesser, 2011; Rodieck and Stone, 1965). From an image processing perspective, DoG filters can perform edge enhancement over different spatial frequencies. Primary visual cortex In the visual cortex, different properties of objects - their distances, shapes, colors and direction of motion - are segregated into separate cortical areas (Zeki and Shipp, 1988). The earliest stage is the primary visual cortex (V1, striate cortex or area 17) that is characterized by columnar organization, where neurons with similar response properties are grouped vertically (Mountcastle, 1957). The primary visual cortex shows the stereotypical layered structure observed in the neo-cortex. It is structured into 6 layers, numbered from the surface to the depth from 1 to 6 (Fig. 1.1 b). The projection from LGN to V1 follows a non-random relationship between the object’s position in visual space and in the cortex position, in retinal coordinates. This topological arrangement where nearby regions on the retina project to nearby cortical regions is called retinotopic (Cowey and Rolls, 1974). In V1, there are two types of excitatory cells known as pyramidal and stellate. Both of these cell types receive direct feedforward input from cells in magnocellular pathway layer of thalamus into the layer 4C. Pyramidal cells have large apical dendrites that pass above layer 4B and into layer 2/3. The apical dendrites of pyramidal cells receive connections from the 1.1. early visual areas 7 parvocellular pathway to layer 4Cβ in V1 and also projects into layer 2/3. Indeed, most of the feed-forward input into layers 2/3 originates from layer 4C and 4B. Only few inputs come from the koniocellular layer of the LGN. Neurons in V1 exhibit selectivity for the orientation of a visual stimulus (Hubel and Wiesel, 1962; Ringach et al., 2002; Nauhaus et al., 2009). They are distributed in orientation maps where some neurons lie in regions of fairly homogeneous orientation preference, while others lie in regions with a variety of preferences called pinwheel centers (Nauhaus et al., 2008). V1 neurons are basically divided in two classes according to the properties of their receptive fields: simple cells and complex cells (Hubel and Wiesel, 1962). Similarly to retinal ganglion and geniculate cells, simple cells show distinct excitatory and inhibitory subdivisions. They respond best to elongated bars or edges in a preferred direction. Complex cells are more concerned with the orientation of a stimulus than with its exact position in the receptive field (Fig. 1.2). The excitatory circuitry in the primary visual cortex is characterized by dense local connectivity (Callaway, 1998; Briggs and Callaway, 2001). It is estimated that 95% of the synapses in V1 are local or feedback connections from other visual areas (Sporns and Zwi, 2004). Within the same area, the projections can link neurons over distances of several millimeters, spatially distributed into clusters of same stimulus preference(Stettler et al., 2002). It has been proposed that V1 selectivity to stimuli with similar orientation preference is mediated by long-range horizontal connections intrinsic to V1 (Stettler et al., 2002). This rule has established the notion of a like-to-like pattern of connectivity (Gilbert and Wiesel, 1989). However some recent experiments have found controversial results. Pyramidal neurons laying close to a population of diffuse orientation selectivity (pinwheel centers) reportedly connect laterally to orientation columns in a cross-oriented or non-selective way (Yousef et al., 2001). There are also massive feedback projections from higher visual areas, for example V2. Compared with the feedforward and lateral connections, 8 introduction Simple Cell Receptive Field - + Complex Cell Receptive Field Ø Stimulus on Stimulus on Spikes 0.25 0.75 sec Figure 1.2: Schematics of receptive field’s response in simple and complex cells: The V1 cells are orientation selective, i.e the cell’s activity increases or decreases when the orientation of edges and bars matches the preferred orientation φ of the cell. the feedback projections to V1 have received less attention from the scientific community(Sincich and Horton, 2005). Feedback connections from higher cortical areas provide more diffuse and divergent input to V1 (Salin et al., 1989). A first, anatomical studies suggested that V2 and V1 cells are preferentially connected when they share similar preferred orientations (Gilbert and Wiesel, 1989). In contradiction, recordings in area V1 of the macaque monkey show that the inactivation of V2 has no effect on the center/surround interactions of V1 neurons. The only effect observed in the inactivation of V2 feedback is a decrease in response to a single oriented bar of about 10% of V1 neurons (Hupé et al., 2001). Indeed, recent modeling studies confirm that the tuning behavior to oriented stimulus is highly prevalent in V1 and only if it operates in a regime of strong local recurrence (Shushruth et al., 2012). 1.2. higher order visual areas: the extrastriate areas and object recognition in the brain 9 Studies in monkeys have investigated the possible influences of horizontal connections and feedback from higher cortical areas in contour integration. The results suggest that V1 intrinsic horizontal connections provide a more likely substrate for contour integration than feedback projections from V2 to V1. However, the exact mechanism is still unclear. In the last years, much has been advanced in the knowledge about V1, however the functional role of the lateral connectivity (horizontal or feedback) remains unknown. Specially in terms of stimuli encoding and its relationship with the dense local connectivity observed in this area. 1.2 Higher order visual areas: the extrastriate areas and object recognition in the brain The extrastriate cortex comprises all the visual areas after the striate cortex, or V1. It includes the areas V2, V3, V4, IT (inferior temporal cortex) and MT (medial temporal cortex). The earlier visual areas, up to V4 appear to have clear retinotopic maps with cells tuned do different features such as orientation, color or spatial frequency. The consecutive stages reveal receptive fields with increasingly complex selectivity. This is particularly more evident in the IT which is attributed to play a major role in object recognition and categorization (Logothetis and Sheinberg, 1996). Early lesion studies showed that a complete removal of both temporal lobes in monkeys resulted in a collection of strange symptoms, but remarkably the inability of recognize objects visually. More specifically, lesions in the inferior temporal lobe produced severe and permanent deficits in learning and remembering to recognize stimuli, resulting in visual agnosias. For example, after bilateral or right hemisphere damage to IT cortex humans reported difficulties in recognizing faces (prosopagnosia), colors (achromatopsia) and other more specific objects. 10 introduction The area IT receives visual information from V1 through a serial pathway, which is called the ventral visual pathway (V1-V2-V4-IT). The IT projects to various brain areas outside the visual cortex, including the prefrontal cortex, the perirhinal cortex (areas 35 and 36), the amygdala, and the striatum of the basal ganglia (Tanaka, 1996). The neurons in the IT have some properties that help to understand the crucial role that this area plays in pattern recognition. The IT cells only respond to visual stimuli. The receptive fields always include the center of gaze and tend to be larger then in the V1, leading to a stimulus generalization within the receptive field. IT neurons usually respond more to complex than simple shapes, but also for color. A small percentage of IT neurons are selective for facial images. Recent FMRI studies macaque monkeys found a specific area in the temporal lobe that is activated much more by faces than by non-face objects (Tsao et al., 2003). Single-unit recordings subsequently confirmed it (Tsao et al., 2006). Middle face patch neurons detect and differentiate faces using a strategy that is both part based (constellations of face parts) and holistic (presence of a whole, upright face) Freiwald et al. (2009). These findings point towards the concept of grandmother cells or gnostic units, i.e, hypothetical neurons that respond only to a highly complex, specific, and meaningful stimulus, such as the image of one’s grandmother (Gross, 2002). However, the idea of localist representations in the brain has been controversial. On one side of the spectrum, the ones who advocate in favor of localist representations claim that these theories of perception and cognition are more consistent with neuroscience (reviewed by (Bowers, 2009)). On the other side, parallel distributed processing theories of cognition claim that knowledge is coded in a distributed manner in mind and brain (reviewed by (O’Reilly, 1998)). First, it would be far too risky for the nervous system to rely too much on selectivity. A well placed damage to a small number of cells could make one never recognize his/her grandmother again. In addition, it’s now well established that higher information-processing areas return information to the lower ones, 1.3. coding strategies and mechanisms used by the neo-cortex to provide invariant representation of the visual information 11 so that information travels in both directions between the modules, and not just upward. In the area IT neurons are usually invariant over changes in contrast, stimulus size, color and position within the retina (Tanaka, 1996). 1.3 Coding strategies and mechanisms used by the neo-cortex to provide invariant representation of the visual information In classical models that mimic natural vision systems, invariant representations emerge in the form of activity patterns at the highest level of the network by virtue of the spatial averaging across the feature detectors at the preceding layers. These models are based on the, so called, Neocognitron (Fukushima, 1980b, 2003; Poggio and Bizzi, 2004; Chikkerur et al., 2010), a hierarchical multilayer network of detectors with varying feature and spatial tuning. In this approach, recognition is based on a large dictionary of features stored in memory. Filters selective to specific features or combinations of features are distributed over different network layers. The underlying coding mechanism used in hierarchical models is based on rate coding, i.e, the mean firing rate of a neuron is directly related to its selectivity to the applied stimulus. In this way, hierarchical models are based on the assumption that a cortical cell is tuned to a local and specific feature using rate coding. A neuron in a rate coding network has a binary response to a certain stimulus, encoding the information of one bit. Thus, in this approach invariances to, for instance, position, scale and orientation, are achieved at the high cost of increasing the number of neurons and connections between layers responsible for all possible kinds of inputs the network can receive. Taking into account the dimensionality of the human’s visual space, the total number of visual features in the world would require far more cells that are available in the cortex. 12 introduction Even under the assumption that neurons do use rate coding, more information could be provided by the network using a population of neurons. In particular, a group of cells, each responsive to a different feature, can cooperatively code for a complete subspace of sensory inputs. In the motor cortex for instance, the direction of movement was found to be uniquely predicted by the action of a population of motor cortical neurons (Georgopoulos et al., 1986). These models have been criticized since the neural substrate probably reflect only a fraction of the patterns present in the brain. And also because this scheme requires a minimum degree of heterogeneity in the neuronal substrate which might not be anatomically true. An alternative for such coding system is to include time as an extra coding dimension. It has been showed that the precise spike timing of single neurons can be successfully used to encode information (reviewed by Tiesinga et al. (2008)). This encoding scheme has been called temporal coding. The difference between rate and temporal code can be unclear if the spiking rates are based on very tinny time bins. However, the conceptual idea is clear. It is intuitive that for a single neuron, the potential information content of precise and reliable spike times is many times larger than that which is contained in the firing rate, which is averaged across a typical interval of a hundred milliseconds (Reinagel and Reid, 2000). Recently, a model of the primary visual cortex has used the concept of temporal code within a population of neuron to show that the temporal dynamics of a recurrently coupled neural network can encode the spatial features of a 2D stimulus (Wyss et al., 2003b,a). This novel idea introduced a new type of coding where the spatial information is translated into a temporal population code (TPC). The model gives a functional interpretation to the dense excitatory local connectivity found in V1 and suggests an encoding paradigm invariant to a number of visual transformations based on a temporal population code (TPC). In this approach the input stimulus is initially processed by an LGN-like circuit and topographically projected onto a V1 network of neurons orga- 1.3. coding strategies and mechanisms used by the neo-cortex to provide invariant representation of the visual information 13 nized in a bi-dimensional Cartesian space with dense local and symmetrical connectivity. The output of the network is a compressed representation of the stimulus captured in the temporal evolution of the population’ spike activity 1.3. V1 LGN Retina Spike Ø2 Readout Retinotopic Projection TPC Time Input Image Modeled Neurons (V1) Spike Lateral Spreading Figure 1.3: Overview of the Temporal Population code architecture. In the model proposed by Wyss et al. (2003b), the input stimulus passes through an edge-detect stage that approximate the receptive field’s characteristics of LGN cells. In the next stage, the LGN output is continuously projected to a network of laterally connected integrate-and-fire neurons with properties found in the primary visual cortex. Due to the recurrent connections, the visual stimulus becomes encoded over the network’s activity trace. An intrinsic characteristic of TPC is the independence of the precise position of the stimulus within the network. The dynamics of a neuron population do not depend on what neurons are involved and are invariant to their topological arrangement. Therefore, the space to spatial-temporal transformation of TPC provides for a high-capacity encoding, invariant to position and rotation. In the code scheme of TPC the relationship among the core geometrical components of the visual input is captured. Up to a certain amount, distortions in form are smoothed by the lateral interactions which makes the encoding robust to small geometrical deformations for instance found in hand written characters (Wyss et al., 2003a). 14 introduction From a theoretical perspective, the TPC architecture is totally generic regardless of the stimulus input. Because the network comprises both the coding material and the mechanisms needed to encapsulate local rules of computation. In this sense, there is no need of re-wiring the network for different inputs. The wire-independence of TPC in its use of connectivity makes it extremely more general than pure hierarchical models. The network can receive general types of inputs without incorporating filters selective to the incoming features. From a biological perspective, there are a number of physiological studies that support the encoding scheme of TPC. A work with salamanders reported that, already in the retina, certain ganglion cells encode the spatial structure of a briefly presented image in the relative timing of their first spikes (Gollisch and Meister, 2008). In an experiment, were eight stimuli were used (perfect encoding would represent 3 bits), the spike latency of a ganglion cell transmitted up to 2 bits of information on a single trial. Indeed, by simply plotting the recorded differential spike latencies as a gray-scale code, Gollisch and Meister (2008) could obtain a rather faithful neural representation of a raw natural image presented to the retina of the animal. A recent work has investigated how populations of neurons in area V1 represent the time-changing sensory input through a spatio-temporal code Benucci et al. (2009). The results showed that the population activity is attracted towards the orientation of consecutive stimulus. Therefore, using a simple linear decoded based on the weighted sum of past stimulus orientations higher cognitive areas can predict with high accuracy the population responses to changes in orientation. A study with monkeys has found that in prefrontal cortices the information about stimulus categories is encoded by a temporal population code (Barak et al., 2010). Monkeys were trained to distinguish among different trials, separated by a time delay, of vibro-tactile stimuli applied on their fingertip. Surprisingly, they found that population state consistently 1.4. how can the key components encapsulated in a temporal code be decoded by different cortical areas involved in the visual process? 15 reflected the vibrational frequency during the stimulus. Moreover, they observed substantial variations of the population state between the stimulus and the end of the delay. These findings challenge the standard view that information in the working memory is exclusively encoded by the spiking activity of dedicated neuronal populations. Population code has been associate with oscillatory discriminations of sensory input in a number of species. In the auditory cortex of birds the temporal responses of neuron populations allow for the intensity invariant discrimination of songs (Billimoria et al., 2008). Or in the early stages of sensory processing of electric fishes (Marsat and Maler, 2010). These animals use discharges of electrical signal to communicate courtship (big chirps, around 300 Hz up to 900 Hz) or aggressive encounters (small chirps, around 100 Hz) that are encoded by different populations of neurons. Big chirps are accurately described by a population of pyramidal neurons using a linear representation of their temporal features. Whereas, small chirps are encoded by synchronous bursting of populations of pyramidal neurons. In the insect olfactory system the neurons of the antennal lobe display stimulus induced temporal modulations of their firing rate in a temporal code (Carlsson et al., 2005; Knusel et al., 2007). 1.4 How can the key components encapsulated in a temporal code be decoded by different cortical areas involved in the visual process? In the last years a growing number of results have supported the idea that the brain uses temporal signals through oscillations to link ongoing processes over different brain areas (Buzsáki, 2006). The hippocampus and prefrontal cortex (PFC) are structures that make use of these channels of information to communicate with other brain circuits. These areas are hubs of communication that orchestrate the signals originated in many cortical and subcortical areas promoting cognitive func- 16 introduction tions such as working memory, memory acquisition and consolidation, and decision making (Benchenane et al., 2011). In a recent study, Battaglia and McNaughton (2011) show that during memory maintenance PFC oscillations are in phase with hippocampal oscillations. However when there are no working memory demands, coherence breaks down, and no consistent phase relationship is observed. Both hippocampus and PFC receive converging input from higher sensory/associative areas. Gamma oscillations have been mostly studied and understood in the early and intermediate visual cortex. In primates, the fast oscillations play a key role in the attention process underlying stimulus selection. Attention causes both firing rate increases and synchrony changes caused by a top-down mechanism mediated by the PFC (reviewed by Noudoost et al. (2010) and Tiesinga and Sejnowski (2009a)). For example, the reception of information from higher visual areas such as V4 can be tuned in on a V1 column whose receptive field contains relevant stimulus, resulting on a steering visual attention mechanism (Womelsdorf et al., 2007). The TPC signal intrinsically carries the visual information through a set of encapsulate oscillations over, different frequency ranges. The temporal code is in tune with the notion that the sensory information travels throughout the brain in different wavelengths. In this context, if the TPC plays a role in encoding global stimulus features in a compact temporal representation, it is relevant to understand what its key coding features are and how the signal can be decoded into different components by higher visual areas by an efficient readout system. In the past years, different solutions for reading out the TPC were proposed. Recently, a TPC readout based on the so-called Liquid State Machine, or LSM (Doetsch, 2000) was developed by Knüsel et al. (2004). The LSM networks are examples of reservoir computing networks where dense local circuits of the cerebral cortex are implemented as a large set of nearly randomly defined filters. The results by Knüsel et al. (2004) showed 1.4. how can the key components encapsulated in a temporal code be decoded by different cortical areas involved in the visual process? 17 to be inefficient compared to traditional linear decoders based purely on correlations. Also additional layers of hundreds of integrate-and-fire neurons are required for the readout. Furthermore, the LSM strategy did not account for the different frequencies of oscillation present in the TPC. Consequently the readout system has remained an open issue. In this dissertation, we followed a different strategy to solve this issue. We proposed a biologically plausible readout circuit based on wavelets transforms (Luvizotto et al., 2012c). The hypothesis behind is that a population of readout neurons tuned to different spectral bands can extract the different content encoded in the temporal signal. Our hypothesis is in line with recent findings suggesting that frequency bands of coherent oscillations constitute spectral fingerprints of canonical computations mechanisms involved in the description of cognitive processes (Siegel et al., 2012). In the model we use a synchronous inhibition system oscillating in the gamma frequency range to simulate the effect of a Haar wavelet transform. Over windows of low inhibitory conductance the TPC signal is integrated and differentiated producing a band-pass filter response. The frequency response of the output is controlled by varying the synchronous inhibition time. In the proposed system, the readout can be tuned to specific frequency ranges, for instance the gamma frequency. Also in this scenario, interferences added by oscillations in different frequency bands can be filtered out enhancing the quality of the information that is captured. In this sense, the wavelet readout could accounting for the demultiplexing of different neuronal responses, for example of motion and orientation that are encoded in V1 (Onat et al., 2011). In addition, the mammalian visual system has a remarkable capacity of processing a large amount of visual information within dozens of milliseconds (Thorpe et al., 1996; Fabre-Thorpe et al., 1998). In this sense, the readout system must be fast, compact and therefore non-redundant. Those are well known advantages of wavelet transforms Mallat (1998). The pro- 18 introduction posed circuit reads out the signal using only a few coefficients. In particular, the simplicity and orthogonality of the Haar wavelet used in the readout circuit leads to a very economic neuronal substrate. The implementation requires only four neurons. In the chapter 2, we present all the details of the proposed neuronal wavelet readout circuit, the methods and results addressing its strengths and limitations. 1.5 Can TPC be used in the recognition of multi-modal sensory input such as human faces and gestures to provide form perception primitives? In this thesis, the original TPC model was revisited. We fully investigate the capabilities that TPC showed previously in simulations (Wyss et al., 2003b,a), virtual environments Wyss and Verschure (2004b) and controlled robot arena (Wyss et al., 2006) under realistic and natural environments. If the TPC encoding scheme provides a good representation of the visual input it is relevant to investigate what are its boundaries of robustness and reliability in complex real-world environment. Also a relevant issue for real-time applications is the computational requirements to process the massive amount of recurrent connections in the network. Can robustness in the encoding being achieved at reasonable speed of processing? We know from a theoretical point of view that TPC is built around a generic encoding mechanism. Can TPC be used as an unified and canonical model of cortical computations that can be extended to other stimulus modalities to generate perception primitives? Using different and more realistic types of modeled neurons and an overall new optimized architecture configuration, the original model was totally re-designed and optimized. To address these questions we first exposed the new TPC network to one of the most complex stimulus we are often 1.5. can tpc be used in the recognition of multi-modal sensory input such as human faces and gestures to provide form perception primitives? 19 exposed to: human faces (Zhao et al., 2003; Martelli et al., 2005; Sirovich and Meytlis, 2009). In chapter 3, we present the details of the face recognition system1 based on the TPC model that was implemented in the humanoid robot iCub (Luvizotto et al., 2011). To allow for real-time computations we resort to a standard method of signal processing to develop a novel way of calculating the lateral interactions in the network. The lateral connections were interpreted as discrete finite impulse responses (FIR) filters and the spreading of activity calculated using 2D convolutions. In addition, all the convolutions were calculated as multiplications in the frequency domain. A simple signal processing technique yields much faster processing times and also the possibility of having much larger networks in comparison to pre-assigned maps of connectivity stored in memory. The model was benchmarked using images from the iCub’s camera and from a standard database: the Yale Face Database B (Georghiades et al., 2001; Lee et al., 2005). This way results we could compare the performance of TPC with standard methods of face recognition available in the literature. This work was awarded as one of the top-five papers presented in the 2011 edition of the IEEE International Conference on Robotics and Biomimetic (ROBIO). Another contribution of this work is a complete TPC library developed and made publicly available2 that can be freely used by the robotics community. With the face recognition results, we could address questions regarding to robustness in real-time and real-world scenarios of TPC. To finally address the generality of TPC, we designed two more test scenarios involving different tasks and a different robot platform. In chapter 4, we extended the experiments performed with faces in the iCub to gesture recognition. The experiments were motivated by the game 1 iCub never forgets a face: a video of the face recognition system on iCub. http://www.robotcompanions.eu/blog/2012/01/2820/ 2 efaa.sourceforge.net 20 introduction Rock-Paper-Scissors. In this human-robot interaction scenario, the robot has to recognize the gesture thrown by the human to reveal who is the winner. The results show that with the same network, the robot can reliably recognize the gestures used in the game. Furthermore, the wavelet circuit got fully integrated in the C++ library showing a remarkable gain in recognition rate versus compression. To further stress the generality of TPC, in chapter 5, the proposed model is implemented in a mobile robot and used for navigation. Based on the camera input, the model represents distance relationships of the robot over different positions in a real-world arena. The TPC representation feeds a hippocampus model that accounts for the formation of place fields (Luvizotto et al., 2012b). These results finally demonstrated that the TPC encoding is totally generic. The same network implementation can feed the the working memory with different stimulus classes without any need of rewiring, fast and invariant to most transformations found in real-world environments. Chapter A wavelet based neural model to optimize and read out a temporal population code It has been proposed that the dense excitatory local connectivity of the neo-cortex plays a specific role in the transformation of spatial stimulus information into a temporal representation or a temporal population code (TPC). TPC provides for a rapid, robust and high-capacity encoding of salient stimulus features with respect to position, rotation and distortion. The TPC hypothesis gives a functional interpretation to a core feature of the cortical anatomy: its dense local and sparse long-range connectivity. Thus far, the question of how the TPC encoding can be decoded in downstream areas has not been addressed. Here, we present a neural circuit that decodes the spectral properties of the TPC using a biologically plausible implementation of a Haar transform. We perform a systematic investigation of our model in a recognition task using a standardized stimulus set. We consider alternative implementations using either regular spiking or bursting neurons and a range of spectral bands. Our results show that our wavelet readout circuit provides for the robust decoding of the TPC and further compresses the code without loosing speed or quality of decod21 2 a wavelet based neural model to optimize and read out a 22 temporal population code ing. We show that in the TPC signal the relevant stimulus information is present in the frequencies around 100 Hz. Our results show that the TPC is constructed around a small number of coding components that can be well decoded by wavelet coefficients in a neuronal implementation. The solution to the TPC decoding problem proposed here suggests that cortical processing streams might well consist of sequential operations where spatio-temporal transformations at lower levels form a compact stimulus encoding using TPC that are subsequently decoded back to a spatial representation using wavelet transforms. In addition, the results presented here show that different properties of the stimulus might be transmitted to further processing stages using different frequency components that are captured by appropriately tuned wavelet based decoders. 2.1 Introduction The encoding of sensory stimuli requires robust compression of salient features (Hung et al., 2005). This compression must support representations of the stimulus that are invariant to a range of transformations caused, in case of vision, by varying viewing angles, different scene configurations and deformations. Invariances and compression of information can be achieved by moving across different representation domains i.e. from spatial to temporal representations. In earlier work we proposed an encoding paradigm that makes use of this strategy called the Temporal Population Code (TPC) (Wyss et al., 2003a). In this approach the input stimulus is topographically projected onto a network of neurons organized in a bi-dimensional Cartesian space with dense local connectivity. The output of the network is a compressed representation of the stimulus captured in the temporal evolution of the population spike activity. The space to time transformation of TPC provides for a high-capacity encoding, invariant to position and image deformations that has been successfully applied to real world tasks such as hand-written character recognition (Wyss et al., 2003c), spatial navigation (Wyss and 2.1. introduction 23 Verschure, 2004b) and face recognition in a humanoid robot (Luvizotto et al., 2011). TPC shows that the dense excitatory local connectivity found in the primary sensory areas of the mammalian neo-cortex can play a specific role in the rapid and robust transformation and compression of spatial stimulus information that can be transmitted over a small number of projections to subsequent areas. This wiring scheme is consistent with the anatomy of the neo-cortex where about 95% of all connections found in a cortical volume also originate in it (Sporns and Zwi, 2004). In classical models of visual perception invariant representations emerge in the form of activity patterns at the highest level of an hierarchical multilayer network of spatial feature detectors (Fukushima, 1980a; Riesenhuber and Poggio, 1999; Serre et al., 2007). In this approach, invariances are achieved at the cost of increasing the number of connections between the different layers of the hierarchy. However, these models seem to be based on a fundamental assumption that is not consistent with cortical anatomy. In comparison to these hierarchical models of object recognition, the TPC architecture has the significant advantage of being both compact and wire independent thus providing for the multiplexing of information. Recently direct support for the TPC as a substrate for stimulus encoding has been found in a number of physiological studies. For instance, the physiology of the mammalian visual system shows dynamics consistent with the TPC in orientation discrimination (Samonds and Bonds, 2004; Benucci et al., 2009; MacEvoy et al., 2009) and spatial selectivity regulation (Benucci et al., 2007). In particular, showing stimulus specific modulation of the phase relationships among active neurons. In the bird auditory system, the temporal responses of neuron populations allow for the intensity invariant discrimination of songs (Billimoria et al., 2008). Similarly in the inferior temporal and prefrontal cortices information about stimulus categories is encoded by the temporal response of populations of neurons (Meyers et al., 2008) (Barak et al., 2010). Signatures of the TPC have also been found in the insect olfactory system where the glomeruli and the projection neurons of the antennal lobe display stimulus induced a wavelet based neural model to optimize and read out a 24 temporal population code temporal modulations of their firing rate at a scale of hundreds of milliseconds (Carlsson et al., 2005) (Knusel et al., 2007). If the TPC plays a role in stimulus encoding it is relevant to understand what its key coding features are and how these features can be subsequently decoded in areas downstream from the encoder. The readout by the decoder must be fast and compact, extracting the key characteristics of the original input stimulus in a compressed way. These key features must be captured in a non-redundant fashion so that prototypes of a class can emerge and be efficiently stored in memory and/or serve on-going action. In a hierarchical model of sensory processing based on the notion of TPC, the encoded temporal information provided by primary areas is mapped back onto the spatial domain allowing higher order structures to further process the stimulus. Hence, a TPC decoder is required to generate a spatially structured response from the temporal population code of the encoder. Taking into account these requirements our question is how a cortical circuit can retrieve the features encapsulated in the TPC. In the past years, different strategies for decoding temporal information have been suggested. A recently proposal is the so-called Liquid State Machine, or LSM (Doetsch, 2000) which is an example of a larger class of models also called reservoir computing (Lukoševičius and Jaeger, 2009). In this approach the dense local circuits of the cerebral cortex are seen as implementing a large set of practically randomly defined filters. When applied to reading out the TPC we have reported a lower performance as compared to using Euclidean distance as a result of the LSM’s noise sensitivity (Knüsel et al., 2004). In addition to being less effective than a linear decoder, LSM is computationally expensive requiring an additional layer of hundreds of integrate and fire neurons, while performance strongly depends on the specific parameters settings which compromises generality. Given that TPC is consistent with current physiology we want to know whether an alternative approach can be defined that is more tuned to the specific properties of the TPC, i.e. its temporal structure. 2.1. introduction 25 A readout mechanism for temporal codes, such as TPC, could also be based on an analysis of the temporal signal over different frequency bands and resolutions. A population of readout neurons tuned to different spectral bands could be possibly capable to implement such a readout stage. In this scheme, the temporal information of TPC is mapped back into a spatial representation by cells responsive to different frequency bands and thus the spectral properties of their inputs. A suitable framework for modeling such a readout stage is the wavelet decomposition: a spectrum analysis technique that divides the frequency spectrum in a desirable number of bands using variable-sized regions (Mallat, 1998). Higher processing stages in the neo-cortex could make use of such a scheme in order to capture information compressed and multiplexed in different frequency bands by preceding areas. The wavelet transform is a biological plausible candidate and has already been extensively used for modeling cortical circuits in different areas (Stevens, 2004; Chi et al., 2005). The classic description of image processing in V1 is based on a two-dimensional Gabor wavelet transform (Daugman, 1980). Recently, two alternative wavelet-based models approximating the receptive field properties of V1 neurons in the discrete domain have been proposed, which show additional desirable features such as orthogonality (Saul, 2008; Willmore et al., 2008). A one-dimensional wavelet transform can be interpreted as a strategy for reading out the different spectral components of the TPC that is equivalent to the wavelet-based models of V1 receptive fields (Jones and Palmer, 1987; Ringach, 2002). Thus providing for a general encoding-decoding model that can be generalized to the whole of the neo-cortex given its relatively uniform anatomical organization. Furthermore, from both representation and implementation perspectives, orthogonal wavelets are a compact way of decomposing a signal where the frequency spectrum is divided in a dyadic manner: at each resolution level of the filtering process a new frequency band emerges represented by half of the wavelet coefficients presented in the previous resolution level. Thus meeting one of the a wavelet based neural model to optimize and read out a 26 temporal population code fundamental requirements of an efficient readout system: compactness. Here, we combine the encoding mechanism of the TPC with decoder that is based on a one-dimensional, orthogonal and discrete wavelet transform implemented by a biological plausible circuit. We show that the information provided by the TPC generated at an earlier neuronal processing level can be decoded in a compressed way by this wavelet read-out circuit. Furthermore, we show that these wavelet transforms can be performed by a plausible neuronal mechanism that implements the, so called, Haar wavelet (Haar, 1911; Papageorgiou et al., 1998; Viola and Jones, 2001). The simplicity and orthogonality of the Haar wavelet makes this readout process fast and compact in a implementation that requires only four neurons. To investigate the validity of our hypothesis we first define a baseline for benchmarking the network’s performance in a classification task. The benchmarking is done using a stimulus set of images based on artificially generated geometric shapes. To test the readout performance we evaluate how the information extracted across different sets of wavelet coefficients, covering orthogonal regions of the frequency spectrum, influences classification performance. The simulations are performed using two types of implementations: regular spiking and bursting neurons. We consider these two types of models in order to address the effects of spiking dynamics on the encoding and decoding performance of the model. These two types of spiking behaviors have also been observed in V1 pyramidal neurons (Hanganu et al., 2006; Iurilli et al., 2011) . We also investigate the speed of encoding-decoding of the proposed wavelet based circuit in comparison to the method used in previous studies of TPC that are based purely on linear classifiers. In the last experiments, we explore the generality that the wavelet coefficients hold in forming prototypical representations of an object class that can be stored in a working memory in fast object recognition tasks. In particular, we are concerned with the question of how high-level information generated by 2.2. material and methods 27 sensory processing streams can be flexibly stored and retrieved in a longterm and working memory systems (Verschure et al., 2003; Duff et al., 2011). One option for the memory storage problem would be a labelled line code where specific axon/synapse complexes are dedicated to specific stimuli and their components (Chandrashekar et al., 2006; Nieder and Merten, 2007). This approach, however, faces capacity limitations both in the amount of information stored and the physical location where it can be processed. Alternatively a purely temporal code, such as TPC, would be in this respect independent of the spatial organization of the physical substrate and allow the multiplexing of high-level information. We show that this latter scenario is feasible and can be realized with simple biologically plausible neuronal components. Our results suggest that sensory processing hierarchies might well comprise sequences of spatio-temporal transformations that encode combinations of local stimulus features into perceptual classes using sequences of TPCs encoding and their wavelet decoding back to a spatial domain. 2.2 Material and methods The model is divided in two stages: a model of the lateral geniculate nucleus (LGN) and a topographic map of laterally connected spiking neurons with properties found in the primary visual cortex V1 (Fig. 2.1) (Wyss et al., 2003c,a). In the first stage we calculate the response of the receptive fields of LGN cells to the input stimulus, a gray scale image that covers the visual field. The approximation of the receptive field’s characteristics is done convolving the input image with a difference of Gaussians operator (DoG) followed by a positive half-wave rectification. The positive rectified DoG operator resembles the properties of on LGN center-surround cells (Rodieck and Stone, 1965; Einevoll and Plesser, 2011). The LGN stage is a mathematical abstraction of known properties of this brain area and a wavelet based neural model to optimize and read out a 28 temporal population code performs an edge enhancement of the input image. In the simulations we use a kernel ratio of 4:1, with size a of 10x10 pixels and variance σ = 1.5 (for the smaller Gaussian). V1 Input LGN TPC Spikes ... ... Gabor Filters Array of Spiking Neurons S Time Figure 2.1: The TPC encoding model. In a first step, the input image is projected to the LGN stage where its edges are enhanced. In the next stage, the LGN output passes through a set of Gabor filters that resemble the orientation selectivity characteristics found in the receptive fields of V1 neurons. Here we show the output response of one Gabor filter as input for the V1 spiking model. After the image onset, the sum of the V1 network’s spiking activity over time gives rise to a temporal representation of the input image. This temporal signature of the spatial input is the, so called temporal population code, or TPC. The LGN signal is projected onto the V1 spiking model, where the coding concept is illustrated in Fig. 2.2. The network is an array of N xN model neurons connected to a circular neighborhood with synapses of equal strength and instantaneous excitatory conductance. The transmission delays are related to the Euclidean distance between the positions of the pre-and postsynaptic neurons. The stimulus is continuously presented to the network and the spatially integrated spreading activity of the V1 units, as a sum of their action potentials, results in the so called TPC signal. In the network, each neuron is approximated using the spiking model proposed by Izhikevich (Izhikevich, 2003). These model neurons are biologically plausible and computationally efficient as integrate-and-fire models (Izhikevich, 2004). Relying only on four parameters, our network can reproduce both regular (RS) and bursting (BS) spiking behavior using a system of ordinary differential equations of the form: 29 2.2. material and methods Modeled Neuron Lateraly Connected Stimulus Network e Tim Connectivity Radius Spike Lateral Propagation Spikes ∑S Spikes TPC Time Figure 2.2: The TPC encoding paradigm. The stimulus, here represented by a star, is projected topographically onto a map of interconnected cortical neurons. When a neuron spikes, its action potential is distributed over a neighborhood of a given radius. The lateral transmission delay of these connections is 1 ms/unit. Because of these lateral intra-cortical interactions, the stimulus becomes encoded in the network’s activity trace. The TPC representation is defined by the spatial average of the population activity over a certain time window. The invariances that the TPC encoding renders are defined by the local excitatory connections. v 0 = 0.04v 2 + 5v + 140 − u + I (2.1) u0 = a(bv − u) (2.2) with the auxiliary after-spike resetting: v ← c if v ≥ 30 mV, then u ← u + d (2.3) Here, v and u are dimensionless variables and a, d, c and d are dimensionless parameters that determine the spiking or bursting behavior of the a wavelet based neural model to optimize and read out a 30 temporal population code neuron unit and 0 = d dt , where t is time. The parameter a describes the time scale of the recovery variable u. The parameter b describes the sensitivity of the recovery variable u to the sub-threshold fluctuations of the membrane potential v. The parameter c accounts for the after-spike reset value of v caused by the fast high-threshold K+ , and d the after-spike for the reset of the recovery variable u caused by slow high-threshold NA+ and K+ conductances. The mathematical analysis of the model can be found in (Izhikevich, 2006). The excitatory input I in 2.1 consists of two components: first a constant driving excitatory input gi and second the synaptic conductances given by the lateral interaction of the units gc (t). So I(t) = gi + gc (t) (2.4) For the simulations, we used the parameters suggested in (Izhikevich, 2004) to reproduce RS and BS spiking behavior (Fig. 2.3). All the parameters used in the simulations are summarized in table 2.1. The network architecture is composed of 24 populations of orientation selective neurons where a bank of Gabor filters are used to reproduce the characteristics of V1 receptive fields (Fig. 2.1). The filters are divided in layers of eight orientations Θ ∈ {0, π8 , 2 π8 , ...π} and three scales denoted by δ. The distance of the central frequency among the scales is 1/2 octave with a max frequency Fmax = 1/10 cycles/pixel. The convolution with Gδ,Θ is computed at each time step and the output is truncated according to a threshold Ti ∈ [0, 1], where the values above Ti are set to a constant driving excitatory input gi . Each unit can be characterized by its orientation selectivity angle Θ, its scale δ and a bi-dimensional vector x ∈ R2 specifying the location of its receptive center within the input plane. So a column is denoted by u(x, Θ, δ). The lateral connectivity between V1 units is exclusively excitatory with strength w. A unit ua (x, φ, δ) connects with ub if all of the following 2.2. material and methods 31 conditions are met: 1. be in the same population: Θa = Θb and δa = δb 2. have a different center position: xa 6= xb 3. within a region of a certain radius: kxb − xa k < r According to recent physiological studies, intrinsic V1 intra-cortical connections cover distances that represent regions of the visual space up to eight times the size of single receptive fields in V1 neurons (Stettler et al., 2002). In our model we set the connectivity radius r to 7 units. The lateral synapses are of equal strength w and the transmission delays τa are proportional to kxb − xa k with 1 ms/cell. The TPC is generated by summing the network activity in a time window of 128 ms. Finally, the output TPC vectors from different layers of orientation and scales are read out by the proposed wavelet circuit forming the decoded TPC vector used for the statistical analysis. In the discrete-time, all the equations are integrated with Euler’s method using a temporal resolution of 1 ms. Neuronal wavelet circuit The proposed neuronal wavelet circuit is based on the discrete multiresolution decomposition where each resolution reflects a different spectral range and uses the Haar wavelets as basis (Mallat, 1998). Approximation Coefficients, or AC, are the high-scale, low-frequency components of the signal spectrum obtained by convolving the signal with the scale function φ. The Detail Coefficients, or DC, are the low-scale, high-frequency components giving by the wavelet function ψ. Each component has a time resolution matched to the wavelet scale that works as a filter. The Haar wavelet ψ at time t is defined as: a wavelet based neural model to optimize and read out a 32 temporal population code Variable N a b crs cbs drs dbs v u gi Ti r w Description network dimension scale of recovery sensitivity of recovery after-spike reset value of v for RS neurons after-spike reset value of v for BS neurons after-spike reset value of u for RS neurons after-spike reset value of u for BS neurons membrane potential rest membrane recovery rest excitatory input conductance minimum V1 input threshold lateral connectivity radius synapse strength Value 80x80 Neurons 0.02 0.2 -65 -55 8 4 -70 -16 20 0.4 7 units 0.4 Table 2.1: Parameters used for the simulations. ψ(t) ≡ 1 0≤t< −1 1 2 0 otherwise 1 2 <t≤1 (2.5) and its associated scale function φ as: 1 0 ≤ t < 1 φ(t) ≡ 0 otherwise (2.6) In a biologically plausible implementation, the wavelet decomposition can be performed based on the activity of two short term buffer cells B1 and B2 inhibited by an asymmetric delayed connection from cell A (Fig. 2.4 a). The buffer cells integrate rapid changes over a certain amount of time analogous to the scale function φ, from eq. 2.6. In our model, the buffer cells are modeled as discrete low-pass Finite Impulse Response (FIR) filters. They are equivalent to the scale function φ in the discrete domain. 33 2.2. material and methods Buffer cells have been reported recently in other cortical areas such as the prefrontal cortex (Koene and Hasselmo, 2005; Sidiropoulou et al., 2009). Regular Spiking (RS) v(t) 25 ms Burst Spiking (BS) 7 v(t) 35 ms I(t) Figure 2.3: Computational properties of the two types of neurons used in the simulations: regular (RS) and burst spiking (BS). The RS neuron shows a mean inter spike interval of about 25 ms (40 Hz). The BS type displays a similar interburst interval with a within burst inter-spike interval of approximately 7 ms (140 Hz) every 35 ms (28 Hz). In our model, the buffer cells must receive an inhibitory input from A that defines the integration envelope. The inhibition has to be over a time period of t and with a shift in phase of π/2 between A to B2. Therefore, B1 and B2 will have their integration profile shifted in time by t. When the inhibition is synchronized in time with the integration profile of the buffer cells, the period 2t determines the low-frequency cutoff or the resolution level l associated with the Haar scale function φ, eq. 2.6 (low-pass filter). If both B1 and B2 are excitatory, the projection to cell W gives rise to the approximation level Al . On the other hand, if one buffer cell is inhibitory, as in the example of Fig. a wavelet based neural model to optimize and read out a 34 temporal population code TPC Spikes a) Time Exc B1 t time Inh Spike |||||| Exc Inhibition Envelope t 128 ms A Exc Short Term Buffer Cell Phase shift 90º B2 Inh |||||| Spike |||||| Exc Integration profile |||||| t Inh W Wavelet W l t read out cell Amplitude b) Ac3 Dc3 π/8 π/4 Dc2 π/2 Frequency Dc1 π Figure 2.4: a) a) Neuronal readout circuit based on wavelet decomposition. The buffer cells B1 and B2 integrate, in time, the network activity performing a low-pass approximation of the signal over two adjacent time windows given by the asynchronous inhibition received from cell A. The differentiation performed by the excitatory and inhibitory connections to W gives rise to a band-pass filtering process analogous to the wavelet detail levels. b) An example of bandpass filtering performed by the wavelet circuit where only the frequency range corresponding to the resolution level Dc3 is kept in the spectrum. 2.2. material and methods 35 2.4 a), the detail level Dl+1 is obtained by cell W as performed by the Haar wavelet function itself (eq. 2.5). In our model, the inhibition is modeled as discrete high-pass FIR filters. The combination of low-pass and high-pass filters in cascade produces band-pass filters. Therefore, the readout can be optimized to specific ranges of the frequency spectrum (Fig. 2.4 b). Stimulus set The stimulus set is based on abstract geometric forms as used previously (Wyss et al., 2003c). In a circular path with a diameter of 40 pixels, 5 uniformly distributed vertices can connect to each other with equal probability, defining the shape of a stimulus class, (Fig. 2.5). The different objects forming a class are generated by jittering the position of the vertices and the default line thickness of 4 pixels. We defined a total of 10 classes for the experiments with 50 exemplars per class. For the experiments we subdivide the data set in three subsets with increasing complexity by varying the amount of jitter in the vertices’ position and thickness of the connected line segments. The values of the jitter are given by uniform randomly distributed factors with zero mean and standard deviation equal to 0.03, 0.04 and 0.05 for the vertices’ position and 0.018, 0.021 and 0.025 the thickness of each subset respectively. We used subset 1 with 50 stimuli per class as training and classification set. In the case where subset 1 is used for classification, 50% of the stimuli are randomly assigned as training set and the other part used for classification. The subsets 2 and 3 are only used as classification set. In this case the subset 1 is entirely used for training. Therefore training stimuli are not used for classification and vice-versa. For estimating the degree of similarity among the images we used the normalized Euclidean distance of the pixels. The normalization is done as follows: a given stimulus has distance equal to zero if it is equal to its class prototype and one if it is the globally most distant exemplar over all a wavelet based neural model to optimize and read out a 36 temporal population code subsets. The image prototype is defined by the five vertices that define the geometry of a class with no jitter applied. Subset 1 Subset 2 Subset 3 Figure 2.5: The stimulus classes used in the experiments after the edge enhancement of the LGN stage. Cluster algorithm For the classification we used the following algorithm. The network’s responses to stimuli from C stimulus classes S1 , S2 , ..., SC are assigned to C response classes R1 , R2 , ..., RC of the training set, yielding a C × C hit matrix N (Sα , Rβ ), whose entries denote the number of times that a stimulus from class Sα elicits a response in class Rβ . Initially, the matrix N (Sα , Rβ ) is set to zero. For each response r ∈ Sα , we calculate the Euclidean distance of r to the responses r0 6= r elicitepd by stimuli of class Sγ : ρ(r, Sγ ) = hkρ(r, r0 )kir0 elicitepd by Sγ (2.7) where h.i denotes the average among the temporal Euclidean distances between r and r0 denoted by ρ(r, r0 ). The response r is classified into the response-class Rβ for that ρ(r, Sβ ) is minimal, and N (Sα , Rβ ) is incremented by one. The overall classification ratio in percentage is calculated summing the diagonal of the N (Sα , Rβ ) and dividing by the total number of elements in the classification set R. We chose the same metrics that was used in previous studies to establish a direct comparison between the results over different scenarios. 2.3. results 2.3 37 Results We start analyzing the properties of the proposed stimulus set detailed in section 2.2. Then, we run network simulations in order to establish a baseline classification ratio in an stimulus detection task. In the follow step we use the wavelet circuit to read out the TPC signal over different frequency bands and compare the classification results to the previously established baseline. In order to address the effect of spiking modality on the decoding mechanism we run separate simulations with two different kinds of neurons: Regular and Burst Spiking (eqs. 2.1 and 2.2). In the subsequent experiment, we investigate how the speed of encoding is affected by reading out the TPC using a neuronal implementation of the wavelet circuit. Subsequently, we show how the dense wavelet representation provided by our decoding circuit can be used to create class prototypes that can be used to flexibly define the content of a memory system. Finally, we perform an analysis of the similarity relationships between the TPC encoding and the representation of the stimuli in the wavelet and in the spatial domain respectively. The model parameters used for all simulations are specified in section 2.2 and in table 2.1. Stimulus set similarity properties We use an algorithmic approach to parametrically define our stimulus classes. Every class is defined around a prototype. (Fig. 2.6a upper row). We measure how similar the exemplars from the classification sets are to the respective image class prototype (see methods, session 2.2). The median Euclidian distance of stimulus set 1 to 3 are 0.59, 0.64 and 0.70 respectively. This increasing median translates in an increasing difficulty in the classification of the stimulus sets. a wavelet based neural model to optimize and read out a 38 temporal population code a) Class prototypes Exemplars with maximum distance to the class prototypes Subset 1 Subset 2 100 50 0 0.2 0.4 0.6 0.8 Euclidean Distance 1 Subset 3 150 Frequency 150 Frequency Frequency b) 100 50 0 0.2 0.4 0.6 0.8 Euclidean Distance 1 150 100 50 0 0.2 0.4 0.6 0.8 Euclidean Distance 1 Figure 2.6: The stimulus set. a) Image-based prototypes (no jitter in the vertices applied) and the globally most different exemplars with normalized distance equal one. The distortions can be very severe as in the case of class number one. b) Histogram of the normalized Euclidean distances between the class exemplars and the class prototypes in the spatial domain. Baseline and network performance As a reference for further experiments we first establish a baseline classification ratio. The baseline is defined by applying the clustering algorithm described previously (section 2.2) directly in the spatial domain of the stimulus set. In this scenario, the classification is performed over the pixel intensities of the centered and edge enhanced images (Fig. 2.7). As to be expected, the classification performance decreases with an increase of the geometric variability of the three subsets used in the experiments. For subset one, two and three the classification ratio reaches 91%, 88% and 82% respectively. Wavelet circuit readout We want to now assess the classification performance of compression capacity of the wavelet circuit we have proposed (section 2.2). We consider 39 Classification % 2.3. results 100 80 60 40 20 0 Subset 1 Subset 2 Subset 3 Figure 2.7: Baseline classification ratio using Euclidean distance among the images from the stimulus set in the spatial domain. a range of frequency resolutions in a dyadic manner using the wavelet resolution levels Ac5 corresponding to 0 to 15.5 Hz, Dc5 from 15.5 to 31 Hz, Dc4 from 31 to 62 Hz, Dc3 from 62 to 125 Hz, Dc2 from 125 to 250 Hz and finally Dc1 from 250 to 500 Hz. In the simulations, we increase the inhibition and integration time of the cells A, B1 and B2 (Fig. 2.4) in order to explore the classification performance in the stimulus classification task of the network. The results using RS neurons show that the classification performance has a peak in the frequency range from 62 Hz to 125 Hz equivalent to the, so called, Dc3 level in the wavelet domain where 91% , 83% and 74% of the TPC encoded stimuli are correctly classified for the subsets one, two and three respectively (Fig. 2.8). In comparison, using BS neurons the classification performance has a peak in the frequency range from 15 Hz to 31 Hz equivalent to the, so called, Dc5 level in the wavelet domain where 92% , 82% and 74% of the responses are correctly classified for the subsets one, two and three respectively (Fig. 2.8). Reading out the TPC without the wavelet circuit, i.e using the integrated spiking activity over time without the wavelet representation, we achieve a classification ratio for subsets one, two and three of 88%, 79% and 75 % for RS neurons and 87%, 80% and 74 % for BS neurons. Thus, the wavelet circuit adds a marginal improvement to the readout of the TPC signals as compared a wavelet based neural model to optimize and read out a 40 temporal population code % Classification RS Neurons 100 Ac5 Dc5 Dc4 Dc3 Dc2 Dc1 80 60 % Classification 4 100 4 8 16 32 64 BS Neurons 80 Subset 1 Subset 2 Subset 3 60 4 4 8 16 32 64 Wavelet Coefficients Figure 2.8: Comparison among the correct classification ratio for different resonance frequencies of the wavelet filters for both types of neurons RS and BS. The frequency bands of the TPC signal is represented by the wavelet coefficients Dc1 to Ac5 in a multi-resolution scheme. The network time window is 128 ms. to the control condition for the RS neurons, in particular for the easier stimulus set, while the BS version of the model does not show a marked increase in classification performance. However, while maintaining classification performance nearly the same, the dyadic property of the wavelet discrete transform compresses the length of the temporal signal by a factor of 8 and 32 using the Dc3 level and the Dc5 levels for RS and BS neurons respectively. So the information encoded over the 128 ms of stimulus presentation is captured by only a few wavelet coefficients, in a compressed way. 2.3. results 41 In comparison, with the benchmark results (Fig. 2.3), the wavelet circuit readout provides slightly lower classification classification ratio. However, If we look at the BS network numbers, the TPC wavelet readout can generate a reliable representation of the input image with 24x4 coefficients. In comparison, the 80x80 pixels of the original input image used for the benchmark is extremely compressed. The compression factor is about 66 times without a significant loss in classification performance. Therefore the wavelet coefficients provide for a compact representation of the stimuli in a specific region of the frequency spectrum. Classification speed A key feature of TPC is the encoding speed. It has been shown in previous TPC studies that the speed of encoding is compatible with the speed of processing observed in the mammalian visual system (Thorpe et al., 1996; Wyss et al., 2003c; Töllner et al., 2011). Here we investigate how fast the information transmitted by the TPC is captured by the wavelet coefficients. We use the mutual information measure (Victor and Purpura, 1999) to quantify the amount of transmitted information for a varying length of the signal interval of all the 24 network layers used to calculate the Euclidean distance (Fig. 2.9). The mutual information calculation is performed using the wavelet coefficients generated by the readout circuit. We also compare the speed of encoding between the TPC signal in the temporal domain against the readout version based on the wavelet coefficients. For RS neurons, the wavelet coefficients that lead to maximum classification performance are localized in the frequency interval from 62 Hz to 125 Hz, equivalent to the resolution level Dc3. For BS neurons, the frequency interval is in a lower range from 15 Hz to 31 Hz, equivalent to the resolution level Dc5. In these frequency ranges the maximum classification performance is achieved as shown in the previous section (Fig. 2.8 ). We observe that the number of bits encoded over the time window of a wavelet based neural model to optimize and read out a 42 temporal population code Information [Bits] 3 2 1 0 Information [Bits] 3 Subset 2 2 1 0 3 Information [Bits] Subset 1 Subset 3 2 RS-TPC RS-Wav BS-TPC BS-Wav 1 0 0 50 100 Time [ms] 150 Figure 2.9: Speed of encoding. Number of bits encoded by the network’s activity trace as a function of time. The RS-TPC and BS-TPC curves represent the bits encoded by the network’s activity trace without the wavelet circuit. The RS-Wav and BS-Wav correspond to the bits encoded by the wavelet coefficients using the Dc3 resolution level for RS neurons and the Dc5 for BS neurons respectively. For a time window of 128 ms the Dc3 level has 16 coefficients and the Dc5 has only 4 coefficients. The dots in the figure represent the moment in time where the coefficients are generated. 2.3. results 43 128 ms is nearly the same when comparing the non-filtered TPC signals (RS-TPC and BS-TPC, Fig. 2.9) and the signals captured by the wavelet readout (RS-Wav and BS-Wav, Fig. 2.9). However, in the case of the BS neurons the speed of encoding is slower when the signal is decoded by the wavelet circuit. This effect is due to the longer time constant of the buffer cells B1 and B2 to integrate and differentiate the signal at this resolution level and therefore to compute the wavelet coefficients. The buffer cells need 32 ms to compute the first wavelet (Fig. 2.9). In case of the RS neurons more than 90% of total information was captured within the second wavelet coefficient, or 16 ms after stimulus onset. Thus the effect of the neuronal wavelet circuit on the speed of encoding depends both on the spiking behavior of the encoders and on the frequency range at which the signal is read out. Prototype-based classification In the last step, we investigate whether the wavelet representation can be generalized to the generation of prototypes from the stimulus classes. The aim of the experiment is to create prototypes learned from the training set that can be stored in memory and retrieved in a future classification task. To construct such representations we build class prototypes based on the wavelet coefficients of the N stimuli making up the training set. For each of the ten stimulus classes we calculate the median over the wavelet coefficients among the 50 response exemplars that define a training class (subset one). Hence, for each class we have a vector of wavelet coefficients that define the class specific prototype, i.e. a representation of a class in the wavelet domain. Based on the classification experiments we use for RS neurons the coefficients from the Dc3 level and for BS neurons the Dc5 levels. The Fourier transform of the prototypes reveals the frequency components that comprise the prototypes for the two different neuron models we consider (Fig. 2.10). We present the classification set to the network and calculate the Euclidean a wavelet based neural model to optimize and read out a 44 temporal population code distance between the output responses of the wavelet network within the ten previously created prototypes. For the classification a simple criterion is adopted, the smallest Euclidean distance defines the class to which the stimulus is assigned. The results for the RS neurons show that 86%, 82% and 75% of the responses are correctly classified (diagonal entries) by the wavelet prototypes for the subset one, two and three respectively. For the BS neurons we observe that 91%, 81% and 72% of the responses are correctly classified for the subset one, two and three respectively (Fig.2.11). The classification ratios are consistent with the results previously presented in section 2.3 using the cluster algorithm (see methods, 2.2). However, the number of calculations in the classification stage is drastically reduced because each class is represented by a prototype vector of wavelet coefficients instead of a collection of vectors. This result suggest that with a simple algorithm the wavelet representation can be integrated into a compact description of a complex spatially organized stimulus. Therefore, the information provided by densely coupled cortical neurons can be learned and efficiently stored in memory independently of the details of their spiking behavior. In the second part of the experiment, we present a geometric and spatiotemporal analysis of the underlying neural code. We perform a correlation analysis among the stimuli misclassified using the wavelet prototypes in both wavelet and spatial domains. We want to understand whether the geometric deformations applied in the spatial domain are directly translated to the temporal representation captured by the wavelet coefficients. This analysis is performed using the exemplars that where misclassified using the wavelet prototypes approach. We calculate the normalized Euclidean distances in the spatial domain between each stimulus of the misclassified set and its prototype for each class. Second, we apply the same distance measure but using the wavelet representation of the misclassified stimuli and the prototypes. Finally, we make a correlation among the distance values. (Fig.2.12). 45 200 Ac5 Dc3 Class Prototype 1 RS Neurons 100 BS Neurons 0 0 100 200 300 400 Frequency (Hz) |X(f)| |X(f)| 100 200 300 400 Frequency (Hz) |X(f)| |X(f)| 0 0 100 200 300 400 Frequency (Hz) Class Prototype 4 100 0 0 100 200 300 400 Frequency (Hz) Class Prototype 5 100 0 0 100 200 300 400 Frequency (Hz) 100 500 200 300 400 Frequency (Hz) 500 Class Prototype 9 100 200 300 400 Frequency (Hz) 500 Class Prototype 10 200 |X(f)| |X(f)| 200 500 100 0 0 500 200 300 400 Frequency (Hz) Class Prototype 8 200 |X(f)| |X(f)| 200 100 100 0 0 500 500 Class Prototype 7 200 100 200 300 400 Frequency (Hz) 100 0 0 500 Class Prototype 3 200 100 200 100 0 0 100 0 0 500 Class Prototype 2 200 Class Prototype 6 200 |X(f)| |X(f)| 2.3. results 100 0 0 100 200 300 400 Frequency (Hz) 500 Figure 2.10: Single-sided amplitude spectrum of the wavelet prototype for each stimulus class used in the simulations. The signals x(t) where reconstructed in time using the wavelet coefficients from the Dc3 and Dc5 levels for RS and BS neurons respectively. The shaded areas shows the optimal frequency response of the Dc3 level (62 Hz to 125 Hz) and of the Dc5 level (15.5 Hz to 31 Hz). The less pronounced responses around 400 Hz are aliasing effects due to the signal reconstruction to calculate the Fourier transform (see discussion). a wavelet based neural model to optimize and read out a 46 temporal population code Subset 2 25 1 23 25 1 1 20 1 2 1 20 4 11 5 1 22 20 3 1 4 25 1 19 1 1 Subset 1 2 23 1 1 25 24 2 1 1 46 4 46 3 1 4 39 Subset 3 1 3 5 3 22 2 22 1 23 Stimulus Class 1 24 3 10 2 41 6 1 3 1 34 6 1 4 44 1 1 45 1 3 36 43 5 1 3 2 2 44 1 42 3 1 4 1 47 28 2 43 3 2 1 4 34 3 2 7 1 7 6 34 2 1 42 2 9 6 6 11 45 Stimulus Class 4 3 5 1 42 6 41 7 1 4 4 2 4 3 39 6 1 1 33 2 1 34 7 4 1 1 1 4 4 35 4 4 4 3 11 3 36 2 6 3 39 1 1 1 3 1 35 3 38 1 Stimulus Class Subset 2 1 1 1 2 20 2 5 1 Stimulus Class Response Class Response Class 23 5 20 1 13 Stimulus Class BS 33 1 46 Response Class Subset 1 Response Class Response Class 14 Subset 3 Response Class RS 41 1 1 8 3 3 47 1 2 1 33 11 2 3 1 6 2 1 36 1 20 1 39 9 2 9 2 3 32 4 2 4 1 11 3 33 1 4 2 1 1 37 3 1 3 1 7 6 3 8 2 41 Stimulus Class Figure 2.11: Prototype based classification hit matrices. For each class in the training we average the wavelet coefficients to form class prototypes. In the classification process, the euclidean distance between the classification set and the prototypes are calculated. A stimulus is assigned to a the class with smaller euclidean distance to the respective class prototype. The results show a positive correlation over the Euclidian distances in the two domains suggesting that the amount of geometric deformations in the spatial domain is directly translated to the wavelet representation of the temporal code. The correlation is higher for RS neurons with a value of 0.62 against 0.50 for BS neurons (p<0.001). The positive correlation between both domains validates the wavelet prototypes and therefore the overall TPC transformation structure as an equivalent representation of the stimuli classes that conserves the relevant spatial information. 2.4 Discussion We have shown previously that in a model of the sensory cortex, the representation of a static stimulus can be generated using the temporal dynamics of a neuronal population or Temporal Population Code. Here 47 Distance Wavelet Domain Distance Wavelet Domain 2.4. discussion 1 RS Neurons Rho = 0.62 p < 0.001 0.5 0 0 1 0.2 0.4 0.6 0.8 Distance Spatial Domain 1 BS Neurons Rho = 0.50 p < 0.001 0.5 0 0 0.2 0.4 0.6 0.8 Distance Spatial Domain 1 Figure 2.12: Distribution of errors in the wavelet-based prototype classification with relation to the Euclidean distances within the prototyped classes. we have shown that this temporal code has a specific signature in the phase relationships among the active neurons of the underlying substrate. This signal is efficiently used to pass a complete and dense amount of information that can be decoded in further areas through a sub-set of wavelet coefficients. The TPC is a relevant hypothesis on the encoding of sensory events given its consistency with cortical anatomy and recent physiology. Since its in- a wavelet based neural model to optimize and read out a 48 temporal population code troduction, however, a persistent problem has been to extend this concept to a readout stage that would be neurobiologically compatible. A priori it was not clear whether a wavelet transform would be suitable because it implies a specific structure in the TPC representation itself. We have shown that decoding of the TPC can be based on wavelet transforms. Using a systematic spectral investigation of the TPC signal we observed that the Haar basis seems to be a feasible choice providing robust and reliable decoding. The achieved results associated with the Haar wavelet are consistent with previous studies where Haar filters are used to preprocess neuronal data prior to a linear discriminant analysis (Laubach, 2004). In comparison with the original TPC readout the neuronal wavelet circuit proposed here showed the same encoding performance for regular spiking neurons as compared to the algorithmic version. However, for bursting neurons the wavelet readout had a slower speed of encoding in comparison with the original TPC. Therefore the details of the readout mechanism such as its integration time constant and its wavelet resolution level depend on the spiking dynamics. The specific geometric characteristics of each stimulus class could be captured in a very compact way using the wavelet filters. We showed that a visual stimulus can be represented and further classified using a strongly compressed signal based on the wavelet coefficients. Reading out the network based on regular spiking neurons using wavelet coefficients yielded a compression factor of 16.6 times the original image size. In the case of the bursting neurons the compression ratio was even higher reaching 66 times the original image size. Therefore, the spatial to temporal transformation of the TPC model combined with the efficient wavelet readout circuit can provide for a robust and compact representation of sensory inputs in the cortex for different spiking modalities. We have performed a detailed analysis of the misclassified stimuli in order to better understand the interesting feature of similarity conserving misclassifications we observe. We found a positive correlation among the 2.4. discussion 49 geometric distortions between the spatial and temporal domain, represented by the wavelet coefficients. These findings suggest that the deformations in the spatial domain were directly translated into the wavelet domain and therefore responsible for the misclassifications observed. This result reinforces the direct relationship present between the geometric and spatio-temporal portions of the underlying neural code and its decoding. Our results also suggest that specific axon/synapse complexes dedicated to specific features are not needed to successfully encode visual stimuli. We have shown that the efficient structure of an orthogonal basis like the Haar wavelet, can be implemented by a neuronal circuit (see Fig. 2.4), with low computational cost, low latency and thus in a real-time system. The wavelet filters as implemented with buffer neurons can be changed on line depending on what kind of information needs to be retrieved and its specific frequency range. This property allows multiplexing high-level information from the visual input and flexible storage and retrieval of information by a memory system. Indeed, recent experiments have shown that distinct activity patterns overlaid in primary visual cortex individually signal motion direction, speed, and orientation of object contours within the same network at the same time (Onat et al., 2011). We speculate that in a non-static scenario other stimulus characteristics such as motion speed and direction could be extracted from other frequency bands using the same wavelet circuit. The optimal resolution level for the wavelet readout was determined through an systematic investigation based on the correlation between compression and performance (Fig. 2.8). In the case of regular spiking neurons we observed a maximum classification performance in the Dc3 resolution level, or in the frequency range from 62 Hz up to 125 Hz. While for the bursting neurons it falls in the range of 15.5 Hz to 31 Hz. The wavelet circuit itself does not define the choice of the wavelet resolution. However, we could show that in the TPC framework the proposed readout circuit can capture different properties of visual stimuli that travel through the sensory processing stream at different frequency ranges. In our case the sensitivity of a wavelet based neural model to optimize and read out a 50 temporal population code the wavelet circuit will depend on the feed-forward receptive fields combined with the phase relationship imposed by the inhibitory units. The mechanisms used by higher cognitive areas to manipulate the frequency ranges and the kind of information carried in these temporal information channels are currently not clear and are subject of follow up studies. In comparison to the LSM model previously applied to read out the TPC, the wavelet circuit is computationally inexpensive and requires only four neurons to be implemented. Although the optimal readout performance also depends on specific parameters to set the readout frequency range the generality of the model is not affected as in the case of the LSM. Our results suggest that this optimal frequency range is determined by the spiking behavior of the neurons in the network. From a technical perspective, one issue related to orthogonal wavelets is the aliasing effect (Chen and Wang, 2001) which could insert redundant spectral content in the TPC signals leading to reduced classification performance. This property can be addressed by increasing the vanishing moments of the wavelet basis, the effects of aliasing is smoothed, increasing the orthogonality between the spectral sub-bands. However, the filtering process would get more sophisticated and would require more than two buffer cells. In contrast, the Haar based readout circuit is computationally efficient. To evaluate the effects of different spiking behaviors on the proposed readout circuit, we used a different and more physiologically constrained neuron model from previous TPC studies. In comparison, the overall dynamics of the previous neuron model are significantly different from the the model used here. For instance, the model used in previous studies (Wyss et al., 2003c) has a strong onset response with about 50% more spikes in the first 20 ms after stimulus onset as compared to the model used in the current study that includes mechanisms for spike adaptation. In addition, we observed significant differences in the sub-threshold fluctuations and the membrane potential envelope between these two neuron models. How- 2.4. discussion 51 ever, the overall results reported here are compatible with those previously reported (Wyss et al., 2003a,c). Based on that, we conclude that TPC and the proposed readout mechanism are robust with respect to the details of the spiking dynamics and the overall biophysical properties of the membrane potential envelope.We are not aware of any other encoding-decoding model of cortical dynamics that shows a similar generality. The performance measure we use, essentially based on classification, is a well-established standard (Victor and Purpura, 1999) in the literature, based on Euclidean distance. We looked both at classification performance and information encoded. In order to develop the specific point of this paper, the decoding of the TPC using wavelets, we adhere to this standard. Hence, the results should be seen as relative to those established and published for the TPC, contributing to the unresolved issue of how a biologically plausible decoding can take place. We demonstrate that a Wavelet-like transform can fulfill the requirements for an efficient readout mechanism, thus generating a specific hypothesis on the role of the sparse inter-areal connectivity found in the neo-cortex. We have shown that the dense local connectivity of the neo-cortex can transform the spatial organization of their inputs into a compact Temporal Population Code (TPC). By virtue of its multiplexing capability this code can be transmitted to downstream decoding areas using the sparse long-range connections of the cortex. We have shown that in these downstream areas the TPC can be decoded and further compressed using a wavelet based readout system. Our results show that the TPC information is organized in a specific subset of frequency space creating virtual communication channels that can serve working memory systems. Thus, TPC does not only provide a functional hypothesis on the specifics of cortical anatomy but it also provides a computational substrate for a functional neuronal architecture that is organized in time rather than in space (Buzsáki, 2006). a wavelet based neural model to optimize and read out a 52 temporal population code 2.5 Acknowledgments This work was supported by EU FP7 projects EFAA (FP7-ICT-270490) and GOAL-LEADERS (FP7-ICT-97732). Chapter Temporal Population Code for Face Recognition on the iCub Robot The connectivity of the cerebral cortex is characterized by dense local and sparse long-range connectivity. It has been proposed that this connection topology provides a rapid and robust transformation of spatial stimulus information into a temporal population code (TPC). TPC is a canonical model of cortical computation whose topological requirements are independent of the properties of the input stimuli and, therefore, can be generalized to the processing requirements of all cortical areas. Here we propose a real time implementation of TPC for classifying faces, a complex natural stimuli that mammals are constantly confronted with. The model consists of a network comprising a primary visual cortex V1 network of laterally connected integrate-and-fire neurons implemented in the humanoid robot platform iCub. The experiment was performed using human faces presented to the robot under different angles and position of light incidence. We show that the TPC-based model can recognize faces with a correct ratio of 97 % without any face-specific strategy. Additionally, the speed of encoding is coherent with the mammalian visual system suggest53 3 temporal population code for face recognition on the icub 54 robot ing that the representation of natural static visual stimulus is generated based on the combined temporal dynamics of multiple neuron populations. Our results provides that, without any input dependent wiring, TPC can be efficiently used for encoding local features in a high complexity task such as face recognition. 3.1 Introduction The mammalian brain has great abilities of recognizing objects under widely varying conditions. To perform this task, the visual system must solve the problem of building invariant representations of the objects that are the sources of the available sensory information. Since the breakthrough work on V1 by Hubel and Wiesel (Hubel and Wiesel, 1962) there has been an increasing number of models of the visual cortex trying to reproduce the performance of natural vision systems. A number of these models has been developed to reproduce characteristics of the visual system such as invariance to shifts in position, rotation, and scaling. Most of these models are based on the, so called, Neocognitron (Fukushima, 1980b, 2003; Poggio and Bizzi, 2004; Chikkerur et al., 2010), a hierarchical multilayer network of detectors with varying feature and spatial tuning. In this classical framework invariant representations emerge in the form of activity patterns at the highest level of the network by virtue of the spatial averaging across the feature detectors at the preceding layers. The recognition process is based on a large dictionary of features stored in memory where model neurons in the different layers act as filters tuned to specific features or feature combinations. In this approach invariances to, for instance, position, scale and orientation, are achieved at the high cost of increasing the number of connections between the layers of the network. In the last years a novel model of object recognition has been proposed based on the specific connectivity template found in the visual cortex 3.1. introduction 55 combined with the temporal dynamics of neural populations. The, so called, Temporal Population Code, or TPC, emphasizes the property of densely coupled networks to rapidly encode the geometric organization of its inputs into the temporal structure of its population response (Wyss et al., 2003a,c). In the TPC architecture the spatial information provided by an input stimulus is transformed into a temporal representation. The encoding is defined by the spatially averaged spikes of a population of integrate and fire neurons over a certain time window. In this paper we make use of the TPC scheme to encode one of the most complex stimuli mammals are challenged by: faces (Zhao et al., 2003; Martelli et al., 2005; Sirovich and Meytlis, 2009). In the brain, the recognition process of known faces takes place through a complex set of local and distributed processes that interact dynamically (Barbeau et al., 2008). Recent results suggest the existence of specialized areas populated by neurons selective to specific faces. For instance, in the macaque brain the middle face patch has been identified where neurons detect and differentiate faces using a strategy based on the distinct constellations of face parts (Freiwald et al., 2009). Nevertheless, these higher cortical areas where curves and spots are assembled into coherent objects must be fed by a canonical computational principle. Our model is based on the hypothesis that the TPC is the canonical feature encoder of the visual cortex that feeds higher areas involved in the recognition process. The model is implemented and tested in real time context. The robot platform used for the experiments is the humanoid robot iCub (Metta et al., 2008). In the face recognition task, we demonstrate that the TPC model can recognize faces under different natural conditions. We also show that the speed of encoding is compatible with the human visual system. Finally, we present a comparison between the TPC face recognition and other models available in the literature using a standard face database. In the next session we have a full description of the model and the parameters used for the simulations. temporal population code for face recognition on the icub 56 robot 3.2 Methods The model we propose here consists of a retinotopic map of spiking neurons with properties found in the primary visual cortex V1(Wyss et al., 2003c,a). The network consists of N xN neurons interconnected with a circular neighborhood with synapses of equal strength and instantaneous excitatory conductances. The transmission delays are related to the Euclidean distance between the positions of the pre-and postsynaptic neurons in the map. The stimuli are continuously presented to the network and the spreading spatial activity of the V1 units are integrated over a time window, as a sum of their action potentials. The outcome is the so called TPC, as showed in the previous chapter (Fig. 2.1). The TPC network In the network, each modeled neuron is approximated by a leaky integrateand-fire unit. The time course of its membrane voltage is given by: Cm dV = −(Ie (t) + Ik (t) + Il (t)). dt (3.1) Cm is the membrane capacitance and I represents the transmembrane current: excitatory (Ie ), spike-triggered potassium current (Ik ) and leak current (Il ). These currents are computed multiplying the conductance g by the driving force as following: I(t) = g(t)(V (t) − V r ) (3.2) where V r is the reversal potential of the conductance. The unit’s activity at time t is given by: A(t) = a{H(V (t) − θ)} (3.3) 57 3.2. methods where a ∈ [0, 1] is the spike amplitude of the unit, and V (t) is determined by the total input composed by the external stimulus and internal connections. H denotes the Heaviside function and θ is the firing threshold. After a spike is generated, the unit’s potential is reset to Vr . The time course of the potassium conductance gl is given by: τK dgk = −(gk (t) − gkpeak Â(t)) dt (3.4) where Â(t) = H(V (t) − θ). The excitatory input is composed by the sum of two components: a constant driving excitatory input gi (t) and the synaptic conductances given by the lateral interaction of the units gc (t). So ge (t) = gi + gc (t) (3.5) In the visual cortex it has been observed that many cells produce strong activity for a specific optimal input pattern. De Valois (De Valois et al., 1982) showed that the receptive field of simple cells is sensitive to specific position, orientation and spatial frequency. This behavior can be formalized as a multidimensional tuning response that can be approximated by a Gaussian-like template-matching operation. In our model each unit is characterized by orientation selectivity angle φ ∈ {0, π4 , 2 π4 , 3 π4 } and a bi-dimensional vector x ∈ R2 specifying its receptive field’s center location within the input plane. So each unit in the network is defined by u(x, φ) that we envision as approximating the functional notion of cortical column. The receptive field of each unit is approximated using the second derivative of an isotropic Gaussian filter Gσ . The input image is convolved with Gσ to reproduce the V1 receptive field’s characteristics for orientation selectivity. The input stimulus is a retinal gray scale image that fills the whole visual field. temporal population code for face recognition on the icub 58 robot 2 The convolution with ∂x(φ) Gσ is computed by twice applying a sampled first derivative of a Gaussian (σ = 1.5 and size 11x11 pixels) over the four orientations φ (Bayerl and Neumann, 2004). The filter output is truncated according to a threshold Ti ∈ [0, 1], where the values above Ti are set to Ex . Each orientation selective output is projected onto a specific network layer. The lateral connectivity between V1 units is exclusively excitatory and a unit ua (x, φ) connects to ub if the following conditions are met: 1. be in the same population: φa = φb 2. have a different center position: xa 6= xb 3. within a region of a certain diameter: kxb − xa k < r The synaptic efficacies are of equal strength w and the transmission delays τa proportional to kxb − xa k with 1 ms/cell. The connectivity diameter is equal to dmax . In the discrete-time case, all the equations are integrated with Euler’s method using a temporal resolution of 1 ms. Therefore in the simulations one time step is equivalent to 1 ms. All the simulations were done using the parameters of table 3.2. When a unit u receives the maximum excitatory input Ex at onset time t = 0 ms, the first spike occurs at time 6 ms. The TPC is generated by summing the network’s activity over a time window of T ms. Finally, the output vectors from different orientation layers are summed forming the TPC vector. In this way the temporal representation of TPC is naturally invariant to position. Theoretically rotation invariances are achieved by increasing the number of orientation selective layers reaching a full invariant architecture by expanding the number of orientation layers. 59 3.2. methods Variable N Cm Ver Vkr Vlr θ τK gkpeak gl a gi Ti dmax w Description network dimension membrane capacitance excitatory rev. potential spike-triggered potassium rev. potential leak reversal potential firing threshold time constant peak potassium conductance leak conductance spike amplitude excitatory input conductance minimum V1 input threshold lateral connectivity diameter synapsis strength Value 128x128 Neurons 0.2nF 60mV -90mV -70mV -55mV 1ms/cell 200 nS 20 nS 1 5 nS 0.6 7 units 0.1 nS Table 3.1: Parameters used for the simulations. See text for further explanation Optimizing the lateral spreading of activation in the network The key part of the model is the lateral propagation of activity denoted by gc (t) in the equation (3.5). Since, for each time step, hundreds of spikes can occur an efficient and fast real-time implementation is required. This is achieved by performing consecutive multiplications in the frequency domain. The lateral propagation of activity is performed after each spike every time step, i.e. 1 ms in our simulations. The lateral spread can be seen as consecutive square regions of propagation fd , or filters of lateral propagation, with increasing scale over a short period of time (Fig. 3.1).The maximum scale of f is given by the lateral connection diameter dmax . The magnitude of the spreading activation is given by the synapsis strength w. For the simulations dmax = 7 thus fd are of scale 3x3, 5x5 and 7x7. They are generated and stored in the frequency domain (denoted by Fd ) when the network is initialized. In the implementation the Discrete Fourier Trans- temporal population code for face recognition on the icub 60 robot form (DFT) operations were done using the FFTW3 library (Frigo and Johnson, 2005). Frequency Domain Matrix of Spikes t Spike Filters t +1 F3 X F5 t +2 Lateral Propagation (gc(t)) iDFT iDFT X Figure 3.1: Schematic of the lateral propagation using Discrete Fourier Transform (DFT). The lateral propagation is performed as a filtering operation in the frequency domain. The matrix of spikes generated at time t is multiplied by the filters of lateral propagation Fd over the next time steps t + 1 and t + 2. The output is given by the inverse DFT of the result. At time t we calculate the DFT of the spike matrix denoted by M . The matrix M is composed by zeros and ones (in the case of spikes). The propagation of activity is achieved by multiplying M by Fd for d = 3, 5 and 7 over the time steps t + 1, t + 2 and t + 3 respectively. Finally the inverse DFT is applied. Using a 2.66 Ghz Intel Core i7 processor under Linux each time step takes approximately 9 ms to be processed. 61 3.2. methods Clustering algorithm used in the classifier The network’s responses to stimuli from C stimulus classes S1 , S2 , ..., SC are assigned to C response classes R1 , R2 , ..., RC of the training set in a supervised manner. The result is a C × C hit matrix N (Sα , Rβ ), whose entries denote the number of times that a stimulus from class Sα elicits a response in class Rβ . Initially, the matrix N (Sα , Rβ ) is set to zero. For each response r ∈ Sα , we calculate the mean Pearson correlation coefficient over r and the responses r0 6= r elicitepd by stimuli of class Sγ : ρ(r, Sγ ) = hk ρ(r, r0 ) kir0 elicited by Sγ (3.6) where h.i denotes the average among the correlation between r and r0 denoted by ρ(r, r0 ). The response r is classified into the response-class Rβ for that ρ(r, Sβ ) is maximal, and N (Sα , Rβ ) is incremented by one. The overall classification ratio in percentage is calculated summing the diagonal of the N (Sα , Rβ ) and dividing by the total number of elements in the classification set R. For the statistics reported here, the data was generated in real-time and analyzed off-line. TPC and DAC architecture The face recognition system was included in a larger architecture called BASSIS. We have recently used this architecture as a framework for humanrobot interaction (Luvizotto et al., 2012a)1 . BASSIS is fully grounded in the Distributed Adaptive Control (DAC) architecture (Verschure and Althaus, 2003; Duff and Verschure, 2010). DAC consists of three tightly coupled layers: reactive, adaptive and contextual. At each level of organization increasingly more complex and memory dependent mappings from sensory states to actions are generated dependent on the internal state 1 This work is was done in collaboration with the partners from the european project EFAA. temporal population code for face recognition on the icub 62 robot of the agent. The BASSIS architecture expands DAC with many new components. It incorporates the repertoire of capabilities available for the human to interact with the humanoid robot iCub. BASSIS is divided in 4 hierarchical layers: Contextual, Adaptive, Reactive, Soma. The different modules responsible for implementing the relevant competences of the humanoid are distributed over these layers (Fig. 3.2). Contextual Supervisor OPC Adaptive Face Recognition Reactive tpcNetwork pmpActions Face Detector Soma cartesianControl gazeControl Figure 3.2: BASSIS is a multi-scale biomimetic architecture organized at three different levels of control: reactive, adaptive and contextual. It is based on the well established DAC architecture. See text for further details. The Soma and the Reactive layers are the bottom layers of the BASSIS diagram. They are in charge of coding the perceptual inputs through the faceDetector where the position of the faces in the visual field is acquired and tpcNetwork that encodes the cropped face collected using the infor- 3.2. methods 63 mation given by the faceDetector. The Reactive layer also deals with the robot motor primitives. The pmpActions package provides motor control mechanisms that allow the robot to perform complex manipulation tasks. It relies on a revised version of the so-called Passive Motion Paradigm (Mohan et al., 2009) that uses the cartesian controller developed by Pattacini et al. (2010) to solve the inverse kinematics problem while assuring to fulfill additional constraints such as joint limits. In the Adaptive layer, the classification of the encoded face is performed using the cluster algorithm detailed in section 3.2. At the top of the BASSIS diagram lays the Contextual layer which is composed of two main entities: the Supervisor and the Objects Properties Collector (OPC). The Supervisor module essentially manages the spoken language system in the interaction with the human partner so as the synthesis of the robot’s voice. The OPC reflects the current knowledge the system is able to acquire from the environment, collecting data that ranges from objects coordinates in multi-reference frames (both in image and robot domains). Furthermore, a central role is played by the OPC module which represents the database where a large variety of information is stored since the beginning of the experiments. The iCub The iCub is a humanoid robot one meter tall, with dimensions similar to a 3.5 years old child. It has 53 actuated degrees of freedom distributed over the hands, arms, head and legs (Metta et al., 2008). The sensory inputs are provided by artificial skin, cameras, microphones. The robot is equipped with novel artificial skin covering the hands (fingertips and the palm) and the forearms (Cannata et al., 2008). The stereo vision is provided by stereo cameras in a swivel mounting. The stereo sound capturing is done using two microphones. Both cameras and microphones are located where eyes and ears would be placed in a human. The facial expressions, mouth and eyebrows, are projected from behind the face panel using lines of red temporal population code for face recognition on the icub 64 robot LEDs. It also has the sense of proprioception (body configuration) and movement (using accelerometers and gyroscopes). The different software modules in the architecture are interconnected using YARP (Metta et al., 2006). This framework supports distributed computation focusing in the robot control and efficiency. The communication between two modules using YARP happens between objects called ”ports”. The ports can exchange data over different network protocols such as TCP and UDP. Currently, the iCub is capable of reliably co-ordinating reach and grasp motions, facial expressions to express emotions, force control exploiting its force/torque sensors and gaze at points in the visual field. The interaction can be performed using spoken language. With all these features, the iCub is an outstanding platform for research in social interaction between humans and humanoid robots. Object Properties Collector (OPC) The OPC module collects and store the data produce during the interaction. The module consists of a database where a large variety of information is stored starting at the beginning of the experiment and as the interaction with humans evolves over time. In this setup, the data provided by the face recognition model and eventually informations about other objects feed the memory of the robot represented by the OPC. OPC reflects what the system knows about the environment and what is available. The data collected range from object coordinates in multireference frames (both in the image and robot domains) to the human’s emotional states or position in the interaction process. Entities in OPC are addressed with unique identifiers and managed dynamically, meaning that they can be added, modified, removed at run-time safely and easily from multiple sources located anywhere in the network, practically without limitations. Entities consist of all the possible elements 3.2. methods 65 that enter or leave the scene: objects, table cursors, matrices used for geometric projection, as well as the robot and the human themselves. All the properties related to the entities can be manipulated and used to populate the centralized database. A property in the OPC vocabulary is identified by a tag specified with a string (a sequence of characters). The values that can be assigned to a tag can be either single strings or numbers, or even a nested list of strings and numbers. Spoken Language Interaction and Supervisor The spoken language interaction is managed by the Supervisor. It is implemented in the CSLU Toolkit (Sutton et al., 1998) Rapid Application Development (RAD) under the form of a finite-state dialogue system where the speech synthesis is done by Festival and the recognition by Sphinx-II. Indeed, these state modules provide functions including conditional transition to new states based on the words and sentences recognized, and thus conditional execution of code based on current state, and spoken input from the user. The recognition and the extraction of the useful information from the recognized sentence is based on a simple grammar that can be developed according to each application (Lallée et al., 2010). The stimulus set used in the face recognition experiments Real-time The images used as input to the network were acquired using the robot’s camera with a resolution of 640x480 pixels (Fig. 3.3). The faces were detected and cropped using a cascade of boosted classifiers based on haarlike features (Li et al., 2004). After the detection process the images are rescaled to a resolution of 128x128 pixels covering the whole scene. We ran the experiments reported here with faces from 6 subjects, a sample size enough for most social robotics tasks such as games, interaction or temporal population code for face recognition on the icub 66 robot Orientation Filters iCub Camera Face Detection V1 Network TPC Figure 3.3: Model overview. The faces are detected and cropped from the input provided by the iCub’s camera image. The cropped faces are resized to a fixed resolution of 128x128 pixels and convolved with the orientation selective filters. The output of each orientation layer is processed by separated neural populations as explained above. The spike activity is summed over a specific time window rendering the Temporal Population Code, or TPC. human assistance. The stimulus set consists of 6 classes of faces, one per subject, where each class has 7 variations of the same face under different poses. For the statistics 2 faces are used as training set and 5 faces as classification set for each class. The training stimuli are not used for classification and vice-versa. Standard face database In order to compare the proposed system to standard face recognition methods available in the literature, we use the cropped Yale Face Database B (Georghiades et al., 2001) (Lee et al., 2005). In this dataset, there are 10 classes with different human subjects. Each class has variations of the same face under different angles and position of light incidence, divided in four subsets of increasing difficulty 3.4. In this quantitative study we use lt = 7 faces of the subset 1 and lc = 12 faces of the subset 2 as training set. The subsets 2, 3 and 4 are used as classification set. The training stimuli 3.3. results 67 are not used for classification and vice-versa, except when the subset 2 is used for classification. All images are in original resolution and aspect ratio. Figure 3.4: Cropped faces from Yale face database B used as a benchmark to compare TPC with other methods of face recognition available in the literature (Georghiades et al., 2001) (Lee et al., 2005). 3.3 Results In the first experiment we investigate the capabilities of our model in a face recognition task using iCub. We also evaluate the optimal number of time steps required for the recognition process which is a free parameter that determines how many interactions are needed for the encoding process to support reliable classification. It both affects directly the processing time and therefore the overall system performance and the length of the TPC vector and thus the optimal performance/compression ratio of the data. We calculate the classification ratio over the network’s activity trace for different time windows after stimulus onset used to compute the correlation between TPCs of the training and test set ranging from 2 to 130 ms every 1 ms. The simulations were performed with the parameters of table 3.2. We are not using any light normalization mechanism in the model. For that reason we performed the simulations for different values of the input threshold Ti (see methods). temporal population code for face recognition on the icub 68 robot The classification performance shows a maximum of 87 % within 21 ms after stimulus onset using Ti = 0.6 (Fig. 3.5). This speed of optimal stimulus encoding is compatible with the speed of processing observed in the mammalian visual system (Thorpe et al., 1996). A TPC vector over this time window a represents a face with 22 points covering about 1% of the total number of pixels in a 128x128 image. The misclassified faces show strong deviations in position or light conditions (Fig. 3.7 and Fig. 3.6). 100 Classification [%] 80 60 40 Ti = 0.2 Ti = 0.4 20 Ti = 0.6 Ti = 0.8 0 0 50 Time [ms] 100 150 Figure 3.5: Speed of encoding. Average correct classification ratio given by the network’s activity trace as a function of time for different values of the input threshold Ti . In the second experiment we show that the recognition performance can be enhanced changing the read out mechanism. In this experiment the network spike activity is read out over Sr adjacent square sub-regions of the same aspect ratio. The TPC vector from the different sub-regions are concatenated to form the overall representation of the face. We used four different configurations to divide the subregions using a time window of 21 ms (Fig. 3.8). This operation increases by Sr times the size of the TPC vector that represents a face. The results show a maximum performance of 97 %. It also suggests an 3.3. results 69 Figure 3.6: Response clustering. The entries of the hit matrix represent the number of times a stimulus class is assigned to a response class over the optimal time window of 21 ms and Ti of 0.6. optimal ratio of performance/compression with 16 sub-regions, i. e 16*22 points per vector. In the last part, we show the results obtained with the TPC model using cropped faces from the Yale Face Database B. We followed the same procedure, reading out the network over Sr adjacent square sub-regions of the same aspect ratio and also varying the values of the input threshold Ti (Fig. 3.9). In table 3.3, we summarize the TPC performance in comparison with other face recognition methods. The TPC results presented in the table where calculated using 64 sub-regions, totalizing 64x22 = 1408 points per vector. This corresponds to approximately 5% of the number of pixels (192x168 = 32256) of the input image. The results show that both performance and compression ratio are totally compatible with state-of-art methods for face recognition. temporal population code for face recognition on the icub 70 robot Class Set Training Set Misclassified Stimulus Figure 3.7: Face data set. Training set (Green), Classification set (Blue) and Misclassified faces (Red). 100 Classification [%] 90 80 70 60 50 1 4 16 64 Sub-Regions Figure 3.8: Classification ratio using a spatial subsampling strategy where the network activity is read over multiples subregions. 71 3.4. discussion Ti = 0.2 Classification Performance [%] 100 80 80 60 60 40 40 20 20 0 2 3 Subset 4 Ti = 0.4 100 0 80 60 60 40 40 20 20 2 2 3 Subset 1 Sr 4 4 Sr 0 16 Sr 3 Subset 4 3 Subset 4 Ti = 0.8 100 80 0 Ti = 0.6 100 2 64 Sr Figure 3.9: Classification ratio using the cropped faces from the Yale databse. 3.4 Discussion In this paper, we addressed the question whether we could reliably recognize faces using a biological plausible cortical neural network exploiting the Temporal Population Code. Moreover, we investigated whether it could be performed in real time using the iCub robot. We have shown that in our cortical model, the representation of a static visual stimulus is generated based on the temporal dynamics of neuronal populations given by a specific signature in the phase relationships of the neuronal responses. In the specific benchmark evaluated here, the model showed a good performance both in terms of the ratio of correct classification and processing temporal population code for face recognition on the icub 72 robot Table 3.2: Comparison of TPC with other face recognition methods. The results were extracted from Lee et al. (2005) were the reader can find the references for each method. Method Correlation Eigenfaces Eigenfaces w/o 1st 3 Linear Subspace Cones-attached Harmonic images (no cast shadows) 9PL (simulated images) Harmonic images (with cast shadows) TPC Gradient Angle Cones-Cast 9PL (real images) Classification Ratio (%) vs. Ilum. Subset 1&2 Subset 3 Subset 4 100.0 76.7 23.7 100.0 74.2 24.3 100.0 80.8 33.6 100.0 100.0 85.0 100.0 100.0 91.4 100.0 100.0 96.4 100.0 100.0 97.2 100.0 100.0 97.3 100.0 100.0 98.0 100.0 100.0 98.6 100.0 100.0 100.0 100.0 100.0 100.0 time. Importantly, we showed that the TPC model can recognize faces with a correct ratio of 97 % without any face-specific mechanism in the network. Indeed, similar filter properties have been used for character recognition and spatial cognition tasks. The processing time needed for classifying an input of 128x128 pixels was about 200 ms. In comparison generic face recognition methods such as correlation and eigenfaces have reduced recognition performance (Lee et al., 2005). In the implementation reported here we used only one frequency resolution to model the receptive field’s characteristics of simple cells. It reduces the number of operations and therefore improves the recognition speed. The optimal frequency response was based on previous studies of V1 (Bayerl and Neumann, 2004). We expect that performance and the generality of the model can be further improved in a future implementation, using multiple processors in parallel. Specially because the features could be extracted by gaussian filters opti- 3.4. discussion 73 mally arranged over different scales similar to the SIFT transform (Lowe, 2004). With such a scheme we could achieve scale invariances without relying on re-scaling the cropped face image. However, the ideal architecture for TPC is a neuromorphic implementation where the integrate and fire units can work fully in parallel which calls for more specialized parallel hardware. Our results confirm that TPC is a relevant hypothesis on the encoding of sensory events by the brain of both vertebrates and invertebrates given its performance, speed of processing and consistency with cortical anatomy and physiology. Moreover, it reinforced the main characteristic of the TPC: its independence of the exact detailed properties of the input stimulus. The network does not need to be rewired to encode a very complex natural stimulus that is a human face and thus qualifies as a canonical cortical input processor. Chapter Using a temporal population to recognize gestures on the humanoid robot iCub. 4.1 Introduction Gesture recognition is a core human ability that improves our capacity of interaction and communication. Through gestures we can reveal unspoken thoughts that are crucial for verbal communication (Mitra et al., 2012) or even to interact with machines (Kurtenbach and Hulteen, 1990). It also plays a role in changing our knowledge, during the learning phase in our childhood (Goldin-Meadow, 2009). Different studies were dedicated to understand and categorize human gestures from a psychological point of view. According to Cadoz (1984); McNeill (1996), gestures can be used to communicate meaningful information (semiotic); used to manipulate the physical world and create artifacts (ergotic) or used to learn from the environment through tactile or haptic exploration (epistemic). 75 4 using a temporal population to recognize gestures on the 76 humanoid robot icub. Gonzalez Rothi et al. (1991) suggest the existence of different neural mechanisms for imitating either meaningful or meaningless actions. Meaningless and meaningful gestures can be imitate using a direct route that connects the visual analysis to the innervatory patterns, whereas meaningful gestures can only be imitate using the semantic route that comprises different processing stages. Independent to the kind of gesture, meaningful or not, higher cognitive areas require a sophisticated representation of the visual content. Furthermore, this representation must be invariant to a number of transformations caused by varying viewing angles, different scene configurations, light sources and deformations in form. In earlier work we proposed a model of the primary visual cortex that makes use of known cortical properties to perform a spatial to temporal transformation in a representation invariant to rotation, translation and scale. We showed that the dense excitatory local connectivity found in the primary visual cortex of the mammalian brain can play a specific role in the rapid and robust transformation of spatial stimulus information into a, so called, Temporal Population Code (or TPC) (Wyss et al., 2003a). In this study, we propose to apply the high-capacity encoding provided by TPC to a gesture recognition task. We want to evaluate whether the performance both in terms of recognition rate and computation speed of the TPC are feasible for real time applications. To our knowledge, there are very few models of object recognition that take into account the anatomy of biological systems. Most of actual systems try to reproduce the characteristics of humans abilities in gesture recognition through purely algorithmic solutions (Maurer et al., 2005). The model proposed here is benchmarked using the TPC library developed for the iCub robot (Luvizotto et al., 2011). The task is motivated by the Rock-Paper-Scissors hand game where the human player competes with the robot. Here we show the initial results of the proposed system using images obtained from the robot’s camera in a real game scenario. We 77 4.2. material and methods Scissors Paper at Rock Be s at Be s Beats Figure 4.1: The gestures used in the Rock-Paper-Scissors game. In the game, the players usually count until three before showing the gestures. The objective is to select a gesture which defeats that of the opponent. Rock breaks scissors, scissors cut paper and paper covers captures rock. Unlike a truly random selection method, like coin flipping or a dice, in the Rock-Paper-Scissors game is possible to recognize and predict the behavior of an opponent. present a detailed description of the model used in the experiments and the statistical analysis of the recognition performance using a standard cluster algorithm. The results are then compared to a predefined baseline classification rate. We then finish the chapter with a discussion of the main findings of this ongoing study. 4.2 Material and methods The model consists on a topographic map of laterally connected spiking neurons with properties found in the primary visual cortex V1 (Fig. 4.2) (Wyss et al., 2003c,a). In the first stage we segment the hand from the background using color segmentation. The input image is transformed into hue, saturation and value (HSV) space and then thresholded. The threshold values for the three parameters H, S and V can be set at runtime using a graphical interface. using a temporal population to recognize gestures on the 78 humanoid robot icub. Orientation V1 Filters Network iCub Camera Color Segmentation Wavelet Readout TPC Figure 4.2: Visual model diagram. In the first step, the input image is color filtered in the hue, saturation and value (HSV) space and segmented from the background. The segmented image is then projected to the LGN stage where its edges are enhanced. In the next stage, the LGN output passes through a set of Gabor filters that resemble the orientation selectivity characteristics found in the receptive field of V1 neurons. Here we show the output response of one Gabor filter as input for the V1 spiking model. After the image onset, the sum of the V1 network’s spiking activity over time gives rise to a temporal representation of the input image. This temporal signature of the spatial input is the so called temporal population code, or TPC. The TPC output is then read out by a wavelet readout system. The segmented hand is then projected to the V1 spiking model, where the coding concept is illustrated in Fig. 4.3. The network is an array of N xN model neurons connected to a circular neighborhood with synapses of equal strength and instantaneous excitatory conductance. The transmission delays are related to the Euclidean distance between the positions of the pre-and postsynaptic neurons. The stimulus is continuously presented to the network and the spatially integrated spreading activity of the V1 units, as a sum of their action potentials, results in the so called TPC signal. In the network, each neuron is approximated using the simple spiking model proposed by Izhikevich (Izhikevich, 2003). These modeled neurons 79 4.2. material and methods Modelled Neuron Lateraly Connected Network e Tim Stimulus (Hand) Connectivity Radius Spike Lateral Propagation Spikes ∑ Spikes TPC Time Figure 4.3: Schematic of the encoding paradigm. The stimulus, a human hand, is continuously projected onto the network of laterally connected integrate and fire model neurons. The lateral transmission delay of these connections is 1 ms/unit. When the membrane potential crosses a certain threshold, a spike occur and its action potential is distributed over a neighborhood of a given radius. Because of these lateral intra-cortical interactions, the stimulus becomes encoded in the network’s activity trace. The TPC signal is generated by summing the total population activity over a certain time window. are biologically plausible and as computationally efficient as integrate-andfire models. Relying only on four parameters, the model can reproduce spiking behavior of known types of cortical neurons (Izhikevich, 2004) using a system of ordinary differential equations of the form: v 0 = 0.04v 2 + 5v + 140 − u + I (4.1) u0 = a(bv − u) (4.2) with the auxiliary after-spike resetting: v ← c if v ≥ 30 mV, then u ← u + d (4.3) using a temporal population to recognize gestures on the 80 humanoid robot icub. Here, v and u are dimensionless variables and a, d, c and d are dimensionless parameters that determine the spiking or bursting behavior of the neuron unit and 0 = d dt , where t is the time. The parameter a describes the time scale of the recovery variable u. The parameter b describes the sensitivity of the recovery variable u to the sub-threshold fluctuations of the membrane potential v. The parameter c accounts for the after-spike reset value of v caused by the fast high-threshold K + , and d the after-spike for the reset of the recovery variable u caused by slow high-threshold N a+ and K + conductances. The mathematical analysis of the model can be find in (Izhikevich, 2006). The excitatory input I in 4.1 consists of two components: first a constant driving excitatory input gi and second the synaptic conductances given by the lateral interaction of the units gc (t). So I(t) = gi + gc (t) (4.4) For the simulations, we used the parameters suggested in (Izhikevich, 2004) to reproduce regular (RS) spiking behavior (Fig. 4.4). All the parameters used in the simulations are summarized in the table 4.1. mV Regular Spiking (RS) 20 0 −20 −40 −60 −80 25 Figure 4.4: Computational properties after frequency adaptation of the integrate and fire neuron used in the simulations. The spikes in this regular spiking neuron model occurs every 25 ms approximately (40 Hz). 4.2. material and methods 81 In our model each unit is characterized by orientation selectivity angle φ ∈ 0, π6 , 2 π6 , ..., π and a bi-dimensional vector x ∈ R2 specifying its receptive field’s center location within the input plane. So each unit in the network is defined by u(x, φ) that we envision as approximating the functional notion of cortical column. The receptive field of each unit is approximated using the second derivative of an isotropic Gaussian filter Gσ . The input image is convolved with Gσ to reproduce the V1 receptive field’s characteristics for orientation selectivity. The input stimulus is a retinal gray scale image that fills the whole visual field. 2 The convolution with ∂x(φ) Gσ is computed by twice applying a sampled first derivative of a Gaussian (σ = 0.7 and size 5x5 pixels) over the six orientations φ (Bayerl and Neumann, 2004). The filter output is truncated according to a threshold Ti ∈ [0, 1], where the values above Ti are set to Ex . Each orientation selective output is projected onto a specific network layer. The lateral connectivity between V1 units is exclusively excitatory with strength w. A unit ua (x, φ) connects with ub if all of the following conditions are met: 1. be in the same population: φa = φb . 2. have a different center position: xa 6= xb 3. within a region of a certain radius: kxb − xa k < r In our model we set the connectivity diameter r to 11 units, as used in previous TPC studies (Luvizotto et al., 2011). The lateral synapses are of equal strength w and the transmission delays τa are proportional to kxb − xa k with 1 ms/cell. The TPC is generated by summing the network’s activity over a time window of 64 ms. Finally, the output vectors from different orientation using a temporal population to recognize gestures on the 82 humanoid robot icub. layers are summed forming the TPC vector. In this way the temporal representation of TPC is naturally invariant to position. Theoretically rotation invariances are achieved by increasing the number of orientation selective layers reaching a full invariant architecture by expanding the number of orientation layers. Wavelet Readout The wavelet transform is a spectral analysis technique that uses variablesized regions as a kernel of the transformation. In comparison with a standard Short Time Fourier Transform (STFT), the wavelet transform does not use a time-frequency, but rather a time-scale region Mallat (1998). One major advantage is the ability to perform local analysis with optimal resolution in both the time and frequency domain rendering a time-frequency representation Fig. 4.5. In the model presented here, we used a neuronal implementation of a Haar wavelet transform. This circuit has been recently proposed in a TPC study (Luvizotto et al., 2012c), detailed in chapter 2. In this circuit, the TPC signal with N = 64 samples (sampling rate of 1 Khz), passes through a set of filters which slice the frequency spectrum in a desirable number of bands, i.e, the approximations and detail levels. The approximations are the high-scale, low-frequency components of the signal spectrum. The details are the low-scale, high-frequency components. At each level l of the filtering process a new frequency band emerges and is represented by N 2l wavelet coefficients. A dense representation is achieved selecting the coefficients that carry the greatest amount of information to re-construct the signal, without unwanted redundancies and artifacts. Stimulus set The stimulus set used in the experiment consists of the three different hand gesture’s classes used in the Rock-Paper-Scissors game (Fig. 4.2). 83 4.2. material and methods Level 0 N = 64 TPC Signal Level 1 N = 32 Approximation (A) Level 2 N = 16 A Level 3 N=8 Level 4 N=4 A A Detail (D) D D D 500 250 62 31 15 125 Level 5 A D N=2 Hz Figure 4.5: Wavelet transform time-scale representation. At each resolution level, the number of wavelet coefficient drops by factor of 2 (dyadic representation) as well as the frequency range of the low-pass signal given by the approximation coefficients. The detail coefficients can be interpreted as a band-pass signal, with frequency range equal to the difference in frequency between the actual and previous approximation levels. For each class, 40 stimulus with natural variations of the hand’s posture are used. There are in total 4 subjects. In the experiments, 75% of the images are used for training and the remaining 25% others are used as classification set. Giving the compact representation of TPC, the training set composed by 90 pictures can be generated offline and stored in memory using a temporal population to recognize gestures on the 84 humanoid robot icub. Variable N a b crs drs v u gi Ti r w Description network dimension scale of the recovery sensitivity of the recovery after-spike reset value of v after-spike reset value of u Membrane potential membrane recovery excitatory input conductance minimum V1 input threshold lateral connectivity radius synapsis strength Value 96x72 Neurons 0.02 0.2 -65 8 -70 -16 20 0.4 11 units 0.4 Table 4.1: Parameters used for the simulations. (only 20 Kb) for online recognition. Paper Rock Scissors Figure 4.6: The stimulus classes used in the experiments. Here 12 exemplars for each class are shown. Cluster algorithm For the classification we used the following algorithm. The network’s responses to stimuli from C stimulus classes S1 , S2 , ..., SC are assigned to C response classes R1 , R2 , ..., RC of the training set, yielding a C × C hit matrix N (Sα , Rβ ), whose entries denote the number of times that a stimulus from class Sα elicits a response in class Rβ . Initially, the matrix N (Sα , Rβ ) is set to zero. For each response r ∈ Sα , we calculate the Euclidean distance of r to the responses r0 6= r elicited by stimuli of class Sγ : 85 4.3. results ρ(r, Sγ ) = hk ρ(r, r0 ) kir0 elicited by Sγ (4.5) where h.i denotes the average among the temporal Euclidean distances between r and r0 denoted by ρ(r, r0 ). The response r is classified into the response-class Rβ for that ρ(r, Sβ ) is minimal, and N (Sα , Rβ ) is incremented by one. The overall classification ratio in percentage is calculated summing the diagonal of the N (Sα , Rβ ) and dividing by the total number of elements in the classification set R. We chose the same metrics that was used in previous studies to establish a direct comparison between the results over different scenarios. 4.3 Results In the first part of the experiments, we established a classification baseline to further compare and evaluate the classification performance provided by TPC. We define the baseline applying the cluster algorithm (see section 4.2 for details) over the pixel intensity values of the gray scale images used in the experiments. In this situation, an image of 96x72 pixels is represented by a vector of size 6912. Using the 75% of the images as training set, the results show a correct classification ratio of 70% (Fig. 4.7). In the next step, we run the simulations using the TPC to compare classification, speed and compression ratio of the model against the baseline. In our simulations we used the wavelet readout circuit at the approximation level 5, or Dc 5 to finally capture the intrinsic geometric properties of the stimulus. At this level, the signal ranges from 15 Hz to 31 Hz and it is represented by only 2 wavelet coefficients. As we have 6 signals provided by the different orientation layers, the total size of the readout signal is 12. The simulations were performed with the parameters of table 4.1, using a TPC library developed in C++. The performance shows a maximum of using a temporal population to recognize gestures on the 86 humanoid robot icub. Response Class 4 6 10 3 7 Stimulus Class Figure 4.7: Classification hit matrix for the baseline. In the classification process, the correlation between the classification set (25% of the stimuli )and the training set (75%) are calculated. A stimulus is assigned to a the class with smaller mean correlation to the respective class in the training set. 93 % of correct classification ratio. The misclassifications happen mostly with class one, i.e the Rock gesture (Fig. 4.8). This gesture intersects with a great extend the other two and it’s more bounded to misclassifications. This classification ratio is compatible with previous TPC works, for instance with faces and hand written characters recognition (Wyss et al., 2003c) (Luvizotto et al., 2011). In a 2.67 Ghz core i7 processor, the processing time per hand is about 0.3 s. In comparison to the baseline, the TPC representation shows a gain in correct classification of 23% while reducing the representation in an incredible factor of 576, i.e. from 6912 pixels in a resolution of 96x72 to 12 Dc 5 wavelet coefficients. 87 4.4. discussion Response Class 8 10 2 10 Stimulus Class Figure 4.8: Classification hit matrix. In the classification process, the euclidean distance between the classification set and the prototypes are calculated. A stimulus is assigned to a the class with smaller euclidean distance to the respective class prototype. 4.4 Discussion We have shown that in a cortical model, the representation of a static visual stimulus can be generated using the temporal dynamics of a neuronal population. This temporal encoding has a specific signature in the phase relationships between the active neurons of the underlying encoding substrate. The TPC is a relevant hypothesis on the encoding of visual information. It is consistent with known cortical characteristics both anatomically and physiologically. Also the final representation given by the wavelet circuit compress the signal, removing redundant and non-orthogonal components of the signal greatly improving the compromise between performance and size. using a temporal population to recognize gestures on the 88 humanoid robot icub. In conclusion, simple hand gestures can be reliably classified using the TPC approach. With the highly optimized C++ library we developed for the experiments, the computational time needed for the TPC computation is totally feasible for the the recognition of the gestures. Hence, TPC has proved to be robust and fast enough as a gesture recognition mechanism in the iCub without affecting or compromising the game playing with iCub. Chapter A framework for mobile robot navigation using a temporal population code Recently, we have proposed that the dense local and sparse long-range connectivity of the visual cortex accounts for the rapid and robust transformation of visual stimulus information into a temporal population code, or TPC. In this paper, we combine the canonical cortical computational principle of the TPC model with two other systems: an attention system and a hippocampus model. We evaluate whether the TPC encoding strategy can be efficiently used to generate a spatial representation of the environment. We benchmark our architecture using stimulus input from a real-world environment. We show that the mean correlation of the TPC representation in two different positions of the environment has a direct relationship with the distance between these locations. Furthermore, we show that this representation can lead to the formation of place cells. Our results suggest that TPC can be efficiently used in a high complexity task such as robot navigation. 89 5 a framework for mobile robot navigation using a temporal 90 population code 5.1 Introduction Biological systems have an extraordinary capacity of performing simple and complex tasks in a very fast and reliable way. These capabilities are largely due to the robust processing of sensory information in an invariant manner. In the mammalian brain, the representation of dynamic scenes in the visual world are processed invariant to a number of deformations such as perspective, different light conditions, rotations and scales. This great robustness leads to successful strategies for optimal foraging. For instance, a foraging animal has to explore the environment, search for food, escape from possible predators and produce an internal representation of the world that allows it to successfully navigate through it. To perform navigation with mobile robots using only visual inputs from a camera is a complex task (Bonin-Font et al., 2008). Recently, a number of models have been proposed to solve this task (Jun et al., 2009; Koch et al., 2010). Unlike the visual system, most of these methods are based on brute force methods of feature matching that are computationally expensive and stimulus dependent. In contrast, we have proposed a model of the visual cortex that can generate reliable invariant representations of visual information. The temporal population code, or TPC, is based on the unique anatomical characteristics of the neo-cortex: dense local connectivity and sparse long-range connectivity (Binzegger et al., 2004). Physiological studies have estimated that only a few percent of synapses that make up cortical circuits originate outside of the local volume (Schubert et al., 2007; Liu et al., 2007; Nauhaus et al., 2009). The TPC proposal has demonstrated the property of densely coupled networks to rapidly encode the geometric organization of the population response, for instance induced by external stimuli, into a robust and high-capacity encoding (Wyss et al., 2003a,c). It has been shown that TPC provides an efficient encoding and can generalize to navigation tasks in virtual environments. In a previous study, 5.1. introduction 91 TPC model applied to a simulated robot in a virtual arena could account for the formation of place cells (Wyss et al., 2006). It has also been showed that TPC can generalize to realistic tasks such as handwritten character classification (Wyss et al., 2003a). The key parameters that control this encoding and the invariances that it can capture, are the topology and the transmission delays of the laterally coupled neurons that constitute an area. In this respect TPC emphasizes that the neo-cortex exploits specific network topologies as opposed to the randomly connected networks found in examples of, so-called, reservoir computing (Lukoševičius and Jaeger, 2009). In this paper, we evaluate whether the TPC applied can be generalized to a navigation task in a natural environment. In particular we assess whether it can be used to extract spatial information from the environment. In order to achieve this generalization we combine the TPC, as a model of the ventral visual system, with a feedforward attention system and a hippocampal like model of episodic memory. The attention system uses early simple visual features to determine which salient regions in the scene the TPC framework will encode. The spatial representation is then generated by a model of the hippocampus that develops place cells. Our model has been developed to mimic the foraging capabilities of a rat. Rodents are capable to optimally explore the environment and its resources using distal visual cues for orientation. The images used in the experiments are acquired in a real-world environment taken by the camera of a mobile robot called the Synthetic Forager (SF) (Rennó-Costa et al., 2011). Our results show that the combination of TPC with a saliency based attention mechanisms can be used to develop robust place cells. These results directly support the hypothesis that the computational principle proposed by TPC leads to a spatial-temporal coding of visual features that can reliably account for position information in a real-world environment. Differently from previous TPC experiments with navigation, here we stress the capabilities of position representation in a real-world environment. a framework for mobile robot navigation using a temporal 92 population code In the next section we present a description of the 3 different systems we use in the experiments. We detailed how the environment for the experiments was setup and how the images for the experiments were generated. In the following section we expose the main results obtained with the experiments and we finish discussing the core findings of the study. 5.2 Methods Architecture overview The architecture presented here comprises three areas (Fig. 5.1): attention system, early visual system and a model of the hippocampus. The attention system exchanges information with the visual system in a bottom-up fashion. It receives bottom-up input of simple visual features and provides salience maps (Itti et al., 1998). The saliency maps determine the sub regions that are rich in features such as colors, intensity and orientations that pop-out from the visual field and therefore will be encoded in the visual system. The visual system also provides input to the hippocampal model in a bottom-up way. The subregions of interest determined by the attention system are translated into temporal representations that feed the hippocampal input stage. The temporal representation of the visual input is generated using the so called Temporal Population Code, or TPC (Wyss et al., 2003a,c; Luvizotto et al., 2011). The hippocampus model translates the visual representation of the environment in a spatial representation based on the activity of place cells (de Almeida et al., 2009a; Rennó-Costa et al., 2010). These kinds of granule cells respond only to one or a few positions in the environment. In the following sections we present the details and modeling strategies used for each area. 93 5.2. methods Attention System colors Hippocampus Input Image intensity orientations Cell #1 Place Cells Cell #2 Salience Map Cell #3 Cell #4 Top-Down Early Visual Features Saliency Maps Bottom-up Bottom-up Visual Representation of the Environment Visual System Salient Subregion e Tim Tpc Figure 5.1: Architecture scheme. The visual system exchanges information with both the attention system and the hippocampus model to support navigation. The attention system is responsible for determining which parts of the scene are considered for the image representation. The final position representation is given by the formation of place cells that show high rates of firing whenever the robot is in a specific location in the environment, i.e. place cells. Attention System In our context, attention is basically the process of focusing the sensory resources to a specific point in space. We proposed a deterministic attention system that is driven by the local points of maximum saliency in the input image. To extract these subregions of interest, the attention system uses a saliency-based model proposed and described in detail in (Itti et al., 1998). The system makes use of image features extracted by the early primate visual system, combining multi-scale image features into a topographical saliency map (bottom-up). The output saliency map has a framework for mobile robot navigation using a temporal 94 population code the same aspect ratio as the input image. To select the center points of the subregions, we calculate the points of local maxima p1 , p2 , . . . , p21 of the saliency map. The minimum accepted distance between peaks in the local maxima search is 20 pixels. This competitive process among the salient points can be interpreted as a top-down pushpull mechanism (Mathews et al., 2011). A subregion of size 41 × 41 is determined by a neighborhood of 20 pixels surrounding a local maximum p. Once a region is determined, the points inside of this area are not used in the search for the next pk+1 . This constraint avoids processing redundant neighboring regions with overlaps bigger than 50% in area. For the experiments, the first 21 local maxima are used by the attention system. These 21 points are the center of the subregions of interest cropped from the input image and sent to the TPC network. We used a Matlab implementation1 of the saliency maps (Itti et al., 1998). In the simulations we use the default parameters, except for the map width which is set to the same size as the input image: 967 pixels. Visual System The early visual system consists of two stages: a model of the lateral geniculate nucleus (LGN) and a topographic map of laterally connected spiking neurons with properties found in the primary visual cortex V1 (Fig. 5.2) (Wyss et al., 2003c,a). In the first stage we calculate the response of the receptive fields of LGN cells to the input stimulus, a gray scale image that covers the visual field. The approximation of the receptive field’s characteristics is performed by convolving the input image with a difference of Gaussians operator (DoG) followed by a positive half-wave rectification. The positive rectified DoG operator resembles the properties of on LGN center-surround cells (Rodieck and Stone, 1965; Einevoll and Plesser, 2011). The LGN stage is a mathematical abstraction of known properties of this brain area and performs an edge enhancement on the 1 The scripts can be found at: http://www.klab.caltech.edu/ harel/share/gbvs.php 95 5.2. methods input image. In the simulations we use a kernel ratio of 4:1, with size of 7X7 pixels and variance σ = 1 (for the smaller Gaussian). LGN V1 TPC Spikes Input Array of Spiking Neurons Time Figure 5.2: Visual model overview. The saliency regions are detected and cropped from the input image provided by camera image. The cropped areas, subregions of the visual field, have a fixed resolution of 41x41 pixels. Each subregion is convolved with difference of gaussian (DoG) operator that approximates the properties of the receptive field of LGN cells. The output of the LGN stage is processed by the simulated cortical neural population and their spiking activity is summed over a specific time window rendering the Temporal Population Code. The LGN signal is projected to the V1 spiking model, that uses the same network based on Izhikevich neurons presented in chapters 2 and 4 (Fig. fig:2Rd). For the simulations, we used the parameters suggested in (Izhikevich, 2004) to reproduce regular firing (RS) spiking behavior. All the parameters used in the simulations are summarized in the table 5.1. The lateral connectivity between V1 units is exclusively excitatory with strength w. A unit ua (x) connects with ub if they have a different center position xa 6= xb and if they are within a region of a certain radius kxb − xa k < r. The temporal population code is generated by summing the network activity in a time window of 128 ms. Thus a TPC is a vector with the spiking count of the network for each time step. In the discrete-time, all the equations are integrated with Euler’s method using a temporal resolution of 1 ms. a framework for mobile robot navigation using a temporal 96 population code Variable N a b crs cbs v u gi Ti r w Description network dimension scale of the recovery sensitivity of the recovery after-spike reset value of v for RS neurons after-spike reset value of v for BS neurons membrane potential membrane recovery excitatory input conductance minimum V1 input threshold lateral connectivity radius synapsis strength Value 41x41 Neurons 0.02 0.2 -65 -55 -70 -16 20 0.4 7 units 0.3 Table 5.1: Parameters used for the simulations. Hippocampus model The hippocampus model is an adaptation of a recent study on the rate remapping in the dentate gyrus (de Almeida et al., 2009a,b; Rennó-Costa et al., 2010). In our implementation, the granule cells receive excitatory input from the grid cells of the enthorinal cortex. The place cells that are active for a given position in the environment are then determined according to the interaction of the summed excitation and inhibition using a rule based on the percentage of maximal suprathreshold excitation E%max winner-take-all (WTA) process. The excitatory input received by the ith place cell from the grid cells is given by: Ii (r) = n X Wij Gj (r) (5.1) j−1 where Wij is the synaptic weight of each input. In our implementation, the weights are either random values in the range of [0, 1] (maximum bin size among normalized histograms) or 0 in case of no connection. In our implementation, we use the TPC histograms as the information provided by the grid cells Gj (r). The activity of the ith place cell is given by: 97 5.2. methods Fi (r) = Ii (r)H(Ii (r) − (1 − k)max grid (r)) (5.2) The constant k varies in the range of 5% to 15% (de Almeida et al., 2009b). It is referred to E%-max and determines which cells will fire. Specifically, the rule states: a cell fires if their feedforward excitation is within E% of the cell receiving maximal excitation. Experimental environment For the experiments we produced a dataset of pictures from an indoor square area of 3m2 . The area was divided in a 5x5 grid of 0.6 m2 square bins, compromising 25 sampling positions (Fig. 5.3a). The distance to objects and walls ranged from 30 cm to 5 meters. To emulate the sensory input we approximated the rat view at each position with 360 degree panoramic pictures. Rats use distal visual cues for orientation (Hebb, 1932) and such information is widely available for the rat given the spaced positioning of their eyes, resulting in a wide field of view and low binocular vision (Block, 1969). The pictures were built using the software Hugin (D’Angelo, 2010) to process 8 pictures with equally distant orientations (45 degrees steps) for each position. The images were acquired using the SF-robot (Rennó-Costa et al., 2011). Each indoor images was segmented in 21 subregions of 41 × 41 pixels by the attention system as described earlier. The maximum overlap among the subregions is 50 % (Fig. 5.3b). Image representation using TPC For every image, 21 TPC vectors are generated, i.e. one for each subregion selected by the attention system. In total, we acquired 25 panoramic images of the environment, leading to a matrix of 525 vectors of TPCs (25x21). We cluster the TPC matrix into 7 classes using the k-means cluster algorithm. Therefore, a panoramic image in the environment is a framework for mobile robot navigation using a temporal 98 population code represented by the cluster distribution of its TPC vectors using a histogram of 7 bins. The histograms are used as a representation of each position in the environment (Fig. 5.3b). They are the inputs for the hippocampus model. a) Environment used in the experiments Sampling position 21 22 16 11 6 1 b) 2 Histogram of TPCs 21 9 11 13 17 19 20 18 12 7 3 0.6 m 4 24 19 13 8 14 9 5 25 20 15 3m 10 Panoramic image of position 2 15 7 23 17 3 6 1 8 14 5 2 4 10 18 12 16 Salient Regions Figure 5.3: Experimental environment. a) The indoor environment was divided in a 5x5 grid of 0.6 m2 square bins, compromising 25 sampling positions. b) For every sampling position, a 360 degrees panorama was generate to simulate the rat’s visual input. A TPC responses is calculated for each salient region, 21 in total. The collection of TPC vectors for the environment is clustered into 7 classes. Finally, a single image is represented by the cluster distribution of its TPC vectors using a histogram of 7 bins. 5.3 Results In the first experiment, we explored whether the TPC response preserves bounded invariance in the observation of natural scenes, i.e. if the representation is tolerant to certain degrees of visual deformations without loosing specificity. More specifically, we investigate whether the correlation between the TPC histograms of visual observations in two positions can be associated with their distance. We calculate the mean pairwise correlation among TPC histograms of the 25 sampling positions used for image acquisition in the environment. We sort the correlation values into 99 5.3. results distance intervals of 0.6 m, the same distance intervals used for sampling the environment. Our results show that the TPC transformation can successfully account for position information in a real world scenario. The mean correlation between two positions in the environment decays monotonically with the increase of their distance (Fig. 5.4). As expected, the error in the correlation measure increases with the distance. Mean Correlation 1 0.8 0.6 0.4 0.2 0 0 0.6 1.2 1.8 Meters 2.4 3 Figure 5.4: Pairwise correlation of the TPC histograms in the environment. We calculate the correlation among all the possible combinations of two positions in the environment. We average the correlation values into distance intervals, according to the sampling distances used in the image acquisition. This result suggests that we can produce a linear curve of distance based on the correlation values calculated using the TPC histograms. In the second experiment, we use the TPC histograms as input for the hippocampus model to acquire place cells. Neural representations of position, as observed in place cells, present a quasi-Gaussian transition between inactive and active locations, which makes bounded invariance a key property for a smooth reconstruction of the position from visual input as observed in nature. Previous work have shown that the conjunction of TPC responses of synthetic figures in a virtual environment do preserve bounded invariance and, with the use of reinforcement learning, place cells could be emerge from these responses (Wyss and Verschure, 2004a). We perform the simulations using 100 cells with random weights. The a framework for mobile robot navigation using a temporal 100 population code results show that from 100 cells used, 12 have significant place related activity (Fig. 5.5). These cells show high activity in specific areas of the environment, also called place fields. Thus the TPC representation of the natural scenes can successfully lead to the formation of place fields. 3m 3m Cell #3 Cell #7 Cell #10 Cell #14 Cell #42 Cell #66 Cell #70 Cell #74 Cell #79 Cell #88 Cell #91 Cell #97 Figure 5.5: Place cells acquired by the combination of TPC based visual representations and the E%-max WTA model of the hippocampus. 5.4 Discussion We addressed the question whether we could combine the properties of TPC with an attention system and a hippocampus model to reliably provide position information in a navigation scenario. We have shown that in our cortical model, the TPC, the representation of a static visual stimulus is reliably generated based on the temporal dynamics of neuronal populations. The TPC is a relevant hypothesis on the encoding of sensory events given its consistency with cortical anatomy and its consistency with contemporary physiology. The information carried in this, so called, Temporal Population Code, is efficiently used to pass a complete and dense amount of information. The encoding of visual information is done through a sub-set of regions of high saliency that pops up from the visual field. Combining the TPC encoding 5.4. discussion 101 strategy with a hippocampus model, we have shown that this approach provides for robust representation of position in a natural environment. Panoramic images have been proposed in other studies of robot navigation (Zeil et al., 2003). Similarly, the TPC strategy proposed here also makes use of pixel differences between panoramic images to provide a reliable cue to location in real-world scenes. In both cases differences in the direction of illumination and environmental motion often can seriously degrade the position representation. However, in comparison, the representation provided by TPC is extremely compact. A subregion of 41x41 pixels is transformed in a TPC vector of 64 points. Finally, the visual field, an image of 96.000 pixels, is represented by a histogram of 7 bins. In the specific benchmark evaluated here, the model showed that distance information could be reliably recovered from the TPC. Importantly, we showed that the model can produce reliable position information based on the mean correlation value of TPC based histograms. Furthermore, this method does not use any particular mechanism specific to the stimuli used in the experiments. It is totally generic. Also, the processing time needed for classifying an input of 41x41 pixels was about 50 ms using an offline Matlab implementation. Therefore, we conclude that our model could be efficiently used in a real-time task. Hence, our model shows that the brain might use the smooth degradation of natural images represented by TPCs as an estimate of distance. Acknowledgment This work was supported by EU FP7 projects GOAL-LEADERS (FP7ICT-97732) and EFAA (FP7-ICT-270490). Chapter Conclusion In this dissertation, we have investigated the computational principles that the early visual system employs in order to provide an invariant representation of the visual world. Our starting point was the Temporal Population Code (TPC) model proposed by Wyss et al. (2003b). We first investigated a central open issue: how the intrinsic elements that form the temporal representation provided by the TPC could be accessed and readout by further cognitive areas involved in the visual processing. Indeed, since the introduction of the TPC, a readout stage that would be neurobiologically compatible had been a persistent problem. In this thesis, we solved this issue proposing a novel and biologically consistent readout system based on the Haar wavelet transform. We showed that the efficient representation provided by the TPC signal can be decoded by further areas through a sub-set of wavelet coefficients. The implicit concept behind the proposed readout is that the brain uses different frequency bands to convey multiple types of informations multiplexed into a temporal signal (Buzsáki, 2006). Indeed, recent results have shown that distinct network activity observed in primary visual cortex individually signals different properties of the input stimulus such as motion direction and speed, orientation of object contours at same time (Onat 103 6 104 conclusion et al., 2011). Therefore, the decoder must be able to select the suitable frequency range in order to access the information needed for each situation. We investigate this idea using a systematic spectral investigation of the TPC signal using a simple, yet powerful, Haar basis that seems to be a feasible choice providing robust and reliable decoding. In comparison with the original TPC readout, the wavelet circuit proposed here showed the same performance in the object recognition experiments. Although, for bursting neurons the wavelet readout is slower in comparison with the original TPC, the wavelet readout could extract the key geometric properties of synthetic shapes without compromising the speed of encoding. The fine details of the mechanism such as its integration time constant and its wavelet resolution level depend on the network dynamics. How higher areas in the visual pathways dynamically control the readout frequency range remains a subject for further investigations. This interaction between bottom-up sensory information and top-down modulation could be related to an attentional mechanism that would synchronize the inhibitory neurons in the neuronal wavelet circuit similar to the interneuron gamma mechanism (Tiesinga and Sejnowski, 2009b). These findings make a strong point in terms of how the neo-cortex deals with sensory information. Our results show that stimulus can be represented and further classified using a strongly compressed temporal signal based on wavelet coefficients. This strategy allows multiplexing high-level information from the visual input and flexible storage and retrieval of information by a memory system. In the next part of this dissertation, we explored the TPC as candidate for a canonical computational mechanism used by the neo-cortex to provide invariant representation of visual information in different scenarios. In chapter 3, we have specifically addressed this question using human faces as a test case. Indeed, we investigated whether the TPC model could be used to provide a real-time face recognition system to the iCub robot. In conclusion 105 the specific benchmark evaluated here using human faces, our results confirmed that the representation of a static visual stimulus can be generated based on the temporal activity of neuronal populations. One of the main issues solved was the reduction of the computational cost due to the recurrent connections in the large scale spiking neural network. We optimized the TPC architecture with respect to the lateral connectivity among the modeled neurons using 2D convolutions performed in the frequency domain. Also, instead of using two different filters to simulate the LGN and V1 receptive field’s characteristics, we redesign the model merging these two stages into a single step based on derivatives of gaussians. With this strategy we could achieve optimal edge enhancement selective to orientation in a single filtering operation which improves considerably the performance. Although the scenario was built around human faces, the TPC network used is totally generic. The model did not rely on any pre-built dictionary of features related to any specific stimulus set. The developed TPC library was integrated into the brain-based cognitive architecture DAC (Verschure et al., 2003). The tightly coupled layered structure of DAC consists on reactive, adaptive and contextual memory mappings. At each level, more complex and memory dependent mappings from sensory states are built depending on the internal state of the agent. Here the TPC plays a major role in building a compact representation of sensory and form primitives. In this sense, the TPC/DAC integration can serve as an unified perception system for the humanoid robot iCub. Indeed, we further explored the capabilities and generality of this architecture in a further scenario to recognize gestures. The experiments were motivated by the game Rock-Paper-Scissor, where a human plays with the robot through a set of three hand gestures that symbolizes these these three elements. The results showed that the robot can reliably recognize gestures under realistic conditions with a precision of 92%. The only assumption at the moment is that the hands are showed against a black background. This way we can use the color information to indicate the time of stimulus onset. Also the wavelet readout circuit has 106 conclusion This promising work has not been published yet, as we are now implement the motor behavior of the robot associated with the game. However, preliminary tests indicate that TPC is a highly feasible solution both in terms of performance and processing time that allows the iCub to play Rock-Paper-Scissors with humans. This scenario will provide a rich platform to study the interaction between humans and humanoid robots. The challenge of winning may have an important impact in the engagement of humans that play the game against the robot. Thus a reliable gesture recognition system is crucial element in the assessment of interaction and engagement. Indeed, given the speed of encoding, the computational cost, the classification performance and consistency with cortical anatomy and physiology TPC has proved to be a relevant hypothesis on the encoding of sensory events. Moreover, it reinforces very well the main aspect of the question we want to address: the generation of a robust representation independent of exact detailed properties of the input stimulus. Subsequently we moved on to propose the use of the generality of TPC in a completely different scenario. If TPC is reliable way of encoding visual sensory input in a generic way, the representation should also account for bounded invariances and therefore for a smooth reconstruction of the position from visual input as observed in nature. In chapter 5, we proposed a framework that combines the TPC, with an attention and a hippocampus model to provide position estimation for navigation tasks. The framework proposed was built to be used in mobile robots relying purely on single camera input. In a previous study, the TPC model was applied to a simulated robot in a virtual arena accounting for the formation of place fields (Wyss et al., 2006). Our results here extended these achievements to a real-world situation, where images of a realistic environment were used as input to feed a fairly realistic hippocampus model (Rennó-Costa et al., 2010). The images captured by the camera were processed to emulate the sensory input conclusion 107 of a rat. At at each position in the arena, 360 degree panoramic pictures where used Hebb (1932). In a bottom-up fashion, the attention system exchanged information with the visual system receiving simple visual features and providing salience maps of the panoramic images (Itti et al., 1998). The saliency maps provided a plot of sub regions rich in features such as colors, intensity and orientations that could be compactly encoded in the visual system by the TPC. The subregions of interest determined by the attention system, translated into temporal representations, fed the hippocampal input stage using the TPC signal leading to the final position representation translated on the activity of place cells (de Almeida et al., 2009a). This strategy, based on distal cues using panoramic images have been proposed in other studies of robot navigation (Zeil et al., 2003). However, the compression versus performance obtained with the proposed framework is remarkable. The visual field, an image of 96.000 pixels, using the proposed framework could be represented by a histogram of 7 bins. With such compact representation we could achieve a monotonically descendent correlation curve among distances in different parts of the arena. The smooth reconstruction of distance relationships indeed fulfilled the needs of the hippocampus input in order to generate place fields. This framework suggests that the brain might use the smooth degradation of natural images represented by TPCs to estimate distances and generate representation of position using place cells. In summary, we proposed that a spatial to temporal transformation based on a Temporal Population Code could account for the generation of a robust, compact and non-redundant invariant representation of visual sensory input. We have extensively showed the capabilities of the proposed strategy over different scenarios, in real-world tasks and in different platforms raging from humanoid and mobile robots to pure computational simulations using artificial objects. The results present in this doctor thesis not only strongly support the idea 108 conclusion that time is indispensable in the construction of invariant representation of sensory inputs but also suggest a clear hypothesis of how time can be used in parallel with spatial information. In recent neuroscience studies, there is increasingly interest in the field of temporal population coding and of how time can be included in models of sensory encoding as suggested by other authors (see (Buonomano and Maass, 2009) for a comprehensive review). In this sense, we are confident that this thesis will strongly contribute to this discussion adding novel methods and results to the fields of vision research, computational neuroscience and robotics. Appendix Distributed Adaptive Control (DAC) Integrated into a Novel Architecture for Controlled Human-Robot Interaction Robots will increasingly become more integrated in our society. This will require high levels of social competence. The understanding of the core aspects required for social interaction is therefore a crucial topic in the field of robotics. In this article we propose an architecture and an experimental paradigm which allows a humanoid robot to engage humans in well defined and quantified social interactions. The BASSIS architecture is a 4 layers structure of increasingly more complex and memory dependent mappings from sensory states to action. At its core, BASSIS is allowing the iCub robot to sustain dyadic interactions with humans using the tangible table Reactable which is a novel musical instrument. In this setup, the iCub receives from the Reactable a complete set of data about the objects that are on the table. This combination avoids issues regarding object categorization and provides a rich environment were humans can play music or games with the robot. We benchmark the system performance in a col109 A distributed adaptive control (dac) integrated into a novel 110 architecture for controlled human-robot interaction laborative object manipulation and planning task using spoken language. The results show that the architecture proposed here provides reliable and complex interaction with controlled and precise action execution. We show that the iCub can successfully interact with humans grasping and placing objects on the table with a precision of 2 cm. The software implementation of the architecture is available online. A.1 Introduction We are at the start of a period of radical social change where robots will enter more and more into our daily lives (Bekey and Yuh, 2007; BarCohen and Hanson, 2009). The interactions between human and robots will become more omnipresent and will take more roles in our houses, offices, hospitals or schools requiring the creation of machines socially compatible with humans (Thrun, 2004; Dautenhahn, 2007). Some studies suggest that humans prefer humanoid robots to other artificial systems in social interactions (Breazeal et al., 2006). Nowadays, this kind of machine is being developed by many research groups around the world and by a number of large corporations and smaller specialist robotics companies (Bekey and Yuh, 2007). In Europe, the FP6 RobotCub project made a major stride forward in providing a means for integrative humanoid research via the iCub platform (Metta et al., 2008). The iCub robot has rich motor and perceptual features provide for all core capabilities to efficiently interact with humans. In the development of experimental setups for investigating social interaction with humanoid robots, we still face many unsolved technical issues in the treatment of sensory inputs that greatly interfere in the development of the the processes of motor control. The categorization of objects is a good example of a hard task which is still not completely solved by current technologies (Dickinson et al., 2009; Pinto et al., 2011). Often the interaction scenario must be simplified to an unrealistic level in order to 111 a.1. introduction be technically feasible. Therefore, to better understand and realize the key elements of the interaction between humans and robots, we need to overcome these technical obstacles. In this paper, we propose a setup in which humans can interact with a robot in a very controlled environment. We, together with this environment, present the technical and methodological aspects of the Biomimetic Architecture for Situated Social Intelligence Systems, called BASSIS. This architecture is based on the DAC architecture (Verschure and Althaus, 2003) and incorporates and implements a set of capabilities which allows the android to engage humans in social interactions, such as games and cognitive tasks. BASSIS allows the humanoid robot iCub to interact with humans using an electronic music instrument with a tangible user interface called Reactable Reactable1 in a controlled environment. The Reactable, conceptualized as a musical instrument (Geiger et al., 2010), provides a complete platform where humans and robots can interact, playing music or games, manipulating physical objects that are on the table. The different properties of the objects are displayed on the table-top in a visually alluring graphical interface (Fig. A.1). In the proposed setup, the iCub is connected to the Reactable and receives precise information about the different properties of objects placed on the table directly from the Reactable software. In this way, issues regarding object properties such as identity and position are flawlessly avoided and the experiments can be totally focused on the interaction itself. Another important component of BASSIS is the spoken language module that allows the human and the robot to interact using natural language. All the parameters describing the interaction can be stored in log files to be further analyzed unveiling a novel architecture for benchmarking experiments in human-robot interactions. 1 The Reactable is a http://www.reactable.com/ trademark of the Reactable Systems 2009-2012. distributed adaptive control (dac) integrated into a novel 112 architecture for controlled human-robot interaction Here we provide a complete description of all the components of the proposed setup. We present in detail all the modules that compose the proposed architecture. Notably, we made all the software publicly available2 . This allows the whole robotics community to benefit from the proposed architecture and to test on their ow and benchmark our results in their on system. In the results section, we present a detailed analysis of the overall system performance. We benchmark the performance of the robot in a collaborative object manipulation task. We provide statistics about the precision in grasping and position representation provided by the BASSIS architecture. We also benchmark the setup over 4 different types of interactions between humans and the iCub. The interactions are performed using the spoken language system to achieve cooperative actions. Finally, in the last section, the highlights and drawbacks of the proposed system are discussed. A.2 Methods BASSIS architecture overview BASSIS is an extended version of the Distributed Adaptive Control (DAC) architecture (Verschure and Althaus, 2003; Duff and Verschure, 2010). DAC consists of three tightly coupled layers: reactive, adaptive and contextual. At each level of organization increasingly more complex and memory dependent mappings from sensory states to actions are generated dependent on the internal state of the agent. The BASSIS architecture expands DAC with many new components. It incorporates the repertoire of capabilities available for the human to interact with the humanoid robot iCub. BASSIS is divided in 4 hierarchical layers: Contextual, Adaptive, Reactive, Soma. The different modules responsible for implementing the 2 Available at: http://efaa.sourceforge.net/ a.2. methods 113 Figure A.1: The iCub ready to play with the objects that are on the Reactable. The table provides an appealing atmosphere for social interactions between humans and the iCub. relevant competences of the humanoid are distributed over these layers (Fig. A.2). The Soma and the Reactive layers are the bottom layers of the BASSIS diagram. They are in charge of coding the robot motor primitives through the attentionSelector and pmpActions package. It also acquires useful sensory information from the tactile sensors mounted on the iCub hands through the tactileInterface module. In the Adaptive layer, a real-time affine transformation is performed in order to map the coordinates received from the Reactable into the robot’s frame of reference. iCub needs to be given the Cartesian coordinates within its own frame to execute any motor action on an object. This is accomplished by the ObjectLocationTransformer, of the Adaptive layer. distributed adaptive control (dac) integrated into a novel 114 architecture for controlled human-robot interaction At the top of the BASSIS diagram lays the Contextual layer which is composed of two main entities: the Supervisor and the Objects Properties Collector (OPC). The Supervisor module essentially manages the spoken language system in the interaction with the human partner so as the synthesis of the robot’s voice. The OPC reflects the current knowledge the system is able to acquire from the environment, collecting data that ranges from objects coordinates in multi-reference frames (both in the ReacTable and robot domains). Furthermore, a central role is played by the OPC module which represents the database where a large variety of information is stored since the beginning of the experiments. And finally, the reactable2OPC modules handles the required communication interface to exchange the information provided by the Reactable with the OPC. In the following sections, we elaborate on the description of each component of the BASSIS architecture. The iCub The iCub is a humanoid robot one meter tall, with dimensions similar to a 3.5 years old child. It has 53 actuated degrees of freedom distributed over the hands, arms, head and legs (Metta et al., 2008). The sensory inputs are provided by artificial skin, cameras, microphones. The robot is equipped with novel artificial skin covering the hands (fingertips and the palm) and the forearms (Cannata et al., 2008). The stereo vision is provided by stereo cameras in a swivel mounting. The stereo sound capturing is done using two microphones. Both cameras and microphones are located where eyes and ears would be placed in a human. The facial expressions, mouth and eyebrows, are projected from behind the face panel using lines of red LEDs. It also has the sense of proprioception (body configuration) and movement (using accelerometers and gyroscopes). The different software modules in the architecture are interconnected us- 115 a.2. methods Contextual Supervisor OPC Adaptive reactable2OPC ObjectLocationTransformer Reactive pmpActions tactileInterface Attention Selector Soma cartesianControl gazeControl Figure A.2: BASSIS is a multi-scale biomimetic architecture organized at three different levels of control reactive, adaptive and contextual. It is based on the well established DAC architecture. See text for further details. ing YARP (Metta et al., 2006). This framework supports distributed computation focusing in the robot control and efficiency. The communication between two modules using YARP happens between objects called ”ports”. The ports can exchange data over different network protocols such as TCP and UDP. Currently, the iCub is capable of reliably co-ordinating reach and grasp motions, facial expressions to express emotions, force control exploiting its force/torque sensors and gaze at points in the visual field. The interaction can be performed using spoken language. With all these features, the iCub is an outstanding platform for research in social interaction between distributed adaptive control (dac) integrated into a novel 116 architecture for controlled human-robot interaction humans and humanoid robots. Reactable Conceptualized as a new musical instrument, the Reactable is a tabletop tangible interface (Geiger et al., 2010). With the Reactable the user can manipulate the digital information interacting with real-world objects. It gives a physical shape to the digital information. The instrument is composed by a round table, with a translucent top where the objects and fingertips (cursors) are used to control its parameters. It has a back projector to display the properties of the objects through the translucent top and a camera used to track the different objects (Fig. A.3). Object recognition is performed using the reacTIVision tracking system (Kaltenbrunner and Bencina, 2007). The reacTIVision can track a large number of objects in real-time using fiducial markers (Bencina and Kaltenbrunner, 2005). The objects receive an identification number (ID) according to its fiducial marker. The software provides precise information about the (x, y) position of the objects, rotation angle, speed and acceleration. The tracking system uses the TUIO protocol for exchanging data (Kaltenbrunner et al., 2005). This protocol has been specifically designed to meet the requirements of a general versatile communication interface for tabletop tangible interfaces. The information from the reacTIVision software can be read by any software using TUIO. Object Properties Collector (OPC) The OPC module collects and store the data produce during the interaction. The module consists of a database where a large variety of information is stored starting at the beginning of the experiment and as the interaction with humans evolves over time. In this setup, the data provided by the Reactable feed the memory of the robot represented by the OPC. 117 a.2. methods Translucent Tabletop Fingertip Pointing Tagged Objects to ec oj Pr a TUIO client Application TUIO Protocol er am C r (x,y,ø,..,s) Reactivision Figure A.3: Reactable system overview. See text for further explanation. OPC reflects what the system knows about the environment and what is available. The data collected range from object coordinates in multireference frames (both in the Reactable and robot domains) to the human’s emotional states or position in the interaction process. Entities in OPC are addressed with unique identifiers and managed dynamically, meaning that they can be added, modified, removed at run-time safely and easily from multiple sources located anywhere in the network, practically without limitations. Entities consist of all the possible elements that enter or leave the scene: objects, table cursors, matrices used for geometric projection, as well as the robot and the human themselves. All the properties related to the entities can be manipulated and used to populate the centralized database. A property in the OPC vocabulary is identified by a tag specified with a string (a sequence of characters). The values that distributed adaptive control (dac) integrated into a novel 118 architecture for controlled human-robot interaction Tag ”entity” ”name” onTable rt position x rt position y robot position x robot position y rt id x Description type of entity object name presence/absence x-table y-table x-robot y-robot unique ID table Type string string integer double double double double integer Examples ”object”, ”table” ”drums”, ”guitar” 0, 1 [xmin, xmax] m [ymin, ymax] m [xmin, xmax] m [ymin, ymax] m ID 0 Table A.1: Excerpt of available OPC properties can be assigned to a tag can be either single strings or numbers, or even a nested list of strings and numbers. The table A.2 shows some examples of the tags used. Reactable connection with OPC The reactable2OPC module provides the required communication interface between the Reactable and the OPC. The implementation is done using a TUIO client module that receives the data from the reacTIVision software and re-sends it over a YARP port. The OPC is populated with the properties of tagged objects that are on the table. Such properties include: object ID, the (x, y) position of each object on the table within the Reactable coordinate frame, angle of rotation and speed. When an object is removed from the table the module keeps a memory of its ID. The system reassigns the same ID to the same object when it reappears. Coordinate frame transformation based on homography In order for the iCub to explore the world and interact with, it not only needs to use its sensors but also coordinate perception and action across a.2. methods 119 several frames of reference. The iCub has its own coordinate frame, therefore it is required to execute a transformation process between the table and the robot centered coordinate frame and vice-versa in order to match objects to actions. In the iCub’s reference frame, the x axis points backward. The y axis points laterally and is chosen according to the right hand rule. And the z axis is parallel to gravity but pointing upwards. With BASSIS, the objectLocationTransformer solves this mathematical problem using a coordinate frame transformation based on homography. This technique is commonly used in computer vision to map two coordinate frames and camera calibration (Zhang, 2000). The kernel of the transformation is given by a 3x3 matrix H that rotates and translates the points B received from the table into the the points A in the frame of the robot. The transformation matrix can be formalized as: A = H × B. To obtain the positions A and B in order to compute H, we perform an a priori calibration session. In this phase, the iCub points at 3 random positions on the table using its fingertips using the tactileInterface. Currently, the center of the palm is the known position where the iCub is pointing. We use the length of iCub fingers and joint angles to calculate the fingertip position, by a computation of the forward kinematics (Siciliano et al., 2011), and therefore estimate A. The points B are the correspondent coordinates in the Reactable frame obtained directly from the OPC. Once the system is calibrated, the transformation between the two domains can be performed in real-time with a simple multiplication by H which is stored in the OPC as well. The Passive Motion Paradigm (PMP) and the Cartesian Interface The PmpActionsModule provides motor control mechanisms that allow the robot to perform complex manipulation tasks. It relies on a revised version distributed adaptive control (dac) integrated into a novel 120 architecture for controlled human-robot interaction of the so-called Passive Motion Paradigm (PMP). PMP is a biomimetic synergy formation model (Mohan et al., 2009) conceived for solving redundancy in human movements. The neurobiological background is related to the Equilibrium Point Hypothesis, developed in the 60s by Feldman and then by Bizzi. In the model (Mussa Ivaldi et al., 1988) goals and constraints are expressed as force fields simultaneously in different motor spaces. In this way the model can carry out kinematic inversion by means of a differential approach that relies on Jacobians but never computes their inverse explicitly. In the current implementation the model was modified by breaking it in two modules: 1) a module where virtual force fields are used in the workspace in order to perform reaching tasks while avoiding obstacles as conceived by Khatib in the 80s (Khatib, 1985); 2) a module that carries out kinematic inversion via a modified pseudoinverse/transposed Jacobian. Within the PMP framework it is possible to describe objects of the perceived world either as obstacles or as targets, and to consequently generate proper repulsive or attractive force fields, respectively. The PMP module can also include internal constraints, for example related to bimanual coordination. According to the composition of all the active fields, the robots end-effector is eventually driven towards the selected target while bypassing the identified obstacles. The behavior and performance strictly depend on the mutual relationship among the tunable field’s parameters. This framework allows defining a trajectory in Cartesian Space, which needs to be properly converted in joint space through a robust inverse kinematics solver. Specifically, we resort to the Cartesian Controller (Pattacini et al., 2010) to solve the inverse kinematics problem while assuring to fulfill additional constraints such as joint limits. The non-linear optimization algorithm used, IPOPT, allows solving the inverse kinematics problem robustly (Wächter and Biegler, 2005). The Cartesian Controller also provides a reliable controller that extends the Multi-Referential Dynamical Systems approach (Hersch and Billard, 2008) synchronizing two dynamical controllers, one in joint space and the other one in task space. It 121 a.2. methods assures the robot to actually follow the trajectory. In order to reproduce human-like manipulation tasks, we built a library, namely the pmpActions, modeling complex actions as a combination of simple action primitives that defined high level actions such as such as Reach, Grasp and Release; in fact if we assume that, when an object is grasped, it becomes an appendix of the robots body structure, complex actions can be performed as a sequence of simple primitives (Fig.A.4). Grasp and release PMP Actions Request for adding/removing objects PMP Client PMP Target/Obstacle position and size Field and trajectory generation PMP Server 3D point position and pose Inverse kinematics Cartesian Interface Figure A.4: PMP architecture: at lower level the pmpServer module computes the trajectory, targets and obstacles in the space feeding the Cartesian Interface with respect to the trajectory that has to be followed. The user relies on a pmpClient in order to ask the server to move, add and remove objects, and eventually to start new reaching tasks. At the highest level the pmpActions library provides a set of complex actions, such as ”Push” or ”Move the object’, simply implemented as combinations of primitives. distributed adaptive control (dac) integrated into a novel 122 architecture for controlled human-robot interaction Spoken Language Interaction and Supervisor The spoken language interaction is managed by the Supervisor. It is implemented in the CSLU Toolkit (Sutton et al., 1998) Rapid Application Development (RAD) under the form of a finite-state dialogue system where the speech synthesis is done by Festival and the recognition by Sphinx-II. Indeed, these state modules provide functions including conditional transition to new states based on the words and sentences recognized, and thus conditional execution of code based on current state, and spoken input from the user. The recognition and the extraction of the useful information from the recognized sentence is based on a simple grammar that we have developed. For instance, for the action order, a sentence is following this structure: 1. $action1 = grasp | point ; 2. $action2 = put | drop ; 3. $object = drums | synthesizer | guitar | box ; 4. $where rel = my | your ; 5. $location = here | $where rel left | middle | $where rel right; 6. $actionOrder = $action1 [*sil%%|the%%] $object | $action2 [the%%] $object [on%% the%%] $location; $action1, $action2, $object, $where rel and $location are variables which could take several words. The final recognized sentence will have the grammar of $actionOrder. The words between square bracket are optional, and the ones followed by a %% sign are recognized but not stored. The Supervisor allows the human to interact with the robot, and requires access and control over several modules and thus, is in charge of the YARP connections and establishes connections with the following components: a.2. methods 123 1. attentionSelector, to be able to control where the iCub is looking, providing a non-verbal way of communication by gazing at the targets while it is speaking (e.g. looking at the drums where he says the name ”drums”), 2. objectsPropertiesCollector, to obtain all the knowledge of the iCub, what it knows about the world in order to answer the human (e.g. to know if the synthesizer is on the table), 3. pmpActions, to move its hands, executing order from the human where it has to manipulate some objects (e.g. grasping the guitar after the proper command spoken by the human), 4. tactileInterface, to realize the calibration process between the ReacTable and the iCub, where the human has made a contribution (e.g. when the iCub ask him to touch the table). Visualizations in real-time Displaying the state of the world and of the iCub is important for debugging purposes since it provides a useful tool to visualize in real-time the elements available in the memory of the robot. All objects detected by the Reactable or stored as static locations are loaded from the OPC and displayed in a graphical user interface, the iCubGUI. The robot is also rendered in real-time, which gives the user an accurate visual representation of the whole scenario in the coordinate frame of the robot. The objLocationTransformer module is responsible to send the data directly to the iCubGUI. This classification allows us to customize the display for different entity types. Experimental scenario We performed 2 different experiments in order to benchmark the architecture’s capabilities to maintain social interaction. We first defined 3 distributed adaptive control (dac) integrated into a novel 124 architecture for controlled human-robot interaction fixed positions on the table: left, middle and right according (Fig. A.5). They are used as spatial reference for placing the objects the robot has to interact with. In the initial experiments, we evaluate the precision in grasping tasks performed by the iCub. The robot had to perform the following tasks: grasp an object, lift it from the table and put it back to a desired location. We use an object of dimensions 4x6x7 cm (Fig. A.1). In the first experiment, we divided the tasks in two: namely the grasping and the place tasks. In the first, the iCub grasps an object placed on one of the three fixed positions and then release it to the exact same point where it picked it up, assessing therefore the robot’s capability of object manipulation as well as precision. For each of these three positions, the robot had to perform 4 grasps. The robot executed a total of 36 grasping action, equally divided between right, middle and left positions. We removed 4 samples relative to non-successful grasping, 3 at the middle and 1 at the left position. At these positions the robot slightly hit the object after placing it on the table. As the data collection is performed after the arm is back to the rest position, the values recorded do not reflect the precision of the grasping.To measure the precision, we define the grasping error (GE) as the difference between the target position and the final position where the object is placed. To further test the robot’s ability to move an object from one point to another, we designed the place task. In this task, the robot had to grasp and move an object from one spatial point to another. We designed the experiment to cover all possible combinations among right, middle and left. The robot executed a total of 72 grasping action, 12 repetitions of the 6 possible combinations of source-target (where source and target are necessarily different). The place error (PE) is calculated in the same way used for the grasping, i.e, the difference between target and final position. The system was tested in a cluster with 4 machines using quad-cores Intel Core i7 820QM processors (1.73 Ghz). a.2. methods 125 In the last experiment, we test the iCub’s learning capabilities as well as the spoken language interaction. For the scope of this experiment, we have used shared plans. Also called cooperative plan, plans correspond to interlaced actions which are scheduled by two cooperative agents (here a human and the iCub) to succeed in achieving a common goal (Tomasello et al., 2005). The shared plan can be taught in real-time by the human. After announcing to the iCub it wants to do a shared plan by telling it ”Shared Plan”, the Supervisor enters the sub-dialogue node of the cooperative plan. The user could describe the plan he wants to execute with the iCub and could execute it if the iCub knows it : otherwise the robot will ask the human to teach him. For the learning part, the user has to describe the sequence of actions, one by one in order to let the iCub asking for confirmation each time. When all the actions have been told, the user end the teaching by saying ”finished” (Petit et al., 2012). Thus the new learnt shared plan is saved in a file under the form : • Shared plan name, e.g. ”Swap” • Steps, e.g. ”{I put {{drums middle}}} {Y ou put {{guitar right}}} {I put {{drums lef t}}” • Subjects, e.g. ”I Y ou” • Arguments, e.g. ”drums guitar” The subjects and arguments are written so as to be parsed and replaced in the shared plan according to the sentence told by the user when he says the shared plan. Thus, one could learn for instance the swap plan using ”I and You swap the drums and the guitar” but execute it with ”You and I swap the drums and the guitar” : in this way, there will be a role reversal and the robot will do the actions originally planned to be the ones executed by the human, and inversely : the role reversal test is passed distributed adaptive control (dac) integrated into a novel 126 architecture for controlled human-robot interaction (Dominey and Warneken, 2009). Indeed, this aspect is very important because this capability is the signature of a shared plan (Tomasello et al., 2005; Carpenter et al., 2005). Moreover, the arguments could also be replaced, and with the same original shared plan learnt, the iCub and the user could resolve the cooperative plan ”You and I swap the synthesizer and the box”, showing the capacity to generalize, in this example to generalize objects, in the cooperative plan : it has learnt how to swap something with something, not just the drums and the guitar(Lallée et al., 2010). For the experiment, we designed 4 different shared plans. We repeated all four shared plans learning and execution with three non-naive humans. Here we present the 4 shared plans in both teaching and executing phases exploring both role reversal and generalization of objects: Teaching: 1. You and I swap drums guitar • You put drums left and I put guitar right 2. You and I split drums guitar • You put drums right and I put guitar left 3. You and I group guitar drums • You put guitar middle and I put drums middle 4. You and I pass guitar left • You put guitar middle and I put guitar left Execution: 1. You and I swap drums guitar 127 a.3. results 2. I and You split drums guitar 3. You and I group synthesizer guitar 4. I and You pass guitar right iCub (x = -35 cm) L (y = 20 cm) M R (y = 0 cm) (y = -20 cm) Reactable Figure A.5: A cartoon of the experimental scenario. The distances for the left (L), middle (M) and right (R) positions are giving according to the robot. A.3 Results In the first part of this section, we present an analysis of error in the grasping tasks. We perform the analysis separately for the two axis x and y. Initially, we perform a Shapiro-Wilk test to check for normality in GEx and GEy . The results show a p-value = 0.68 for GEx and p-value = 0.85 for GEy , hence the data is normal distributed. We further perform a t-test to evaluate whether the distribution have zero mean. The values are equal to -1.20 cm and 1.57 for GEx and GEy respectively. It suggests that the robot has a tendency of placing the objects more to the front and right of the target. distributed adaptive control (dac) integrated into a novel 128 architecture for controlled human-robot interaction Analyzing individually the 3 different conditions, we see that the mean in the horizontal error GEx decreases from right to left (Fig. A.6). Meaning, the horizontal error is higher at the side correspond to the arm used in the task (right arm). In our interpretation, this reflects the geometrical arrangement of the arm in relation to the torso when the target position is in the opposite side of the arm used for grasping. In the left side, the error in x is into its cosine component. Therefore, the same deviation caused by possible imprecisions in the movement of the torso represents a smaller error in the left side than in the middle or right side. We confirm this observation by performing a multiple regression analysis to investigate the impact of different positions over those deviations. The only significant impact is in GEx when the action is performed at the left side (with p-value = 2.55 × 107 ). No impact on source was observed in GEy . Hence, we conclude that the iCub is significantly more precise left middle right Position on the Table Grasping Error in y [cm] 0.5 1.0 1.5 2.0 2.5 Grasping Error in x [cm] −2.5 −2.0 −1.5 −1.0 −0.5 0.0 grasping objects that are on the left side. In the second part, we present left middle right Position on the Table Figure A.6: Box plots of GEx and GEy for the grasp experiments. The error in placing an object on the table increases from left to right with respect to the x axis. No significant difference was observed for y. an analysis of error in the place experiment (PE). The Shapiro-Wilk test gives a p-value of 0.25 and 0.0015, respectively for PEx and PEy . In this case, PEy is not normally distributed. For this reason, we use parametric tests for PEx and non-parametric for PEy . A linear regression is performed in order to see if the distance of the 129 a.3. results trajectory, defined as one for middle to right/left and two from left to right, has an impact on the error. The analysis gives a p-value of 0.80, suggesting that the distance has no impact in the error (Fig. A.7). Target left middle right Error in x [cm] −1 0 1 2 3 Error in x [cm] −1.0 0.0 1.0 Source Target Error in y [cm] −1 0 1 2 3 Error in y [cm] −1.0 0.0 1.0 Source left middle right left middle right left middle right Figure A.7: Box plots for the place experiment for the different conditions analyzed: source and target. Both conditions have an impact in the error. We go further and perform another linear regression using the source and the target as variables, followed by an ANOVA. The analysis gives an impact for both parameters (p-value equals 0.020 and 0.001 for source and target respectively). However, still very precise and the error is going from -1 to 1 cm. For the analysis of PEy we use Kruskal-Willis tests. The tests reveal no significant impact from the distance (p-value = 0.099), but a significant influence from the source (p-value = 4.8 × 10−10 ) and the target (p-value = 2.06 × 10−9 ). The precision in this axis is smaller than the one in x axis, distributed adaptive control (dac) integrated into a novel 130 architecture for controlled human-robot interaction but still very acceptable. The errors are about -1 to 3 cm when moving a 4x6 cm-sided object in at least 20 cm (or 40). For the second experiment, three subjects were asked to teach the robot 4 different shared plans and then execute them. In total, the robot learned and executed 12 shared plans. To evaluate the learning capabilities of the iCub as well as it’s performance in executing each plan, we measured the time of completion for each shared plan. Learning time is measured from the beginning of the human command to the indication of ”finished”, which represents the completion of a learning sequence. Accordingly, execution time is measured from the start of the human command to the execution of the last action, followed by a next request by the iCub. The iCub was able to learn a shared plan of two actions in approximately half a minute and execute the learned plan in a minute (Fig. A.8). During the learning phase, only two deficits in speech recognition occurred, whilst during execution phase three deficits in speech recognition occurred. These deficits are mainly attributed to the the Sphinx-II speech recognizer, and where therefore not taken into account. During the execution of the 12 shared plans with N=18 distinct actions by the iCub, no physical errors occurred. Finally, it is worth noticing that a shared plan execution takes roughly the same amount of time with a role reversal, replacing arguments or both role reversal and argument replacement, stressing the robustness of the system. A.4 Conclusions In this paper, we proposed an interaction paradigm and architecture for human robot interaction, called BASSIS, designed to provide a novel and controlled environment for research in social interaction with humanoid robots. The core part of BASSIS is the integration between the iCub robot and the tangible interface Reactable. This combination provides a rich context for the execution of cooperative plans between the iCub and 131 a.4. conclusions Time [s] 60 50 40 30 20 Learning Execution Figure A.8: Average time needed for the shared plan tasks. The learning box, represents the time need for the human to create the task. The execution, represents the time needed by the interaction to perform the learned action. the human with games or collaborative music generation. We presented a detailed technical description of all the components that build the 4 layers of BASSIS. Moreover, we stressed the system capacities over 2 different kinds of experiments. In our results we have assessed both the precision of the overall system and the usability of it in the context of acquiring and executing shared plans. We showed that the position of the objects provided by the combination of the Reactable, objectLocationTransformer, pmpActions and cartesianControl is reliably mapped to the robot’s frame allowing precise motor actions. In a series of grasping tasks, the setup showed a maximum error of 2 cm between pre-defined targets and the final position the objects were displaced. In the experiments, we observed an impact of position of the objects in the grasping precision. For simple grasping over the same position, i.e, taking an object off table and putting it back on the same spot, the precision was significantly better on the left side of the table (at the side opposite to the used arm). Assuming that the robot has a perfect bilateral symmetric, this difference should be compensated using both arms. The processes run smooth without affecting the dynamics of the interac- distributed adaptive control (dac) integrated into a novel 132 architecture for controlled human-robot interaction tion. Over 40 hours of tests and 360 executions of grasping, we had no mechanical issues with any hardware or software. In the next version of the architecture 2 new modules that are already in a test phase will be integrated. The first, is based on the integration of the HAMMER model (Sarabia et al., 2011) (Hierarchical Attentive Multiple Models for Execution and Recognition of actions) with the PmpActionsModule. With HAMMER the system will be able to understand human movements by internally simulating the robots own actions and predicting future human states. The other model will add a face recognition capabilities to the robot. It is based on a canonical model of cortical computation called Temporal Population Code (Luvizotto et al., 2011). In addition we will add an multi-scale bottom-up and top-down attention system (Mathews et al., 2011). With these new features we expect to increase the level of social connection between the human in the interaction with the robot. Bibliography Each reference indicates the pages where it appears. Yoseph Bar-Cohen and David Hanson. The Coming Robot Revolution: Expectations and Fears About Emerging Intelligent, Humanlike Machines. Springer, 2009. ISBN 0387853480. 110 Omri Barak, Misha Tsodyks, and Ranulfo Romo. Neuronal popula- tion coding of parametric working memory. The Journal of neuroscience, 30(28):9424–30, July 2010. ISSN 1529-2401. doi: 10.1523/ JNEUROSCI.1875-10.2010. URL http://www.jneurosci.org/cgi/ content/abstract/30/28/9424. 14, 23 Emmanuel J Barbeau, Margot J Taylor, Jean Regis, Patrick Marquis, Patrick Chauvel, and Catherine Liégeois-Chauvel. Spatio temporal dynamics of face recognition. Cerebral cortex (New York, N.Y. : 1991), 18 (5):997–1009, May 2008. ISSN 1460-2199. doi: 10.1093/cercor/bhm140. URL http://cercor.oxfordjournals.org/cgi/content/abstract/ 18/5/997. 55 Francesco P Battaglia and Bruce L McNaughton. Polyrhythms of the brain. Neuron, 72(1):6–8, October 2011. ISSN 1097-4199. doi: 10.1016/ j.neuron.2011.09.019. URL http://dx.doi.org/10.1016/j.neuron. 2011.09.019. 16 Pierre Bayerl and Heiko Neumann. Disambiguating visual motion through contextual feedback modulation. Neural computation, 16(10):2041–66, October 2004. ISSN 0899-7667. doi: 10.1162/0899766041732404. URL 133 134 bibliography http://www.citeulike.org/user/toerst/article/4199040. 58, 72, 81 G. Bekey and J. Yuh. The Status of Robotics Report on the WTEC international study: Part I. IEEE Robotics and Automation Magazine, 14(4):76–81, 2007. 110 Karim Benchenane, Paul H Tiesinga, and Francesco P Battaglia. Oscillations in the prefrontal cortex: a gateway to memory and attention. Current opinion in neurobiology, 21(3):475–85, June 2011. ISSN 18736882. doi: 10.1016/j.conb.2011.01.004. URL http://dx.doi.org/10. 1016/j.conb.2011.01.004. 16 Ross Bencina and Martin Kaltenbrunner. The Design and Evolution of Fiducials for the reacTIVision System. In 3rd international conference on generative systems in the electronic arts, 2005. 116 Andrea Benucci, Robert A Frazor, and Matteo Carandini. Standing waves and traveling waves distinguish two circuits in visual cortex. Neuron, 55 (1):103–117, 2007. ISSN 0896-6273 (Print). doi: 10.1016/j.neuron.2007. 06.017. 23 Andrea Benucci, Dario L Ringach, and Matteo Carandini. Coding of stimulus sequences by population responses in visual cortex. Nature neuroscience, 12(10):1317–24, October 2009. ISSN 1546-1726. doi: 10.1038/ nn.2398. URL http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=2847499&tool=pmcentrez&rendertype=abstract. 14, 23 Cyrus P Billimoria, Benjamin J Kraus, Rajiv Narayan, Ross K Maddox, and Kamal Sen. Invariance and sensitivity to intensity in neural discrimination of natural sounds. The Journal of neuroscience, 28(25):6304– 6308, 2008. ISSN 1529-2401 (Electronic). doi: 10.1523/JNEUROSCI. 0961-08.2008. 15, 23 Tom Binzegger, Rodney J Douglas, and Kevan A C Martin. A quantitative map of the circuit of cat primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 24(39):8441–53, September 2004. 10.1523/JNEUROSCI.1400-04.2004. 90 ISSN 1529-2401. doi: 135 bibliography M Block. A note on the refraction and image formation of the rat’s eye. Vision Research, 9(6):705–711, June 1969. ISSN 00426989. doi: 10.1016/0042-6989(69)90127-8. URL http://dx.doi.org/10.1016/ 0042-6989(69)90127-8. 97 Francisco Bonin-Font, Alberto Ortiz, and Gabriel Oliver. Visual Navigation for Mobile Robots: A Survey. J. Intell. Robotics Syst., 53(3):263– 296, November 2008. ISSN 0921-0296. doi: 10.1007/s10846-008-9235-4. 90 Jeffrey S. Bowers. On the biological plausibility of grandmother cells: Implications for neural network theories in psychology and neuroscience. Psychological Review, 116(1):220, 2009. 10 Cynthia Breazeal, Matt Berlin, Andrew Brooks, Jesse Gray, and Andrea L. Thomaz. Using perspective taking to learn from ambiguous demonstrations. Robotics and Autonomous Systems, 54(5):385–393, May 2006. ISSN 09218890. doi: 10.1016/j.robot.2006.02.004. 110 Farran Briggs and Edward M. Callaway. Layer-Specific Input to Distinct Cell Types in Layer 6 of Monkey Primary Visual Cortex. J. Neurosci., 21(10):3600–3608, May 2001. URL http://www.jneurosci.org/cgi/ content/abstract/21/10/3600. 7 Dean V Buonomano and Wolfgang Maass. State-dependent computations: spatiotemporal processing in cortical networks. Nature reviews. Neuroscience, 10(2):113–25, February 2009. ISSN 1471-0048. doi: 10.1038/nrn2558. URL http://dx.doi.org/10.1038/nrn2558. 108 G Buzsáki. Rhythms of the brain. Oxford University Press, New York, New York, USA, 2006. ISBN 9780195301069. URL http://books. google.es/books?id=ldz58irprjYC. 15, 51, 103 C. Cadoz. Le geste canal de communication homme/machine: la communication instrumentale. TSI. Technique et science informatiques, 13(1): 31–61, 1984. ISSN 0752-4072. URL http://cat.inist.fr/?aModele= afficheN&cpsidt=3595521. 75 Edward M. Callaway. Local circuits in primary visual cortex of the macaque monkey. Annual Review of Neuroscience, 21:47–74, November 136 1998. bibliography URL http://www.annualreviews.org/doi/abs/10.1146/ annurev.neuro.21.1.47?prevSearch=Local+circuits+in+primary+ visual+cortex+of+the+macaque+monkey&searchHistoryKey=. 7 Edward M Callaway. Structure and function of parallel pathways in the primate early visual system. The Journal of physiology, 566(Pt 1):13–9, July 2005. ISSN 0022-3751. doi: 10.1113/jphysiol.2005. 088047. URL http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=1464718&tool=pmcentrez&rendertype=abstract. 5 Giorgio Cannata, Marco Maggiali, Giorgio Metta, and Giulio Sandini. An embedded artificial skin for humanoid robots. In 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pages 434–438. IEEE, August 2008. ISBN 978-1-4244-2143-5. doi: 10.1109/MFI.2008.4648033. 63, 114 Mikael A Carlsson, Philipp Knusel, Paul F M J Verschure, and Bill S Hansson. Spatio-temporal Ca2+ dynamics of moth olfactory projection neurones. European Journal of Neuroscience, 22(3):647–657, 2005. ISSN 0953-816X (Print). doi: 10.1111/j.1460-9568.2005.04239.x. 15, 24 Malinda Carpenter, Michael Tomasello, and Tricia Striano. Role Reversal Imitation and Language in Typically Developing Infants and Children With Autism. Infancy, 8(3):253–278, November 2005. ISSN 15250008. doi: 10.1207/s15327078in0803\ 4. 126 Jayaram Chandrashekar, Mark A Hoon, Nicholas J P Ryba, and Charles S Zuker. The receptors and cells for mammalian taste. Nature, 444(7117): 288–94, November 2006. ISSN 1476-4687. doi: 10.1038/nature05401. URL http://dx.doi.org/10.1038/nature05401. 27 Shi-Huang Chen and Jhing-Fa Wang. Extraction of pitch information in noisy speech using wavelet transform with aliasing compensation. In Acoustics, Speech, and Signal Processing, 2001. Proceedings., volume 1, pages 89–92 vol.1, Salt Lake City, USA, 2001. doi: 10.1109/ICASSP. 2001.940774. 50 Taishih Chi, Powen Ru, and Shihab A Shamma. Multiresolution spectrotemporal analysis of complex sounds. The Journal of the Acoustical 137 bibliography Society of America, 118(2):887–906, 2005. doi: 10.1121/1.1945807. URL http://link.aip.org/link/?JAS/118/887/1. 25 S Chikkerur, T Serre, C Tan, and T Poggio. What and where: a Bayesian inference theory of attention. Vision research, 50(22):2233–47, October 2010. ISSN 1878-5646. doi: 10.1016/j.visres.2010.05.013. URL http: //www.ncbi.nlm.nih.gov/pubmed/20493206. 11, 54 Bevil R Conway, Soumya Chatterjee, Greg D Field, Gregory D Horwitz, Elizabeth N Johnson, Kowa Koida, and Katherine Mancuso. Advances in color science: from retina to behavior. The Journal of neuroscience : the official journal of the Society for Neuroscience, 30(45):14955–63, November 2010. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.4348-10. 2010. URL http://www.jneurosci.org/cgi/content/abstract/30/ 45/14955. 3 A. Cowey and E.T. Rolls. Human cortical magnification factor and its relation to visual acuity. Experimental Brain Research, 21(5), December 1974. ISSN 0014-4819. doi: 10.1007/BF00237163. URL http://www. springerlink.com/content/q7v12n66365h123h/. 6 Pablo D’Angelo. Hugin, 2010. URL http://hugin.sourceforge.net/. 97 J Daugman. Two-dimensional spectral analysis of cortical receptive field profiles. Vision Research, 20(10):847–856, 1980. ISSN 00426989. doi: 10.1016/0042-6989(80)90065-6. URL http://dx.doi.org/10.1016/ 0042-6989(80)90065-6. 25 Kerstin Dautenhahn. Socially intelligent robots: dimensions of humanrobot interaction. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 362(1480):679–704, April 2007. ISSN 0962-8436. doi: 10.1098/rstb.2006.2004. 110 Licurgo de Almeida, Marco Idiart, and John E Lisman. The input-output transformation of the hippocampal granule cells: from grid cells to place fields. The Journal of neuroscience : the official journal of the Society for Neuroscience, 29(23):7504–12, June 2009a. ISSN 1529-2401. doi: 10. 1523/JNEUROSCI.6048-08.2009. URL http://www.jneurosci.org/ 138 bibliography cgi/content/abstract/29/23/7504. 92, 96, 107 Licurgo de Almeida, Marco Idiart, and John E Lisman. A second function of gamma frequency oscillations: an E%-max winner-take-all mechanism selects which cells fire. The Journal of neuroscience : the official journal of the Society for Neuroscience, 29(23):7497–503, June 2009b. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.6044-08.2009. URL http://www. jneurosci.org/cgi/content/abstract/29/23/7497. 96, 97 R L De Valois, D G Albrecht, and L G Thorell. Spatial frequency selectivity of cells in macaque visual cortex. Vision research, 22(5):545–59, January 1982. ISSN 0042-6989. URL http://www.ncbi.nlm.nih.gov/ pubmed/7112954. 57 Sven J. Dickinson, Aleš Leonardis, Bernt Schiele, and Michael J. Tarr, editors. Object Categorization. Cambridge University Press, Cambridge, 2009. ISBN 9780511635465. doi: 10.1017/CBO9780511635465. 110 G S Doetsch. Patterns in the brain. Neuronal population coding in the somatosensory system. Physiology and Behavior, 69(1-2):187–201, 2000. ISSN 0031-9384 (Print). 16, 24 Peter Ford Dominey and Felix Warneken. The basis of shared intentions in human and robot cognition. New Ideas in Psychology, 29(3):260–274, December 2009. ISSN 0732118X. doi: 10.1016/j.newideapsych.2009.07. 006. 126 Armin Duff and Paul F.M.J. Verschure. Unifying perceptual and behavioral learning with a correlative subspace learning rule. computing, 73(10-12):1818–1830, June 2010. Neuro- ISSN 09252312. doi: 10.1016/j.neucom.2009.11.048. 61, 112 Armin Duff, César Rennó-Costa, Encarni Marcos, Andre Luvizotto, Andrea Giovannucci, Marti Sanchez-Fibla, Ulysses Bernardet, and Paul Verschure. From Motor Learning to Interaction Learning in Robots, volume 264 of Studies in Computational Intelligence. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-05180-7. doi: 10.1007/978-3-642-05181-4. content/v348576tk12u628h. URL http://www.springerlink.com/ 139 bibliography Armin Duff, Marti Sanchez Fibla, and Paul F M J Verschure. A biologically based model for the integration of sensory-motor contingencies in rules and plans: A prefrontal cortex based extension of the Distributed Adaptive Control architecture. Brain research bulletin, 85(5):289–304, June 2011. ISSN 1873-2747. doi: 10.1016/j.brainresbull.2010.11.008. URL http://www.ncbi.nlm.nih.gov/pubmed/21138760. 27 Gaute T. Einevoll and Hans E. Plesser. Extended difference-of-Gaussians model incorporating cortical feedback for relay cells in the lateral geniculate nucleus of cat. Cognitive Neurodynamics, pages 1–18, November 2011. ISSN 1871-4080. doi: 10.1007/s11571-011-9183-8. URL http://www.springerlink.com/content/g64572342p61vk40/. 6, 27, 94 M Fabre-Thorpe, G Richard, and S J Thorpe. Rapid categorization of natural images by rhesus monkeys. Neuroreport, 9(2):303–8, January 1998. ISSN 0959-4965. URL http://www.ncbi.nlm.nih.gov/pubmed/ 9507973. 17 Winrich A Freiwald, Doris Y Tsao, and Margaret S Livingstone. A face feature space in the macaque temporal lobe. Nature neuroscience, 12 (9):1187–96, September 2009. ISSN 1546-1726. doi: 10.1038/nn.2363. URL http://dx.doi.org/10.1038/nn.2363. 10, 55 M. Frigo and S.G. Johnson. The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, February 2005. ISSN 00189219. doi: 10.1109/JPROC.2004.840301. URL http://ieeexplore. ieee.org/xpl/freeabs_all.jsp?arnumber=1386650. 60 K Fukushima. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, January 1980a. ISSN 0340-1200. URL http://www.ncbi.nlm.nih.gov/pubmed/7370364. 2, 23 K Fukushima. Neocognitron for handwritten digit recognition. rocomputing, 51(1):161–180, 2003. S0925-2312(02)00614-8. ISSN 09252312. Neu- doi: 10.1016/ URL http://linkinghub.elsevier.com/ retrieve/pii/S0925231202006148. 11, 54 140 bibliography Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, April 1980b. ISSN 03401200. doi: 10.1007/BF00344251. URL http://www.springerlink. com/content/r6g5w3tt54528137/. 11, 54 G Geiger, N Alber, S Jordà, and M Alonso. The Reactable: A Collaborative Musical Instrument for Playing and Understanding Music. Her&Mus. Heritage & Museography, 2(4):36 – 43, 2010. 111, 116 A S Georghiades, P N Belhumeur, and D J Kriegman. From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23 (6):643–660, 2001. xxviii, 19, 66, 67 A. Georgopoulos, A. Schwartz, and R. Kettner. Neuronal population coding of movement direction. Science, 233(4771):1416–1419, September 1986. ISSN 0036-8075. doi: 10.1126/science.3749885. URL http://www.sciencemag.org/content/233/4771/1416.abstract. 12 CD Gilbert and TN Wiesel. Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9(7):2432–2442, July 1989. URL http://www.jneurosci.org/cgi/ content/abstract/9/7/2432. 7, 8 Susan Goldin-Meadow. out childhood. gust 2009. How gesture promotes learning through- Child development perspectives, 3(2):106–111, Au- ISSN 1750-8606. doi: 10.1111/j.1750-8606.2009.00088. x. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2835356&tool=pmcentrez&rendertype=abstract. 75 Tim Gollisch and Markus Meister. Rapid neural coding in the retina with relative spike latencies. Science (New York, N.Y.), 319(5866):1108–11, February 2008. ISSN 1095-9203. doi: 10.1126/science.1149639. URL http://www.sciencemag.org/content/319/5866/1108.abstract. 4, 14 Tim Gollisch and Markus Meister. Eye smarter than scientists believed: neural computations in circuits of the retina. Neuron, 65(2):150–64, 141 bibliography January 2010. ISSN 1097-4199. doi: 10.1016/j.neuron.2009.12.009. URL http://www.ncbi.nlm.nih.gov/pubmed/20152123. 4 Leslie J. Gonzalez Rothi, Cynthia Ochipa, and Kenneth M. Heilman. A Cognitive Neuropsychological Model of Limb Praxis. Neuropsychology, 8(6):443–458, November 1991. Cognitive ISSN 0264-3294. doi: 10.1080/02643299108253382. URL http://dx.doi.org/10.1080/ 02643299108253382. 75 Charles G Gross. Genealogy of the ”grandmother cell”. The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry, 8(5): 512–8, October 2002. ISSN 1073-8584. URL http://www.ncbi.nlm. nih.gov/pubmed/12374433. 10 Alfred Haar. Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen, 71(1):38–53, March 1911. ISSN 0025-5831. doi: 10.1007/BF01456927. URL http://www.springerlink.com/content/ q1n417x61q8r2827/. 26 Ileana L Hanganu, Yehezkel Ben-Ari, and Rustem Khazipov. Reti- nal waves trigger spindle bursts in the neonatal rat visual cortex. The Journal of neuroscience, 26(25):6728–36, June 2006. ISSN 15292401. doi: 10.1523/JNEUROSCI.0752-06.2006. URL http://www. jneurosci.org/cgi/content/abstract/26/25/6728. 26 D O Hebb. Studies of the organization of behavior. I. Behavior of the rat in a field orientation. Journal of Comparative Psychology, 25:333–353, 1932. 97, 107 Micha Hersch and Aude G. Billard. Reaching with multi-referential dynamical systems. Autonomous Robots, 25(1-2):71–83, January 2008. ISSN 0929-5593. doi: 10.1007/s10514-007-9070-7. 120 D H Hubel and T N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Jour- nal of physiology, 160:106–54, January 1962. ISSN 0022-3751. URL http://www.ncbi.nlm.nih.gov/pubmed/14449617. 7, 54 Chou P Hung, Gabriel Kreiman, Tomaso Poggio, and James J DiCarlo. Fast readout of object identity from macaque inferior tempo- 142 ral cortex. bibliography Science, 310(5749):863–6, November 2005. ISSN 1095- 9203. doi: 10.1126/science.1117593. URL http://www.sciencemag. org/content/310/5749/863.abstract. 22 J M Hupé, A C James, P Girard, and J Bullier. Response modulations by static texture surround in area V1 of the macaque monkey do not depend on feedback connections from V2. Journal of neurophysiology, 85(1):146–63, January 2001. ISSN 0022-3077. URL http://www.ncbi. nlm.nih.gov/pubmed/11152715. 8 L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998. ISSN 01628828. doi: 10. 1109/34.730558. URL http://ieeexplore.ieee.org/xpl/freeabs_ all.jsp?arnumber=730558. 92, 93, 94, 107 Giuliano Iurilli, Fabio Benfenati, and Paolo Medini. Loss of Visually Driven Synaptic Responses in Layer 4 Regular-Spiking Neurons of Rat Visual Cortex in Absence of Competing Inputs. Cerebral cortex, pages bhr304–, November 2011. ISSN 1460-2199. doi: 10.1093/cercor/bhr304. URL http://cercor.oxfordjournals.org/cgi/content/abstract/ bhr304v1. 26 E M Izhikevich. Simple model of spiking neurons. IEEE transactions on neural networks, 14(6):1569–72, January 2003. ISSN 1045-9227. doi: 10.1109/TNN.2003.820440. URL http://ieeexplore.ieee.org/xpl/ freeabs_all.jsp?arnumber=1257420. 28, 78 Eugene Izhikevich. Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting. The MIT Press, Cambridge, USA, 1 edition, 2006. ISBN 0262090430. URL http://www.citeulike.org/user/ sdaehne/article/1396879. 30, 80 Eugene M Izhikevich. neurons? Which model to use for cortical spiking IEEE transactions on neural networks, 15(5):1063– 70, September 2004. ISSN 1045-9227. doi: 10.1109/TNN.2004. 832719. URL http://ieeexplore.ieee.org/xpl/freeabs_all.jsp? arnumber=1333071. 28, 30, 79, 80, 95 143 bibliography J P Jones and L A Palmer. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6):1233–58, December 1987. ISSN 0022-3077. URL http://www.ncbi.nlm.nih.gov/pubmed/3437332. 25 Sewoong Jun, Youngouk Kim, and Jongbae Lee. Difference of wavelet SIFT based mobile robot navigation. In 2009 IEEE International Conference on Control and Automation, pages 2305–2310. IEEE, December 2009. ISBN 978-1-4244-4706-0. doi: 10.1109/ICCA.2009.5410318. 90 Martin Kaltenbrunner and Ross Bencina. reacTIVision: a computer-vision framework for table-based tangible interaction. In Proceedings of the 1st international conference on Tangible and embedded interaction - TEI ’07, page 69, New York, New York, USA, February 2007. ACM Press. ISBN 9781595936196. doi: 10.1145/1226969.1226983. 116 Martin Kaltenbrunner, Till Bovermann, Ross Bencina, and Enrico Costanza. TUIO - A Protocol for Table Based Tangible User Interfaces. In Proceedings of the 6th International Workshop on Gesture in Human-Computer Interaction and Simulation (GW 2005), Vannes, France, 2005. 116 O. Khatib. Real-time obstacle avoidance for manipulators and mobile robots. In IEEE International Conference on Robotics and Automation, volume 2, pages 500–505. Institute of Electrical and Electronics Engineers, 1985. doi: 10.1109/ROBOT.1985.1087247. 120 Philipp Knüsel, Reto Wyss, Peter König, and Paul F M J Verschure. Decoding a temporal population code. Neural Computa- tion, 16(10):2079–100, October 2004. ISSN 0899-7667. doi: 10.1162/ 0899766041732459. URL http://portal.acm.org/citation.cfm?id= 1119706.1119711. 16, 24 Philipp Knusel, Mikael A Carlsson, Bill S Hansson, Tim C Pearce, and Paul F M J Verschure. Time and space are complementary encoding dimensions in the moth antennal lobe. Network, 18(1):35–62, 2007. ISSN 0954-898X (Print). doi: 10.1080/09548980701242573. 15, 24 Olivier Koch, Matthew R Walter, Albert S Huang, and Seth Teller. 144 bibliography Ground robot navigation using uncalibrated cameras. In 2010 IEEE International Conference on Robotics and Automation, pages 2423–2430. IEEE, May 2010. ISBN 978-1-4244-5038-1. doi: 10.1109/ROBOT. 2010.5509325. URL http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=5509325. 90 Randal A Koene and Michael E Hasselmo. An integrate-and-fire model of prefrontal cortex neuronal activity during performance of goal-directed decision making. Cerebral Cortex, 15(12):1964–81, December 2005. ISSN 1047-3211. doi: 10.1093/cercor/bhi072. URL http://cercor. oxfordjournals.org/cgi/content/abstract/15/12/1964. 33 Gord Kurtenbach and Eric Hulteen. Gestures in Human-Computer Communication. In Brenda Laurel, editor, The Art and Science of Interface Design., pages 309–317. Addison-Wesley Publishing Co., Boston, 1 edition, 1990. 75 Stephane Lallée, Séverin Lemaignan, Alexander Lenz, Chris Melhuish, Lorenzo Natale, S Skachek, T Van Der Zant, Felix Warneken, and P F Dominey. Towards a Platform-Independent Cooperative Human-Robot Interaction System: I. Perception. Proceedings of the IEEERSJ International Conference on Intelligent Robots and Systems, pages 4444–4451, 2010. 65, 126 M Laubach. Wavelet-based processing of neuronal spike trains prior to discriminant analysis. 159–168, April 2004. Journal Neuroscience Methods, 134(2): ISSN 0165-0270. doi: http://dx.doi.org/10. 1016/j.jneumeth.2003.11.007. URL http://dx.doi.org/10.1016/j. jneumeth.2003.11.007. 48 Sylvain Le Groux, Jonatas Manzolli, Marti Sanchez, Andre Luvizotto, Anna Mura, Aleksander Valjamae, Christoph Guger, Robert Prueckl, Ulysses Bernardet, and Paul Verschure. Disembodied and Collaborative Musical Interaction in the Multimodal Brain Orchestra. In Proceedings of the international conference on New Interfaces for Musical Expression, 2010. URL http://www.citeulike.org/user/slegroux/ article/8492764. 145 bibliography A B Lee, B Blais, H Z Shouval, and L N Cooper. Statistics of lateral geniculate nucleus (LGN) activity determine the segregation of ON/OFF subfields for simple cells in visual cortex. Proceedings of the National Academy of Sciences of the United States of America, 97(23):12875–9, November 2000. ISSN 0027-8424. doi: 10.1073/pnas.97.23.12875. URL http://www.pnas.org/cgi/content/abstract/97/23/12875. 4 K C Lee, J Ho, and D Kriegman. Acquiring Linear Subspaces for Face Recognition under Variable Lighting. IEEE Trans. Pattern Anal. Mach. Intelligence, 27(5):684–698, 2005. xxviii, xxxiii, 19, 66, 67, 72 Dong Zhang Li, S. Z. Li, and Daniel Gatica-perez. Real-Time Face Detection Using Boosting in Hierarchical Feature Spaces, 2004. URL http:// citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.2088. 65 Bao-hua Liu, Guangying K Wu, Robert Arbuckle, Huizhong W Tao, and Li I Zhang. Defining cortical frequency tuning with recurrent excitatory circuitry. Nature neuroscience, 10(12):1594–600, December 2007. ISSN 1097-6256. doi: 10.1038/nn2012. URL http://dx.doi.org/10.1038/ nn2012. 90 N K Logothetis and D L Sheinberg. Visual object recognition. Annual review of neuroscience, 19:577–621, January 1996. ISSN 0147-006X. doi: 10.1146/annurev.ne.19.030196.003045. URL http://www.ncbi. nlm.nih.gov/pubmed/8833455. 9 David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110, November 2004. ISSN 0920-5691. doi: 10.1023/B:VISI.0000029664. 99615.94. URL http://portal.acm.org/citation.cfm?id=993451. 996342. 73 Mantas Lukoševičius and Herbert Jaeger. Reservoir computing ap- proaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009. ISSN 15740137. doi: 10.1016/j.cosrev. 2009.03.005. URL http://linkinghub.elsevier.com/retrieve/pii/ S1574013709000173. 24, 91 Andre Luvizotto, César Rennó-Costa, Ugo Pattacini, and Paul Verschure. 146 bibliography The encoding of complex visual stimuli by a canonical model of the primary visual cortex: temporal population coding for face recognition on the iCub robot. In IEEE International Conference on Robotics and Biomimetics, page 6, Thailand, 2011. 19, 23, 76, 81, 86, 92, 132 Andre Luvizotto, Maxime Petit, Vasiliki Vouloutsi, and Et Al. Experimental and Functional Android Assistant : I . A Novel Architecture for a Controlled Human-Robot Interaction Environment. In IROS 2012 (Submitted), 2012a. 61 Andre Luvizotto, César Rennó-Costa, and Paul F.M.J. Verschure. A framework for mobile robot navigation using a temporal population code. In Springer Lecture Notes in Computer Science - Living Machines, page 12, 2012b. 20 Andre Luvizotto, César Rennó-Costa, and Paul F.M.J. Verschure. A wavelet based neural model to optimize and read out a temporal population code. Frontiers in Computational Neuroscience, 6(21):14, 2012c. 17, 82 Sean P MacEvoy, Thomas R Tucker, and David Fitzpatrick. A precise form of divisive suppression supports population coding in the primary visual cortex. Nature Neuroscience, 12(5):637–45, May 2009. ISSN 1546-1726. doi: 10.1038/nn.2310. URL http://dx.doi.org/10.1038/nn.2310. 23 Stéphane Mallat. A Wavelet Tour of Signal Processing. Academic Press, Burlington, USA, 1998. 17, 25, 31, 82 Gary Marsat and Leonard Maler. Neural heterogeneity and efficient population codes for communication signals. Journal of neurophysiology, 104(5):2543–55, November 2010. ISSN 1522-1598. doi: 10. 1152/jn.00256.2010. URL http://jn.physiology.org/cgi/content/ abstract/104/5/2543. 15 Marialuisa; Martelli, Najib J.; Majaj, and Denis G.; Pelli. Are faces processed like words? A diagnostic test for recognition by parts. Journal of Vision, 5(1):58–70, 2005. 19, 55 Zenon Mathews, Sergi Bermúdez i Badia, and Paul F.M.J. Verschure. PASAR: An integrated model of prediction, anticipation, sensation, at- 147 bibliography tention and response for artificial sensorimotor systems. Information Sciences, 186(1):1–19, March 2011. ISSN 00200255. doi: 10.1016/j.ins. 2011.09.042. 94, 132 André Maurer, Micha Hersch, and Aude G Billard. Extended hopfield network for sequence learning: Application to gesture recognition. Proceedings of ICANN, 1, 2005. URL http://citeseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.68.7770. 76 David McNeill. Hand and Mind: What Gestures Reveal about Thought. University Of Chicago Press, 1996. ISBN 0226561348. URL http://www.amazon.com/Hand-Mind-Gestures-Reveal-Thought/ dp/0226561348. 75 Giorgio Metta, Paul Fitzpatrick, and Lorenzo Natale. YARP: Yet Another Robot Platform. International Journal of Advanced Robotic Systems, 3 (1), 2006. URL http://eris.liralab.it/yarp/. 64, 115 Giorgio Metta, Giulio Sandini, David Vernon, Lorenzo Natale, and Francesco Nori. The iCub humanoid robot: an open platform for research in embodied cognition. In Proceedings of the 8th Workshop on Performance Metrics for Intelligent Systems - PerMIS ’08, page 50, New York, New York, USA, 2008. ACM Press. doi: 10.1145/1774674. 1774683. 55, 63, 110, 114 Ethan M Meyers, David J Freedman, Gabriel Kreiman, Earl K Miller, and Tomaso Poggio. Dynamic population coding of category information in inferior temporal and prefrontal cortex. Journal of Neurophysiology, 100(3):1407–19, September 2008. ISSN 0022-3077. doi: 10. 1152/jn.90248.2008. URL http://jn.physiology.org/cgi/content/ abstract/100/3/1407. 23 Vikramjit Mitra, Hosung Nam, Carol Espy-Wilson, Elliot Saltzman, and Louis Goldstein. Recognizing articulatory gestures from speech for robust speech recognition. The Journal of the Acoustical Society of America, 131(3):2270–87, March 2012. ISSN 1520-8524. doi: 10.1121/1. 3682038. URL http://www.ncbi.nlm.nih.gov/pubmed/22423722. 75 V. Mohan, P. Morasso, G. Metta, and G. Sandini. A biomimetic, force-field 148 bibliography based computational model for motion planning and bimanual coordination in humanoid robots. Autonomous Robots, 27(3):291–307, August 2009. ISSN 0929-5593. doi: 10.1007/s10514-009-9127-x. 63, 120 V B Mountcastle. Modality and topographic properties of single neurons of cat’s somatic sensory cortex. Journal of neurophysiology, 20(4):408– 34, July 1957. ISSN 0022-3077. URL http://www.ncbi.nlm.nih.gov/ pubmed/13439410. 6 F.A. Mussa Ivaldi, P. Morasso, and R. Zaccaria. Kinematic networks distributed model for representing and regularizing motor redundancy. Biological Cybernetics, 60(1):1–16, November 1988. ISSN 0340-1200. 120 Jonathan J Nassi and Edward M Callaway. Parallel processing strategies of the primate visual system. Nature Reviews Neuroscience, 10(5):360– 372, April 2009. ISSN 1471-003X. doi: 10.1038/nrn2619. URL http: //dx.doi.org/10.1038/nrn2619. xxiv, 3, 4, 5, 6 Ian Nauhaus, Andrea Benucci, Matteo Carandini, and Dario L Ringach. Neuronal selectivity and local map structure in visual cortex. Neuron, 57 (5):673–9, March 2008. ISSN 1097-4199. doi: 10.1016/j.neuron.2008.01. 020. URL http://www.cell.com/neuron/fulltext/S0896-6273(08) 00105-0. 7 Ian Nauhaus, Laura Busse, Matteo Carandini, and Dario L Ringach. Stimulus contrast modulates functional connectivity in visual cortex. Nature Neuroscience, 12(1):70–6, January 2009. ISSN 1546-1726. doi: 10.1038/nn.2232. URL http://dx.doi.org/10.1038/nn.2232. 7, 90 Andreas Nieder and Katharina Merten. A labeled-line code for small and large numerosities in the monkey prefrontal cortex. The Journal of Neuroscience, 27(22):5986–93, May 2007. ISSN 1529-2401. doi: 10. 1523/JNEUROSCI.1056-07.2007. URL http://www.jneurosci.org/ cgi/content/abstract/27/22/5986. 27 Behrad Noudoost, Mindy H Chang, Nicholas A Steinmetz, and Tirin Moore. Top-down control of visual attention. Current opinion in neurobiology, 20(2):183–90, April 2010. ISSN 1873-6882. doi: 10.1016/ 149 bibliography j.conb.2010.02.003. URL http://dx.doi.org/10.1016/j.conb.2010. 02.003. 16 Selim Onat, Nora Nortmann, Sascha Rekauzke, Peter König, and Dirk Jancke. Independent encoding of grating motion across stationary feature maps in primary visual cortex visualized with voltage-sensitive dye imaging. NeuroImage, 55(4):1763–70, April 2011. ISSN 1095-9572. doi: 10.1016/j.neuroimage.2011.01.004. URL http://dx.doi.org/10. 1016/j.neuroimage.2011.01.004. 17, 49, 103 R C O’Reilly. Six principles for biologically based computational models of cortical cognition. Trends in cognitive sciences, 2(11):455–62, November 1998. ISSN 1364-6613. URL http://www.ncbi.nlm.nih.gov/pubmed/ 21227277. 10 C.P. Papageorgiou, M. Oren, and T. Poggio. work for object detection. dia, January 1998. A general frame- Narosa Publishing House, Bombay, In- ISBN 81-7319-221-9. doi: 10.1109/ICCV.1998. 710772. URL http://www.computer.org/portal/web/csdl/doi/10. 1109/ICCV.1998.710772. 26 U Pattacini, F Nori, L Natale, G Metta, and G Sandini. An experimental evaluation of a novel minimum-jerk cartesian controller for humanoid robots. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1668–1674. IEEE, October 2010. ISBN 978-1-4244-6674-0. doi: 10.1109/IROS.2010.5650851. 63, 120 Maxime Petit, Stéphane Lallée, Jean-David Boucher, Grégoire Pointeau, Pierrick Cheminade, Dimitri Ognibene, Eris Chinellato, Ugo Pattacini, Ilaria Gori, Giorgio Metta, Uriel Martinez-Hernandez, Hector Barron, Martin Inderbitzin, Andre Luvizotto, Vicky Vouloutsi, Yannis Demiris, and Peter Ford Dominey. The Coordinating Role of Language in RealTime Multi-Modal Learning of Cooperative Tasks. IEEE Transactions on Autonomous Mental Development (TAMD), 2012. 125 N Pinto, Y Barhomi, D D Cox, and J J DiCarlo. Comparing state-of-theart visual features on invariant object recognition tasks. In Applications of Computer Vision (WACV), 2011 IEEE Workshop on, pages 463–470, 150 bibliography 2011. doi: 10.1109/WACV.2011.5711540. 110 Nicolas Pinto, David D Cox, and James J DiCarlo. Why is real-world visual object recognition hard? PLoS computational biology, 4(1):e27, January 2008. ISSN 1553-7358. doi: 10.1371/journal.pcbi.0040027. URL http://dx.plos.org/10.1371/journal.pcbi.0040027. 2 Tomaso Poggio and Emilio Bizzi. Generalization in vision and motor control. Nature, 431(7010):768–74, October 2004. ISSN 1476-4687. doi: 10. 1038/nature03014. URL http://dx.doi.org/10.1038/nature03014. 11, 54 P Reinagel and R C Reid. Temporal coding of visual information in the thalamus. The Journal of neuroscience : the official journal of the Society for Neuroscience, 20(14):5392–400, July 2000. ISSN 0270-6474. URL http://www.ncbi.nlm.nih.gov/pubmed/10884324. 12 César Rennó-Costa, John E Lisman, and Paul F M J Verschure. The mechanism of rate remapping in the dentate gyrus. Neuron, 68(6):1051– 8, December 2010. ISSN 1097-4199. doi: 10.1016/j.neuron.2010.11. 024. URL http://www.cell.com/neuron/fulltext/S0896-6273(10) 00940-2. 92, 96, 106 César Rennó-Costa, André L Luvizotto, Encarni Marcos, Armin Duff, Martı́ Sánchez-Fibla, and Paul F M J Verschure. Integrating Neuroscience-based Models Towards an Autonomous Biomimetic Synthetic. In 2011 IEEE International Conference on RObotics and BIOmimetics (IEEE-ROBIO 2011), Phuket Island, Thailand, 2011. IEEE, IEEE. 91, 97 César Rennó-Costa, André Luvizotto, Alberto Betella, Marti Sanchez Fibla, and Paul F. M. J. Verschure. Internal drive regulation of sensorimotor reflexes in the control of a catering assistant autonomous robot. In Lecture Notes in Artificial Intelligence: Living Machines, 2012. M Riesenhuber and T Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019–25, November 1999. ISSN 1097-6256. doi: 10.1038/14819. URL http://www.ncbi.nlm.nih.gov/ pubmed/10526343. 2, 23 151 bibliography M Riesenhuber and T Poggio. Models of object recognition. Nature neuroscience, 3 Suppl:1199–204, November 2000. ISSN 1097-6256. doi: 10.1038/81479. URL http://www.nature.com/neuro/journal/ v3/n11s/full/nn1100_1199.html#B47. 2 Dario L Ringach. Spatial Structure and Symmetry of Simple-Cell Receptive Fields in Macaque Primary Visual Cortex. Journal of Neurophysiology, 88(1):455–463, 2002. URL http://jn.physiology.org/cgi/ content/abstract/88/1/455. 25 Dario L. Ringach, Robert M. Shapley, and Michael J. Hawken. Orientation Selectivity in Macaque V1: Diversity and Laminar Dependence. J. Neurosci., 22(13):5639–5651, July 2002. URL http://www.jneurosci. org/cgi/content/abstract/22/13/5639. 7 R W Rodieck and J Stone. Analysis of receptive fields of cat retinal ganglion cells. Journal of Neurophysiology, 28(5):833, 1965. ISSN 00223077. URL http://jn.physiology.org/cgi/reprint/28/5/833.pdf. 6, 27, 94 F Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386–408, November 1958. ISSN 0033-295X. URL http://www.ncbi.nlm.nih. gov/pubmed/13602029. 2 David E. Rumelhart, James L. McClelland, and PDP Research Group. Parallel Distributed Processing: crostructure of Cognition: Book, 1986. Explorations in the Mi- Foundations (Volume 1). ISBN 0262181207. A Bradford URL http://www.amazon.com/ Parallel-Distributed-Processing-Explorations-Microstructure/ dp/0262181207. 2 P A Salin, J Bullier, and H Kennedy. Convergence and divergence in the afferent projections to cat area 17. The Journal of comparative neurology, 283(4):486–512, May 1989. ISSN 0021-9967. doi: 10.1002/cne. 902830405. URL http://www.ncbi.nlm.nih.gov/pubmed/2745751. 8 Jason M Samonds and A B Bonds. From another angle: Differences in cortical coding between fine and coarse discrimination of orientation. 152 bibliography Journal of neurophysiology, 91(3):1193–202, March 2004. ISSN 00223077. doi: 10.1152/jn.00829.2003. URL http://www.ncbi.nlm.nih. gov/pubmed/14614106. 23 Miguel Sarabia, Raquel Ros, and Yiannis Demiris. Towards an open-source social middleware for humanoid robots. In 11th IEEE-RAS International Conference on Humanoid Robots, pages 670 – 675, Bled, 2011. 132 Alan B Saul. Temporal receptive field estimation using wavelets. Journal of Neuroscience Methods, 168(2):450–464, 2008. ISSN 0165-0270 (Print). doi: 10.1016/j.jneumeth.2007.11.014. 25 Dirk Schubert, Rolf Kötter, and Jochen F Staiger. Mapping func- tional connectivity in barrel-related columns reveals layer- and cell type-specific microcircuits. Brain structure & function, 212(2):107– 19, September 2007. ISSN 1863-2653. doi: 10.1007/s00429-007-0147-z. URL http://www.ncbi.nlm.nih.gov/pubmed/17717691. 90 Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio. Robust Object Recognition with Cortex-Like Mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29:411–426, 2007. ISSN 0162-8828. doi: http://doi. ieeecomputersociety.org/10.1109/TPAMI.2007.56. 2, 23 Robert Shapley and Michael J Hawken. Color in the cortex: single- and double-opponent cells. Vision research, 51(7):701–17, April 2011. ISSN 1878-5646. doi: 10.1016/j.visres.2011.02.012. URL /pmc/articles/ PMC3121536/?report=abstract. 5 S Shushruth, Pradeep Mangapathy, Jennifer M Ichida, Paul C Bressloff, Lars Schwabe, and Alessandra Angelucci. Strong recurrent networks compute the orientation tuning of surround modulation in the primate primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 32(1):308–21, January 2012. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.3789-11.2012. URL http://www.jneurosci.org/content/32/1/308.full. 8 Bruno Siciliano, Lorenzo Sciavicco, Luigi Villani, and Giuseppe Oriolo. 153 bibliography Robotics: Modelling, Planning and Control (Advanced Textbooks in Control and Signal Processing). Springer, 2011. ISBN 1846286417. 119 Kyriaki Sidiropoulou, Fang-Min Lu, Melissa A Fowler, Rui Xiao, Christopher Phillips, Emin D Ozkan, Michael X Zhu, Francis J White, and Donald C Cooper. Dopamine modulates an mGluR5-mediated depolarization underlying prefrontal persistent activity. Nature Neuroscience, 12(2):190–199, January 2009. ISSN 1097-6256. doi: 10.1038/nn.2245. URL http://dx.doi.org/10.1038/nn.2245. 33 Markus Siegel, Tobias H Donner, and Andreas K Engel. Spectral fingerprints of large-scale neuronal interactions. Nature reviews. Neuroscience, 13(2):121–34, February 2012. ISSN 1471-0048. doi: 10.1038/nrn3137. URL http://dx.doi.org/10.1038/nrn3137. 17 Lawrence C Sincich and Jonathan C Horton. The circuitry of V1 and V2: integration of color, form, and motion. Annual review of neuroscience, 28:303–26, January 2005. ISSN 0147-006X. doi: 10. 1146/annurev.neuro.28.061604.135731. URL http://www.ncbi.nlm. nih.gov/pubmed/16022598. 8 Lawrence Sirovich and Marsha Meytlis. Symmetry, probability, and recognition in face space. Proceedings of the National Academy of Sciences of the United States of America, 106(17):6895–9, April 2009. ISSN 1091-6490. doi: 10.1073/pnas.0812680106. URL http://www.pnas. org/cgi/content/abstract/106/17/6895. 19, 55 Olaf Sporns and Jonathan D Zwi. cortex. The small world of the cerebral Neuroinformatics, 2(2):145–62, January 2004. ISSN 1539- 2791. doi: 10.1385/NI:2:2:145. URL http://www.ncbi.nlm.nih.gov/ pubmed/15319512. 7, 23 Dan D Stettler, Aniruddha Das, Jean Bennett, and Charles D Gilbert. Lateral Connectivity and Contextual Interactions in Macaque Primary Visual Cortex. Neuron, 36(4):739–750, November 2002. ISSN 08966273. doi: 10.1016/S0896-6273(02)01029-2. URL http://dx.doi.org/10. 1016/S0896-6273(02)01029-2. 7, 31 Charles F Stevens. Preserving properties of object shape by computations 154 bibliography in primary visual cortex. Proceedings of the National Academy of Sciences of the United States of America, 101(43):15524–15529, 2004. doi: 10.1073/pnas.0406664101. URL http://www.pnas.org/content/101/ 43/15524.abstract. 25 Stephen Sutton, Ronald Cole, Jacques De Villiers, Johan Schalkwyk, Pieter Vermeulen, Mike Macon, Yonghong Yan, Ed Kaiser, Brian Rundle, Khaldoun Shobaki, Paul Hosom, Alex Kain, Johan, Johan Wouters, Dominic Massaro, and Michael Cohen. Universal Speech Tools: The Cslu Toolkit. in proceedings of the international conference on spoken language processing (ICSLP), 7:3221 – 3224, 1998. 65, 122 Carolina Tamara and William Timberlake. Route and landmark learning by rats searching for food. Behavioural processes, 86(1):125–32, January 2011. ISSN 1872-8308. doi: 10.1016/j.beproc.2010.10.007. URL http: //dx.doi.org/10.1016/j.beproc.2010.10.007. 1 K Tanaka. Inferotemporal cortex and object vision. Annual review of neuroscience, 19:109–39, January 1996. ISSN 0147-006X. doi: 10.1146/ annurev.ne.19.030196.000545. URL https://vpn.upf.edu/+CSCO+ 0h756767633A2F2F6A6A6A2E6E6161686E7965726976726A662E626574+ +/doi/abs/10.1146/annurev.ne.19.030196.000545. 10, 11 S Thorpe, D Fize, and C Marlot. Speed of processing in the human visual system. Nature, 381(6582):520–2, June 1996. ISSN 0028-0836. doi: 10.1038/381520a0. URL http://www.ncbi.nlm.nih.gov/pubmed/ 8632824. 17, 41, 68 Sebastian Thrun. Toward a framework for human-robot interaction. Human-Computer Interaction, 19, 2004. 110 Paul Tiesinga and Terrence J Sejnowski. Cortical enlightenment: are attentional gamma oscillations driven by ING or PING? Neuron, 63 (6):727–32, September 2009a. ISSN 1097-4199. doi: 10.1016/j.neuron. 2009.09.009. URL http://dx.doi.org/10.1016/j.neuron.2009.09. 009. 16 Paul Tiesinga and Terrence J Sejnowski. Cortical enlightenment: are attentional gamma oscillations driven by ING or PING? Neuron, 63(6): bibliography 155 727–32, September 2009b. ISSN 1097-4199. doi: 10.1016/j.neuron.2009. 09.009. URL http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=2778762&tool=pmcentrez&rendertype=abstract. 104 Paul Tiesinga, Jean-Marc Fellous, and Terrence J Sejnowski. Regulation of spike timing in visual cortical circuits. Nature reviews. Neuroscience, 9(2):97–107, February 2008. ISSN 1471-0048. doi: 10.1038/nrn2315. URL http://dx.doi.org/10.1038/nrn2315. 12 Thomas Töllner, Michael Zehetleitner, Klaus Gramann, and Hermann J Müller. Stimulus saliency modulates pre-attentive processing speed in human visual cortex. PloS one, 6(1):e16276, January 2011. ISSN 19326203. doi: 10.1371/journal.pone.0016276. URL http://dx.plos.org/ 10.1371/journal.pone.0016276. 41 Michael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and Henrike Moll. Understanding and sharing intentions: the origins of cultural cognition. The Behavioral and brain sciences, 28(5):675–91; discussion 691–735, October 2005. ISSN 0140-525X. doi: 10.1017/ S0140525X05000129. 125, 126 Doris Y Tsao, Winrich A Freiwald, Tamara A Knutsen, Joseph B Mandeville, and Roger B H Tootell. Faces and objects in macaque cerebral cortex. Nature neuroscience, 6(9):989–95, September 2003. ISSN 1097-6256. doi: 10.1038/nn1111. URL http://dx.doi.org/10.1038/ nn1111. 10 Doris Y Tsao, Winrich A Freiwald, Roger B H Tootell, and Margaret S Livingstone. A cortical region consisting entirely of face-selective cells. Science (New York, N.Y.), 311(5761):670–4, February 2006. ISSN 10959203. doi: 10.1126/science.1119983. URL http://www.sciencemag. org/content/311/5761/670.abstract. 10 P F M J Verschure and P Althaus. A real-world rational agent: Unifying old and new AI. Cognitive Science, 27:561–590, 2003. 61, 111, 112 Paul F M J Verschure, Thomas Voegtlin, and Rodney J Douglas. Environmentally mediated synergy between perception and behaviour in mobile robots. Nature, 425(6958):620–4, October 2003. ISSN 1476- 156 bibliography 4687. doi: 10.1038/nature02024. URL http://dx.doi.org/10.1038/ nature02024. 27, 105 J Victor and K Purpura. Estimation of information in neuronal responses. Trends in Neurosciences, 22(12):543, 1999. ISSN 0166-2236 (Print). 41, 51 P. Viola and M. Jones. Rapid object detection using a boosted cas- cade of simple features. IEEE Comput. Soc, Kauai, HI, USA, 2001. ISBN 0-7695-1272-0. doi: 10.1109/CVPR.2001.990517. URL http:// ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=990517. 26 Andreas Wächter and Lorenz T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, April 2005. ISSN 0025-5610. doi: 10.1007/s10107-004-0559-y. 120 T N Wiesel and D H Hubel. Spatial and chromatic interactions in the lateral geniculate body of the rhesus monkey. Journal of neu- rophysiology, 29(6):1115–56, November 1966. ISSN 0022-3077. URL http://www.ncbi.nlm.nih.gov/pubmed/4961644. 4 Ben Willmore, Ryan J Prenger, Michael C-K Wu, and Jack L Gallant. The berkeley wavelet transform: a biologically inspired orthogonal wavelet transform. Neural Computation, 20(6):1537–1564, 2008. ISSN 0899-7667 (Print). doi: 10.1162/neco.2007.05-07-513. 25 Thilo Womelsdorf, Jan-Mathijs Schoffelen, Robert Oostenveld, Wolf Singer, Robert Desimone, Andreas K Engel, and Pascal Fries. Modulation of neuronal interactions through neuronal synchronization. Science (New York, N.Y.), 316(5831):1609–12, June 2007. ISSN 10959203. doi: 10.1126/science.1139597. URL http://www.sciencemag. org/content/316/5831/1609.abstract. 16 Reto Wyss and Paul F. M. J. Verschure. Bounded Invariance and the Formation of Place Fields. In Advances in Neural Information Processing Systems 16. MIT Press, 2004a. URL http://citeseerx.ist.psu. edu/viewdoc/summary?doi=10.1.1.6.4379. 99 Reto Wyss and Paul F M J Verschure. Bounded Invariance and the For- bibliography 157 mation of Place Fields. In Neural Information Processing Systems 16, Vancouver, B.C. Canada, 2004b. 18, 22 Reto Wyss, Peter Konig, and Paul F M J Verschure. Invariant representations of visual patterns in a temporal population code. Proceedings of the National Academy of Sciences of the United States of America, 100(1): 324–9, January 2003a. ISSN 0027-8424. doi: 10.1073/pnas.0136977100. URL http://www.pnas.org/cgi/content/abstract/100/1/324. 12, 13, 18, 22, 27, 51, 55, 56, 76, 77, 90, 91, 92, 94 Reto Wyss, Paul F M J Verschure, and Peter König. Properties of a temporal population code. Reviews in the neurosciences, 14(1-2):21– 33, January 2003b. ISSN 0334-1763. URL http://www.ncbi.nlm.nih. gov/pubmed/12929915. xxiv, 12, 13, 18, 103 Reto Wyss, Paul F M J Verschure, and Peter König. Properties of a temporal population code. Reviews in the neurosciences, 14(1-2):21– 33, January 2003c. ISSN 0334-1763. URL http://www.ncbi.nlm.nih. gov/pubmed/12929915. 22, 27, 35, 41, 50, 51, 55, 56, 77, 86, 90, 92, 94 Reto Wyss, Peter König, and Paul F M J Verschure. A model of the ventral visual system based on temporal stability and local memory. PLoS biology, 4(5):e120, May 2006. ISSN 1545-7885. doi: 10.1371/ journal.pbio.0040120. URL http://dx.plos.org/10.1371/journal. pbio.0040120. 18, 91, 106 T Yousef, E Tóth, M Rausch, U T Eysel, and Z F Kisvárday. Topography of orientation centre connections in the primary visual cortex of the cat. Neuroreport, 12(8):1693–9, July 2001. ISSN 0959-4965. URL http: //www.ncbi.nlm.nih.gov/pubmed/11409741. 7 Jochen Zeil, Martin I. Hofmann, and Javaan S. Chahl. Catchment areas of panoramic snapshots in outdoor scenes. Journal of the Optical Society of America A, 20(3):450, March 2003. ISSN 1084-7529. doi: 10.1364/ JOSAA.20.000450. URL http://josaa.osa.org/abstract.cfm?URI= josaa-20-3-450. 101, 107 S Zeki and S Shipp. The functional logic of cortical connections. Nature, 335(6188):311–7, September 1988. ISSN 0028-0836. doi: 10.1038/ 158 bibliography 335311a0. URL http://dx.doi.org/10.1038/335311a0. 6 Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1330–1334, 2000. ISSN 01628828. doi: 10.1109/34.888718. 119 W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition:A literature survey. ACM Computing Surveys (CSUR), 35(4), 2003. ISSN 0360-0300. URL http://portal.acm.org/citation.cfm? id=954342. 19, 55 Davide Zoccolan, Nadja Oertelt, James J DiCarlo, and David D Cox. A rodent model for the study of invariant visual object recognition. Proceedings of the National Academy of Sciences of the United States of America, 106(21):8748–53, May 2009. ISSN 1091-6490. doi: 10.1073/pnas.0811583106. URL http://www.pnas.org/cgi/content/ abstract/106/21/8748. 1