Download The encoding and decoding of com-

Document related concepts

Visual search wikipedia , lookup

Binding problem wikipedia , lookup

Perception wikipedia , lookup

Visual selective attention in dementia wikipedia , lookup

Perception of infrasound wikipedia , lookup

Neuroanatomy wikipedia , lookup

Eyeblink conditioning wikipedia , lookup

Sensory cue wikipedia , lookup

Multielectrode array wikipedia , lookup

Allochiria wikipedia , lookup

Neural modeling fields wikipedia , lookup

Neural oscillation wikipedia , lookup

Central pattern generator wikipedia , lookup

Holonomic brain theory wikipedia , lookup

Premovement neuronal activity wikipedia , lookup

Neuropsychopharmacology wikipedia , lookup

Recurrent neural network wikipedia , lookup

Development of the nervous system wikipedia , lookup

Visual N1 wikipedia , lookup

Psychophysics wikipedia , lookup

Metastability in the brain wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Neuroesthetics wikipedia , lookup

Embodied cognitive science wikipedia , lookup

Optogenetics wikipedia , lookup

Synaptic gating wikipedia , lookup

C1 and P1 (neuroscience) wikipedia , lookup

Stimulus (physiology) wikipedia , lookup

Convolutional neural network wikipedia , lookup

Biological neuron model wikipedia , lookup

Visual servoing wikipedia , lookup

Time perception wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Neural correlates of consciousness wikipedia , lookup

Nervous system network models wikipedia , lookup

Neural coding wikipedia , lookup

Inferior temporal gyrus wikipedia , lookup

Efficient coding hypothesis wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Transcript
The encoding and decoding of complex visual stimuli: a neural model
to optimize and read out a temporal
population code.
Andre Luiz Luvizotto
TESI DOCTORAL UPF / 2012
Director de la tesi
Prof. Dr. Paul Verschure,
Department of Information and Communication Technologies
ii
By My Self and licensed under
Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported
You are free to Share – to copy, distribute and transmit the work Under
the following conditions:
• Attribution – You must attribute the work in the manner specified
by the author or licensor (but not in any way that suggests that they
endorse you or your use of the work).
• Noncommercial – You may not use this work for commercial purposes.
• No Derivative Works – You may not alter, transform, or build
upon this work.
With the understanding that:
Waiver – Any of the above conditions can be waived if you get permission
from the copyright holder.
Public Domain – Where the work or any of its elements is in the public
domain under applicable law, that status is in no way affected by
the license.
Other Rights – In no way are any of the following rights affected by the
license:
• Your fair dealing or fair use rights, or other applicable copyright
exceptions and limitations;
• The author’s moral rights;
• Rights other persons may have either in the work itself or in
how the work is used, such as publicity or privacy rights.
iv
Notice – For any reuse or distribution, you must make clear to others
the license terms of this work. The best way to do this is with a link
to this web page.
The court’s PhD was appointed by the recto of the Universitat Pompeu
Fabra on .............................................., 2012.
Chairman:
Secretary:
Member:
The doctoral defense was held on ......................................................., 2012,
at the Universitat Pompeu Fabra and scored as ...................................................
PRESIDENT
MEMBERS
SECRETARY
To my parents...
Acknowledgements
This dissertation has been developed in the department of technology of
the Universitat Pompeu Fabra in the research lab SPECS. My first and
sincere thanks goes to Prof. Paul Verschure who since the beginning constantly encouraged and supported me in the development of my research
ideas and goals with his supervision. Also thanks to all SPECS members
for many useful insights, discussions and feedback.
My deepest gratitude to my great friend Cesar Rennó-Costa for the extensive support, discussions and valuable overall collaboration that contributed crucially to this thesis. I would like to thank Zenon Mathews
and Martı́ Sanchez-Fibla for invaluable support over these years. And my
great friend Prof. Jonatas Manzolli for linking me with Paul and SPECS.
I’m deeply thankful to my parents, Ismael and Angela, and my brothers
George and Mauly for all the love and support throughout this journey.
They have been always ready for me. And also my ”youngest parents”
Tila and Giancarlo. You have both contributed so much with your love
and care!
My deepest and heartfelt thanks to my beloved Sassi. I have not enough
words to express how important you have been to me. Your support,
company, love, lecker food and patience made everything easier and much
more beautiful! Thanks and I love you.
Finally, I thank my heroes John, Paul, George, Ringo, Hendrix, Slash,
Vaughan, EVH, Mr. Vernon and many others for providing me with the
x
xi
rock’n’roll stamina to write this dissertation!
Abstract
The mammalian visual system has a remarkable capacity of processing
a large amount of information within milliseconds under widely varying
conditions into invariant representations. Recently a model of the primary
visual system exploited the unique feature of dense local excitatory connectivity of the neo-cortex to match these criteria. The model rapidly
generates invariant representations integrating the activity of spatially
distributed modeled neurons into a so-called Temporal Population Code
(TPC). In this thesis, we first investigate an issue that has persisted TPC
since its introduction: to extend the concept to a biologically compatible
readout stage. We propose a novel neural readout circuit based on wavelet
transform that decodes the TPC over different frequency bands. We show
that, in comparison with pure linear readouts used previously, the proposed system provides a robust, fast and highly compact representation
of visual input. We then generalized this optimized encoding-decoding
paradigm to deal with a number of robotics application in real-world tasks
to investigate its robustness. Our results show that complex stimuli such
as human faces, hand gestures and environmental cues can be reliably encoded by TPC which provides a powerful biologically plausible framework
for real-time object recognition. In addition, our results suggest that the
representation of sensory input can be built into a spatial-temporal code
interpreted and parsed in series of wavelet like components by higher visual
areas.
xiv
Resumen
El sistema visual dels mamfers t una remarcable capacitat per processar
informaci en intervals de temps de mili-segons sota condicions molt variables i adquirir representacions invariants d’aquesta informaci. Recentment un model del crtex primari visual explota les caracterstiques d’alta
connectivitat excitatriu local del neocortex per modelar aquestes capacitats. El model integra rpidament l’activitat repartida espaialment de les
neurones i genera codificacions invariants que s’anomenen Temporal Population Codes (TPC). Aqu investiguem una qesti que ha persistit des de
la introducci del TPC: estudiar un procs biolgicament possible capa de
fer la lectura d’aquestes codificacions. Nosaltres proposem un nou circuit neuronal de lectura basat en la Wavelet Transform que decodifica la
senyal TPC en diferents intervals de freqncia. Monstrem que, comparat
amb lectures purament lineals utilitzades previament, el sistema proposat
proporciona una representaci robusta, rpida i compacta de l’entrada visual. Tamb presentem una generalitzaci d’aquest paradigma de codificacidecodificaci optimitzat que apliquem a diferents tasques de visi per computador i a la visi dins del context de la robtica. Els resultats del nostre
estudi suggereixen que la representaci d’escenes visuals complexes, com
cares humanes, gestos amb les mans i senyals del medi ambient podrien
ser codificades pel TPC el qual es pot considerar un poders marc biolgic per reconeixement d’objectes en temps real. A ms a ms, els nostres
resultats suggereixen que la representaci de l’entrada sensorial pot ser integrada en un codi espai-temporal interpretat i analitzat en una serie de
components Wavelet per rees visuals superiors.
xv
Publications
Included in the thesis
Peer-reviewed
• Andre Luvizotto, César Rennó-Costa, and Paul F.M.J. Verschure.
A wavelet based neural model to optimize and read out a temporal
population code. Frontiers in Computational Neuroscience, 6(21):
14, 2012c
• Andre Luvizotto, César Rennó-Costa, and Paul F.M.J. Verschure.
A framework for mobile robot navigation using a temporal population code. In Springer Lecture Notes in Computer Science - Living
Machines, page 12, 2012b
• Andre Luvizotto, Maxime Petit, Vasiliki Vouloutsi, and Et Al. Experimental and Functional Android Assistant : I . A Novel Architecture for a Controlled Human-Robot Interaction Environment. In
IROS 2012 (Submitted), 2012a
• Andre Luvizotto, César Rennó-Costa, Ugo Pattacini, and Paul Verschure. The encoding of complex visual stimuli by a canonical model
of the primary visual cortex: temporal population coding for face
recognition on the iCub robot. In IEEE International Conference
on Robotics and Biomimetics, page 6, Thailand, 2011
xvii
xviii
publications
Other publications as co-author
• César Rennó-Costa, André Luvizotto, Alberto Betella, Marti Sanchez
Fibla, and Paul F. M. J. Verschure. Internal drive regulation of sensorimotor reflexes in the control of a catering assistant autonomous
robot. In Lecture Notes in Artificial Intelligence: Living Machines,
2012
• Maxime Petit, Stéphane Lallée, Jean-David Boucher, Grégoire Pointeau,
Pierrick Cheminade, Dimitri Ognibene, Eris Chinellato, Ugo Pattacini,
Ilaria Gori, Giorgio Metta, Uriel Martinez-Hernandez, Hector Barron, Martin Inderbitzin, Andre Luvizotto, Vicky Vouloutsi, Yannis
Demiris, and Peter Ford Dominey. The Coordinating Role of Language in Real-Time Multi-Modal Learning of Cooperative Tasks.
IEEE Transactions on Autonomous Mental Development (TAMD),
2012
• César Rennó-Costa, André L Luvizotto, Encarni Marcos, Armin
Duff, Martı́ Sánchez-Fibla, and Paul F M J Verschure. Integrating Neuroscience-based Models Towards an Autonomous Biomimetic
Synthetic. In 2011 IEEE International Conference on RObotics and
BIOmimetics (IEEE-ROBIO 2011), Phuket Island, Thailand, 2011.
IEEE, IEEE
• Armin Duff, César Rennó-Costa, Encarni Marcos, Andre Luvizotto,
Andrea Giovannucci, Marti Sanchez-Fibla, Ulysses Bernardet, and
Paul Verschure. From Motor Learning to Interaction Learning in
Robots, volume 264 of Studies in Computational Intelligence. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-051807. doi: 10.1007/978-3-642-05181-4. URL http://www.springerlink.
com/content/v348576tk12u628h
• Sylvain Le Groux, Jonatas Manzolli, Marti Sanchez, Andre Luvizotto, Anna Mura, Aleksander Valjamae, Christoph Guger, Robert
Prueckl, Ulysses Bernardet, and Paul Verschure. Disembodied and
publications
xix
Collaborative Musical Interaction in the Multimodal Brain Orchestra. In Proceedings of the international conference on New Interfaces
for Musical Expression, 2010. URL http://www.citeulike.org/
user/slegroux/article/8492764
Contents
Abstract
xiv
Resumen
xv
Publications
xvii
List of Figures
xxiv
List of Tables
xxxiii
1 Introduction
1
1.1
Early visual areas . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Higher order visual areas: the extrastriate areas and object
recognition in the brain . . . . . . . . . . . . . . . . . . . .
1.3
3
9
Coding strategies and mechanisms used by the neo-cortex
to provide invariant representation of the visual information 11
1.4
How can the key components encapsulated in a temporal
code be decoded by different cortical areas involved in the
visual process? . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5
Can TPC be used in the recognition of multi-modal sensory
input such as human faces and gestures to provide form
perception primitives? . . . . . . . . . . . . . . . . . . . . . 18
2 A wavelet based neural model to optimize and read out
a temporal population code
2.1
21
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
xxi
xxii
contents
2.2
Material and methods . . . . . . . . . . . . . . . . . . . . . 27
2.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Temporal Population Code for Face Recognition on the
iCub Robot
53
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Using a temporal population to recognize gestures on
the humanoid robot iCub.
75
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2
Material and methods . . . . . . . . . . . . . . . . . . . . . 77
4.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5 A framework for mobile robot navigation using a temporal population code
89
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Conclusion
103
A Distributed Adaptive Control (DAC) Integrated into a
Novel Architecture for Controlled Human-Robot Interaction
109
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
contents
xxiii
A.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Bibliography
133
List of Figures
1.1
Parallel pathways in the early visual system. Redrawn based
on (Nassi and Callaway, 2009) . . . . . . . . . . . . . . . . . .
1.2
5
Schematics of receptive field’s response in simple and complex
cells: The V1 cells are orientation selective, i.e the cell’s activity
increases or decreases when the orientation of edges and bars
matches the preferred orientation φ of the cell. . . . . . . . . .
1.3
8
Overview of the Temporal Population code architecture. In
the model proposed by Wyss et al. (2003b), the input stimulus
passes through an edge-detect stage that approximate the receptive field’s characteristics of LGN cells. In the next stage,
the LGN output is continuously projected to a network of laterally connected integrate-and-fire neurons with properties found
in the primary visual cortex. Due to the recurrent connections,
the visual stimulus becomes encoded over the network’s activity trace.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
xxiv
list of figures
2.1
xxv
The TPC encoding model. In a first step, the input image is
projected to the LGN stage where its edges are enhanced. In
the next stage, the LGN output passes through a set of Gabor
filters that resemble the orientation selectivity characteristics
found in the receptive fields of V1 neurons. Here we show the
output response of one Gabor filter as input for the V1 spiking
model. After the image onset, the sum of the V1 network’s
spiking activity over time gives rise to a temporal representation of the input image. This temporal signature of the spatial
input is the, so called temporal population code, or TPC. . . . 28
2.2
The TPC encoding paradigm. The stimulus, here represented
by a star, is projected topographically onto a map of interconnected cortical neurons. When a neuron spikes, its action potential is distributed over a neighborhood of a given radius. The
lateral transmission delay of these connections is 1 ms/unit.
Because of these lateral intra-cortical interactions, the stimulus becomes encoded in the network’s activity trace. The TPC
representation is defined by the spatial average of the population activity over a certain time window. The invariances that
the TPC encoding renders are defined by the local excitatory
connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3
Computational properties of the two types of neurons used in
the simulations: regular (RS) and burst spiking (BS). The RS
neuron shows a mean inter spike interval of about 25 ms (40
Hz). The BS type displays a similar inter-burst interval with
a within burst inter-spike interval of approximately 7 ms (140
Hz) every 35 ms (28 Hz). . . . . . . . . . . . . . . . . . . . . . 33
xxvi
2.4
list of figures
a) a) Neuronal readout circuit based on wavelet decomposition.
The buffer cells B1 and B2 integrate, in time, the network activity performing a low-pass approximation of the signal over
two adjacent time windows given by the asynchronous inhibition received from cell A. The differentiation performed by
the excitatory and inhibitory connections to W gives rise to
a band-pass filtering process analogous to the wavelet detail
levels. b) An example of band-pass filtering performed by the
wavelet circuit where only the frequency range corresponding
to the resolution level Dc3 is kept in the spectrum. . . . . . . . 34
2.5
The stimulus classes used in the experiments after the edge
enhancement of the LGN stage. . . . . . . . . . . . . . . . . . . 36
2.6
The stimulus set. a) Image-based prototypes (no jitter in the
vertices applied) and the globally most different exemplars with
normalized distance equal one. The distortions can be very
severe as in the case of class number one. b) Histogram of the
normalized Euclidean distances between the class exemplars
and the class prototypes in the spatial domain. . . . . . . . . . 38
2.7
Baseline classification ratio using Euclidean distance among the
images from the stimulus set in the spatial domain. . . . . . . . 39
2.8
Comparison among the correct classification ratio for different
resonance frequencies of the wavelet filters for both types of
neurons RS and BS. The frequency bands of the TPC signal is
represented by the wavelet coefficients Dc1 to Ac5 in a multiresolution scheme. The network time window is 128 ms. . . . . 40
xxvii
list of figures
2.9
Speed of encoding. Number of bits encoded by the network’s
activity trace as a function of time. The RS-TPC and BSTPC curves represent the bits encoded by the network’s activity trace without the wavelet circuit. The RS-Wav and BS-Wav
correspond to the bits encoded by the wavelet coefficients using the Dc3 resolution level for RS neurons and the Dc5 for
BS neurons respectively. For a time window of 128 ms the Dc3
level has 16 coefficients and the Dc5 has only 4 coefficients.
The dots in the figure represent the moment in time where the
coefficients are generated. . . . . . . . . . . . . . . . . . . . . . 42
2.10 Single-sided amplitude spectrum of the wavelet prototype for
each stimulus class used in the simulations. The signals x(t)
where reconstructed in time using the wavelet coefficients from
the Dc3 and Dc5 levels for RS and BS neurons respectively.
The shaded areas shows the optimal frequency response of the
Dc3 level (62 Hz to 125 Hz) and of the Dc5 level (15.5 Hz to 31
Hz). The less pronounced responses around 400 Hz are aliasing
effects due to the signal reconstruction to calculate the Fourier
transform (see discussion). . . . . . . . . . . . . . . . . . . . . . 45
2.11 Prototype based classification hit matrices. For each class in
the training we average the wavelet coefficients to form class
prototypes. In the classification process, the euclidean distance
between the classification set and the prototypes are calculated.
A stimulus is assigned to a the class with smaller euclidean
distance to the respective class prototype.
. . . . . . . . . . . 46
2.12 Distribution of errors in the wavelet-based prototype classification with relation to the Euclidean distances within the prototyped classes.
. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
xxviii
3.1
list of figures
Schematic of the lateral propagation using Discrete Fourier
Transform (DFT). The lateral propagation is performed as a
filtering operation in the frequency domain. The matrix of
spikes generated at time t is multiplied by the filters of lateral
propagation Fd over the next time steps t + 1 and t + 2. The
output is given by the inverse DFT of the result. . . . . . . . . 60
3.2
BASSIS is a multi-scale biomimetic architecture organized at
three different levels of control: reactive, adaptive and contextual. It is based on the well established DAC architecture. See
text for further details. . . . . . . . . . . . . . . . . . . . . . . . 62
3.3
Model overview. The faces are detected and cropped from the
input provided by the iCub’s camera image. The cropped faces
are resized to a fixed resolution of 128x128 pixels and convolved
with the orientation selective filters. The output of each orientation layer is processed by separated neural populations as
explained above. The spike activity is summed over a specific
time window rendering the Temporal Population Code, or TPC. 66
3.4
Cropped faces from Yale face database B used as a benchmark
to compare TPC with other methods of face recognition available in the literature (Georghiades et al., 2001) (Lee et al., 2005). 67
3.5
Speed of encoding. Average correct classification ratio given by
the network’s activity trace as a function of time for different
values of the input threshold Ti . . . . . . . . . . . . . . . . . . 68
3.6
Response clustering. The entries of the hit matrix represent
the number of times a stimulus class is assigned to a response
class over the optimal time window of 21 ms and Ti of 0.6. . . . 69
3.7
Face data set. Training set (Green), Classification set (Blue)
and Misclassified faces (Red). . . . . . . . . . . . . . . . . . . . 70
3.8
Classification ratio using a spatial subsampling strategy where
the network activity is read over multiples subregions. . . . . . 70
3.9
Classification ratio using the cropped faces from the Yale databse. 71
list of figures
4.1
xxix
The gestures used in the Rock-Paper-Scissors game. In the
game, the players usually count until three before showing the
gestures. The objective is to select a gesture which defeats that
of the opponent. Rock breaks scissors, scissors cut paper and
paper covers captures rock. Unlike a truly random selection
method, like coin flipping or a dice, in the Rock-Paper-Scissors
game is possible to recognize and predict the behavior of an
opponent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2
Visual model diagram. In the first step, the input image is
color filtered in the hue, saturation and value (HSV) space and
segmented from the background. The segmented image is then
projected to the LGN stage where its edges are enhanced. In
the next stage, the LGN output passes through a set of Gabor
filters that resemble the orientation selectivity characteristics
found in the receptive field of V1 neurons. Here we show the
output response of one Gabor filter as input for the V1 spiking
model. After the image onset, the sum of the V1 network’s
spiking activity over time gives rise to a temporal representation of the input image. This temporal signature of the spatial
input is the so called temporal population code, or TPC. The
TPC output is then read out by a wavelet readout system. . . 78
4.3
Schematic of the encoding paradigm. The stimulus, a human
hand, is continuously projected onto the network of laterally
connected integrate and fire model neurons. The lateral transmission delay of these connections is 1 ms/unit. When the
membrane potential crosses a certain threshold, a spike occur
and its action potential is distributed over a neighborhood of
a given radius. Because of these lateral intra-cortical interactions, the stimulus becomes encoded in the network’s activity
trace. The TPC signal is generated by summing the total population activity over a certain time window. . . . . . . . . . . . 79
xxx
4.4
list of figures
Computational properties after frequency adaptation of the integrate and fire neuron used in the simulations. The spikes in
this regular spiking neuron model occurs every 25 ms approximately (40 Hz).
4.5
. . . . . . . . . . . . . . . . . . . . . . . . . . 80
Wavelet transform time-scale representation. At each resolution level, the number of wavelet coefficient drops by factor of
2 (dyadic representation) as well as the frequency range of the
low-pass signal given by the approximation coefficients. The
detail coefficients can be interpreted as a band-pass signal, with
frequency range equal to the difference in frequency between
the actual and previous approximation levels. . . . . . . . . . . 83
4.6
The stimulus classes used in the experiments. Here 12 exemplars for each class are shown.
4.7
. . . . . . . . . . . . . . . . . . 84
Classification hit matrix for the baseline. In the classification
process, the correlation between the classification set (25% of
the stimuli )and the training set (75%) are calculated. A stimulus is assigned to a the class with smaller mean correlation to
the respective class in the training set.
4.8
. . . . . . . . . . . . . 86
Classification hit matrix. In the classification process, the euclidean distance between the classification set and the prototypes are calculated. A stimulus is assigned to a the class with
smaller euclidean distance to the respective class prototype.
5.1
. 87
Architecture scheme. The visual system exchanges information
with both the attention system and the hippocampus model to
support navigation. The attention system is responsible for
determining which parts of the scene are considered for the
image representation. The final position representation is given
by the formation of place cells that show high rates of firing
whenever the robot is in a specific location in the environment,
i.e. place cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
list of figures
5.2
xxxi
Visual model overview. The saliency regions are detected and
cropped from the input image provided by camera image. The
cropped areas, subregions of the visual field, have a fixed resolution of 41x41 pixels. Each subregion is convolved with difference of gaussian (DoG) operator that approximates the properties of the receptive field of LGN cells. The output of the
LGN stage is processed by the simulated cortical neural population and their spiking activity is summed over a specific time
window rendering the Temporal Population Code. . . . . . . . 95
5.3
Experimental environment. a) The indoor environment was
divided in a 5x5 grid of 0.6 m2 square bins, compromising 25
sampling positions. b) For every sampling position, a 360 degrees panorama was generate to simulate the rat’s visual input. A TPC responses is calculated for each salient region, 21
in total. The collection of TPC vectors for the environment is
clustered into 7 classes. Finally, a single image is represented
by the cluster distribution of its TPC vectors using a histogram
of 7 bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4
Pairwise correlation of the TPC histograms in the environment.
We calculate the correlation among all the possible combinations of two positions in the environment. We average the
correlation values into distance intervals, according to the sampling distances used in the image acquisition. This result suggests that we can produce a linear curve of distance based on
the correlation values calculated using the TPC histograms. . . 99
5.5
Place cells acquired by the combination of TPC based visual
representations and the E%-max WTA model of the hippocampus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.1 The iCub ready to play with the objects that are on the Reactable. The table provides an appealing atmosphere for social
interactions between humans and the iCub. . . . . . . . . . . . 113
xxxii
list of figures
A.2 BASSIS is a multi-scale biomimetic architecture organized at
three different levels of control reactive, adaptive and contextual. It is based on the well established DAC architecture. See
text for further details. . . . . . . . . . . . . . . . . . . . . . . . 115
A.3 Reactable system overview. See text for further explanation. . 117
A.4 PMP architecture: at lower level the pmpServer module computes the trajectory, targets and obstacles in the space feeding
the Cartesian Interface with respect to the trajectory that has
to be followed. The user relies on a pmpClient in order to
ask the server to move, add and remove objects, and eventually to start new reaching tasks. At the highest level the
pmpActions library provides a set of complex actions, such as
”Push” or ”Move the object’, simply implemented as combinations of primitives. . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.5 A cartoon of the experimental scenario. The distances for the
left (L), middle (M) and right (R) positions are giving according
to the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.6 Box plots of GEx and GEy for the grasp experiments. The
error in placing an object on the table increases from left to
right with respect to the x axis. No significant difference was
observed for y. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.7 Box plots for the place experiment for the different conditions
analyzed: source and target. Both conditions have an impact
in the error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.8 Average time needed for the shared plan tasks. The learning
box, represents the time need for the human to create the task.
The execution, represents the time needed by the interaction
to perform the learned action. . . . . . . . . . . . . . . . . . . . 131
List of Tables
2.1
Parameters used for the simulations. . . . . . . . . . . . . . . . 32
3.1
Parameters used for the simulations. See text for further explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2
Comparison of TPC with other face recognition methods. The
results were extracted from Lee et al. (2005) were the reader
can find the references for each method. . . . . . . . . . . . . . 72
4.1
Parameters used for the simulations. . . . . . . . . . . . . . . . 84
5.1
Parameters used for the simulations. . . . . . . . . . . . . . . . 96
A.1 Excerpt of available OPC properties . . . . . . . . . . . . . . . 118
xxxiii
Chapter
Introduction
In our daily life we are constantly exposed to situations where recognizing
different objects, faces, colors or even estimating the speed of a moving
object plays a key role in the way we behave. To recognize different scenes
under widely varying conditions, the visual system must solve the hard
problem of building invariant representations of available sensory information. Invariant representation of visual input is a key element for a
number of tasks in different species. For example, It has been shown in
foraging experiments with rats that the animal relies on visual landmarks
to explore the environment and to produce an internal representation of
the world. This strategy allows not only rats, but a number of other animals, to successfully navigate over different territories while searching for
food (Tamara and Timberlake, 2011; Zoccolan et al., 2009).
Furthermore, the invariant representation of visual content must be very
efficiently performed, capturing all the rich details necessary to distinguish among different situations in a very fast and non-redundant way so
that prototypes of a class can emerge and be efficiently stored in memory
and/or serve on-going action. Most work done in invariance of the visual
cortex is related to the primary visual cortex. Despite intense researches
in the field of vision and neuroscience, it still remains unclear what are
1
1
2
introduction
the underlying cortical mechanisms responsible for building invariant representations (Pinto et al., 2008). This question has been of great interest
to the field of computational neuroscience and also roboticists building
artificial systems acting in real-world situations (Riesenhuber and Poggio, 2000). This is because, biological systems far outperform advanced
artificial systems in terms of solving real-world visual tasks.
In classical models of visual perception invariant representations emerge in
the form of activity patterns at the highest level of an hierarchical multilayer network of spatial feature detectors (Fukushima, 1980a; Riesenhuber
and Poggio, 1999; Serre et al., 2007). These models are extensions of
the Hubel and Wiesel simple-to-complex cell hierarchy. In this approach,
invariances are achieved at the cost of increasing the number of connections between the different layers in the hierarchy. However, these models
seem to be based on a fundamental assumption that is not consistent with
cortical anatomy.
Another inconsistency of traditional artificial neural networks, such as
the perceptron and multi-layer perceptrons (Rosenblatt, 1958; Rumelhart
et al., 1986), is their time independence. These models were designed
to deal with purely static spatial patterns of input. Time is considered
as another dimension, which makes those models extremely hard-wired.
From the biological perspective, it is clear that time is not treated as
another spatial dimension at the input level. Thus a central question in
this discussion is how time can be incorporated in the representation of a
spatial input.
In this dissertation we address this issue proposing a plausible circuitry
for cortical processing of sensory input. We develop a model of the early
visual system aiming to capture the key ingredients necessary to build up
a robust invariant and compact representation of the visual world. The
proposed model is fully grounded on physiological and anatomical properties of the mammalian visual system and gives a functional interpretation
to the dense intra-cortical connectivity observed in the cortex. In this
1.1. early visual areas
3
chapter, we start from a brief review of the brain areas involved in the
representation and categorization of objects, defining in detail the questions and proposed solutions.
1.1
Early visual areas
Retina
The sensory input organ for the visual system is the eye, where the first
steps of seeing begin in the retina (Nassi and Callaway, 2009). The light
that hits the eye passes through the cornea and the lens before arriving the
retina. The retina contains a dense array of photoreceptors, called rods
and cones, that encode the intensity of light as a function of 2D position,
wavelength and time. The rods are in charge of scotopic vision, or night
vision, and get saturated in day light. They are very sensitive to light
and can respond to a single photon. The cones are not very sensitive to
light, and thus are responsible for day light vision. Cones come in three
subtypes with distinct spectral absorption functions: short (S), medium
(M), and long (L), which form the bases for chromatic vision. A single
cone by itself is color blind in the sense that its activation depends both on
the wavelength and intensity of the stimulus. The signals from different
classes of photoreceptors must be compared in order to access the incoming
information about color (Conway et al., 2010). The existence of coneopponent retinal ganglion cells (RGC) that perform such comparisons is
well established in primate.
The axons of (RGC) form the optic nerve connecting the eye to the brain.
There are three types of RGC that have been particularly well characterized. Morphologically defined, ON and OFF Midget RGC are the origin of
the parvocellular pathway. Both midget and parvocellular cells provide a
L versus M color-opponent signal to the parvocellular layers of the lateral
geniculate nucleus (LGN), which are often associate to red-green colors.
Parasol RGC constitute the magnocellular pathway and convey a broad-
4
introduction
band, achromatic signal to the magnocellular layers of the LGN. Cells in
this pathway have large receptive fields, highly sensitive to contrast and
spatial frequency. Finally, small and large bistratified RGC compose the
koniocellular pathway and convey a S signals against L+M, or blue-yellow
color-opponent signal to koniocellular layers of the LGN (Nassi and Callaway, 2009).
In both areas, retina and LGN, the cells have center-surround receptive
fields (Lee et al., 2000). These neurons have an ON-center/OFF-surround
responses to light intensity, or vice versa. An ON-center cell has its activity
increased when light hits its receptive field’s center and decreased activity
when light hits the center’s surround, and vice-versa for the OFF-center
cells.
Most researchers argue that the retina’s principal function is to convey
the visual image through the optic nerve to the brain. However, recent
results suggest that already in this early stage of the visual pathway the
chromatic or achromatic characteristics of a stimulus are encoded by a
complex temporal response of RGC (Gollisch and Meister, 2008) (Gollisch
and Meister, 2010).
Lateral geniculate nucleus
The output of these specialized channels are projected to the LGN of the
thalamus. The primate LGN has a laminar organization, divided into 6
layers that receives information from the RGC. The most ventral layers
(layers 1-2) receive the inputs from the magnocellular pathway, while the
remaining layers receive input from the parvocellular pathway. Wiesel and
Hubel (1966) found that Monkey neurons in the magnocellular layers of
LGN were largely color-blind, with achromatic, large and highly contrast
sensitive receptive fields. Whereas the neurons in the parvocellular layers had smaller, colour-opponent and poorly contrast sensitive receptive
fields. Further work revealed that the koniocellular pathway connects in
segregated stripes between the Parvocellular and Magnocellular layers of
5
1.1. early visual areas
LGN (reviewed by (Callaway, 2005) and (Shapley and Hawken, 2011))
(Fig. 1.1 a).
Retina
LGN
Parasol
Ganglion Cell
Luminance
B)
Midget
Koniocellular
Bistratified
Magnocellular
Parasol
Parvocellular Layers
Small Bistratified
Ganglion Cell
Blue-Yellow colour
opponency
Parvocellular
Koniocellular Layers
Midget Ganglion Cell
Red-Green colour
opponency
Magnocellular
Layers
A)
V1
2/3
4A
4B
4Cα
4Cβ
5
6
Figure 1.1: Parallel pathways in the early visual system. Redrawn based on
(Nassi and Callaway, 2009)
In the primary visual cortex, the output of parvo, magno and koniocellular
cells from the LGN keeps segregated. Parvo cells terminate in layer 4Cβ,
whereas magno cells innervate layer 4Cα. Konio cells are the input to
6
introduction
layers 2/3. These distinct anatomical projections persuaded early investigators that parvo and magno channels remain functionally isolated in V1
which is still a source of controversy (Nassi and Callaway, 2009) (Fig. 1.1
a).
The receptive field of LGN cells exhibit an ON-OFF response to light intensity. Modeling studies have used filters based on difference of Gaussian
(DoG) to approximate LGN cells receptive field’s characteristics (Einevoll
and Plesser, 2011; Rodieck and Stone, 1965). From an image processing perspective, DoG filters can perform edge enhancement over different
spatial frequencies.
Primary visual cortex
In the visual cortex, different properties of objects - their distances, shapes,
colors and direction of motion - are segregated into separate cortical areas
(Zeki and Shipp, 1988). The earliest stage is the primary visual cortex
(V1, striate cortex or area 17) that is characterized by columnar organization, where neurons with similar response properties are grouped vertically
(Mountcastle, 1957).
The primary visual cortex shows the stereotypical layered structure observed in the neo-cortex. It is structured into 6 layers, numbered from the
surface to the depth from 1 to 6 (Fig. 1.1 b). The projection from LGN
to V1 follows a non-random relationship between the object’s position in
visual space and in the cortex position, in retinal coordinates. This topological arrangement where nearby regions on the retina project to nearby
cortical regions is called retinotopic (Cowey and Rolls, 1974).
In V1, there are two types of excitatory cells known as pyramidal and
stellate. Both of these cell types receive direct feedforward input from cells
in magnocellular pathway layer of thalamus into the layer 4C. Pyramidal
cells have large apical dendrites that pass above layer 4B and into layer
2/3. The apical dendrites of pyramidal cells receive connections from the
1.1. early visual areas
7
parvocellular pathway to layer 4Cβ in V1 and also projects into layer 2/3.
Indeed, most of the feed-forward input into layers 2/3 originates from layer
4C and 4B. Only few inputs come from the koniocellular layer of the LGN.
Neurons in V1 exhibit selectivity for the orientation of a visual stimulus
(Hubel and Wiesel, 1962; Ringach et al., 2002; Nauhaus et al., 2009). They
are distributed in orientation maps where some neurons lie in regions of
fairly homogeneous orientation preference, while others lie in regions with
a variety of preferences called pinwheel centers (Nauhaus et al., 2008).
V1 neurons are basically divided in two classes according to the properties of their receptive fields: simple cells and complex cells (Hubel and
Wiesel, 1962). Similarly to retinal ganglion and geniculate cells, simple
cells show distinct excitatory and inhibitory subdivisions. They respond
best to elongated bars or edges in a preferred direction. Complex cells
are more concerned with the orientation of a stimulus than with its exact
position in the receptive field (Fig. 1.2).
The excitatory circuitry in the primary visual cortex is characterized by
dense local connectivity (Callaway, 1998; Briggs and Callaway, 2001). It is
estimated that 95% of the synapses in V1 are local or feedback connections
from other visual areas (Sporns and Zwi, 2004). Within the same area, the
projections can link neurons over distances of several millimeters, spatially
distributed into clusters of same stimulus preference(Stettler et al., 2002).
It has been proposed that V1 selectivity to stimuli with similar orientation
preference is mediated by long-range horizontal connections intrinsic to V1
(Stettler et al., 2002). This rule has established the notion of a like-to-like
pattern of connectivity (Gilbert and Wiesel, 1989). However some recent
experiments have found controversial results. Pyramidal neurons laying
close to a population of diffuse orientation selectivity (pinwheel centers)
reportedly connect laterally to orientation columns in a cross-oriented or
non-selective way (Yousef et al., 2001).
There are also massive feedback projections from higher visual areas, for
example V2. Compared with the feedforward and lateral connections,
8
introduction
Simple Cell
Receptive Field
- +
Complex Cell
Receptive Field
Ø
Stimulus on
Stimulus on
Spikes
0.25
0.75 sec
Figure 1.2: Schematics of receptive field’s response in simple and complex cells:
The V1 cells are orientation selective, i.e the cell’s activity increases or decreases
when the orientation of edges and bars matches the preferred orientation φ of the
cell.
the feedback projections to V1 have received less attention from the scientific community(Sincich and Horton, 2005). Feedback connections from
higher cortical areas provide more diffuse and divergent input to V1 (Salin
et al., 1989). A first, anatomical studies suggested that V2 and V1 cells
are preferentially connected when they share similar preferred orientations
(Gilbert and Wiesel, 1989). In contradiction, recordings in area V1 of the
macaque monkey show that the inactivation of V2 has no effect on the
center/surround interactions of V1 neurons. The only effect observed in
the inactivation of V2 feedback is a decrease in response to a single oriented bar of about 10% of V1 neurons (Hupé et al., 2001). Indeed, recent
modeling studies confirm that the tuning behavior to oriented stimulus is
highly prevalent in V1 and only if it operates in a regime of strong local
recurrence (Shushruth et al., 2012).
1.2. higher order visual areas: the extrastriate areas and
object recognition in the brain
9
Studies in monkeys have investigated the possible influences of horizontal
connections and feedback from higher cortical areas in contour integration.
The results suggest that V1 intrinsic horizontal connections provide a more
likely substrate for contour integration than feedback projections from V2
to V1. However, the exact mechanism is still unclear.
In the last years, much has been advanced in the knowledge about V1,
however the functional role of the lateral connectivity (horizontal or feedback) remains unknown. Specially in terms of stimuli encoding and its
relationship with the dense local connectivity observed in this area.
1.2
Higher order visual areas: the extrastriate
areas and object recognition in the brain
The extrastriate cortex comprises all the visual areas after the striate
cortex, or V1. It includes the areas V2, V3, V4, IT (inferior temporal
cortex) and MT (medial temporal cortex). The earlier visual areas, up
to V4 appear to have clear retinotopic maps with cells tuned do different
features such as orientation, color or spatial frequency. The consecutive
stages reveal receptive fields with increasingly complex selectivity. This is
particularly more evident in the IT which is attributed to play a major
role in object recognition and categorization (Logothetis and Sheinberg,
1996).
Early lesion studies showed that a complete removal of both temporal lobes
in monkeys resulted in a collection of strange symptoms, but remarkably
the inability of recognize objects visually. More specifically, lesions in the
inferior temporal lobe produced severe and permanent deficits in learning
and remembering to recognize stimuli, resulting in visual agnosias. For
example, after bilateral or right hemisphere damage to IT cortex humans
reported difficulties in recognizing faces (prosopagnosia), colors (achromatopsia) and other more specific objects.
10
introduction
The area IT receives visual information from V1 through a serial pathway,
which is called the ventral visual pathway (V1-V2-V4-IT). The IT projects
to various brain areas outside the visual cortex, including the prefrontal
cortex, the perirhinal cortex (areas 35 and 36), the amygdala, and the
striatum of the basal ganglia (Tanaka, 1996).
The neurons in the IT have some properties that help to understand the
crucial role that this area plays in pattern recognition. The IT cells only
respond to visual stimuli. The receptive fields always include the center
of gaze and tend to be larger then in the V1, leading to a stimulus generalization within the receptive field. IT neurons usually respond more
to complex than simple shapes, but also for color. A small percentage of
IT neurons are selective for facial images. Recent FMRI studies macaque
monkeys found a specific area in the temporal lobe that is activated much
more by faces than by non-face objects (Tsao et al., 2003). Single-unit
recordings subsequently confirmed it (Tsao et al., 2006). Middle face patch
neurons detect and differentiate faces using a strategy that is both part
based (constellations of face parts) and holistic (presence of a whole, upright face) Freiwald et al. (2009).
These findings point towards the concept of grandmother cells or gnostic
units, i.e, hypothetical neurons that respond only to a highly complex,
specific, and meaningful stimulus, such as the image of one’s grandmother
(Gross, 2002). However, the idea of localist representations in the brain
has been controversial. On one side of the spectrum, the ones who advocate in favor of localist representations claim that these theories of perception and cognition are more consistent with neuroscience (reviewed by
(Bowers, 2009)). On the other side, parallel distributed processing theories of cognition claim that knowledge is coded in a distributed manner
in mind and brain (reviewed by (O’Reilly, 1998)). First, it would be far
too risky for the nervous system to rely too much on selectivity. A well
placed damage to a small number of cells could make one never recognize
his/her grandmother again. In addition, it’s now well established that
higher information-processing areas return information to the lower ones,
1.3. coding strategies and mechanisms used by the
neo-cortex to provide invariant representation of the visual
information
11
so that information travels in both directions between the modules, and
not just upward.
In the area IT neurons are usually invariant over changes in contrast,
stimulus size, color and position within the retina (Tanaka, 1996).
1.3
Coding strategies and mechanisms used by
the neo-cortex to provide invariant
representation of the visual information
In classical models that mimic natural vision systems, invariant representations emerge in the form of activity patterns at the highest level of the
network by virtue of the spatial averaging across the feature detectors at
the preceding layers. These models are based on the, so called, Neocognitron (Fukushima, 1980b, 2003; Poggio and Bizzi, 2004; Chikkerur et al.,
2010), a hierarchical multilayer network of detectors with varying feature
and spatial tuning. In this approach, recognition is based on a large dictionary of features stored in memory. Filters selective to specific features
or combinations of features are distributed over different network layers.
The underlying coding mechanism used in hierarchical models is based
on rate coding, i.e, the mean firing rate of a neuron is directly related to
its selectivity to the applied stimulus. In this way, hierarchical models
are based on the assumption that a cortical cell is tuned to a local and
specific feature using rate coding. A neuron in a rate coding network has
a binary response to a certain stimulus, encoding the information of one
bit. Thus, in this approach invariances to, for instance, position, scale
and orientation, are achieved at the high cost of increasing the number of
neurons and connections between layers responsible for all possible kinds
of inputs the network can receive. Taking into account the dimensionality
of the human’s visual space, the total number of visual features in the
world would require far more cells that are available in the cortex.
12
introduction
Even under the assumption that neurons do use rate coding, more information could be provided by the network using a population of neurons.
In particular, a group of cells, each responsive to a different feature, can
cooperatively code for a complete subspace of sensory inputs. In the motor
cortex for instance, the direction of movement was found to be uniquely
predicted by the action of a population of motor cortical neurons (Georgopoulos et al., 1986). These models have been criticized since the neural
substrate probably reflect only a fraction of the patterns present in the
brain. And also because this scheme requires a minimum degree of heterogeneity in the neuronal substrate which might not be anatomically true.
An alternative for such coding system is to include time as an extra coding
dimension. It has been showed that the precise spike timing of single neurons can be successfully used to encode information (reviewed by Tiesinga
et al. (2008)). This encoding scheme has been called temporal coding. The
difference between rate and temporal code can be unclear if the spiking
rates are based on very tinny time bins. However, the conceptual idea is
clear. It is intuitive that for a single neuron, the potential information
content of precise and reliable spike times is many times larger than that
which is contained in the firing rate, which is averaged across a typical
interval of a hundred milliseconds (Reinagel and Reid, 2000).
Recently, a model of the primary visual cortex has used the concept of
temporal code within a population of neuron to show that the temporal
dynamics of a recurrently coupled neural network can encode the spatial
features of a 2D stimulus (Wyss et al., 2003b,a). This novel idea introduced a new type of coding where the spatial information is translated into
a temporal population code (TPC). The model gives a functional interpretation to the dense excitatory local connectivity found in V1 and suggests
an encoding paradigm invariant to a number of visual transformations
based on a temporal population code (TPC).
In this approach the input stimulus is initially processed by an LGN-like
circuit and topographically projected onto a V1 network of neurons orga-
1.3. coding strategies and mechanisms used by the
neo-cortex to provide invariant representation of the visual
information
13
nized in a bi-dimensional Cartesian space with dense local and symmetrical
connectivity. The output of the network is a compressed representation of
the stimulus captured in the temporal evolution of the population’ spike
activity 1.3.
V1
LGN
Retina
Spike
Ø2
Readout
Retinotopic
Projection
TPC
Time
Input
Image
Modeled
Neurons
(V1)
Spike
Lateral
Spreading
Figure 1.3: Overview of the Temporal Population code architecture. In the
model proposed by Wyss et al. (2003b), the input stimulus passes through an
edge-detect stage that approximate the receptive field’s characteristics of LGN
cells. In the next stage, the LGN output is continuously projected to a network
of laterally connected integrate-and-fire neurons with properties found in the
primary visual cortex. Due to the recurrent connections, the visual stimulus
becomes encoded over the network’s activity trace.
An intrinsic characteristic of TPC is the independence of the precise position of the stimulus within the network. The dynamics of a neuron
population do not depend on what neurons are involved and are invariant
to their topological arrangement. Therefore, the space to spatial-temporal
transformation of TPC provides for a high-capacity encoding, invariant to
position and rotation. In the code scheme of TPC the relationship among
the core geometrical components of the visual input is captured. Up to a
certain amount, distortions in form are smoothed by the lateral interactions which makes the encoding robust to small geometrical deformations
for instance found in hand written characters (Wyss et al., 2003a).
14
introduction
From a theoretical perspective, the TPC architecture is totally generic regardless of the stimulus input. Because the network comprises both the
coding material and the mechanisms needed to encapsulate local rules of
computation. In this sense, there is no need of re-wiring the network for
different inputs. The wire-independence of TPC in its use of connectivity makes it extremely more general than pure hierarchical models. The
network can receive general types of inputs without incorporating filters
selective to the incoming features.
From a biological perspective, there are a number of physiological studies
that support the encoding scheme of TPC. A work with salamanders reported that, already in the retina, certain ganglion cells encode the spatial
structure of a briefly presented image in the relative timing of their first
spikes (Gollisch and Meister, 2008). In an experiment, were eight stimuli
were used (perfect encoding would represent 3 bits), the spike latency of
a ganglion cell transmitted up to 2 bits of information on a single trial.
Indeed, by simply plotting the recorded differential spike latencies as a
gray-scale code, Gollisch and Meister (2008) could obtain a rather faithful
neural representation of a raw natural image presented to the retina of the
animal.
A recent work has investigated how populations of neurons in area V1
represent the time-changing sensory input through a spatio-temporal code
Benucci et al. (2009). The results showed that the population activity
is attracted towards the orientation of consecutive stimulus. Therefore,
using a simple linear decoded based on the weighted sum of past stimulus
orientations higher cognitive areas can predict with high accuracy the
population responses to changes in orientation.
A study with monkeys has found that in prefrontal cortices the information about stimulus categories is encoded by a temporal population code
(Barak et al., 2010). Monkeys were trained to distinguish among different trials, separated by a time delay, of vibro-tactile stimuli applied on
their fingertip. Surprisingly, they found that population state consistently
1.4. how can the key components encapsulated in a
temporal code be decoded by different cortical areas
involved in the visual process?
15
reflected the vibrational frequency during the stimulus. Moreover, they
observed substantial variations of the population state between the stimulus and the end of the delay. These findings challenge the standard view
that information in the working memory is exclusively encoded by the
spiking activity of dedicated neuronal populations.
Population code has been associate with oscillatory discriminations of sensory input in a number of species. In the auditory cortex of birds the
temporal responses of neuron populations allow for the intensity invariant
discrimination of songs (Billimoria et al., 2008). Or in the early stages
of sensory processing of electric fishes (Marsat and Maler, 2010). These
animals use discharges of electrical signal to communicate courtship (big
chirps, around 300 Hz up to 900 Hz) or aggressive encounters (small chirps,
around 100 Hz) that are encoded by different populations of neurons. Big
chirps are accurately described by a population of pyramidal neurons using a linear representation of their temporal features. Whereas, small
chirps are encoded by synchronous bursting of populations of pyramidal
neurons. In the insect olfactory system the neurons of the antennal lobe
display stimulus induced temporal modulations of their firing rate in a
temporal code (Carlsson et al., 2005; Knusel et al., 2007).
1.4
How can the key components encapsulated
in a temporal code be decoded by different
cortical areas involved in the visual process?
In the last years a growing number of results have supported the idea
that the brain uses temporal signals through oscillations to link ongoing
processes over different brain areas (Buzsáki, 2006).
The hippocampus and prefrontal cortex (PFC) are structures that make
use of these channels of information to communicate with other brain circuits. These areas are hubs of communication that orchestrate the signals
originated in many cortical and subcortical areas promoting cognitive func-
16
introduction
tions such as working memory, memory acquisition and consolidation, and
decision making (Benchenane et al., 2011). In a recent study, Battaglia
and McNaughton (2011) show that during memory maintenance PFC oscillations are in phase with hippocampal oscillations. However when there
are no working memory demands, coherence breaks down, and no consistent phase relationship is observed.
Both hippocampus and PFC receive converging input from higher sensory/associative areas. Gamma oscillations have been mostly studied and
understood in the early and intermediate visual cortex. In primates, the
fast oscillations play a key role in the attention process underlying stimulus selection. Attention causes both firing rate increases and synchrony
changes caused by a top-down mechanism mediated by the PFC (reviewed
by Noudoost et al. (2010) and Tiesinga and Sejnowski (2009a)). For example, the reception of information from higher visual areas such as V4
can be tuned in on a V1 column whose receptive field contains relevant
stimulus, resulting on a steering visual attention mechanism (Womelsdorf
et al., 2007).
The TPC signal intrinsically carries the visual information through a set
of encapsulate oscillations over, different frequency ranges. The temporal code is in tune with the notion that the sensory information travels
throughout the brain in different wavelengths. In this context, if the TPC
plays a role in encoding global stimulus features in a compact temporal
representation, it is relevant to understand what its key coding features
are and how the signal can be decoded into different components by higher
visual areas by an efficient readout system.
In the past years, different solutions for reading out the TPC were proposed. Recently, a TPC readout based on the so-called Liquid State Machine, or LSM (Doetsch, 2000) was developed by Knüsel et al. (2004).
The LSM networks are examples of reservoir computing networks where
dense local circuits of the cerebral cortex are implemented as a large set of
nearly randomly defined filters. The results by Knüsel et al. (2004) showed
1.4. how can the key components encapsulated in a
temporal code be decoded by different cortical areas
involved in the visual process?
17
to be inefficient compared to traditional linear decoders based purely on
correlations. Also additional layers of hundreds of integrate-and-fire neurons are required for the readout. Furthermore, the LSM strategy did
not account for the different frequencies of oscillation present in the TPC.
Consequently the readout system has remained an open issue.
In this dissertation, we followed a different strategy to solve this issue.
We proposed a biologically plausible readout circuit based on wavelets
transforms (Luvizotto et al., 2012c). The hypothesis behind is that a
population of readout neurons tuned to different spectral bands can extract the different content encoded in the temporal signal. Our hypothesis
is in line with recent findings suggesting that frequency bands of coherent oscillations constitute spectral fingerprints of canonical computations
mechanisms involved in the description of cognitive processes (Siegel et al.,
2012).
In the model we use a synchronous inhibition system oscillating in the
gamma frequency range to simulate the effect of a Haar wavelet transform.
Over windows of low inhibitory conductance the TPC signal is integrated
and differentiated producing a band-pass filter response. The frequency
response of the output is controlled by varying the synchronous inhibition
time. In the proposed system, the readout can be tuned to specific frequency ranges, for instance the gamma frequency. Also in this scenario,
interferences added by oscillations in different frequency bands can be filtered out enhancing the quality of the information that is captured. In
this sense, the wavelet readout could accounting for the demultiplexing of
different neuronal responses, for example of motion and orientation that
are encoded in V1 (Onat et al., 2011).
In addition, the mammalian visual system has a remarkable capacity of
processing a large amount of visual information within dozens of milliseconds (Thorpe et al., 1996; Fabre-Thorpe et al., 1998). In this sense, the
readout system must be fast, compact and therefore non-redundant. Those
are well known advantages of wavelet transforms Mallat (1998). The pro-
18
introduction
posed circuit reads out the signal using only a few coefficients. In particular, the simplicity and orthogonality of the Haar wavelet used in the
readout circuit leads to a very economic neuronal substrate. The implementation requires only four neurons.
In the chapter 2, we present all the details of the proposed neuronal wavelet
readout circuit, the methods and results addressing its strengths and limitations.
1.5
Can TPC be used in the recognition of
multi-modal sensory input such as human
faces and gestures to provide form
perception primitives?
In this thesis, the original TPC model was revisited. We fully investigate
the capabilities that TPC showed previously in simulations (Wyss et al.,
2003b,a), virtual environments Wyss and Verschure (2004b) and controlled
robot arena (Wyss et al., 2006) under realistic and natural environments.
If the TPC encoding scheme provides a good representation of the visual
input it is relevant to investigate what are its boundaries of robustness
and reliability in complex real-world environment. Also a relevant issue
for real-time applications is the computational requirements to process the
massive amount of recurrent connections in the network. Can robustness
in the encoding being achieved at reasonable speed of processing? We
know from a theoretical point of view that TPC is built around a generic
encoding mechanism. Can TPC be used as an unified and canonical model
of cortical computations that can be extended to other stimulus modalities
to generate perception primitives?
Using different and more realistic types of modeled neurons and an overall
new optimized architecture configuration, the original model was totally
re-designed and optimized. To address these questions we first exposed
the new TPC network to one of the most complex stimulus we are often
1.5. can tpc be used in the recognition of multi-modal
sensory input such as human faces and gestures to provide
form perception primitives?
19
exposed to: human faces (Zhao et al., 2003; Martelli et al., 2005; Sirovich
and Meytlis, 2009).
In chapter 3, we present the details of the face recognition system1 based
on the TPC model that was implemented in the humanoid robot iCub
(Luvizotto et al., 2011). To allow for real-time computations we resort
to a standard method of signal processing to develop a novel way of calculating the lateral interactions in the network. The lateral connections
were interpreted as discrete finite impulse responses (FIR) filters and the
spreading of activity calculated using 2D convolutions. In addition, all the
convolutions were calculated as multiplications in the frequency domain.
A simple signal processing technique yields much faster processing times
and also the possibility of having much larger networks in comparison to
pre-assigned maps of connectivity stored in memory.
The model was benchmarked using images from the iCub’s camera and
from a standard database: the Yale Face Database B (Georghiades et al.,
2001; Lee et al., 2005). This way results we could compare the performance of TPC with standard methods of face recognition available in the
literature. This work was awarded as one of the top-five papers presented
in the 2011 edition of the IEEE International Conference on Robotics and
Biomimetic (ROBIO). Another contribution of this work is a complete
TPC library developed and made publicly available2 that can be freely
used by the robotics community.
With the face recognition results, we could address questions regarding
to robustness in real-time and real-world scenarios of TPC. To finally address the generality of TPC, we designed two more test scenarios involving
different tasks and a different robot platform.
In chapter 4, we extended the experiments performed with faces in the
iCub to gesture recognition. The experiments were motivated by the game
1
iCub never forgets a face: a video of the face recognition system on iCub.
http://www.robotcompanions.eu/blog/2012/01/2820/
2
efaa.sourceforge.net
20
introduction
Rock-Paper-Scissors. In this human-robot interaction scenario, the robot
has to recognize the gesture thrown by the human to reveal who is the
winner. The results show that with the same network, the robot can
reliably recognize the gestures used in the game. Furthermore, the wavelet
circuit got fully integrated in the C++ library showing a remarkable gain
in recognition rate versus compression.
To further stress the generality of TPC, in chapter 5, the proposed model
is implemented in a mobile robot and used for navigation. Based on the
camera input, the model represents distance relationships of the robot
over different positions in a real-world arena. The TPC representation
feeds a hippocampus model that accounts for the formation of place fields
(Luvizotto et al., 2012b).
These results finally demonstrated that the TPC encoding is totally generic.
The same network implementation can feed the the working memory with
different stimulus classes without any need of rewiring, fast and invariant
to most transformations found in real-world environments.
Chapter
A wavelet based neural model
to optimize and read out a
temporal population code
It has been proposed that the dense excitatory local connectivity of the
neo-cortex plays a specific role in the transformation of spatial stimulus
information into a temporal representation or a temporal population code
(TPC). TPC provides for a rapid, robust and high-capacity encoding of
salient stimulus features with respect to position, rotation and distortion.
The TPC hypothesis gives a functional interpretation to a core feature of
the cortical anatomy: its dense local and sparse long-range connectivity.
Thus far, the question of how the TPC encoding can be decoded in downstream areas has not been addressed. Here, we present a neural circuit that
decodes the spectral properties of the TPC using a biologically plausible
implementation of a Haar transform. We perform a systematic investigation of our model in a recognition task using a standardized stimulus
set. We consider alternative implementations using either regular spiking
or bursting neurons and a range of spectral bands. Our results show that
our wavelet readout circuit provides for the robust decoding of the TPC
and further compresses the code without loosing speed or quality of decod21
2
a wavelet based neural model to optimize and read out a
22
temporal population code
ing. We show that in the TPC signal the relevant stimulus information is
present in the frequencies around 100 Hz. Our results show that the TPC
is constructed around a small number of coding components that can be
well decoded by wavelet coefficients in a neuronal implementation. The
solution to the TPC decoding problem proposed here suggests that cortical processing streams might well consist of sequential operations where
spatio-temporal transformations at lower levels form a compact stimulus
encoding using TPC that are subsequently decoded back to a spatial representation using wavelet transforms. In addition, the results presented
here show that different properties of the stimulus might be transmitted
to further processing stages using different frequency components that are
captured by appropriately tuned wavelet based decoders.
2.1
Introduction
The encoding of sensory stimuli requires robust compression of salient features (Hung et al., 2005). This compression must support representations
of the stimulus that are invariant to a range of transformations caused,
in case of vision, by varying viewing angles, different scene configurations and deformations. Invariances and compression of information can
be achieved by moving across different representation domains i.e. from
spatial to temporal representations.
In earlier work we proposed an encoding paradigm that makes use of this
strategy called the Temporal Population Code (TPC) (Wyss et al., 2003a).
In this approach the input stimulus is topographically projected onto a
network of neurons organized in a bi-dimensional Cartesian space with
dense local connectivity. The output of the network is a compressed representation of the stimulus captured in the temporal evolution of the population spike activity. The space to time transformation of TPC provides
for a high-capacity encoding, invariant to position and image deformations
that has been successfully applied to real world tasks such as hand-written
character recognition (Wyss et al., 2003c), spatial navigation (Wyss and
2.1. introduction
23
Verschure, 2004b) and face recognition in a humanoid robot (Luvizotto
et al., 2011). TPC shows that the dense excitatory local connectivity
found in the primary sensory areas of the mammalian neo-cortex can play
a specific role in the rapid and robust transformation and compression of
spatial stimulus information that can be transmitted over a small number
of projections to subsequent areas. This wiring scheme is consistent with
the anatomy of the neo-cortex where about 95% of all connections found
in a cortical volume also originate in it (Sporns and Zwi, 2004).
In classical models of visual perception invariant representations emerge in
the form of activity patterns at the highest level of an hierarchical multilayer network of spatial feature detectors (Fukushima, 1980a; Riesenhuber
and Poggio, 1999; Serre et al., 2007). In this approach, invariances are
achieved at the cost of increasing the number of connections between the
different layers of the hierarchy. However, these models seem to be based
on a fundamental assumption that is not consistent with cortical anatomy.
In comparison to these hierarchical models of object recognition, the TPC
architecture has the significant advantage of being both compact and wire
independent thus providing for the multiplexing of information.
Recently direct support for the TPC as a substrate for stimulus encoding has been found in a number of physiological studies. For instance,
the physiology of the mammalian visual system shows dynamics consistent with the TPC in orientation discrimination (Samonds and Bonds,
2004; Benucci et al., 2009; MacEvoy et al., 2009) and spatial selectivity
regulation (Benucci et al., 2007). In particular, showing stimulus specific modulation of the phase relationships among active neurons. In the
bird auditory system, the temporal responses of neuron populations allow
for the intensity invariant discrimination of songs (Billimoria et al., 2008).
Similarly in the inferior temporal and prefrontal cortices information about
stimulus categories is encoded by the temporal response of populations of
neurons (Meyers et al., 2008) (Barak et al., 2010). Signatures of the TPC
have also been found in the insect olfactory system where the glomeruli
and the projection neurons of the antennal lobe display stimulus induced
a wavelet based neural model to optimize and read out a
24
temporal population code
temporal modulations of their firing rate at a scale of hundreds of milliseconds (Carlsson et al., 2005) (Knusel et al., 2007).
If the TPC plays a role in stimulus encoding it is relevant to understand
what its key coding features are and how these features can be subsequently decoded in areas downstream from the encoder. The readout by
the decoder must be fast and compact, extracting the key characteristics of
the original input stimulus in a compressed way. These key features must
be captured in a non-redundant fashion so that prototypes of a class can
emerge and be efficiently stored in memory and/or serve on-going action.
In a hierarchical model of sensory processing based on the notion of TPC,
the encoded temporal information provided by primary areas is mapped
back onto the spatial domain allowing higher order structures to further
process the stimulus. Hence, a TPC decoder is required to generate a
spatially structured response from the temporal population code of the
encoder. Taking into account these requirements our question is how a
cortical circuit can retrieve the features encapsulated in the TPC.
In the past years, different strategies for decoding temporal information
have been suggested. A recently proposal is the so-called Liquid State
Machine, or LSM (Doetsch, 2000) which is an example of a larger class of
models also called reservoir computing (Lukoševičius and Jaeger, 2009).
In this approach the dense local circuits of the cerebral cortex are seen
as implementing a large set of practically randomly defined filters. When
applied to reading out the TPC we have reported a lower performance
as compared to using Euclidean distance as a result of the LSM’s noise
sensitivity (Knüsel et al., 2004). In addition to being less effective than a
linear decoder, LSM is computationally expensive requiring an additional
layer of hundreds of integrate and fire neurons, while performance strongly
depends on the specific parameters settings which compromises generality.
Given that TPC is consistent with current physiology we want to know
whether an alternative approach can be defined that is more tuned to the
specific properties of the TPC, i.e. its temporal structure.
2.1. introduction
25
A readout mechanism for temporal codes, such as TPC, could also be based
on an analysis of the temporal signal over different frequency bands and
resolutions. A population of readout neurons tuned to different spectral
bands could be possibly capable to implement such a readout stage. In this
scheme, the temporal information of TPC is mapped back into a spatial
representation by cells responsive to different frequency bands and thus
the spectral properties of their inputs. A suitable framework for modeling
such a readout stage is the wavelet decomposition: a spectrum analysis
technique that divides the frequency spectrum in a desirable number of
bands using variable-sized regions (Mallat, 1998). Higher processing stages
in the neo-cortex could make use of such a scheme in order to capture
information compressed and multiplexed in different frequency bands by
preceding areas.
The wavelet transform is a biological plausible candidate and has already been extensively used for modeling cortical circuits in different areas (Stevens, 2004; Chi et al., 2005). The classic description of image
processing in V1 is based on a two-dimensional Gabor wavelet transform
(Daugman, 1980). Recently, two alternative wavelet-based models approximating the receptive field properties of V1 neurons in the discrete domain
have been proposed, which show additional desirable features such as orthogonality (Saul, 2008; Willmore et al., 2008).
A one-dimensional wavelet transform can be interpreted as a strategy for
reading out the different spectral components of the TPC that is equivalent to the wavelet-based models of V1 receptive fields (Jones and Palmer,
1987; Ringach, 2002). Thus providing for a general encoding-decoding
model that can be generalized to the whole of the neo-cortex given its
relatively uniform anatomical organization. Furthermore, from both representation and implementation perspectives, orthogonal wavelets are a
compact way of decomposing a signal where the frequency spectrum is divided in a dyadic manner: at each resolution level of the filtering process
a new frequency band emerges represented by half of the wavelet coefficients presented in the previous resolution level. Thus meeting one of the
a wavelet based neural model to optimize and read out a
26
temporal population code
fundamental requirements of an efficient readout system: compactness.
Here, we combine the encoding mechanism of the TPC with decoder that
is based on a one-dimensional, orthogonal and discrete wavelet transform
implemented by a biological plausible circuit. We show that the information provided by the TPC generated at an earlier neuronal processing
level can be decoded in a compressed way by this wavelet read-out circuit.
Furthermore, we show that these wavelet transforms can be performed
by a plausible neuronal mechanism that implements the, so called, Haar
wavelet (Haar, 1911; Papageorgiou et al., 1998; Viola and Jones, 2001).
The simplicity and orthogonality of the Haar wavelet makes this readout process fast and compact in a implementation that requires only four
neurons.
To investigate the validity of our hypothesis we first define a baseline for
benchmarking the network’s performance in a classification task. The
benchmarking is done using a stimulus set of images based on artificially
generated geometric shapes. To test the readout performance we evaluate
how the information extracted across different sets of wavelet coefficients,
covering orthogonal regions of the frequency spectrum, influences classification performance. The simulations are performed using two types of
implementations: regular spiking and bursting neurons. We consider these
two types of models in order to address the effects of spiking dynamics on
the encoding and decoding performance of the model. These two types
of spiking behaviors have also been observed in V1 pyramidal neurons
(Hanganu et al., 2006; Iurilli et al., 2011) .
We also investigate the speed of encoding-decoding of the proposed wavelet
based circuit in comparison to the method used in previous studies of
TPC that are based purely on linear classifiers. In the last experiments,
we explore the generality that the wavelet coefficients hold in forming
prototypical representations of an object class that can be stored in a
working memory in fast object recognition tasks. In particular, we are
concerned with the question of how high-level information generated by
2.2. material and methods
27
sensory processing streams can be flexibly stored and retrieved in a longterm and working memory systems (Verschure et al., 2003; Duff et al.,
2011).
One option for the memory storage problem would be a labelled line code
where specific axon/synapse complexes are dedicated to specific stimuli
and their components (Chandrashekar et al., 2006; Nieder and Merten,
2007). This approach, however, faces capacity limitations both in the
amount of information stored and the physical location where it can be
processed. Alternatively a purely temporal code, such as TPC, would be
in this respect independent of the spatial organization of the physical substrate and allow the multiplexing of high-level information. We show that
this latter scenario is feasible and can be realized with simple biologically
plausible neuronal components.
Our results suggest that sensory processing hierarchies might well comprise
sequences of spatio-temporal transformations that encode combinations of
local stimulus features into perceptual classes using sequences of TPCs
encoding and their wavelet decoding back to a spatial domain.
2.2
Material and methods
The model is divided in two stages: a model of the lateral geniculate nucleus (LGN) and a topographic map of laterally connected spiking neurons
with properties found in the primary visual cortex V1 (Fig. 2.1) (Wyss
et al., 2003c,a). In the first stage we calculate the response of the receptive
fields of LGN cells to the input stimulus, a gray scale image that covers
the visual field. The approximation of the receptive field’s characteristics
is done convolving the input image with a difference of Gaussians operator
(DoG) followed by a positive half-wave rectification. The positive rectified
DoG operator resembles the properties of on LGN center-surround cells
(Rodieck and Stone, 1965; Einevoll and Plesser, 2011). The LGN stage
is a mathematical abstraction of known properties of this brain area and
a wavelet based neural model to optimize and read out a
28
temporal population code
performs an edge enhancement of the input image. In the simulations we
use a kernel ratio of 4:1, with size a of 10x10 pixels and variance σ = 1.5
(for the smaller Gaussian).
V1
Input
LGN
TPC
Spikes
...
...
Gabor
Filters
Array of
Spiking Neurons
S
Time
Figure 2.1: The TPC encoding model. In a first step, the input image is
projected to the LGN stage where its edges are enhanced. In the next stage, the
LGN output passes through a set of Gabor filters that resemble the orientation
selectivity characteristics found in the receptive fields of V1 neurons. Here we
show the output response of one Gabor filter as input for the V1 spiking model.
After the image onset, the sum of the V1 network’s spiking activity over time gives
rise to a temporal representation of the input image. This temporal signature of
the spatial input is the, so called temporal population code, or TPC.
The LGN signal is projected onto the V1 spiking model, where the coding concept is illustrated in Fig. 2.2. The network is an array of N xN
model neurons connected to a circular neighborhood with synapses of equal
strength and instantaneous excitatory conductance. The transmission delays are related to the Euclidean distance between the positions of the
pre-and postsynaptic neurons. The stimulus is continuously presented to
the network and the spatially integrated spreading activity of the V1 units,
as a sum of their action potentials, results in the so called TPC signal.
In the network, each neuron is approximated using the spiking model proposed by Izhikevich (Izhikevich, 2003). These model neurons are biologically plausible and computationally efficient as integrate-and-fire models
(Izhikevich, 2004). Relying only on four parameters, our network can reproduce both regular (RS) and bursting (BS) spiking behavior using a
system of ordinary differential equations of the form:
29
2.2. material and methods
Modeled Neuron
Lateraly Connected
Stimulus
Network
e
Tim
Connectivity
Radius
Spike
Lateral Propagation
Spikes
∑S
Spikes
TPC
Time
Figure 2.2: The TPC encoding paradigm. The stimulus, here represented by a
star, is projected topographically onto a map of interconnected cortical neurons.
When a neuron spikes, its action potential is distributed over a neighborhood of
a given radius. The lateral transmission delay of these connections is 1 ms/unit.
Because of these lateral intra-cortical interactions, the stimulus becomes encoded
in the network’s activity trace. The TPC representation is defined by the spatial
average of the population activity over a certain time window. The invariances
that the TPC encoding renders are defined by the local excitatory connections.
v 0 = 0.04v 2 + 5v + 140 − u + I
(2.1)
u0 = a(bv − u)
(2.2)
with the auxiliary after-spike resetting:

v ← c
if v ≥ 30 mV, then
u ← u + d
(2.3)
Here, v and u are dimensionless variables and a, d, c and d are dimensionless parameters that determine the spiking or bursting behavior of the
a wavelet based neural model to optimize and read out a
30
temporal population code
neuron unit and
0
=
d
dt ,
where t is time. The parameter a describes the
time scale of the recovery variable u. The parameter b describes the sensitivity of the recovery variable u to the sub-threshold fluctuations of the
membrane potential v. The parameter c accounts for the after-spike reset
value of v caused by the fast high-threshold K+ , and d the after-spike for
the reset of the recovery variable u caused by slow high-threshold NA+
and K+ conductances. The mathematical analysis of the model can be
found in (Izhikevich, 2006).
The excitatory input I in 2.1 consists of two components: first a constant
driving excitatory input gi and second the synaptic conductances given by
the lateral interaction of the units gc (t). So
I(t) = gi + gc (t)
(2.4)
For the simulations, we used the parameters suggested in (Izhikevich, 2004)
to reproduce RS and BS spiking behavior (Fig. 2.3). All the parameters
used in the simulations are summarized in table 2.1.
The network architecture is composed of 24 populations of orientation selective neurons where a bank of Gabor filters are used to reproduce the
characteristics of V1 receptive fields (Fig. 2.1). The filters are divided in
layers of eight orientations Θ ∈ {0, π8 , 2 π8 , ...π} and three scales denoted by
δ. The distance of the central frequency among the scales is 1/2 octave
with a max frequency Fmax = 1/10 cycles/pixel. The convolution with
Gδ,Θ is computed at each time step and the output is truncated according
to a threshold Ti ∈ [0, 1], where the values above Ti are set to a constant
driving excitatory input gi . Each unit can be characterized by its orientation selectivity angle Θ, its scale δ and a bi-dimensional vector x ∈ R2
specifying the location of its receptive center within the input plane. So a
column is denoted by u(x, Θ, δ).
The lateral connectivity between V1 units is exclusively excitatory with
strength w. A unit ua (x, φ, δ) connects with ub if all of the following
2.2. material and methods
31
conditions are met:
1. be in the same population: Θa = Θb and δa = δb
2. have a different center position: xa 6= xb
3. within a region of a certain radius: kxb − xa k < r
According to recent physiological studies, intrinsic V1 intra-cortical connections cover distances that represent regions of the visual space up to
eight times the size of single receptive fields in V1 neurons (Stettler et al.,
2002). In our model we set the connectivity radius r to 7 units. The lateral synapses are of equal strength w and the transmission delays τa are
proportional to kxb − xa k with 1 ms/cell.
The TPC is generated by summing the network activity in a time window
of 128 ms. Finally, the output TPC vectors from different layers of orientation and scales are read out by the proposed wavelet circuit forming the
decoded TPC vector used for the statistical analysis. In the discrete-time,
all the equations are integrated with Euler’s method using a temporal
resolution of 1 ms.
Neuronal wavelet circuit
The proposed neuronal wavelet circuit is based on the discrete multiresolution decomposition where each resolution reflects a different spectral
range and uses the Haar wavelets as basis (Mallat, 1998). Approximation
Coefficients, or AC, are the high-scale, low-frequency components of the
signal spectrum obtained by convolving the signal with the scale function
φ. The Detail Coefficients, or DC, are the low-scale, high-frequency components giving by the wavelet function ψ. Each component has a time
resolution matched to the wavelet scale that works as a filter.
The Haar wavelet ψ at time t is defined as:
a wavelet based neural model to optimize and read out a
32
temporal population code
Variable
N
a
b
crs
cbs
drs
dbs
v
u
gi
Ti
r
w
Description
network dimension
scale of recovery
sensitivity of recovery
after-spike reset value of v for RS neurons
after-spike reset value of v for BS neurons
after-spike reset value of u for RS neurons
after-spike reset value of u for BS neurons
membrane potential rest
membrane recovery rest
excitatory input conductance
minimum V1 input threshold
lateral connectivity radius
synapse strength
Value
80x80 Neurons
0.02
0.2
-65
-55
8
4
-70
-16
20
0.4
7 units
0.4
Table 2.1: Parameters used for the simulations.
ψ(t) ≡



1
0≤t<
−1
1
2



0
otherwise
1
2
<t≤1
(2.5)
and its associated scale function φ as:

1 0 ≤ t < 1
φ(t) ≡
0 otherwise
(2.6)
In a biologically plausible implementation, the wavelet decomposition can
be performed based on the activity of two short term buffer cells B1 and
B2 inhibited by an asymmetric delayed connection from cell A (Fig. 2.4
a). The buffer cells integrate rapid changes over a certain amount of
time analogous to the scale function φ, from eq. 2.6. In our model, the
buffer cells are modeled as discrete low-pass Finite Impulse Response (FIR)
filters. They are equivalent to the scale function φ in the discrete domain.
33
2.2. material and methods
Buffer cells have been reported recently in other cortical areas such as the
prefrontal cortex (Koene and Hasselmo, 2005; Sidiropoulou et al., 2009).
Regular Spiking (RS)
v(t)
25 ms
Burst Spiking (BS)
7
v(t)
35 ms
I(t)
Figure 2.3: Computational properties of the two types of neurons used in the
simulations: regular (RS) and burst spiking (BS). The RS neuron shows a mean
inter spike interval of about 25 ms (40 Hz). The BS type displays a similar interburst interval with a within burst inter-spike interval of approximately 7 ms (140
Hz) every 35 ms (28 Hz).
In our model, the buffer cells must receive an inhibitory input from A
that defines the integration envelope. The inhibition has to be over a time
period of t and with a shift in phase of π/2 between A to B2. Therefore, B1
and B2 will have their integration profile shifted in time by t. When the
inhibition is synchronized in time with the integration profile of the buffer
cells, the period 2t determines the low-frequency cutoff or the resolution
level l associated with the Haar scale function φ, eq. 2.6 (low-pass filter).
If both B1 and B2 are excitatory, the projection to cell W gives rise to the
approximation level Al .
On the other hand, if one buffer cell is inhibitory, as in the example of Fig.
a wavelet based neural model to optimize and read out a
34
temporal population code
TPC
Spikes
a)
Time
Exc
B1
t
time
Inh
Spike
||||||
Exc
Inhibition
Envelope
t
128 ms
A
Exc
Short Term
Buffer Cell
Phase shift
90º
B2
Inh
||||||
Spike
||||||
Exc
Integration
profile
||||||
t
Inh
W
Wavelet
W
l t
read out cell
Amplitude
b)
Ac3
Dc3
π/8
π/4
Dc2
π/2
Frequency
Dc1
π
Figure 2.4: a) a) Neuronal readout circuit based on wavelet decomposition.
The buffer cells B1 and B2 integrate, in time, the network activity performing
a low-pass approximation of the signal over two adjacent time windows given by
the asynchronous inhibition received from cell A. The differentiation performed
by the excitatory and inhibitory connections to W gives rise to a band-pass
filtering process analogous to the wavelet detail levels. b) An example of bandpass filtering performed by the wavelet circuit where only the frequency range
corresponding to the resolution level Dc3 is kept in the spectrum.
2.2. material and methods
35
2.4 a), the detail level Dl+1 is obtained by cell W as performed by the Haar
wavelet function itself (eq. 2.5). In our model, the inhibition is modeled as
discrete high-pass FIR filters. The combination of low-pass and high-pass
filters in cascade produces band-pass filters. Therefore, the readout can
be optimized to specific ranges of the frequency spectrum (Fig. 2.4 b).
Stimulus set
The stimulus set is based on abstract geometric forms as used previously
(Wyss et al., 2003c). In a circular path with a diameter of 40 pixels, 5
uniformly distributed vertices can connect to each other with equal probability, defining the shape of a stimulus class, (Fig. 2.5). The different
objects forming a class are generated by jittering the position of the vertices and the default line thickness of 4 pixels. We defined a total of 10
classes for the experiments with 50 exemplars per class.
For the experiments we subdivide the data set in three subsets with increasing complexity by varying the amount of jitter in the vertices’ position
and thickness of the connected line segments. The values of the jitter are
given by uniform randomly distributed factors with zero mean and standard deviation equal to 0.03, 0.04 and 0.05 for the vertices’ position and
0.018, 0.021 and 0.025 the thickness of each subset respectively. We used
subset 1 with 50 stimuli per class as training and classification set. In the
case where subset 1 is used for classification, 50% of the stimuli are randomly assigned as training set and the other part used for classification.
The subsets 2 and 3 are only used as classification set. In this case the
subset 1 is entirely used for training. Therefore training stimuli are not
used for classification and vice-versa.
For estimating the degree of similarity among the images we used the
normalized Euclidean distance of the pixels. The normalization is done
as follows: a given stimulus has distance equal to zero if it is equal to its
class prototype and one if it is the globally most distant exemplar over all
a wavelet based neural model to optimize and read out a
36
temporal population code
subsets. The image prototype is defined by the five vertices that define
the geometry of a class with no jitter applied.
Subset 1
Subset 2
Subset 3
Figure 2.5: The stimulus classes used in the experiments after the edge enhancement of the LGN stage.
Cluster algorithm
For the classification we used the following algorithm. The network’s responses to stimuli from C stimulus classes S1 , S2 , ..., SC are assigned to
C response classes R1 , R2 , ..., RC of the training set, yielding a C × C
hit matrix N (Sα , Rβ ), whose entries denote the number of times that a
stimulus from class Sα elicits a response in class Rβ . Initially, the matrix
N (Sα , Rβ ) is set to zero. For each response r ∈ Sα , we calculate the Euclidean distance of r to the responses r0 6= r elicitepd by stimuli of class
Sγ :
ρ(r, Sγ ) = hkρ(r, r0 )kir0
elicitepd by Sγ
(2.7)
where h.i denotes the average among the temporal Euclidean distances
between r and r0 denoted by ρ(r, r0 ). The response r is classified into the
response-class Rβ for that ρ(r, Sβ ) is minimal, and N (Sα , Rβ ) is incremented by one. The overall classification ratio in percentage is calculated
summing the diagonal of the N (Sα , Rβ ) and dividing by the total number
of elements in the classification set R. We chose the same metrics that
was used in previous studies to establish a direct comparison between the
results over different scenarios.
2.3. results
2.3
37
Results
We start analyzing the properties of the proposed stimulus set detailed in
section 2.2. Then, we run network simulations in order to establish a baseline classification ratio in an stimulus detection task. In the follow step we
use the wavelet circuit to read out the TPC signal over different frequency
bands and compare the classification results to the previously established
baseline. In order to address the effect of spiking modality on the decoding mechanism we run separate simulations with two different kinds of
neurons: Regular and Burst Spiking (eqs. 2.1 and 2.2). In the subsequent
experiment, we investigate how the speed of encoding is affected by reading out the TPC using a neuronal implementation of the wavelet circuit.
Subsequently, we show how the dense wavelet representation provided by
our decoding circuit can be used to create class prototypes that can be
used to flexibly define the content of a memory system. Finally, we perform an analysis of the similarity relationships between the TPC encoding
and the representation of the stimuli in the wavelet and in the spatial
domain respectively. The model parameters used for all simulations are
specified in section 2.2 and in table 2.1.
Stimulus set similarity properties
We use an algorithmic approach to parametrically define our stimulus
classes. Every class is defined around a prototype. (Fig. 2.6a upper row).
We measure how similar the exemplars from the classification sets are
to the respective image class prototype (see methods, session 2.2). The
median Euclidian distance of stimulus set 1 to 3 are 0.59, 0.64 and 0.70
respectively. This increasing median translates in an increasing difficulty
in the classification of the stimulus sets.
a wavelet based neural model to optimize and read out a
38
temporal population code
a)
Class prototypes
Exemplars with maximum distance to the class prototypes
Subset 1
Subset 2
100
50
0
0.2
0.4
0.6
0.8
Euclidean Distance
1
Subset 3
150
Frequency
150
Frequency
Frequency
b)
100
50
0
0.2
0.4
0.6
0.8
Euclidean Distance
1
150
100
50
0
0.2
0.4
0.6
0.8
Euclidean Distance
1
Figure 2.6: The stimulus set. a) Image-based prototypes (no jitter in the vertices applied) and the globally most different exemplars with normalized distance
equal one. The distortions can be very severe as in the case of class number one.
b) Histogram of the normalized Euclidean distances between the class exemplars
and the class prototypes in the spatial domain.
Baseline and network performance
As a reference for further experiments we first establish a baseline classification ratio. The baseline is defined by applying the clustering algorithm
described previously (section 2.2) directly in the spatial domain of the
stimulus set. In this scenario, the classification is performed over the pixel
intensities of the centered and edge enhanced images (Fig. 2.7). As to
be expected, the classification performance decreases with an increase of
the geometric variability of the three subsets used in the experiments. For
subset one, two and three the classification ratio reaches 91%, 88% and
82% respectively.
Wavelet circuit readout
We want to now assess the classification performance of compression capacity of the wavelet circuit we have proposed (section 2.2). We consider
39
Classification %
2.3. results
100
80
60
40
20
0
Subset 1 Subset 2 Subset 3
Figure 2.7: Baseline classification ratio using Euclidean distance among the
images from the stimulus set in the spatial domain.
a range of frequency resolutions in a dyadic manner using the wavelet resolution levels Ac5 corresponding to 0 to 15.5 Hz, Dc5 from 15.5 to 31 Hz,
Dc4 from 31 to 62 Hz, Dc3 from 62 to 125 Hz, Dc2 from 125 to 250 Hz
and finally Dc1 from 250 to 500 Hz. In the simulations, we increase the
inhibition and integration time of the cells A, B1 and B2 (Fig. 2.4) in order to explore the classification performance in the stimulus classification
task of the network.
The results using RS neurons show that the classification performance has
a peak in the frequency range from 62 Hz to 125 Hz equivalent to the,
so called, Dc3 level in the wavelet domain where 91% , 83% and 74% of
the TPC encoded stimuli are correctly classified for the subsets one, two
and three respectively (Fig. 2.8). In comparison, using BS neurons the
classification performance has a peak in the frequency range from 15 Hz
to 31 Hz equivalent to the, so called, Dc5 level in the wavelet domain
where 92% , 82% and 74% of the responses are correctly classified for the
subsets one, two and three respectively (Fig. 2.8). Reading out the TPC
without the wavelet circuit, i.e using the integrated spiking activity over
time without the wavelet representation, we achieve a classification ratio
for subsets one, two and three of 88%, 79% and 75 % for RS neurons
and 87%, 80% and 74 % for BS neurons. Thus, the wavelet circuit adds
a marginal improvement to the readout of the TPC signals as compared
a wavelet based neural model to optimize and read out a
40
temporal population code
% Classification
RS Neurons
100
Ac5 Dc5 Dc4 Dc3 Dc2 Dc1
80
60
% Classification
4
100
4
8 16 32 64
BS Neurons
80
Subset 1
Subset 2
Subset 3
60
4
4
8
16 32 64
Wavelet Coefficients
Figure 2.8: Comparison among the correct classification ratio for different resonance frequencies of the wavelet filters for both types of neurons RS and BS.
The frequency bands of the TPC signal is represented by the wavelet coefficients
Dc1 to Ac5 in a multi-resolution scheme. The network time window is 128 ms.
to the control condition for the RS neurons, in particular for the easier
stimulus set, while the BS version of the model does not show a marked
increase in classification performance.
However, while maintaining classification performance nearly the same,
the dyadic property of the wavelet discrete transform compresses the
length of the temporal signal by a factor of 8 and 32 using the Dc3 level
and the Dc5 levels for RS and BS neurons respectively. So the information
encoded over the 128 ms of stimulus presentation is captured by only a
few wavelet coefficients, in a compressed way.
2.3. results
41
In comparison, with the benchmark results (Fig. 2.3), the wavelet circuit
readout provides slightly lower classification classification ratio. However,
If we look at the BS network numbers, the TPC wavelet readout can
generate a reliable representation of the input image with 24x4 coefficients.
In comparison, the 80x80 pixels of the original input image used for the
benchmark is extremely compressed. The compression factor is about 66
times without a significant loss in classification performance. Therefore the
wavelet coefficients provide for a compact representation of the stimuli in
a specific region of the frequency spectrum.
Classification speed
A key feature of TPC is the encoding speed. It has been shown in previous
TPC studies that the speed of encoding is compatible with the speed of
processing observed in the mammalian visual system (Thorpe et al., 1996;
Wyss et al., 2003c; Töllner et al., 2011). Here we investigate how fast
the information transmitted by the TPC is captured by the wavelet coefficients. We use the mutual information measure (Victor and Purpura,
1999) to quantify the amount of transmitted information for a varying
length of the signal interval of all the 24 network layers used to calculate
the Euclidean distance (Fig. 2.9). The mutual information calculation is
performed using the wavelet coefficients generated by the readout circuit.
We also compare the speed of encoding between the TPC signal in the
temporal domain against the readout version based on the wavelet coefficients. For RS neurons, the wavelet coefficients that lead to maximum
classification performance are localized in the frequency interval from 62
Hz to 125 Hz, equivalent to the resolution level Dc3. For BS neurons, the
frequency interval is in a lower range from 15 Hz to 31 Hz, equivalent to
the resolution level Dc5. In these frequency ranges the maximum classification performance is achieved as shown in the previous section (Fig. 2.8
).
We observe that the number of bits encoded over the time window of
a wavelet based neural model to optimize and read out a
42
temporal population code
Information [Bits]
3
2
1
0
Information [Bits]
3
Subset 2
2
1
0
3
Information [Bits]
Subset 1
Subset 3
2
RS-TPC
RS-Wav
BS-TPC
BS-Wav
1
0
0
50
100
Time [ms]
150
Figure 2.9: Speed of encoding. Number of bits encoded by the network’s activity
trace as a function of time. The RS-TPC and BS-TPC curves represent the bits
encoded by the network’s activity trace without the wavelet circuit. The RS-Wav
and BS-Wav correspond to the bits encoded by the wavelet coefficients using the
Dc3 resolution level for RS neurons and the Dc5 for BS neurons respectively. For
a time window of 128 ms the Dc3 level has 16 coefficients and the Dc5 has only
4 coefficients. The dots in the figure represent the moment in time where the
coefficients are generated.
2.3. results
43
128 ms is nearly the same when comparing the non-filtered TPC signals
(RS-TPC and BS-TPC, Fig. 2.9) and the signals captured by the wavelet
readout (RS-Wav and BS-Wav, Fig. 2.9). However, in the case of the
BS neurons the speed of encoding is slower when the signal is decoded
by the wavelet circuit. This effect is due to the longer time constant of
the buffer cells B1 and B2 to integrate and differentiate the signal at this
resolution level and therefore to compute the wavelet coefficients. The
buffer cells need 32 ms to compute the first wavelet (Fig. 2.9). In case of
the RS neurons more than 90% of total information was captured within
the second wavelet coefficient, or 16 ms after stimulus onset. Thus the
effect of the neuronal wavelet circuit on the speed of encoding depends
both on the spiking behavior of the encoders and on the frequency range
at which the signal is read out.
Prototype-based classification
In the last step, we investigate whether the wavelet representation can be
generalized to the generation of prototypes from the stimulus classes. The
aim of the experiment is to create prototypes learned from the training
set that can be stored in memory and retrieved in a future classification
task. To construct such representations we build class prototypes based
on the wavelet coefficients of the N stimuli making up the training set. For
each of the ten stimulus classes we calculate the median over the wavelet
coefficients among the 50 response exemplars that define a training class
(subset one). Hence, for each class we have a vector of wavelet coefficients
that define the class specific prototype, i.e. a representation of a class in
the wavelet domain. Based on the classification experiments we use for
RS neurons the coefficients from the Dc3 level and for BS neurons the
Dc5 levels. The Fourier transform of the prototypes reveals the frequency
components that comprise the prototypes for the two different neuron
models we consider (Fig. 2.10).
We present the classification set to the network and calculate the Euclidean
a wavelet based neural model to optimize and read out a
44
temporal population code
distance between the output responses of the wavelet network within the
ten previously created prototypes. For the classification a simple criterion
is adopted, the smallest Euclidean distance defines the class to which the
stimulus is assigned.
The results for the RS neurons show that 86%, 82% and 75% of the responses are correctly classified (diagonal entries) by the wavelet prototypes
for the subset one, two and three respectively. For the BS neurons we observe that 91%, 81% and 72% of the responses are correctly classified for
the subset one, two and three respectively (Fig.2.11). The classification
ratios are consistent with the results previously presented in section 2.3
using the cluster algorithm (see methods, 2.2). However, the number of
calculations in the classification stage is drastically reduced because each
class is represented by a prototype vector of wavelet coefficients instead of
a collection of vectors. This result suggest that with a simple algorithm
the wavelet representation can be integrated into a compact description of
a complex spatially organized stimulus. Therefore, the information provided by densely coupled cortical neurons can be learned and efficiently
stored in memory independently of the details of their spiking behavior.
In the second part of the experiment, we present a geometric and spatiotemporal analysis of the underlying neural code. We perform a correlation
analysis among the stimuli misclassified using the wavelet prototypes in
both wavelet and spatial domains. We want to understand whether the geometric deformations applied in the spatial domain are directly translated
to the temporal representation captured by the wavelet coefficients. This
analysis is performed using the exemplars that where misclassified using
the wavelet prototypes approach. We calculate the normalized Euclidean
distances in the spatial domain between each stimulus of the misclassified
set and its prototype for each class. Second, we apply the same distance
measure but using the wavelet representation of the misclassified stimuli
and the prototypes. Finally, we make a correlation among the distance
values. (Fig.2.12).
45
200 Ac5 Dc3
Class Prototype 1
RS Neurons
100
BS Neurons
0
0
100
200 300 400
Frequency (Hz)
|X(f)|
|X(f)|
100
200 300 400
Frequency (Hz)
|X(f)|
|X(f)|
0
0
100
200 300 400
Frequency (Hz)
Class Prototype 4
100
0
0
100
200 300 400
Frequency (Hz)
Class Prototype 5
100
0
0
100
200 300 400
Frequency (Hz)
100
500
200 300 400
Frequency (Hz)
500
Class Prototype 9
100
200 300 400
Frequency (Hz)
500
Class Prototype 10
200
|X(f)|
|X(f)|
200
500
100
0
0
500
200 300 400
Frequency (Hz)
Class Prototype 8
200
|X(f)|
|X(f)|
200
100
100
0
0
500
500
Class Prototype 7
200
100
200 300 400
Frequency (Hz)
100
0
0
500
Class Prototype 3
200
100
200
100
0
0
100
0
0
500
Class Prototype 2
200
Class Prototype 6
200
|X(f)|
|X(f)|
2.3. results
100
0
0
100
200 300 400
Frequency (Hz)
500
Figure 2.10: Single-sided amplitude spectrum of the wavelet prototype for each
stimulus class used in the simulations. The signals x(t) where reconstructed in
time using the wavelet coefficients from the Dc3 and Dc5 levels for RS and BS
neurons respectively. The shaded areas shows the optimal frequency response of
the Dc3 level (62 Hz to 125 Hz) and of the Dc5 level (15.5 Hz to 31 Hz). The
less pronounced responses around 400 Hz are aliasing effects due to the signal
reconstruction to calculate the Fourier transform (see discussion).
a wavelet based neural model to optimize and read out a
46
temporal population code
Subset 2
25
1
23
25
1
1
20
1 2
1
20
4
11
5
1
22
20
3 1 4
25
1
19
1
1
Subset 1
2
23
1
1
25
24
2
1 1
46 4
46
3
1 4
39
Subset 3
1
3 5
3
22
2 22
1
23
Stimulus Class
1
24
3 10
2
41 6
1 3
1 34
6 1 4 44 1 1
45
1 3
36
43 5 1
3
2
2
44
1
42 3
1
4
1
47
28
2 43 3
2 1
4
34 3 2
7
1
7
6 34 2
1
42
2
9 6 6 11 45
Stimulus Class
4
3
5 1
42
6
41 7 1 4 4 2 4 3
39 6
1
1
33
2
1
34 7
4 1
1
1 4
4 35 4
4 4 3
11 3 36 2 6
3
39
1
1
1
3 1 35
3
38
1
Stimulus Class
Subset 2
1
1 1 2
20
2
5 1
Stimulus Class
Response Class
Response Class
23 5
20
1
13
Stimulus Class
BS
33
1 46
Response Class
Subset 1
Response Class
Response Class
14
Subset 3
Response Class
RS
41 1 1
8
3
3
47
1
2 1 33 11 2
3 1 6 2
1 36
1
20
1 39 9
2
9 2
3 32 4 2
4
1
11 3 33 1 4
2
1 1
37
3 1 3 1 7 6 3 8 2 41
Stimulus Class
Figure 2.11: Prototype based classification hit matrices. For each class in the
training we average the wavelet coefficients to form class prototypes. In the
classification process, the euclidean distance between the classification set and
the prototypes are calculated. A stimulus is assigned to a the class with smaller
euclidean distance to the respective class prototype.
The results show a positive correlation over the Euclidian distances in the
two domains suggesting that the amount of geometric deformations in the
spatial domain is directly translated to the wavelet representation of the
temporal code. The correlation is higher for RS neurons with a value
of 0.62 against 0.50 for BS neurons (p<0.001). The positive correlation
between both domains validates the wavelet prototypes and therefore the
overall TPC transformation structure as an equivalent representation of
the stimuli classes that conserves the relevant spatial information.
2.4
Discussion
We have shown previously that in a model of the sensory cortex, the
representation of a static stimulus can be generated using the temporal
dynamics of a neuronal population or Temporal Population Code. Here
47
Distance Wavelet Domain
Distance Wavelet Domain
2.4. discussion
1
RS Neurons
Rho = 0.62
p < 0.001
0.5
0
0
1
0.2
0.4
0.6
0.8
Distance Spatial Domain
1
BS Neurons
Rho = 0.50
p < 0.001
0.5
0
0
0.2
0.4
0.6
0.8
Distance Spatial Domain
1
Figure 2.12: Distribution of errors in the wavelet-based prototype classification
with relation to the Euclidean distances within the prototyped classes.
we have shown that this temporal code has a specific signature in the
phase relationships among the active neurons of the underlying substrate.
This signal is efficiently used to pass a complete and dense amount of
information that can be decoded in further areas through a sub-set of
wavelet coefficients.
The TPC is a relevant hypothesis on the encoding of sensory events given
its consistency with cortical anatomy and recent physiology. Since its in-
a wavelet based neural model to optimize and read out a
48
temporal population code
troduction, however, a persistent problem has been to extend this concept
to a readout stage that would be neurobiologically compatible. A priori
it was not clear whether a wavelet transform would be suitable because it
implies a specific structure in the TPC representation itself.
We have shown that decoding of the TPC can be based on wavelet transforms. Using a systematic spectral investigation of the TPC signal we
observed that the Haar basis seems to be a feasible choice providing robust and reliable decoding. The achieved results associated with the Haar
wavelet are consistent with previous studies where Haar filters are used to
preprocess neuronal data prior to a linear discriminant analysis (Laubach,
2004). In comparison with the original TPC readout the neuronal wavelet
circuit proposed here showed the same encoding performance for regular
spiking neurons as compared to the algorithmic version. However, for
bursting neurons the wavelet readout had a slower speed of encoding in
comparison with the original TPC. Therefore the details of the readout
mechanism such as its integration time constant and its wavelet resolution
level depend on the spiking dynamics.
The specific geometric characteristics of each stimulus class could be captured in a very compact way using the wavelet filters. We showed that a
visual stimulus can be represented and further classified using a strongly
compressed signal based on the wavelet coefficients. Reading out the network based on regular spiking neurons using wavelet coefficients yielded a
compression factor of 16.6 times the original image size. In the case of the
bursting neurons the compression ratio was even higher reaching 66 times
the original image size. Therefore, the spatial to temporal transformation
of the TPC model combined with the efficient wavelet readout circuit can
provide for a robust and compact representation of sensory inputs in the
cortex for different spiking modalities.
We have performed a detailed analysis of the misclassified stimuli in order to better understand the interesting feature of similarity conserving
misclassifications we observe. We found a positive correlation among the
2.4. discussion
49
geometric distortions between the spatial and temporal domain, represented by the wavelet coefficients. These findings suggest that the deformations in the spatial domain were directly translated into the wavelet
domain and therefore responsible for the misclassifications observed. This
result reinforces the direct relationship present between the geometric and
spatio-temporal portions of the underlying neural code and its decoding.
Our results also suggest that specific axon/synapse complexes dedicated
to specific features are not needed to successfully encode visual stimuli.
We have shown that the efficient structure of an orthogonal basis like the
Haar wavelet, can be implemented by a neuronal circuit (see Fig. 2.4),
with low computational cost, low latency and thus in a real-time system.
The wavelet filters as implemented with buffer neurons can be changed
on line depending on what kind of information needs to be retrieved and
its specific frequency range. This property allows multiplexing high-level
information from the visual input and flexible storage and retrieval of information by a memory system. Indeed, recent experiments have shown
that distinct activity patterns overlaid in primary visual cortex individually signal motion direction, speed, and orientation of object contours
within the same network at the same time (Onat et al., 2011). We speculate that in a non-static scenario other stimulus characteristics such as
motion speed and direction could be extracted from other frequency bands
using the same wavelet circuit.
The optimal resolution level for the wavelet readout was determined through
an systematic investigation based on the correlation between compression
and performance (Fig. 2.8). In the case of regular spiking neurons we observed a maximum classification performance in the Dc3 resolution level,
or in the frequency range from 62 Hz up to 125 Hz. While for the bursting
neurons it falls in the range of 15.5 Hz to 31 Hz. The wavelet circuit itself
does not define the choice of the wavelet resolution. However, we could
show that in the TPC framework the proposed readout circuit can capture
different properties of visual stimuli that travel through the sensory processing stream at different frequency ranges. In our case the sensitivity of
a wavelet based neural model to optimize and read out a
50
temporal population code
the wavelet circuit will depend on the feed-forward receptive fields combined with the phase relationship imposed by the inhibitory units. The
mechanisms used by higher cognitive areas to manipulate the frequency
ranges and the kind of information carried in these temporal information
channels are currently not clear and are subject of follow up studies.
In comparison to the LSM model previously applied to read out the TPC,
the wavelet circuit is computationally inexpensive and requires only four
neurons to be implemented. Although the optimal readout performance
also depends on specific parameters to set the readout frequency range
the generality of the model is not affected as in the case of the LSM. Our
results suggest that this optimal frequency range is determined by the
spiking behavior of the neurons in the network.
From a technical perspective, one issue related to orthogonal wavelets is
the aliasing effect (Chen and Wang, 2001) which could insert redundant
spectral content in the TPC signals leading to reduced classification performance. This property can be addressed by increasing the vanishing
moments of the wavelet basis, the effects of aliasing is smoothed, increasing the orthogonality between the spectral sub-bands. However, the filtering process would get more sophisticated and would require more than two
buffer cells. In contrast, the Haar based readout circuit is computationally
efficient.
To evaluate the effects of different spiking behaviors on the proposed readout circuit, we used a different and more physiologically constrained neuron model from previous TPC studies. In comparison, the overall dynamics of the previous neuron model are significantly different from the the
model used here. For instance, the model used in previous studies (Wyss
et al., 2003c) has a strong onset response with about 50% more spikes in
the first 20 ms after stimulus onset as compared to the model used in the
current study that includes mechanisms for spike adaptation. In addition,
we observed significant differences in the sub-threshold fluctuations and
the membrane potential envelope between these two neuron models. How-
2.4. discussion
51
ever, the overall results reported here are compatible with those previously
reported (Wyss et al., 2003a,c). Based on that, we conclude that TPC and
the proposed readout mechanism are robust with respect to the details of
the spiking dynamics and the overall biophysical properties of the membrane potential envelope.We are not aware of any other encoding-decoding
model of cortical dynamics that shows a similar generality.
The performance measure we use, essentially based on classification, is
a well-established standard (Victor and Purpura, 1999) in the literature,
based on Euclidean distance. We looked both at classification performance and information encoded. In order to develop the specific point
of this paper, the decoding of the TPC using wavelets, we adhere to this
standard. Hence, the results should be seen as relative to those established
and published for the TPC, contributing to the unresolved issue of how
a biologically plausible decoding can take place. We demonstrate that a
Wavelet-like transform can fulfill the requirements for an efficient readout
mechanism, thus generating a specific hypothesis on the role of the sparse
inter-areal connectivity found in the neo-cortex.
We have shown that the dense local connectivity of the neo-cortex can
transform the spatial organization of their inputs into a compact Temporal Population Code (TPC). By virtue of its multiplexing capability this
code can be transmitted to downstream decoding areas using the sparse
long-range connections of the cortex. We have shown that in these downstream areas the TPC can be decoded and further compressed using a
wavelet based readout system. Our results show that the TPC information is organized in a specific subset of frequency space creating virtual
communication channels that can serve working memory systems. Thus,
TPC does not only provide a functional hypothesis on the specifics of cortical anatomy but it also provides a computational substrate for a functional neuronal architecture that is organized in time rather than in space
(Buzsáki, 2006).
a wavelet based neural model to optimize and read out a
52
temporal population code
2.5
Acknowledgments
This work was supported by EU FP7 projects EFAA (FP7-ICT-270490)
and GOAL-LEADERS (FP7-ICT-97732).
Chapter
Temporal Population Code for
Face Recognition on the iCub
Robot
The connectivity of the cerebral cortex is characterized by dense local
and sparse long-range connectivity. It has been proposed that this connection topology provides a rapid and robust transformation of spatial
stimulus information into a temporal population code (TPC). TPC is a
canonical model of cortical computation whose topological requirements
are independent of the properties of the input stimuli and, therefore, can
be generalized to the processing requirements of all cortical areas. Here we
propose a real time implementation of TPC for classifying faces, a complex
natural stimuli that mammals are constantly confronted with. The model
consists of a network comprising a primary visual cortex V1 network of laterally connected integrate-and-fire neurons implemented in the humanoid
robot platform iCub. The experiment was performed using human faces
presented to the robot under different angles and position of light incidence. We show that the TPC-based model can recognize faces with a
correct ratio of 97 % without any face-specific strategy. Additionally, the
speed of encoding is coherent with the mammalian visual system suggest53
3
temporal population code for face recognition on the icub
54
robot
ing that the representation of natural static visual stimulus is generated
based on the combined temporal dynamics of multiple neuron populations.
Our results provides that, without any input dependent wiring, TPC can
be efficiently used for encoding local features in a high complexity task
such as face recognition.
3.1
Introduction
The mammalian brain has great abilities of recognizing objects under
widely varying conditions. To perform this task, the visual system must
solve the problem of building invariant representations of the objects that
are the sources of the available sensory information. Since the breakthrough work on V1 by Hubel and Wiesel (Hubel and Wiesel, 1962) there
has been an increasing number of models of the visual cortex trying to
reproduce the performance of natural vision systems.
A number of these models has been developed to reproduce characteristics
of the visual system such as invariance to shifts in position, rotation, and
scaling. Most of these models are based on the, so called, Neocognitron
(Fukushima, 1980b, 2003; Poggio and Bizzi, 2004; Chikkerur et al., 2010),
a hierarchical multilayer network of detectors with varying feature and
spatial tuning.
In this classical framework invariant representations emerge in the form
of activity patterns at the highest level of the network by virtue of the
spatial averaging across the feature detectors at the preceding layers. The
recognition process is based on a large dictionary of features stored in
memory where model neurons in the different layers act as filters tuned to
specific features or feature combinations. In this approach invariances to,
for instance, position, scale and orientation, are achieved at the high cost
of increasing the number of connections between the layers of the network.
In the last years a novel model of object recognition has been proposed
based on the specific connectivity template found in the visual cortex
3.1. introduction
55
combined with the temporal dynamics of neural populations. The, so
called, Temporal Population Code, or TPC, emphasizes the property of
densely coupled networks to rapidly encode the geometric organization of
its inputs into the temporal structure of its population response (Wyss
et al., 2003a,c). In the TPC architecture the spatial information provided
by an input stimulus is transformed into a temporal representation. The
encoding is defined by the spatially averaged spikes of a population of
integrate and fire neurons over a certain time window.
In this paper we make use of the TPC scheme to encode one of the most
complex stimuli mammals are challenged by: faces (Zhao et al., 2003;
Martelli et al., 2005; Sirovich and Meytlis, 2009). In the brain, the recognition process of known faces takes place through a complex set of local
and distributed processes that interact dynamically (Barbeau et al., 2008).
Recent results suggest the existence of specialized areas populated by neurons selective to specific faces. For instance, in the macaque brain the
middle face patch has been identified where neurons detect and differentiate faces using a strategy based on the distinct constellations of face parts
(Freiwald et al., 2009). Nevertheless, these higher cortical areas where
curves and spots are assembled into coherent objects must be fed by a
canonical computational principle. Our model is based on the hypothesis
that the TPC is the canonical feature encoder of the visual cortex that
feeds higher areas involved in the recognition process.
The model is implemented and tested in real time context. The robot
platform used for the experiments is the humanoid robot iCub (Metta
et al., 2008). In the face recognition task, we demonstrate that the TPC
model can recognize faces under different natural conditions. We also show
that the speed of encoding is compatible with the human visual system.
Finally, we present a comparison between the TPC face recognition and
other models available in the literature using a standard face database.
In the next session we have a full description of the model and the parameters used for the simulations.
temporal population code for face recognition on the icub
56
robot
3.2
Methods
The model we propose here consists of a retinotopic map of spiking neurons with properties found in the primary visual cortex V1(Wyss et al.,
2003c,a). The network consists of N xN neurons interconnected with a
circular neighborhood with synapses of equal strength and instantaneous
excitatory conductances. The transmission delays are related to the Euclidean distance between the positions of the pre-and postsynaptic neurons
in the map. The stimuli are continuously presented to the network and
the spreading spatial activity of the V1 units are integrated over a time
window, as a sum of their action potentials. The outcome is the so called
TPC, as showed in the previous chapter (Fig. 2.1).
The TPC network
In the network, each modeled neuron is approximated by a leaky integrateand-fire unit. The time course of its membrane voltage is given by:
Cm
dV
= −(Ie (t) + Ik (t) + Il (t)).
dt
(3.1)
Cm is the membrane capacitance and I represents the transmembrane
current: excitatory (Ie ), spike-triggered potassium current (Ik ) and leak
current (Il ). These currents are computed multiplying the conductance g
by the driving force as following:
I(t) = g(t)(V (t) − V r )
(3.2)
where V r is the reversal potential of the conductance. The unit’s activity
at time t is given by:
A(t) = a{H(V (t) − θ)}
(3.3)
57
3.2. methods
where a ∈ [0, 1] is the spike amplitude of the unit, and V (t) is determined
by the total input composed by the external stimulus and internal connections. H denotes the Heaviside function and θ is the firing threshold.
After a spike is generated, the unit’s potential is reset to Vr . The time
course of the potassium conductance gl is given by:
τK dgk
= −(gk (t) − gkpeak Â(t))
dt
(3.4)
where Â(t) = H(V (t) − θ).
The excitatory input is composed by the sum of two components: a constant driving excitatory input gi (t) and the synaptic conductances given
by the lateral interaction of the units gc (t). So
ge (t) = gi + gc (t)
(3.5)
In the visual cortex it has been observed that many cells produce strong
activity for a specific optimal input pattern. De Valois (De Valois et al.,
1982) showed that the receptive field of simple cells is sensitive to specific
position, orientation and spatial frequency. This behavior can be formalized as a multidimensional tuning response that can be approximated by
a Gaussian-like template-matching operation.
In our model each unit is characterized by orientation selectivity angle φ ∈
{0, π4 , 2 π4 , 3 π4 } and a bi-dimensional vector x ∈ R2 specifying its receptive
field’s center location within the input plane. So each unit in the network
is defined by u(x, φ) that we envision as approximating the functional
notion of cortical column.
The receptive field of each unit is approximated using the second derivative
of an isotropic Gaussian filter Gσ . The input image is convolved with Gσ to
reproduce the V1 receptive field’s characteristics for orientation selectivity.
The input stimulus is a retinal gray scale image that fills the whole visual
field.
temporal population code for face recognition on the icub
58
robot
2
The convolution with ∂x(φ)
Gσ is computed by twice applying a sampled
first derivative of a Gaussian (σ = 1.5 and size 11x11 pixels) over the four
orientations φ (Bayerl and Neumann, 2004). The filter output is truncated
according to a threshold Ti ∈ [0, 1], where the values above Ti are set to
Ex . Each orientation selective output is projected onto a specific network
layer.
The lateral connectivity between V1 units is exclusively excitatory and a
unit ua (x, φ) connects to ub if the following conditions are met:
1. be in the same population: φa = φb
2. have a different center position: xa 6= xb
3. within a region of a certain diameter: kxb − xa k < r
The synaptic efficacies are of equal strength w and the transmission delays
τa proportional to kxb − xa k with 1 ms/cell. The connectivity diameter is
equal to dmax . In the discrete-time case, all the equations are integrated
with Euler’s method using a temporal resolution of 1 ms. Therefore in
the simulations one time step is equivalent to 1 ms. All the simulations
were done using the parameters of table 3.2. When a unit u receives the
maximum excitatory input Ex at onset time t = 0 ms, the first spike
occurs at time 6 ms.
The TPC is generated by summing the network’s activity over a time
window of T ms. Finally, the output vectors from different orientation
layers are summed forming the TPC vector. In this way the temporal
representation of TPC is naturally invariant to position. Theoretically
rotation invariances are achieved by increasing the number of orientation
selective layers reaching a full invariant architecture by expanding the
number of orientation layers.
59
3.2. methods
Variable
N
Cm
Ver
Vkr
Vlr
θ
τK
gkpeak
gl
a
gi
Ti
dmax
w
Description
network dimension
membrane capacitance
excitatory rev. potential
spike-triggered potassium rev. potential
leak reversal potential
firing threshold
time constant
peak potassium conductance
leak conductance
spike amplitude
excitatory input conductance
minimum V1 input threshold
lateral connectivity diameter
synapsis strength
Value
128x128 Neurons
0.2nF
60mV
-90mV
-70mV
-55mV
1ms/cell
200 nS
20 nS
1
5 nS
0.6
7 units
0.1 nS
Table 3.1: Parameters used for the simulations. See text for further explanation
Optimizing the lateral spreading of activation in the
network
The key part of the model is the lateral propagation of activity denoted
by gc (t) in the equation (3.5). Since, for each time step, hundreds of
spikes can occur an efficient and fast real-time implementation is required.
This is achieved by performing consecutive multiplications in the frequency
domain.
The lateral propagation of activity is performed after each spike every time
step, i.e. 1 ms in our simulations. The lateral spread can be seen as consecutive square regions of propagation fd , or filters of lateral propagation,
with increasing scale over a short period of time (Fig. 3.1).The maximum
scale of f is given by the lateral connection diameter dmax . The magnitude of the spreading activation is given by the synapsis strength w. For
the simulations dmax = 7 thus fd are of scale 3x3, 5x5 and 7x7. They are
generated and stored in the frequency domain (denoted by Fd ) when the
network is initialized. In the implementation the Discrete Fourier Trans-
temporal population code for face recognition on the icub
60
robot
form (DFT) operations were done using the FFTW3 library (Frigo and
Johnson, 2005).
Frequency Domain
Matrix of Spikes
t
Spike
Filters
t +1
F3
X
F5
t +2
Lateral Propagation (gc(t))
iDFT
iDFT
X
Figure 3.1: Schematic of the lateral propagation using Discrete Fourier Transform (DFT). The lateral propagation is performed as a filtering operation in the
frequency domain. The matrix of spikes generated at time t is multiplied by the
filters of lateral propagation Fd over the next time steps t + 1 and t + 2. The
output is given by the inverse DFT of the result.
At time t we calculate the DFT of the spike matrix denoted by M . The
matrix M is composed by zeros and ones (in the case of spikes). The propagation of activity is achieved by multiplying M by Fd for d = 3, 5 and 7
over the time steps t + 1, t + 2 and t + 3 respectively. Finally the inverse
DFT is applied. Using a 2.66 Ghz Intel Core i7 processor under Linux
each time step takes approximately 9 ms to be processed.
61
3.2. methods
Clustering algorithm used in the classifier
The network’s responses to stimuli from C stimulus classes S1 , S2 , ..., SC
are assigned to C response classes R1 , R2 , ..., RC of the training set in a
supervised manner. The result is a C × C hit matrix N (Sα , Rβ ), whose
entries denote the number of times that a stimulus from class Sα elicits a
response in class Rβ . Initially, the matrix N (Sα , Rβ ) is set to zero. For
each response r ∈ Sα , we calculate the mean Pearson correlation coefficient
over r and the responses r0 6= r elicitepd by stimuli of class Sγ :
ρ(r, Sγ ) = hk ρ(r, r0 ) kir0
elicited by Sγ
(3.6)
where h.i denotes the average among the correlation between r and r0
denoted by ρ(r, r0 ). The response r is classified into the response-class
Rβ for that ρ(r, Sβ ) is maximal, and N (Sα , Rβ ) is incremented by one.
The overall classification ratio in percentage is calculated summing the
diagonal of the N (Sα , Rβ ) and dividing by the total number of elements
in the classification set R. For the statistics reported here, the data was
generated in real-time and analyzed off-line.
TPC and DAC architecture
The face recognition system was included in a larger architecture called
BASSIS. We have recently used this architecture as a framework for humanrobot interaction (Luvizotto et al., 2012a)1 . BASSIS is fully grounded in
the Distributed Adaptive Control (DAC) architecture (Verschure and Althaus, 2003; Duff and Verschure, 2010). DAC consists of three tightly
coupled layers: reactive, adaptive and contextual. At each level of organization increasingly more complex and memory dependent mappings from
sensory states to actions are generated dependent on the internal state
1
This work is was done in collaboration with the partners from the european project
EFAA.
temporal population code for face recognition on the icub
62
robot
of the agent. The BASSIS architecture expands DAC with many new
components. It incorporates the repertoire of capabilities available for the
human to interact with the humanoid robot iCub. BASSIS is divided in
4 hierarchical layers: Contextual, Adaptive, Reactive, Soma. The different modules responsible for implementing the relevant competences of the
humanoid are distributed over these layers (Fig. 3.2).
Contextual
Supervisor
OPC
Adaptive
Face Recognition
Reactive
tpcNetwork
pmpActions
Face
Detector
Soma
cartesianControl
gazeControl
Figure 3.2: BASSIS is a multi-scale biomimetic architecture organized at three
different levels of control: reactive, adaptive and contextual. It is based on the
well established DAC architecture. See text for further details.
The Soma and the Reactive layers are the bottom layers of the BASSIS
diagram. They are in charge of coding the perceptual inputs through the
faceDetector where the position of the faces in the visual field is acquired
and tpcNetwork that encodes the cropped face collected using the infor-
3.2. methods
63
mation given by the faceDetector. The Reactive layer also deals with the
robot motor primitives. The pmpActions package provides motor control
mechanisms that allow the robot to perform complex manipulation tasks.
It relies on a revised version of the so-called Passive Motion Paradigm (Mohan et al., 2009) that uses the cartesian controller developed by Pattacini
et al. (2010) to solve the inverse kinematics problem while assuring to
fulfill additional constraints such as joint limits.
In the Adaptive layer, the classification of the encoded face is performed
using the cluster algorithm detailed in section 3.2.
At the top of the BASSIS diagram lays the Contextual layer which is
composed of two main entities: the Supervisor and the Objects Properties
Collector (OPC). The Supervisor module essentially manages the spoken language system in the interaction with the human partner so as the
synthesis of the robot’s voice. The OPC reflects the current knowledge
the system is able to acquire from the environment, collecting data that
ranges from objects coordinates in multi-reference frames (both in image
and robot domains). Furthermore, a central role is played by the OPC
module which represents the database where a large variety of information
is stored since the beginning of the experiments.
The iCub
The iCub is a humanoid robot one meter tall, with dimensions similar to a
3.5 years old child. It has 53 actuated degrees of freedom distributed over
the hands, arms, head and legs (Metta et al., 2008). The sensory inputs are
provided by artificial skin, cameras, microphones. The robot is equipped
with novel artificial skin covering the hands (fingertips and the palm) and
the forearms (Cannata et al., 2008). The stereo vision is provided by
stereo cameras in a swivel mounting. The stereo sound capturing is done
using two microphones. Both cameras and microphones are located where
eyes and ears would be placed in a human. The facial expressions, mouth
and eyebrows, are projected from behind the face panel using lines of red
temporal population code for face recognition on the icub
64
robot
LEDs. It also has the sense of proprioception (body configuration) and
movement (using accelerometers and gyroscopes).
The different software modules in the architecture are interconnected using YARP (Metta et al., 2006). This framework supports distributed
computation focusing in the robot control and efficiency. The communication between two modules using YARP happens between objects called
”ports”. The ports can exchange data over different network protocols
such as TCP and UDP.
Currently, the iCub is capable of reliably co-ordinating reach and grasp
motions, facial expressions to express emotions, force control exploiting its
force/torque sensors and gaze at points in the visual field. The interaction
can be performed using spoken language. With all these features, the
iCub is an outstanding platform for research in social interaction between
humans and humanoid robots.
Object Properties Collector (OPC)
The OPC module collects and store the data produce during the interaction. The module consists of a database where a large variety of information is stored starting at the beginning of the experiment and as
the interaction with humans evolves over time. In this setup, the data
provided by the face recognition model and eventually informations about
other objects feed the memory of the robot represented by the OPC.
OPC reflects what the system knows about the environment and what
is available. The data collected range from object coordinates in multireference frames (both in the image and robot domains) to the human’s
emotional states or position in the interaction process.
Entities in OPC are addressed with unique identifiers and managed dynamically, meaning that they can be added, modified, removed at run-time
safely and easily from multiple sources located anywhere in the network,
practically without limitations. Entities consist of all the possible elements
3.2. methods
65
that enter or leave the scene: objects, table cursors, matrices used for geometric projection, as well as the robot and the human themselves. All the
properties related to the entities can be manipulated and used to populate
the centralized database. A property in the OPC vocabulary is identified
by a tag specified with a string (a sequence of characters). The values that
can be assigned to a tag can be either single strings or numbers, or even
a nested list of strings and numbers.
Spoken Language Interaction and Supervisor
The spoken language interaction is managed by the Supervisor. It is implemented in the CSLU Toolkit (Sutton et al., 1998) Rapid Application
Development (RAD) under the form of a finite-state dialogue system where
the speech synthesis is done by Festival and the recognition by Sphinx-II.
Indeed, these state modules provide functions including conditional transition to new states based on the words and sentences recognized, and thus
conditional execution of code based on current state, and spoken input
from the user. The recognition and the extraction of the useful information from the recognized sentence is based on a simple grammar that can
be developed according to each application (Lallée et al., 2010).
The stimulus set used in the face recognition experiments
Real-time
The images used as input to the network were acquired using the robot’s
camera with a resolution of 640x480 pixels (Fig. 3.3). The faces were
detected and cropped using a cascade of boosted classifiers based on haarlike features (Li et al., 2004). After the detection process the images are
rescaled to a resolution of 128x128 pixels covering the whole scene.
We ran the experiments reported here with faces from 6 subjects, a sample
size enough for most social robotics tasks such as games, interaction or
temporal population code for face recognition on the icub
66
robot
Orientation
Filters
iCub Camera
Face
Detection
V1
Network
TPC
Figure 3.3: Model overview. The faces are detected and cropped from the input
provided by the iCub’s camera image. The cropped faces are resized to a fixed
resolution of 128x128 pixels and convolved with the orientation selective filters.
The output of each orientation layer is processed by separated neural populations
as explained above. The spike activity is summed over a specific time window
rendering the Temporal Population Code, or TPC.
human assistance. The stimulus set consists of 6 classes of faces, one per
subject, where each class has 7 variations of the same face under different
poses. For the statistics 2 faces are used as training set and 5 faces as
classification set for each class. The training stimuli are not used for
classification and vice-versa.
Standard face database
In order to compare the proposed system to standard face recognition
methods available in the literature, we use the cropped Yale Face Database
B (Georghiades et al., 2001) (Lee et al., 2005). In this dataset, there are
10 classes with different human subjects. Each class has variations of the
same face under different angles and position of light incidence, divided in
four subsets of increasing difficulty 3.4. In this quantitative study we use
lt = 7 faces of the subset 1 and lc = 12 faces of the subset 2 as training set.
The subsets 2, 3 and 4 are used as classification set. The training stimuli
3.3. results
67
are not used for classification and vice-versa, except when the subset 2
is used for classification. All images are in original resolution and aspect
ratio.
Figure 3.4: Cropped faces from Yale face database B used as a benchmark to
compare TPC with other methods of face recognition available in the literature
(Georghiades et al., 2001) (Lee et al., 2005).
3.3
Results
In the first experiment we investigate the capabilities of our model in a
face recognition task using iCub. We also evaluate the optimal number
of time steps required for the recognition process which is a free parameter that determines how many interactions are needed for the encoding
process to support reliable classification. It both affects directly the processing time and therefore the overall system performance and the length
of the TPC vector and thus the optimal performance/compression ratio of
the data. We calculate the classification ratio over the network’s activity
trace for different time windows after stimulus onset used to compute the
correlation between TPCs of the training and test set ranging from 2 to
130 ms every 1 ms.
The simulations were performed with the parameters of table 3.2. We are
not using any light normalization mechanism in the model. For that reason
we performed the simulations for different values of the input threshold Ti
(see methods).
temporal population code for face recognition on the icub
68
robot
The classification performance shows a maximum of 87 % within 21 ms
after stimulus onset using Ti = 0.6 (Fig. 3.5). This speed of optimal
stimulus encoding is compatible with the speed of processing observed in
the mammalian visual system (Thorpe et al., 1996). A TPC vector over
this time window a represents a face with 22 points covering about 1%
of the total number of pixels in a 128x128 image. The misclassified faces
show strong deviations in position or light conditions (Fig. 3.7 and Fig.
3.6).
100
Classification [%]
80
60
40
Ti = 0.2
Ti = 0.4
20
Ti = 0.6
Ti = 0.8
0
0
50
Time [ms]
100
150
Figure 3.5: Speed of encoding. Average correct classification ratio given by
the network’s activity trace as a function of time for different values of the input
threshold Ti .
In the second experiment we show that the recognition performance can
be enhanced changing the read out mechanism. In this experiment the
network spike activity is read out over Sr adjacent square sub-regions of
the same aspect ratio. The TPC vector from the different sub-regions are
concatenated to form the overall representation of the face. We used four
different configurations to divide the subregions using a time window of 21
ms (Fig. 3.8). This operation increases by Sr times the size of the TPC
vector that represents a face.
The results show a maximum performance of 97 %. It also suggests an
3.3. results
69
Figure 3.6: Response clustering. The entries of the hit matrix represent the
number of times a stimulus class is assigned to a response class over the optimal
time window of 21 ms and Ti of 0.6.
optimal ratio of performance/compression with 16 sub-regions, i. e 16*22
points per vector.
In the last part, we show the results obtained with the TPC model using cropped faces from the Yale Face Database B. We followed the same
procedure, reading out the network over Sr adjacent square sub-regions
of the same aspect ratio and also varying the values of the input threshold Ti (Fig. 3.9). In table 3.3, we summarize the TPC performance in
comparison with other face recognition methods. The TPC results presented in the table where calculated using 64 sub-regions, totalizing 64x22
= 1408 points per vector. This corresponds to approximately 5% of the
number of pixels (192x168 = 32256) of the input image. The results show
that both performance and compression ratio are totally compatible with
state-of-art methods for face recognition.
temporal population code for face recognition on the icub
70
robot
Class Set
Training Set
Misclassified
Stimulus
Figure 3.7: Face data set. Training set (Green), Classification set (Blue) and
Misclassified faces (Red).
100
Classification [%]
90
80
70
60
50
1
4
16
64
Sub-Regions
Figure 3.8: Classification ratio using a spatial subsampling strategy where the
network activity is read over multiples subregions.
71
3.4. discussion
Ti = 0.2
Classification Performance [%]
100
80
80
60
60
40
40
20
20
0
2
3
Subset
4
Ti = 0.4
100
0
80
60
60
40
40
20
20
2
2
3
Subset
1 Sr
4
4 Sr
0
16 Sr
3
Subset
4
3
Subset
4
Ti = 0.8
100
80
0
Ti = 0.6
100
2
64 Sr
Figure 3.9: Classification ratio using the cropped faces from the Yale databse.
3.4
Discussion
In this paper, we addressed the question whether we could reliably recognize faces using a biological plausible cortical neural network exploiting the
Temporal Population Code. Moreover, we investigated whether it could
be performed in real time using the iCub robot. We have shown that in
our cortical model, the representation of a static visual stimulus is generated based on the temporal dynamics of neuronal populations given by a
specific signature in the phase relationships of the neuronal responses.
In the specific benchmark evaluated here, the model showed a good performance both in terms of the ratio of correct classification and processing
temporal population code for face recognition on the icub
72
robot
Table 3.2: Comparison of TPC with other face recognition methods. The results
were extracted from Lee et al. (2005) were the reader can find the references for
each method.
Method
Correlation
Eigenfaces
Eigenfaces w/o 1st 3
Linear Subspace
Cones-attached
Harmonic images (no cast shadows)
9PL (simulated images)
Harmonic images (with cast shadows)
TPC
Gradient Angle
Cones-Cast
9PL (real images)
Classification Ratio (%) vs. Ilum.
Subset 1&2 Subset 3 Subset 4
100.0
76.7
23.7
100.0
74.2
24.3
100.0
80.8
33.6
100.0
100.0
85.0
100.0
100.0
91.4
100.0
100.0
96.4
100.0
100.0
97.2
100.0
100.0
97.3
100.0
100.0
98.0
100.0
100.0
98.6
100.0
100.0
100.0
100.0
100.0
100.0
time. Importantly, we showed that the TPC model can recognize faces
with a correct ratio of 97 % without any face-specific mechanism in the
network. Indeed, similar filter properties have been used for character
recognition and spatial cognition tasks. The processing time needed for
classifying an input of 128x128 pixels was about 200 ms. In comparison
generic face recognition methods such as correlation and eigenfaces have
reduced recognition performance (Lee et al., 2005).
In the implementation reported here we used only one frequency resolution
to model the receptive field’s characteristics of simple cells. It reduces the
number of operations and therefore improves the recognition speed. The
optimal frequency response was based on previous studies of V1 (Bayerl
and Neumann, 2004).
We expect that performance and the generality of the model can be further
improved in a future implementation, using multiple processors in parallel.
Specially because the features could be extracted by gaussian filters opti-
3.4. discussion
73
mally arranged over different scales similar to the SIFT transform (Lowe,
2004). With such a scheme we could achieve scale invariances without
relying on re-scaling the cropped face image. However, the ideal architecture for TPC is a neuromorphic implementation where the integrate and
fire units can work fully in parallel which calls for more specialized parallel
hardware.
Our results confirm that TPC is a relevant hypothesis on the encoding of
sensory events by the brain of both vertebrates and invertebrates given its
performance, speed of processing and consistency with cortical anatomy
and physiology. Moreover, it reinforced the main characteristic of the
TPC: its independence of the exact detailed properties of the input stimulus. The network does not need to be rewired to encode a very complex
natural stimulus that is a human face and thus qualifies as a canonical
cortical input processor.
Chapter
Using a temporal population
to recognize gestures on the
humanoid robot iCub.
4.1
Introduction
Gesture recognition is a core human ability that improves our capacity of
interaction and communication. Through gestures we can reveal unspoken
thoughts that are crucial for verbal communication (Mitra et al., 2012) or
even to interact with machines (Kurtenbach and Hulteen, 1990). It also
plays a role in changing our knowledge, during the learning phase in our
childhood (Goldin-Meadow, 2009).
Different studies were dedicated to understand and categorize human gestures from a psychological point of view. According to Cadoz (1984); McNeill (1996), gestures can be used to communicate meaningful information
(semiotic); used to manipulate the physical world and create artifacts (ergotic) or used to learn from the environment through tactile or haptic
exploration (epistemic).
75
4
using a temporal population to recognize gestures on the
76
humanoid robot icub.
Gonzalez Rothi et al. (1991) suggest the existence of different neural mechanisms for imitating either meaningful or meaningless actions. Meaningless and meaningful gestures can be imitate using a direct route that connects the visual analysis to the innervatory patterns, whereas meaningful
gestures can only be imitate using the semantic route that comprises different processing stages.
Independent to the kind of gesture, meaningful or not, higher cognitive
areas require a sophisticated representation of the visual content. Furthermore, this representation must be invariant to a number of transformations caused by varying viewing angles, different scene configurations,
light sources and deformations in form.
In earlier work we proposed a model of the primary visual cortex that
makes use of known cortical properties to perform a spatial to temporal
transformation in a representation invariant to rotation, translation and
scale. We showed that the dense excitatory local connectivity found in the
primary visual cortex of the mammalian brain can play a specific role in
the rapid and robust transformation of spatial stimulus information into
a, so called, Temporal Population Code (or TPC) (Wyss et al., 2003a).
In this study, we propose to apply the high-capacity encoding provided
by TPC to a gesture recognition task. We want to evaluate whether the
performance both in terms of recognition rate and computation speed
of the TPC are feasible for real time applications. To our knowledge,
there are very few models of object recognition that take into account the
anatomy of biological systems. Most of actual systems try to reproduce the
characteristics of humans abilities in gesture recognition through purely
algorithmic solutions (Maurer et al., 2005).
The model proposed here is benchmarked using the TPC library developed
for the iCub robot (Luvizotto et al., 2011). The task is motivated by the
Rock-Paper-Scissors hand game where the human player competes with
the robot. Here we show the initial results of the proposed system using
images obtained from the robot’s camera in a real game scenario. We
77
4.2. material and methods
Scissors
Paper
at
Rock
Be
s
at
Be
s
Beats
Figure 4.1: The gestures used in the Rock-Paper-Scissors game. In the game,
the players usually count until three before showing the gestures. The objective
is to select a gesture which defeats that of the opponent. Rock breaks scissors,
scissors cut paper and paper covers captures rock. Unlike a truly random selection
method, like coin flipping or a dice, in the Rock-Paper-Scissors game is possible
to recognize and predict the behavior of an opponent.
present a detailed description of the model used in the experiments and
the statistical analysis of the recognition performance using a standard
cluster algorithm. The results are then compared to a predefined baseline
classification rate. We then finish the chapter with a discussion of the
main findings of this ongoing study.
4.2
Material and methods
The model consists on a topographic map of laterally connected spiking
neurons with properties found in the primary visual cortex V1 (Fig. 4.2)
(Wyss et al., 2003c,a). In the first stage we segment the hand from the
background using color segmentation. The input image is transformed
into hue, saturation and value (HSV) space and then thresholded. The
threshold values for the three parameters H, S and V can be set at runtime
using a graphical interface.
using a temporal population to recognize gestures on the
78
humanoid robot icub.
Orientation
V1
Filters
Network
iCub Camera
Color
Segmentation
Wavelet Readout
TPC
Figure 4.2: Visual model diagram. In the first step, the input image is color
filtered in the hue, saturation and value (HSV) space and segmented from the
background. The segmented image is then projected to the LGN stage where its
edges are enhanced. In the next stage, the LGN output passes through a set of
Gabor filters that resemble the orientation selectivity characteristics found in the
receptive field of V1 neurons. Here we show the output response of one Gabor
filter as input for the V1 spiking model. After the image onset, the sum of the
V1 network’s spiking activity over time gives rise to a temporal representation
of the input image. This temporal signature of the spatial input is the so called
temporal population code, or TPC. The TPC output is then read out by a wavelet
readout system.
The segmented hand is then projected to the V1 spiking model, where
the coding concept is illustrated in Fig. 4.3. The network is an array of
N xN model neurons connected to a circular neighborhood with synapses
of equal strength and instantaneous excitatory conductance. The transmission delays are related to the Euclidean distance between the positions
of the pre-and postsynaptic neurons. The stimulus is continuously presented to the network and the spatially integrated spreading activity of
the V1 units, as a sum of their action potentials, results in the so called
TPC signal.
In the network, each neuron is approximated using the simple spiking
model proposed by Izhikevich (Izhikevich, 2003). These modeled neurons
79
4.2. material and methods
Modelled Neuron
Lateraly Connected
Network
e
Tim
Stimulus
(Hand)
Connectivity
Radius
Spike
Lateral Propagation
Spikes
∑ Spikes
TPC
Time
Figure 4.3: Schematic of the encoding paradigm. The stimulus, a human hand,
is continuously projected onto the network of laterally connected integrate and fire
model neurons. The lateral transmission delay of these connections is 1 ms/unit.
When the membrane potential crosses a certain threshold, a spike occur and its
action potential is distributed over a neighborhood of a given radius. Because
of these lateral intra-cortical interactions, the stimulus becomes encoded in the
network’s activity trace. The TPC signal is generated by summing the total
population activity over a certain time window.
are biologically plausible and as computationally efficient as integrate-andfire models. Relying only on four parameters, the model can reproduce
spiking behavior of known types of cortical neurons (Izhikevich, 2004)
using a system of ordinary differential equations of the form:
v 0 = 0.04v 2 + 5v + 140 − u + I
(4.1)
u0 = a(bv − u)
(4.2)
with the auxiliary after-spike resetting:

v ← c
if v ≥ 30 mV, then
u ← u + d
(4.3)
using a temporal population to recognize gestures on the
80
humanoid robot icub.
Here, v and u are dimensionless variables and a, d, c and d are dimensionless parameters that determine the spiking or bursting behavior of the
neuron unit and
0
=
d
dt ,
where t is the time. The parameter a describes
the time scale of the recovery variable u. The parameter b describes the
sensitivity of the recovery variable u to the sub-threshold fluctuations of
the membrane potential v. The parameter c accounts for the after-spike
reset value of v caused by the fast high-threshold K + , and d the after-spike
for the reset of the recovery variable u caused by slow high-threshold N a+
and K + conductances. The mathematical analysis of the model can be
find in (Izhikevich, 2006).
The excitatory input I in 4.1 consists of two components: first a constant
driving excitatory input gi and second the synaptic conductances given by
the lateral interaction of the units gc (t). So
I(t) = gi + gc (t)
(4.4)
For the simulations, we used the parameters suggested in (Izhikevich, 2004)
to reproduce regular (RS) spiking behavior (Fig. 4.4). All the parameters
used in the simulations are summarized in the table 4.1.
mV
Regular Spiking (RS)
20
0
−20
−40
−60
−80
25
Figure 4.4: Computational properties after frequency adaptation of the integrate and fire neuron used in the simulations. The spikes in this regular spiking
neuron model occurs every 25 ms approximately (40 Hz).
4.2. material and methods
81
In our model each unit is characterized by orientation selectivity angle φ ∈
0, π6 , 2 π6 , ..., π and a bi-dimensional vector x ∈ R2 specifying its receptive
field’s center location within the input plane. So each unit in the network
is defined by u(x, φ) that we envision as approximating the functional
notion of cortical column.
The receptive field of each unit is approximated using the second derivative
of an isotropic Gaussian filter Gσ . The input image is convolved with Gσ to
reproduce the V1 receptive field’s characteristics for orientation selectivity.
The input stimulus is a retinal gray scale image that fills the whole visual
field.
2
The convolution with ∂x(φ)
Gσ is computed by twice applying a sampled
first derivative of a Gaussian (σ = 0.7 and size 5x5 pixels) over the six
orientations φ (Bayerl and Neumann, 2004). The filter output is truncated
according to a threshold Ti ∈ [0, 1], where the values above Ti are set to
Ex . Each orientation selective output is projected onto a specific network
layer.
The lateral connectivity between V1 units is exclusively excitatory with
strength w. A unit ua (x, φ) connects with ub if all of the following conditions are met:
1. be in the same population: φa = φb .
2. have a different center position: xa 6= xb
3. within a region of a certain radius: kxb − xa k < r
In our model we set the connectivity diameter r to 11 units, as used in
previous TPC studies (Luvizotto et al., 2011). The lateral synapses are
of equal strength w and the transmission delays τa are proportional to
kxb − xa k with 1 ms/cell.
The TPC is generated by summing the network’s activity over a time
window of 64 ms. Finally, the output vectors from different orientation
using a temporal population to recognize gestures on the
82
humanoid robot icub.
layers are summed forming the TPC vector. In this way the temporal
representation of TPC is naturally invariant to position. Theoretically
rotation invariances are achieved by increasing the number of orientation
selective layers reaching a full invariant architecture by expanding the
number of orientation layers.
Wavelet Readout
The wavelet transform is a spectral analysis technique that uses variablesized regions as a kernel of the transformation. In comparison with a standard Short Time Fourier Transform (STFT), the wavelet transform does
not use a time-frequency, but rather a time-scale region Mallat (1998). One
major advantage is the ability to perform local analysis with optimal resolution in both the time and frequency domain rendering a time-frequency
representation Fig. 4.5.
In the model presented here, we used a neuronal implementation of a Haar
wavelet transform. This circuit has been recently proposed in a TPC study
(Luvizotto et al., 2012c), detailed in chapter 2. In this circuit, the TPC
signal with N = 64 samples (sampling rate of 1 Khz), passes through
a set of filters which slice the frequency spectrum in a desirable number
of bands, i.e, the approximations and detail levels. The approximations
are the high-scale, low-frequency components of the signal spectrum. The
details are the low-scale, high-frequency components. At each level l of
the filtering process a new frequency band emerges and is represented by
N
2l
wavelet coefficients. A dense representation is achieved selecting the
coefficients that carry the greatest amount of information to re-construct
the signal, without unwanted redundancies and artifacts.
Stimulus set
The stimulus set used in the experiment consists of the three different
hand gesture’s classes used in the Rock-Paper-Scissors game (Fig. 4.2).
83
4.2. material and methods
Level 0
N = 64
TPC Signal
Level 1
N = 32
Approximation
(A)
Level 2
N = 16
A
Level 3
N=8
Level 4
N=4
A
A
Detail
(D)
D
D
D
500
250
62
31
15
125
Level 5 A
D
N=2
Hz
Figure 4.5: Wavelet transform time-scale representation. At each resolution
level, the number of wavelet coefficient drops by factor of 2 (dyadic representation)
as well as the frequency range of the low-pass signal given by the approximation
coefficients. The detail coefficients can be interpreted as a band-pass signal,
with frequency range equal to the difference in frequency between the actual and
previous approximation levels.
For each class, 40 stimulus with natural variations of the hand’s posture
are used. There are in total 4 subjects. In the experiments, 75% of the
images are used for training and the remaining 25% others are used as
classification set. Giving the compact representation of TPC, the training
set composed by 90 pictures can be generated offline and stored in memory
using a temporal population to recognize gestures on the
84
humanoid robot icub.
Variable
N
a
b
crs
drs
v
u
gi
Ti
r
w
Description
network dimension
scale of the recovery
sensitivity of the recovery
after-spike reset value of v
after-spike reset value of u
Membrane potential
membrane recovery
excitatory input conductance
minimum V1 input threshold
lateral connectivity radius
synapsis strength
Value
96x72 Neurons
0.02
0.2
-65
8
-70
-16
20
0.4
11 units
0.4
Table 4.1: Parameters used for the simulations.
(only 20 Kb) for online recognition.
Paper
Rock
Scissors
Figure 4.6: The stimulus classes used in the experiments. Here 12 exemplars
for each class are shown.
Cluster algorithm
For the classification we used the following algorithm. The network’s responses to stimuli from C stimulus classes S1 , S2 , ..., SC are assigned to
C response classes R1 , R2 , ..., RC of the training set, yielding a C × C
hit matrix N (Sα , Rβ ), whose entries denote the number of times that a
stimulus from class Sα elicits a response in class Rβ . Initially, the matrix
N (Sα , Rβ ) is set to zero. For each response r ∈ Sα , we calculate the Euclidean distance of r to the responses r0 6= r elicited by stimuli of class
Sγ :
85
4.3. results
ρ(r, Sγ ) = hk ρ(r, r0 ) kir0
elicited by Sγ
(4.5)
where h.i denotes the average among the temporal Euclidean distances
between r and r0 denoted by ρ(r, r0 ). The response r is classified into the
response-class Rβ for that ρ(r, Sβ ) is minimal, and N (Sα , Rβ ) is incremented by one. The overall classification ratio in percentage is calculated
summing the diagonal of the N (Sα , Rβ ) and dividing by the total number
of elements in the classification set R. We chose the same metrics that
was used in previous studies to establish a direct comparison between the
results over different scenarios.
4.3
Results
In the first part of the experiments, we established a classification baseline
to further compare and evaluate the classification performance provided by
TPC. We define the baseline applying the cluster algorithm (see section 4.2
for details) over the pixel intensity values of the gray scale images used in
the experiments. In this situation, an image of 96x72 pixels is represented
by a vector of size 6912. Using the 75% of the images as training set, the
results show a correct classification ratio of 70% (Fig. 4.7). In the next
step, we run the simulations using the TPC to compare classification,
speed and compression ratio of the model against the baseline.
In our simulations we used the wavelet readout circuit at the approximation level 5, or Dc 5 to finally capture the intrinsic geometric properties of
the stimulus. At this level, the signal ranges from 15 Hz to 31 Hz and it is
represented by only 2 wavelet coefficients. As we have 6 signals provided
by the different orientation layers, the total size of the readout signal is
12.
The simulations were performed with the parameters of table 4.1, using a
TPC library developed in C++. The performance shows a maximum of
using a temporal population to recognize gestures on the
86
humanoid robot icub.
Response Class
4
6
10
3
7
Stimulus Class
Figure 4.7: Classification hit matrix for the baseline. In the classification process, the correlation between the classification set (25% of the stimuli )and the
training set (75%) are calculated. A stimulus is assigned to a the class with
smaller mean correlation to the respective class in the training set.
93 % of correct classification ratio. The misclassifications happen mostly
with class one, i.e the Rock gesture (Fig. 4.8). This gesture intersects
with a great extend the other two and it’s more bounded to misclassifications. This classification ratio is compatible with previous TPC works, for
instance with faces and hand written characters recognition (Wyss et al.,
2003c) (Luvizotto et al., 2011). In a 2.67 Ghz core i7 processor, the processing time per hand is about 0.3 s. In comparison to the baseline, the
TPC representation shows a gain in correct classification of 23% while
reducing the representation in an incredible factor of 576, i.e. from 6912
pixels in a resolution of 96x72 to 12 Dc 5 wavelet coefficients.
87
4.4. discussion
Response Class
8
10
2
10
Stimulus Class
Figure 4.8: Classification hit matrix. In the classification process, the euclidean
distance between the classification set and the prototypes are calculated. A stimulus is assigned to a the class with smaller euclidean distance to the respective
class prototype.
4.4
Discussion
We have shown that in a cortical model, the representation of a static
visual stimulus can be generated using the temporal dynamics of a neuronal population. This temporal encoding has a specific signature in the
phase relationships between the active neurons of the underlying encoding
substrate.
The TPC is a relevant hypothesis on the encoding of visual information.
It is consistent with known cortical characteristics both anatomically and
physiologically. Also the final representation given by the wavelet circuit
compress the signal, removing redundant and non-orthogonal components
of the signal greatly improving the compromise between performance and
size.
using a temporal population to recognize gestures on the
88
humanoid robot icub.
In conclusion, simple hand gestures can be reliably classified using the
TPC approach. With the highly optimized C++ library we developed for
the experiments, the computational time needed for the TPC computation
is totally feasible for the the recognition of the gestures. Hence, TPC has
proved to be robust and fast enough as a gesture recognition mechanism in
the iCub without affecting or compromising the game playing with iCub.
Chapter
A framework for mobile robot
navigation using a temporal
population code
Recently, we have proposed that the dense local and sparse long-range
connectivity of the visual cortex accounts for the rapid and robust transformation of visual stimulus information into a temporal population code,
or TPC. In this paper, we combine the canonical cortical computational
principle of the TPC model with two other systems: an attention system and a hippocampus model. We evaluate whether the TPC encoding
strategy can be efficiently used to generate a spatial representation of the
environment. We benchmark our architecture using stimulus input from
a real-world environment. We show that the mean correlation of the TPC
representation in two different positions of the environment has a direct
relationship with the distance between these locations. Furthermore, we
show that this representation can lead to the formation of place cells. Our
results suggest that TPC can be efficiently used in a high complexity task
such as robot navigation.
89
5
a framework for mobile robot navigation using a temporal
90
population code
5.1
Introduction
Biological systems have an extraordinary capacity of performing simple
and complex tasks in a very fast and reliable way. These capabilities are
largely due to the robust processing of sensory information in an invariant
manner. In the mammalian brain, the representation of dynamic scenes
in the visual world are processed invariant to a number of deformations
such as perspective, different light conditions, rotations and scales. This
great robustness leads to successful strategies for optimal foraging. For
instance, a foraging animal has to explore the environment, search for food,
escape from possible predators and produce an internal representation of
the world that allows it to successfully navigate through it.
To perform navigation with mobile robots using only visual inputs from a
camera is a complex task (Bonin-Font et al., 2008). Recently, a number of
models have been proposed to solve this task (Jun et al., 2009; Koch et al.,
2010). Unlike the visual system, most of these methods are based on brute
force methods of feature matching that are computationally expensive and
stimulus dependent.
In contrast, we have proposed a model of the visual cortex that can generate reliable invariant representations of visual information. The temporal
population code, or TPC, is based on the unique anatomical characteristics
of the neo-cortex: dense local connectivity and sparse long-range connectivity (Binzegger et al., 2004). Physiological studies have estimated that
only a few percent of synapses that make up cortical circuits originate outside of the local volume (Schubert et al., 2007; Liu et al., 2007; Nauhaus
et al., 2009). The TPC proposal has demonstrated the property of densely
coupled networks to rapidly encode the geometric organization of the population response, for instance induced by external stimuli, into a robust
and high-capacity encoding (Wyss et al., 2003a,c).
It has been shown that TPC provides an efficient encoding and can generalize to navigation tasks in virtual environments. In a previous study,
5.1. introduction
91
TPC model applied to a simulated robot in a virtual arena could account
for the formation of place cells (Wyss et al., 2006). It has also been showed
that TPC can generalize to realistic tasks such as handwritten character
classification (Wyss et al., 2003a). The key parameters that control this
encoding and the invariances that it can capture, are the topology and
the transmission delays of the laterally coupled neurons that constitute an
area. In this respect TPC emphasizes that the neo-cortex exploits specific
network topologies as opposed to the randomly connected networks found
in examples of, so-called, reservoir computing (Lukoševičius and Jaeger,
2009).
In this paper, we evaluate whether the TPC applied can be generalized
to a navigation task in a natural environment. In particular we assess
whether it can be used to extract spatial information from the environment. In order to achieve this generalization we combine the TPC, as a
model of the ventral visual system, with a feedforward attention system
and a hippocampal like model of episodic memory. The attention system
uses early simple visual features to determine which salient regions in the
scene the TPC framework will encode. The spatial representation is then
generated by a model of the hippocampus that develops place cells. Our
model has been developed to mimic the foraging capabilities of a rat. Rodents are capable to optimally explore the environment and its resources
using distal visual cues for orientation. The images used in the experiments are acquired in a real-world environment taken by the camera of a
mobile robot called the Synthetic Forager (SF) (Rennó-Costa et al., 2011).
Our results show that the combination of TPC with a saliency based attention mechanisms can be used to develop robust place cells. These
results directly support the hypothesis that the computational principle
proposed by TPC leads to a spatial-temporal coding of visual features that
can reliably account for position information in a real-world environment.
Differently from previous TPC experiments with navigation, here we stress
the capabilities of position representation in a real-world environment.
a framework for mobile robot navigation using a temporal
92
population code
In the next section we present a description of the 3 different systems we
use in the experiments. We detailed how the environment for the experiments was setup and how the images for the experiments were generated.
In the following section we expose the main results obtained with the
experiments and we finish discussing the core findings of the study.
5.2
Methods
Architecture overview
The architecture presented here comprises three areas (Fig. 5.1): attention
system, early visual system and a model of the hippocampus. The attention system exchanges information with the visual system in a bottom-up
fashion. It receives bottom-up input of simple visual features and provides
salience maps (Itti et al., 1998). The saliency maps determine the sub regions that are rich in features such as colors, intensity and orientations
that pop-out from the visual field and therefore will be encoded in the
visual system.
The visual system also provides input to the hippocampal model in a
bottom-up way. The subregions of interest determined by the attention
system are translated into temporal representations that feed the hippocampal input stage. The temporal representation of the visual input is
generated using the so called Temporal Population Code, or TPC (Wyss
et al., 2003a,c; Luvizotto et al., 2011). The hippocampus model translates
the visual representation of the environment in a spatial representation
based on the activity of place cells (de Almeida et al., 2009a; Rennó-Costa
et al., 2010). These kinds of granule cells respond only to one or a few positions in the environment. In the following sections we present the details
and modeling strategies used for each area.
93
5.2. methods
Attention
System
colors
Hippocampus
Input Image
intensity
orientations
Cell #1
Place
Cells
Cell #2
Salience Map
Cell #3
Cell #4
Top-Down
Early Visual
Features
Saliency
Maps
Bottom-up
Bottom-up
Visual Representation
of the Environment
Visual
System
Salient
Subregion
e
Tim
Tpc
Figure 5.1: Architecture scheme. The visual system exchanges information with
both the attention system and the hippocampus model to support navigation.
The attention system is responsible for determining which parts of the scene are
considered for the image representation. The final position representation is given
by the formation of place cells that show high rates of firing whenever the robot
is in a specific location in the environment, i.e. place cells.
Attention System
In our context, attention is basically the process of focusing the sensory
resources to a specific point in space. We proposed a deterministic attention system that is driven by the local points of maximum saliency in
the input image. To extract these subregions of interest, the attention
system uses a saliency-based model proposed and described in detail in
(Itti et al., 1998). The system makes use of image features extracted by
the early primate visual system, combining multi-scale image features into
a topographical saliency map (bottom-up). The output saliency map has
a framework for mobile robot navigation using a temporal
94
population code
the same aspect ratio as the input image. To select the center points of the
subregions, we calculate the points of local maxima p1 , p2 , . . . , p21 of the
saliency map. The minimum accepted distance between peaks in the local
maxima search is 20 pixels. This competitive process among the salient
points can be interpreted as a top-down pushpull mechanism (Mathews
et al., 2011).
A subregion of size 41 × 41 is determined by a neighborhood of 20 pixels
surrounding a local maximum p. Once a region is determined, the points
inside of this area are not used in the search for the next pk+1 . This
constraint avoids processing redundant neighboring regions with overlaps
bigger than 50% in area. For the experiments, the first 21 local maxima
are used by the attention system. These 21 points are the center of the
subregions of interest cropped from the input image and sent to the TPC
network. We used a Matlab implementation1 of the saliency maps (Itti
et al., 1998). In the simulations we use the default parameters, except for
the map width which is set to the same size as the input image: 967 pixels.
Visual System
The early visual system consists of two stages: a model of the lateral
geniculate nucleus (LGN) and a topographic map of laterally connected
spiking neurons with properties found in the primary visual cortex V1 (Fig.
5.2) (Wyss et al., 2003c,a). In the first stage we calculate the response
of the receptive fields of LGN cells to the input stimulus, a gray scale
image that covers the visual field. The approximation of the receptive
field’s characteristics is performed by convolving the input image with a
difference of Gaussians operator (DoG) followed by a positive half-wave
rectification. The positive rectified DoG operator resembles the properties
of on LGN center-surround cells (Rodieck and Stone, 1965; Einevoll and
Plesser, 2011). The LGN stage is a mathematical abstraction of known
properties of this brain area and performs an edge enhancement on the
1
The scripts can be found at: http://www.klab.caltech.edu/ harel/share/gbvs.php
95
5.2. methods
input image. In the simulations we use a kernel ratio of 4:1, with size of
7X7 pixels and variance σ = 1 (for the smaller Gaussian).
LGN
V1
TPC
Spikes
Input
Array of
Spiking Neurons
Time
Figure 5.2: Visual model overview. The saliency regions are detected and
cropped from the input image provided by camera image. The cropped areas,
subregions of the visual field, have a fixed resolution of 41x41 pixels. Each subregion is convolved with difference of gaussian (DoG) operator that approximates
the properties of the receptive field of LGN cells. The output of the LGN stage is
processed by the simulated cortical neural population and their spiking activity
is summed over a specific time window rendering the Temporal Population Code.
The LGN signal is projected to the V1 spiking model, that uses the same
network based on Izhikevich neurons presented in chapters 2 and 4 (Fig.
fig:2Rd).
For the simulations, we used the parameters suggested in (Izhikevich, 2004)
to reproduce regular firing (RS) spiking behavior. All the parameters
used in the simulations are summarized in the table 5.1. The lateral
connectivity between V1 units is exclusively excitatory with strength w.
A unit ua (x) connects with ub if they have a different center position
xa 6= xb and if they are within a region of a certain radius kxb − xa k < r.
The temporal population code is generated by summing the network activity in a time window of 128 ms. Thus a TPC is a vector with the spiking
count of the network for each time step. In the discrete-time, all the equations are integrated with Euler’s method using a temporal resolution of 1
ms.
a framework for mobile robot navigation using a temporal
96
population code
Variable
N
a
b
crs
cbs
v
u
gi
Ti
r
w
Description
network dimension
scale of the recovery
sensitivity of the recovery
after-spike reset value of v for RS neurons
after-spike reset value of v for BS neurons
membrane potential
membrane recovery
excitatory input conductance
minimum V1 input threshold
lateral connectivity radius
synapsis strength
Value
41x41 Neurons
0.02
0.2
-65
-55
-70
-16
20
0.4
7 units
0.3
Table 5.1: Parameters used for the simulations.
Hippocampus model
The hippocampus model is an adaptation of a recent study on the rate
remapping in the dentate gyrus (de Almeida et al., 2009a,b; Rennó-Costa
et al., 2010). In our implementation, the granule cells receive excitatory
input from the grid cells of the enthorinal cortex. The place cells that
are active for a given position in the environment are then determined
according to the interaction of the summed excitation and inhibition using
a rule based on the percentage of maximal suprathreshold excitation E%max winner-take-all (WTA) process. The excitatory input received by the
ith place cell from the grid cells is given by:
Ii (r) =
n
X
Wij Gj (r)
(5.1)
j−1
where Wij is the synaptic weight of each input. In our implementation,
the weights are either random values in the range of [0, 1] (maximum bin
size among normalized histograms) or 0 in case of no connection. In our
implementation, we use the TPC histograms as the information provided
by the grid cells Gj (r). The activity of the ith place cell is given by:
97
5.2. methods
Fi (r) = Ii (r)H(Ii (r) − (1 − k)max
grid (r))
(5.2)
The constant k varies in the range of 5% to 15% (de Almeida et al., 2009b).
It is referred to E%-max and determines which cells will fire. Specifically,
the rule states: a cell fires if their feedforward excitation is within E% of
the cell receiving maximal excitation.
Experimental environment
For the experiments we produced a dataset of pictures from an indoor
square area of 3m2 . The area was divided in a 5x5 grid of 0.6 m2 square
bins, compromising 25 sampling positions (Fig. 5.3a). The distance to
objects and walls ranged from 30 cm to 5 meters. To emulate the sensory
input we approximated the rat view at each position with 360 degree
panoramic pictures. Rats use distal visual cues for orientation (Hebb,
1932) and such information is widely available for the rat given the spaced
positioning of their eyes, resulting in a wide field of view and low binocular
vision (Block, 1969). The pictures were built using the software Hugin
(D’Angelo, 2010) to process 8 pictures with equally distant orientations
(45 degrees steps) for each position. The images were acquired using the
SF-robot (Rennó-Costa et al., 2011). Each indoor images was segmented
in 21 subregions of 41 × 41 pixels by the attention system as described
earlier. The maximum overlap among the subregions is 50 % (Fig. 5.3b).
Image representation using TPC
For every image, 21 TPC vectors are generated, i.e. one for each subregion selected by the attention system. In total, we acquired 25 panoramic
images of the environment, leading to a matrix of 525 vectors of TPCs
(25x21). We cluster the TPC matrix into 7 classes using the k-means
cluster algorithm. Therefore, a panoramic image in the environment is
a framework for mobile robot navigation using a temporal
98
population code
represented by the cluster distribution of its TPC vectors using a histogram of 7 bins. The histograms are used as a representation of each
position in the environment (Fig. 5.3b). They are the inputs for the
hippocampus model.
a)
Environment used in the experiments
Sampling
position
21
22
16
11
6
1
b)
2
Histogram of
TPCs
21
9
11
13 17
19 20
18
12
7
3
0.6 m
4
24
19
13
8
14
9
5
25
20
15
3m
10
Panoramic image of position 2
15
7
23
17
3
6
1
8 14 5
2
4
10
18
12
16
Salient Regions
Figure 5.3: Experimental environment. a) The indoor environment was divided
in a 5x5 grid of 0.6 m2 square bins, compromising 25 sampling positions. b) For
every sampling position, a 360 degrees panorama was generate to simulate the
rat’s visual input. A TPC responses is calculated for each salient region, 21
in total. The collection of TPC vectors for the environment is clustered into 7
classes. Finally, a single image is represented by the cluster distribution of its
TPC vectors using a histogram of 7 bins.
5.3
Results
In the first experiment, we explored whether the TPC response preserves
bounded invariance in the observation of natural scenes, i.e. if the representation is tolerant to certain degrees of visual deformations without
loosing specificity. More specifically, we investigate whether the correlation between the TPC histograms of visual observations in two positions
can be associated with their distance. We calculate the mean pairwise
correlation among TPC histograms of the 25 sampling positions used for
image acquisition in the environment. We sort the correlation values into
99
5.3. results
distance intervals of 0.6 m, the same distance intervals used for sampling
the environment.
Our results show that the TPC transformation can successfully account
for position information in a real world scenario. The mean correlation
between two positions in the environment decays monotonically with the
increase of their distance (Fig. 5.4). As expected, the error in the correlation measure increases with the distance.
Mean Correlation
1
0.8
0.6
0.4
0.2
0
0
0.6
1.2
1.8
Meters
2.4
3
Figure 5.4: Pairwise correlation of the TPC histograms in the environment.
We calculate the correlation among all the possible combinations of two positions
in the environment. We average the correlation values into distance intervals,
according to the sampling distances used in the image acquisition. This result
suggests that we can produce a linear curve of distance based on the correlation
values calculated using the TPC histograms.
In the second experiment, we use the TPC histograms as input for the
hippocampus model to acquire place cells. Neural representations of position, as observed in place cells, present a quasi-Gaussian transition between inactive and active locations, which makes bounded invariance a
key property for a smooth reconstruction of the position from visual input
as observed in nature. Previous work have shown that the conjunction of
TPC responses of synthetic figures in a virtual environment do preserve
bounded invariance and, with the use of reinforcement learning, place cells
could be emerge from these responses (Wyss and Verschure, 2004a).
We perform the simulations using 100 cells with random weights. The
a framework for mobile robot navigation using a temporal
100
population code
results show that from 100 cells used, 12 have significant place related
activity (Fig. 5.5). These cells show high activity in specific areas of the
environment, also called place fields. Thus the TPC representation of the
natural scenes can successfully lead to the formation of place fields.
3m
3m
Cell #3
Cell #7
Cell #10
Cell #14
Cell #42
Cell #66
Cell #70
Cell #74
Cell #79
Cell #88
Cell #91
Cell #97
Figure 5.5: Place cells acquired by the combination of TPC based visual
representations and the E%-max WTA model of the hippocampus.
5.4
Discussion
We addressed the question whether we could combine the properties of
TPC with an attention system and a hippocampus model to reliably provide position information in a navigation scenario. We have shown that
in our cortical model, the TPC, the representation of a static visual stimulus is reliably generated based on the temporal dynamics of neuronal
populations. The TPC is a relevant hypothesis on the encoding of sensory
events given its consistency with cortical anatomy and its consistency with
contemporary physiology.
The information carried in this, so called, Temporal Population Code, is
efficiently used to pass a complete and dense amount of information. The
encoding of visual information is done through a sub-set of regions of high
saliency that pops up from the visual field. Combining the TPC encoding
5.4. discussion
101
strategy with a hippocampus model, we have shown that this approach
provides for robust representation of position in a natural environment.
Panoramic images have been proposed in other studies of robot navigation
(Zeil et al., 2003). Similarly, the TPC strategy proposed here also makes
use of pixel differences between panoramic images to provide a reliable
cue to location in real-world scenes. In both cases differences in the direction of illumination and environmental motion often can seriously degrade
the position representation. However, in comparison, the representation
provided by TPC is extremely compact. A subregion of 41x41 pixels is
transformed in a TPC vector of 64 points. Finally, the visual field, an
image of 96.000 pixels, is represented by a histogram of 7 bins.
In the specific benchmark evaluated here, the model showed that distance
information could be reliably recovered from the TPC. Importantly, we
showed that the model can produce reliable position information based on
the mean correlation value of TPC based histograms. Furthermore, this
method does not use any particular mechanism specific to the stimuli used
in the experiments. It is totally generic. Also, the processing time needed
for classifying an input of 41x41 pixels was about 50 ms using an offline
Matlab implementation. Therefore, we conclude that our model could be
efficiently used in a real-time task. Hence, our model shows that the brain
might use the smooth degradation of natural images represented by TPCs
as an estimate of distance.
Acknowledgment
This work was supported by EU FP7 projects GOAL-LEADERS (FP7ICT-97732) and EFAA (FP7-ICT-270490).
Chapter
Conclusion
In this dissertation, we have investigated the computational principles that
the early visual system employs in order to provide an invariant representation of the visual world. Our starting point was the Temporal Population
Code (TPC) model proposed by Wyss et al. (2003b). We first investigated
a central open issue: how the intrinsic elements that form the temporal
representation provided by the TPC could be accessed and readout by
further cognitive areas involved in the visual processing. Indeed, since the
introduction of the TPC, a readout stage that would be neurobiologically
compatible had been a persistent problem. In this thesis, we solved this
issue proposing a novel and biologically consistent readout system based
on the Haar wavelet transform. We showed that the efficient representation provided by the TPC signal can be decoded by further areas through
a sub-set of wavelet coefficients.
The implicit concept behind the proposed readout is that the brain uses
different frequency bands to convey multiple types of informations multiplexed into a temporal signal (Buzsáki, 2006). Indeed, recent results have
shown that distinct network activity observed in primary visual cortex individually signals different properties of the input stimulus such as motion
direction and speed, orientation of object contours at same time (Onat
103
6
104
conclusion
et al., 2011). Therefore, the decoder must be able to select the suitable
frequency range in order to access the information needed for each situation. We investigate this idea using a systematic spectral investigation of
the TPC signal using a simple, yet powerful, Haar basis that seems to be
a feasible choice providing robust and reliable decoding.
In comparison with the original TPC readout, the wavelet circuit proposed
here showed the same performance in the object recognition experiments.
Although, for bursting neurons the wavelet readout is slower in comparison with the original TPC, the wavelet readout could extract the key
geometric properties of synthetic shapes without compromising the speed
of encoding.
The fine details of the mechanism such as its integration time constant and
its wavelet resolution level depend on the network dynamics. How higher
areas in the visual pathways dynamically control the readout frequency
range remains a subject for further investigations. This interaction between bottom-up sensory information and top-down modulation could be
related to an attentional mechanism that would synchronize the inhibitory
neurons in the neuronal wavelet circuit similar to the interneuron gamma
mechanism (Tiesinga and Sejnowski, 2009b).
These findings make a strong point in terms of how the neo-cortex deals
with sensory information. Our results show that stimulus can be represented and further classified using a strongly compressed temporal signal
based on wavelet coefficients. This strategy allows multiplexing high-level
information from the visual input and flexible storage and retrieval of
information by a memory system.
In the next part of this dissertation, we explored the TPC as candidate for
a canonical computational mechanism used by the neo-cortex to provide
invariant representation of visual information in different scenarios. In
chapter 3, we have specifically addressed this question using human faces
as a test case. Indeed, we investigated whether the TPC model could be
used to provide a real-time face recognition system to the iCub robot. In
conclusion
105
the specific benchmark evaluated here using human faces, our results confirmed that the representation of a static visual stimulus can be generated
based on the temporal activity of neuronal populations.
One of the main issues solved was the reduction of the computational cost
due to the recurrent connections in the large scale spiking neural network.
We optimized the TPC architecture with respect to the lateral connectivity
among the modeled neurons using 2D convolutions performed in the frequency domain. Also, instead of using two different filters to simulate the
LGN and V1 receptive field’s characteristics, we redesign the model merging these two stages into a single step based on derivatives of gaussians.
With this strategy we could achieve optimal edge enhancement selective
to orientation in a single filtering operation which improves considerably
the performance. Although the scenario was built around human faces,
the TPC network used is totally generic. The model did not rely on any
pre-built dictionary of features related to any specific stimulus set.
The developed TPC library was integrated into the brain-based cognitive
architecture DAC (Verschure et al., 2003). The tightly coupled layered
structure of DAC consists on reactive, adaptive and contextual memory
mappings. At each level, more complex and memory dependent mappings
from sensory states are built depending on the internal state of the agent.
Here the TPC plays a major role in building a compact representation
of sensory and form primitives. In this sense, the TPC/DAC integration
can serve as an unified perception system for the humanoid robot iCub.
Indeed, we further explored the capabilities and generality of this architecture in a further scenario to recognize gestures. The experiments were
motivated by the game Rock-Paper-Scissor, where a human plays with
the robot through a set of three hand gestures that symbolizes these these
three elements. The results showed that the robot can reliably recognize
gestures under realistic conditions with a precision of 92%. The only assumption at the moment is that the hands are showed against a black
background. This way we can use the color information to indicate the
time of stimulus onset. Also the wavelet readout circuit has
106
conclusion
This promising work has not been published yet, as we are now implement the motor behavior of the robot associated with the game. However,
preliminary tests indicate that TPC is a highly feasible solution both in
terms of performance and processing time that allows the iCub to play
Rock-Paper-Scissors with humans. This scenario will provide a rich platform to study the interaction between humans and humanoid robots. The
challenge of winning may have an important impact in the engagement
of humans that play the game against the robot. Thus a reliable gesture
recognition system is crucial element in the assessment of interaction and
engagement.
Indeed, given the speed of encoding, the computational cost, the classification performance and consistency with cortical anatomy and physiology
TPC has proved to be a relevant hypothesis on the encoding of sensory
events. Moreover, it reinforces very well the main aspect of the question
we want to address: the generation of a robust representation independent
of exact detailed properties of the input stimulus.
Subsequently we moved on to propose the use of the generality of TPC in
a completely different scenario. If TPC is reliable way of encoding visual
sensory input in a generic way, the representation should also account
for bounded invariances and therefore for a smooth reconstruction of the
position from visual input as observed in nature.
In chapter 5, we proposed a framework that combines the TPC, with an
attention and a hippocampus model to provide position estimation for
navigation tasks. The framework proposed was built to be used in mobile
robots relying purely on single camera input.
In a previous study, the TPC model was applied to a simulated robot in
a virtual arena accounting for the formation of place fields (Wyss et al.,
2006). Our results here extended these achievements to a real-world situation, where images of a realistic environment were used as input to feed
a fairly realistic hippocampus model (Rennó-Costa et al., 2010). The images captured by the camera were processed to emulate the sensory input
conclusion
107
of a rat. At at each position in the arena, 360 degree panoramic pictures
where used Hebb (1932). In a bottom-up fashion, the attention system
exchanged information with the visual system receiving simple visual features and providing salience maps of the panoramic images (Itti et al.,
1998). The saliency maps provided a plot of sub regions rich in features
such as colors, intensity and orientations that could be compactly encoded
in the visual system by the TPC. The subregions of interest determined
by the attention system, translated into temporal representations, fed the
hippocampal input stage using the TPC signal leading to the final position
representation translated on the activity of place cells (de Almeida et al.,
2009a).
This strategy, based on distal cues using panoramic images have been
proposed in other studies of robot navigation (Zeil et al., 2003). However,
the compression versus performance obtained with the proposed framework is remarkable. The visual field, an image of 96.000 pixels, using
the proposed framework could be represented by a histogram of 7 bins.
With such compact representation we could achieve a monotonically descendent correlation curve among distances in different parts of the arena.
The smooth reconstruction of distance relationships indeed fulfilled the
needs of the hippocampus input in order to generate place fields. This
framework suggests that the brain might use the smooth degradation of
natural images represented by TPCs to estimate distances and generate
representation of position using place cells.
In summary, we proposed that a spatial to temporal transformation based
on a Temporal Population Code could account for the generation of a robust, compact and non-redundant invariant representation of visual sensory input. We have extensively showed the capabilities of the proposed
strategy over different scenarios, in real-world tasks and in different platforms raging from humanoid and mobile robots to pure computational
simulations using artificial objects.
The results present in this doctor thesis not only strongly support the idea
108
conclusion
that time is indispensable in the construction of invariant representation of
sensory inputs but also suggest a clear hypothesis of how time can be used
in parallel with spatial information. In recent neuroscience studies, there is
increasingly interest in the field of temporal population coding and of how
time can be included in models of sensory encoding as suggested by other
authors (see (Buonomano and Maass, 2009) for a comprehensive review).
In this sense, we are confident that this thesis will strongly contribute to
this discussion adding novel methods and results to the fields of vision
research, computational neuroscience and robotics.
Appendix
Distributed Adaptive Control
(DAC) Integrated into a Novel
Architecture for Controlled
Human-Robot Interaction
Robots will increasingly become more integrated in our society. This will
require high levels of social competence. The understanding of the core aspects required for social interaction is therefore a crucial topic in the field
of robotics. In this article we propose an architecture and an experimental
paradigm which allows a humanoid robot to engage humans in well defined
and quantified social interactions. The BASSIS architecture is a 4 layers
structure of increasingly more complex and memory dependent mappings
from sensory states to action. At its core, BASSIS is allowing the iCub
robot to sustain dyadic interactions with humans using the tangible table
Reactable which is a novel musical instrument. In this setup, the iCub
receives from the Reactable a complete set of data about the objects that
are on the table. This combination avoids issues regarding object categorization and provides a rich environment were humans can play music or
games with the robot. We benchmark the system performance in a col109
A
distributed adaptive control (dac) integrated into a novel
110
architecture for controlled human-robot interaction
laborative object manipulation and planning task using spoken language.
The results show that the architecture proposed here provides reliable and
complex interaction with controlled and precise action execution. We show
that the iCub can successfully interact with humans grasping and placing
objects on the table with a precision of 2 cm. The software implementation
of the architecture is available online.
A.1
Introduction
We are at the start of a period of radical social change where robots will
enter more and more into our daily lives (Bekey and Yuh, 2007; BarCohen and Hanson, 2009). The interactions between human and robots
will become more omnipresent and will take more roles in our houses,
offices, hospitals or schools requiring the creation of machines socially
compatible with humans (Thrun, 2004; Dautenhahn, 2007).
Some studies suggest that humans prefer humanoid robots to other artificial systems in social interactions (Breazeal et al., 2006). Nowadays,
this kind of machine is being developed by many research groups around
the world and by a number of large corporations and smaller specialist
robotics companies (Bekey and Yuh, 2007). In Europe, the FP6 RobotCub project made a major stride forward in providing a means for integrative humanoid research via the iCub platform (Metta et al., 2008).
The iCub robot has rich motor and perceptual features provide for all core
capabilities to efficiently interact with humans.
In the development of experimental setups for investigating social interaction with humanoid robots, we still face many unsolved technical issues
in the treatment of sensory inputs that greatly interfere in the development of the the processes of motor control. The categorization of objects
is a good example of a hard task which is still not completely solved by
current technologies (Dickinson et al., 2009; Pinto et al., 2011). Often the
interaction scenario must be simplified to an unrealistic level in order to
111
a.1. introduction
be technically feasible. Therefore, to better understand and realize the
key elements of the interaction between humans and robots, we need to
overcome these technical obstacles.
In this paper, we propose a setup in which humans can interact with a
robot in a very controlled environment. We, together with this environment, present the technical and methodological aspects of the Biomimetic
Architecture for Situated Social Intelligence Systems, called BASSIS. This
architecture is based on the DAC architecture (Verschure and Althaus,
2003) and incorporates and implements a set of capabilities which allows
the android to engage humans in social interactions, such as games and
cognitive tasks.
BASSIS allows the humanoid robot iCub to interact with humans using an
electronic music instrument with a tangible user interface called Reactable
Reactable1 in a controlled environment. The Reactable, conceptualized as
a musical instrument (Geiger et al., 2010), provides a complete platform
where humans and robots can interact, playing music or games, manipulating physical objects that are on the table. The different properties of
the objects are displayed on the table-top in a visually alluring graphical
interface (Fig. A.1).
In the proposed setup, the iCub is connected to the Reactable and receives
precise information about the different properties of objects placed on the
table directly from the Reactable software. In this way, issues regarding
object properties such as identity and position are flawlessly avoided and
the experiments can be totally focused on the interaction itself. Another
important component of BASSIS is the spoken language module that allows the human and the robot to interact using natural language.
All the parameters describing the interaction can be stored in log files
to be further analyzed unveiling a novel architecture for benchmarking
experiments in human-robot interactions.
1
The Reactable is a
http://www.reactable.com/
trademark
of
the
Reactable
Systems
2009-2012.
distributed adaptive control (dac) integrated into a novel
112
architecture for controlled human-robot interaction
Here we provide a complete description of all the components of the proposed setup. We present in detail all the modules that compose the proposed architecture. Notably, we made all the software publicly available2 .
This allows the whole robotics community to benefit from the proposed
architecture and to test on their ow and benchmark our results in their on
system.
In the results section, we present a detailed analysis of the overall system
performance. We benchmark the performance of the robot in a collaborative object manipulation task. We provide statistics about the precision
in grasping and position representation provided by the BASSIS architecture. We also benchmark the setup over 4 different types of interactions
between humans and the iCub. The interactions are performed using the
spoken language system to achieve cooperative actions. Finally, in the
last section, the highlights and drawbacks of the proposed system are discussed.
A.2
Methods
BASSIS architecture overview
BASSIS is an extended version of the Distributed Adaptive Control (DAC)
architecture (Verschure and Althaus, 2003; Duff and Verschure, 2010).
DAC consists of three tightly coupled layers: reactive, adaptive and contextual. At each level of organization increasingly more complex and
memory dependent mappings from sensory states to actions are generated
dependent on the internal state of the agent. The BASSIS architecture
expands DAC with many new components. It incorporates the repertoire
of capabilities available for the human to interact with the humanoid robot
iCub. BASSIS is divided in 4 hierarchical layers: Contextual, Adaptive,
Reactive, Soma. The different modules responsible for implementing the
2
Available at: http://efaa.sourceforge.net/
a.2. methods
113
Figure A.1: The iCub ready to play with the objects that are on the Reactable.
The table provides an appealing atmosphere for social interactions between humans and the iCub.
relevant competences of the humanoid are distributed over these layers
(Fig. A.2).
The Soma and the Reactive layers are the bottom layers of the BASSIS
diagram. They are in charge of coding the robot motor primitives through
the attentionSelector and pmpActions package. It also acquires useful
sensory information from the tactile sensors mounted on the iCub hands
through the tactileInterface module.
In the Adaptive layer, a real-time affine transformation is performed in
order to map the coordinates received from the Reactable into the robot’s
frame of reference. iCub needs to be given the Cartesian coordinates
within its own frame to execute any motor action on an object. This is
accomplished by the ObjectLocationTransformer, of the Adaptive layer.
distributed adaptive control (dac) integrated into a novel
114
architecture for controlled human-robot interaction
At the top of the BASSIS diagram lays the Contextual layer which is
composed of two main entities: the Supervisor and the Objects Properties
Collector (OPC). The Supervisor module essentially manages the spoken
language system in the interaction with the human partner so as the synthesis of the robot’s voice. The OPC reflects the current knowledge the
system is able to acquire from the environment, collecting data that ranges
from objects coordinates in multi-reference frames (both in the ReacTable
and robot domains). Furthermore, a central role is played by the OPC
module which represents the database where a large variety of information is stored since the beginning of the experiments. And finally, the
reactable2OPC modules handles the required communication interface to
exchange the information provided by the Reactable with the OPC.
In the following sections, we elaborate on the description of each component of the BASSIS architecture.
The iCub
The iCub is a humanoid robot one meter tall, with dimensions similar to
a 3.5 years old child. It has 53 actuated degrees of freedom distributed
over the hands, arms, head and legs (Metta et al., 2008).
The sensory inputs are provided by artificial skin, cameras, microphones.
The robot is equipped with novel artificial skin covering the hands (fingertips and the palm) and the forearms (Cannata et al., 2008). The stereo
vision is provided by stereo cameras in a swivel mounting. The stereo
sound capturing is done using two microphones. Both cameras and microphones are located where eyes and ears would be placed in a human.
The facial expressions, mouth and eyebrows, are projected from behind
the face panel using lines of red LEDs. It also has the sense of proprioception (body configuration) and movement (using accelerometers and
gyroscopes).
The different software modules in the architecture are interconnected us-
115
a.2. methods
Contextual
Supervisor
OPC
Adaptive
reactable2OPC
ObjectLocationTransformer
Reactive
pmpActions
tactileInterface
Attention
Selector
Soma
cartesianControl
gazeControl
Figure A.2: BASSIS is a multi-scale biomimetic architecture organized at three
different levels of control reactive, adaptive and contextual. It is based on the
well established DAC architecture. See text for further details.
ing YARP (Metta et al., 2006). This framework supports distributed
computation focusing in the robot control and efficiency. The communication between two modules using YARP happens between objects called
”ports”. The ports can exchange data over different network protocols
such as TCP and UDP.
Currently, the iCub is capable of reliably co-ordinating reach and grasp
motions, facial expressions to express emotions, force control exploiting its
force/torque sensors and gaze at points in the visual field. The interaction
can be performed using spoken language. With all these features, the
iCub is an outstanding platform for research in social interaction between
distributed adaptive control (dac) integrated into a novel
116
architecture for controlled human-robot interaction
humans and humanoid robots.
Reactable
Conceptualized as a new musical instrument, the Reactable is a tabletop
tangible interface (Geiger et al., 2010). With the Reactable the user can
manipulate the digital information interacting with real-world objects. It
gives a physical shape to the digital information. The instrument is composed by a round table, with a translucent top where the objects and
fingertips (cursors) are used to control its parameters. It has a back projector to display the properties of the objects through the translucent top
and a camera used to track the different objects (Fig. A.3).
Object recognition is performed using the reacTIVision tracking system
(Kaltenbrunner and Bencina, 2007). The reacTIVision can track a large
number of objects in real-time using fiducial markers (Bencina and Kaltenbrunner, 2005). The objects receive an identification number (ID) according to
its fiducial marker. The software provides precise information about the
(x, y) position of the objects, rotation angle, speed and acceleration.
The tracking system uses the TUIO protocol for exchanging data (Kaltenbrunner et al., 2005). This protocol has been specifically designed to meet the
requirements of a general versatile communication interface for tabletop
tangible interfaces. The information from the reacTIVision software can
be read by any software using TUIO.
Object Properties Collector (OPC)
The OPC module collects and store the data produce during the interaction. The module consists of a database where a large variety of information is stored starting at the beginning of the experiment and as the
interaction with humans evolves over time. In this setup, the data provided by the Reactable feed the memory of the robot represented by the
OPC.
117
a.2. methods
Translucent
Tabletop
Fingertip
Pointing
Tagged
Objects
to
ec
oj
Pr
a
TUIO client
Application
TUIO
Protocol
er
am
C
r
(x,y,ø,..,s)
Reactivision
Figure A.3: Reactable system overview. See text for further explanation.
OPC reflects what the system knows about the environment and what
is available. The data collected range from object coordinates in multireference frames (both in the Reactable and robot domains) to the human’s
emotional states or position in the interaction process.
Entities in OPC are addressed with unique identifiers and managed dynamically, meaning that they can be added, modified, removed at run-time
safely and easily from multiple sources located anywhere in the network,
practically without limitations. Entities consist of all the possible elements
that enter or leave the scene: objects, table cursors, matrices used for geometric projection, as well as the robot and the human themselves. All the
properties related to the entities can be manipulated and used to populate
the centralized database. A property in the OPC vocabulary is identified
by a tag specified with a string (a sequence of characters). The values that
distributed adaptive control (dac) integrated into a novel
118
architecture for controlled human-robot interaction
Tag
”entity”
”name”
onTable
rt position x
rt position y
robot position x
robot position y
rt id x
Description
type of entity
object name
presence/absence
x-table
y-table
x-robot
y-robot
unique ID table
Type
string
string
integer
double
double
double
double
integer
Examples
”object”, ”table”
”drums”, ”guitar”
0, 1
[xmin, xmax] m
[ymin, ymax] m
[xmin, xmax] m
[ymin, ymax] m
ID 0
Table A.1: Excerpt of available OPC properties
can be assigned to a tag can be either single strings or numbers, or even
a nested list of strings and numbers. The table A.2 shows some examples
of the tags used.
Reactable connection with OPC
The reactable2OPC module provides the required communication interface
between the Reactable and the OPC. The implementation is done using a
TUIO client module that receives the data from the reacTIVision software
and re-sends it over a YARP port.
The OPC is populated with the properties of tagged objects that are on the
table. Such properties include: object ID, the (x, y) position of each object
on the table within the Reactable coordinate frame, angle of rotation and
speed. When an object is removed from the table the module keeps a
memory of its ID. The system reassigns the same ID to the same object
when it reappears.
Coordinate frame transformation based on homography
In order for the iCub to explore the world and interact with, it not only
needs to use its sensors but also coordinate perception and action across
a.2. methods
119
several frames of reference. The iCub has its own coordinate frame, therefore it is required to execute a transformation process between the table
and the robot centered coordinate frame and vice-versa in order to match
objects to actions. In the iCub’s reference frame, the x axis points backward. The y axis points laterally and is chosen according to the right hand
rule. And the z axis is parallel to gravity but pointing upwards.
With BASSIS, the objectLocationTransformer solves this mathematical
problem using a coordinate frame transformation based on homography.
This technique is commonly used in computer vision to map two coordinate frames and camera calibration (Zhang, 2000). The kernel of the
transformation is given by a 3x3 matrix H that rotates and translates the
points B received from the table into the the points A in the frame of the
robot. The transformation matrix can be formalized as: A = H × B.
To obtain the positions A and B in order to compute H, we perform an
a priori calibration session. In this phase, the iCub points at 3 random
positions on the table using its fingertips using the tactileInterface. Currently, the center of the palm is the known position where the iCub is
pointing. We use the length of iCub fingers and joint angles to calculate the fingertip position, by a computation of the forward kinematics
(Siciliano et al., 2011), and therefore estimate A. The points B are the
correspondent coordinates in the Reactable frame obtained directly from
the OPC.
Once the system is calibrated, the transformation between the two domains can be performed in real-time with a simple multiplication by H
which is stored in the OPC as well.
The Passive Motion Paradigm (PMP) and the Cartesian
Interface
The PmpActionsModule provides motor control mechanisms that allow the
robot to perform complex manipulation tasks. It relies on a revised version
distributed adaptive control (dac) integrated into a novel
120
architecture for controlled human-robot interaction
of the so-called Passive Motion Paradigm (PMP). PMP is a biomimetic
synergy formation model (Mohan et al., 2009) conceived for solving redundancy in human movements. The neurobiological background is related
to the Equilibrium Point Hypothesis, developed in the 60s by Feldman
and then by Bizzi. In the model (Mussa Ivaldi et al., 1988) goals and
constraints are expressed as force fields simultaneously in different motor spaces. In this way the model can carry out kinematic inversion by
means of a differential approach that relies on Jacobians but never computes their inverse explicitly. In the current implementation the model
was modified by breaking it in two modules: 1) a module where virtual
force fields are used in the workspace in order to perform reaching tasks
while avoiding obstacles as conceived by Khatib in the 80s (Khatib, 1985);
2) a module that carries out kinematic inversion via a modified pseudoinverse/transposed Jacobian. Within the PMP framework it is possible to
describe objects of the perceived world either as obstacles or as targets,
and to consequently generate proper repulsive or attractive force fields,
respectively. The PMP module can also include internal constraints, for
example related to bimanual coordination.
According to the composition of all the active fields, the robots end-effector
is eventually driven towards the selected target while bypassing the identified obstacles. The behavior and performance strictly depend on the mutual relationship among the tunable field’s parameters. This framework
allows defining a trajectory in Cartesian Space, which needs to be properly
converted in joint space through a robust inverse kinematics solver.
Specifically, we resort to the Cartesian Controller (Pattacini et al., 2010)
to solve the inverse kinematics problem while assuring to fulfill additional constraints such as joint limits. The non-linear optimization algorithm used, IPOPT, allows solving the inverse kinematics problem robustly (Wächter and Biegler, 2005). The Cartesian Controller also provides a reliable controller that extends the Multi-Referential Dynamical
Systems approach (Hersch and Billard, 2008) synchronizing two dynamical controllers, one in joint space and the other one in task space. It
121
a.2. methods
assures the robot to actually follow the trajectory.
In order to reproduce human-like manipulation tasks, we built a library,
namely the pmpActions, modeling complex actions as a combination of
simple action primitives that defined high level actions such as such as
Reach, Grasp and Release; in fact if we assume that, when an object is
grasped, it becomes an appendix of the robots body structure, complex
actions can be performed as a sequence of simple primitives (Fig.A.4).
Grasp and
release
PMP
Actions
Request for
adding/removing
objects
PMP
Client
PMP
Target/Obstacle
position and size
Field and
trajectory
generation
PMP
Server
3D point position
and pose
Inverse
kinematics
Cartesian
Interface
Figure A.4: PMP architecture: at lower level the pmpServer module computes
the trajectory, targets and obstacles in the space feeding the Cartesian Interface
with respect to the trajectory that has to be followed. The user relies on a
pmpClient in order to ask the server to move, add and remove objects, and
eventually to start new reaching tasks. At the highest level the pmpActions
library provides a set of complex actions, such as ”Push” or ”Move the object’,
simply implemented as combinations of primitives.
distributed adaptive control (dac) integrated into a novel
122
architecture for controlled human-robot interaction
Spoken Language Interaction and Supervisor
The spoken language interaction is managed by the Supervisor. It is implemented in the CSLU Toolkit (Sutton et al., 1998) Rapid Application
Development (RAD) under the form of a finite-state dialogue system where
the speech synthesis is done by Festival and the recognition by Sphinx-II.
Indeed, these state modules provide functions including conditional transition to new states based on the words and sentences recognized, and thus
conditional execution of code based on current state, and spoken input
from the user. The recognition and the extraction of the useful information from the recognized sentence is based on a simple grammar that we
have developed. For instance, for the action order, a sentence is following
this structure:
1. $action1 = grasp | point ;
2. $action2 = put | drop ;
3. $object = drums | synthesizer | guitar | box ;
4. $where rel = my | your ;
5. $location = here | $where rel left | middle | $where rel right;
6. $actionOrder = $action1 [*sil%%|the%%] $object | $action2 [the%%]
$object [on%% the%%] $location;
$action1, $action2, $object, $where rel and $location are variables which
could take several words. The final recognized sentence will have the
grammar of $actionOrder. The words between square bracket are optional,
and the ones followed by a %% sign are recognized but not stored.
The Supervisor allows the human to interact with the robot, and requires
access and control over several modules and thus, is in charge of the YARP
connections and establishes connections with the following components:
a.2. methods
123
1. attentionSelector, to be able to control where the iCub is looking,
providing a non-verbal way of communication by gazing at the targets while it is speaking (e.g. looking at the drums where he says
the name ”drums”),
2. objectsPropertiesCollector, to obtain all the knowledge of the iCub,
what it knows about the world in order to answer the human (e.g.
to know if the synthesizer is on the table),
3. pmpActions, to move its hands, executing order from the human
where it has to manipulate some objects (e.g. grasping the guitar
after the proper command spoken by the human),
4. tactileInterface, to realize the calibration process between the ReacTable and the iCub, where the human has made a contribution
(e.g. when the iCub ask him to touch the table).
Visualizations in real-time
Displaying the state of the world and of the iCub is important for debugging purposes since it provides a useful tool to visualize in real-time the
elements available in the memory of the robot. All objects detected by
the Reactable or stored as static locations are loaded from the OPC and
displayed in a graphical user interface, the iCubGUI. The robot is also
rendered in real-time, which gives the user an accurate visual representation of the whole scenario in the coordinate frame of the robot. The
objLocationTransformer module is responsible to send the data directly
to the iCubGUI. This classification allows us to customize the display for
different entity types.
Experimental scenario
We performed 2 different experiments in order to benchmark the architecture’s capabilities to maintain social interaction. We first defined 3
distributed adaptive control (dac) integrated into a novel
124
architecture for controlled human-robot interaction
fixed positions on the table: left, middle and right according (Fig. A.5).
They are used as spatial reference for placing the objects the robot has to
interact with.
In the initial experiments, we evaluate the precision in grasping tasks
performed by the iCub. The robot had to perform the following tasks:
grasp an object, lift it from the table and put it back to a desired location.
We use an object of dimensions 4x6x7 cm (Fig. A.1).
In the first experiment, we divided the tasks in two: namely the grasping
and the place tasks. In the first, the iCub grasps an object placed on
one of the three fixed positions and then release it to the exact same
point where it picked it up, assessing therefore the robot’s capability of
object manipulation as well as precision. For each of these three positions,
the robot had to perform 4 grasps. The robot executed a total of 36
grasping action, equally divided between right, middle and left positions.
We removed 4 samples relative to non-successful grasping, 3 at the middle
and 1 at the left position. At these positions the robot slightly hit the
object after placing it on the table. As the data collection is performed
after the arm is back to the rest position, the values recorded do not
reflect the precision of the grasping.To measure the precision, we define
the grasping error (GE) as the difference between the target position and
the final position where the object is placed.
To further test the robot’s ability to move an object from one point to
another, we designed the place task. In this task, the robot had to grasp
and move an object from one spatial point to another. We designed the
experiment to cover all possible combinations among right, middle and
left. The robot executed a total of 72 grasping action, 12 repetitions of
the 6 possible combinations of source-target (where source and target are
necessarily different). The place error (PE) is calculated in the same way
used for the grasping, i.e, the difference between target and final position.
The system was tested in a cluster with 4 machines using quad-cores Intel
Core i7 820QM processors (1.73 Ghz).
a.2. methods
125
In the last experiment, we test the iCub’s learning capabilities as well as
the spoken language interaction. For the scope of this experiment, we
have used shared plans. Also called cooperative plan, plans correspond to
interlaced actions which are scheduled by two cooperative agents (here a
human and the iCub) to succeed in achieving a common goal (Tomasello
et al., 2005).
The shared plan can be taught in real-time by the human. After announcing to the iCub it wants to do a shared plan by telling it ”Shared Plan”,
the Supervisor enters the sub-dialogue node of the cooperative plan. The
user could describe the plan he wants to execute with the iCub and could
execute it if the iCub knows it : otherwise the robot will ask the human to
teach him. For the learning part, the user has to describe the sequence of
actions, one by one in order to let the iCub asking for confirmation each
time. When all the actions have been told, the user end the teaching by
saying ”finished” (Petit et al., 2012). Thus the new learnt shared plan is
saved in a file under the form :
• Shared plan name, e.g. ”Swap”
• Steps, e.g. ”{I put {{drums middle}}}
{Y ou put {{guitar right}}} {I put {{drums lef t}}”
• Subjects, e.g. ”I Y ou”
• Arguments, e.g. ”drums guitar”
The subjects and arguments are written so as to be parsed and replaced
in the shared plan according to the sentence told by the user when he says
the shared plan. Thus, one could learn for instance the swap plan using
”I and You swap the drums and the guitar” but execute it with ”You
and I swap the drums and the guitar” : in this way, there will be a role
reversal and the robot will do the actions originally planned to be the ones
executed by the human, and inversely : the role reversal test is passed
distributed adaptive control (dac) integrated into a novel
126
architecture for controlled human-robot interaction
(Dominey and Warneken, 2009). Indeed, this aspect is very important
because this capability is the signature of a shared plan (Tomasello et al.,
2005; Carpenter et al., 2005).
Moreover, the arguments could also be replaced, and with the same original shared plan learnt, the iCub and the user could resolve the cooperative
plan ”You and I swap the synthesizer and the box”, showing the capacity
to generalize, in this example to generalize objects, in the cooperative plan
: it has learnt how to swap something with something, not just the drums
and the guitar(Lallée et al., 2010).
For the experiment, we designed 4 different shared plans. We repeated
all four shared plans learning and execution with three non-naive humans.
Here we present the 4 shared plans in both teaching and executing phases
exploring both role reversal and generalization of objects:
Teaching:
1. You and I swap drums guitar
• You put drums left and I put guitar right
2. You and I split drums guitar
• You put drums right and I put guitar left
3. You and I group guitar drums
• You put guitar middle and I put drums middle
4. You and I pass guitar left
• You put guitar middle and I put guitar left
Execution:
1. You and I swap drums guitar
127
a.3. results
2. I and You split drums guitar
3. You and I group synthesizer guitar
4. I and You pass guitar right
iCub
(x = -35 cm)
L
(y = 20 cm)
M
R
(y = 0 cm)
(y = -20 cm)
Reactable
Figure A.5: A cartoon of the experimental scenario. The distances for the left
(L), middle (M) and right (R) positions are giving according to the robot.
A.3
Results
In the first part of this section, we present an analysis of error in the
grasping tasks. We perform the analysis separately for the two axis x and
y. Initially, we perform a Shapiro-Wilk test to check for normality in GEx
and GEy . The results show a p-value = 0.68 for GEx and p-value = 0.85
for GEy , hence the data is normal distributed. We further perform a t-test
to evaluate whether the distribution have zero mean. The values are equal
to -1.20 cm and 1.57 for GEx and GEy respectively. It suggests that the
robot has a tendency of placing the objects more to the front and right of
the target.
distributed adaptive control (dac) integrated into a novel
128
architecture for controlled human-robot interaction
Analyzing individually the 3 different conditions, we see that the mean in
the horizontal error GEx decreases from right to left (Fig. A.6). Meaning,
the horizontal error is higher at the side correspond to the arm used in
the task (right arm). In our interpretation, this reflects the geometrical
arrangement of the arm in relation to the torso when the target position is
in the opposite side of the arm used for grasping. In the left side, the error
in x is into its cosine component. Therefore, the same deviation caused
by possible imprecisions in the movement of the torso represents a smaller
error in the left side than in the middle or right side.
We confirm this observation by performing a multiple regression analysis
to investigate the impact of different positions over those deviations. The
only significant impact is in GEx when the action is performed at the
left side (with p-value = 2.55 × 107 ). No impact on source was observed
in GEy . Hence, we conclude that the iCub is significantly more precise
left
middle right
Position on the Table
Grasping Error in y [cm]
0.5 1.0 1.5 2.0 2.5
Grasping Error in x [cm]
−2.5 −2.0 −1.5 −1.0 −0.5 0.0
grasping objects that are on the left side.
In the second part, we present
left
middle right
Position on the Table
Figure A.6: Box plots of GEx and GEy for the grasp experiments. The error
in placing an object on the table increases from left to right with respect to the
x axis. No significant difference was observed for y.
an analysis of error in the place experiment (PE). The Shapiro-Wilk test
gives a p-value of 0.25 and 0.0015, respectively for PEx and PEy . In this
case, PEy is not normally distributed. For this reason, we use parametric
tests for PEx and non-parametric for PEy .
A linear regression is performed in order to see if the distance of the
129
a.3. results
trajectory, defined as one for middle to right/left and two from left to
right, has an impact on the error. The analysis gives a p-value of 0.80,
suggesting that the distance has no impact in the error (Fig. A.7).
Target
left middle right
Error in x [cm]
−1 0 1 2 3
Error in x [cm]
−1.0 0.0 1.0
Source
Target
Error in y [cm]
−1 0 1 2 3
Error in y [cm]
−1.0 0.0 1.0
Source
left middle right
left middle right
left middle right
Figure A.7: Box plots for the place experiment for the different conditions
analyzed: source and target. Both conditions have an impact in the error.
We go further and perform another linear regression using the source and
the target as variables, followed by an ANOVA. The analysis gives an
impact for both parameters (p-value equals 0.020 and 0.001 for source
and target respectively). However, still very precise and the error is going
from -1 to 1 cm.
For the analysis of PEy we use Kruskal-Willis tests. The tests reveal no
significant impact from the distance (p-value = 0.099), but a significant
influence from the source (p-value = 4.8 × 10−10 ) and the target (p-value
= 2.06 × 10−9 ). The precision in this axis is smaller than the one in x axis,
distributed adaptive control (dac) integrated into a novel
130
architecture for controlled human-robot interaction
but still very acceptable. The errors are about -1 to 3 cm when moving a
4x6 cm-sided object in at least 20 cm (or 40).
For the second experiment, three subjects were asked to teach the robot 4
different shared plans and then execute them. In total, the robot learned
and executed 12 shared plans. To evaluate the learning capabilities of
the iCub as well as it’s performance in executing each plan, we measured
the time of completion for each shared plan. Learning time is measured
from the beginning of the human command to the indication of ”finished”,
which represents the completion of a learning sequence. Accordingly, execution time is measured from the start of the human command to the
execution of the last action, followed by a next request by the iCub.
The iCub was able to learn a shared plan of two actions in approximately
half a minute and execute the learned plan in a minute (Fig. A.8). During
the learning phase, only two deficits in speech recognition occurred, whilst
during execution phase three deficits in speech recognition occurred. These
deficits are mainly attributed to the the Sphinx-II speech recognizer, and
where therefore not taken into account. During the execution of the 12
shared plans with N=18 distinct actions by the iCub, no physical errors
occurred. Finally, it is worth noticing that a shared plan execution takes
roughly the same amount of time with a role reversal, replacing arguments
or both role reversal and argument replacement, stressing the robustness
of the system.
A.4
Conclusions
In this paper, we proposed an interaction paradigm and architecture for
human robot interaction, called BASSIS, designed to provide a novel and
controlled environment for research in social interaction with humanoid
robots. The core part of BASSIS is the integration between the iCub
robot and the tangible interface Reactable. This combination provides a
rich context for the execution of cooperative plans between the iCub and
131
a.4. conclusions
Time [s]
60
50
40
30
20
Learning
Execution
Figure A.8: Average time needed for the shared plan tasks. The learning
box, represents the time need for the human to create the task. The execution,
represents the time needed by the interaction to perform the learned action.
the human with games or collaborative music generation. We presented a
detailed technical description of all the components that build the 4 layers
of BASSIS. Moreover, we stressed the system capacities over 2 different
kinds of experiments.
In our results we have assessed both the precision of the overall system and
the usability of it in the context of acquiring and executing shared plans.
We showed that the position of the objects provided by the combination
of the Reactable, objectLocationTransformer, pmpActions and cartesianControl is reliably mapped to the robot’s frame allowing precise motor
actions. In a series of grasping tasks, the setup showed a maximum error
of 2 cm between pre-defined targets and the final position the objects were
displaced.
In the experiments, we observed an impact of position of the objects in the
grasping precision. For simple grasping over the same position, i.e, taking
an object off table and putting it back on the same spot, the precision was
significantly better on the left side of the table (at the side opposite to the
used arm). Assuming that the robot has a perfect bilateral symmetric,
this difference should be compensated using both arms.
The processes run smooth without affecting the dynamics of the interac-
distributed adaptive control (dac) integrated into a novel
132
architecture for controlled human-robot interaction
tion. Over 40 hours of tests and 360 executions of grasping, we had no
mechanical issues with any hardware or software.
In the next version of the architecture 2 new modules that are already in a
test phase will be integrated. The first, is based on the integration of the
HAMMER model (Sarabia et al., 2011) (Hierarchical Attentive Multiple
Models for Execution and Recognition of actions) with the PmpActionsModule. With HAMMER the system will be able to understand human
movements by internally simulating the robots own actions and predicting
future human states. The other model will add a face recognition capabilities to the robot. It is based on a canonical model of cortical computation
called Temporal Population Code (Luvizotto et al., 2011). In addition we
will add an multi-scale bottom-up and top-down attention system (Mathews et al., 2011). With these new features we expect to increase the level
of social connection between the human in the interaction with the robot.
Bibliography
Each reference indicates the pages where it appears.
Yoseph Bar-Cohen and David Hanson. The Coming Robot Revolution: Expectations and Fears About Emerging Intelligent, Humanlike Machines.
Springer, 2009. ISBN 0387853480. 110
Omri Barak, Misha Tsodyks, and Ranulfo Romo.
Neuronal popula-
tion coding of parametric working memory. The Journal of neuroscience, 30(28):9424–30, July 2010. ISSN 1529-2401. doi: 10.1523/
JNEUROSCI.1875-10.2010.
URL http://www.jneurosci.org/cgi/
content/abstract/30/28/9424. 14, 23
Emmanuel J Barbeau, Margot J Taylor, Jean Regis, Patrick Marquis,
Patrick Chauvel, and Catherine Liégeois-Chauvel. Spatio temporal dynamics of face recognition. Cerebral cortex (New York, N.Y. : 1991), 18
(5):997–1009, May 2008. ISSN 1460-2199. doi: 10.1093/cercor/bhm140.
URL http://cercor.oxfordjournals.org/cgi/content/abstract/
18/5/997. 55
Francesco P Battaglia and Bruce L McNaughton. Polyrhythms of the
brain. Neuron, 72(1):6–8, October 2011. ISSN 1097-4199. doi: 10.1016/
j.neuron.2011.09.019. URL http://dx.doi.org/10.1016/j.neuron.
2011.09.019. 16
Pierre Bayerl and Heiko Neumann. Disambiguating visual motion through
contextual feedback modulation. Neural computation, 16(10):2041–66,
October 2004. ISSN 0899-7667. doi: 10.1162/0899766041732404. URL
133
134
bibliography
http://www.citeulike.org/user/toerst/article/4199040. 58, 72,
81
G. Bekey and J. Yuh. The Status of Robotics Report on the WTEC
international study: Part I. IEEE Robotics and Automation Magazine,
14(4):76–81, 2007. 110
Karim Benchenane, Paul H Tiesinga, and Francesco P Battaglia. Oscillations in the prefrontal cortex: a gateway to memory and attention.
Current opinion in neurobiology, 21(3):475–85, June 2011. ISSN 18736882. doi: 10.1016/j.conb.2011.01.004. URL http://dx.doi.org/10.
1016/j.conb.2011.01.004. 16
Ross Bencina and Martin Kaltenbrunner. The Design and Evolution of
Fiducials for the reacTIVision System. In 3rd international conference
on generative systems in the electronic arts, 2005. 116
Andrea Benucci, Robert A Frazor, and Matteo Carandini. Standing waves
and traveling waves distinguish two circuits in visual cortex. Neuron, 55
(1):103–117, 2007. ISSN 0896-6273 (Print). doi: 10.1016/j.neuron.2007.
06.017. 23
Andrea Benucci, Dario L Ringach, and Matteo Carandini. Coding of stimulus sequences by population responses in visual cortex. Nature neuroscience, 12(10):1317–24, October 2009. ISSN 1546-1726. doi: 10.1038/
nn.2398. URL http://www.pubmedcentral.nih.gov/articlerender.
fcgi?artid=2847499&tool=pmcentrez&rendertype=abstract. 14, 23
Cyrus P Billimoria, Benjamin J Kraus, Rajiv Narayan, Ross K Maddox,
and Kamal Sen. Invariance and sensitivity to intensity in neural discrimination of natural sounds. The Journal of neuroscience, 28(25):6304–
6308, 2008. ISSN 1529-2401 (Electronic). doi: 10.1523/JNEUROSCI.
0961-08.2008. 15, 23
Tom Binzegger, Rodney J Douglas, and Kevan A C Martin. A quantitative map of the circuit of cat primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 24(39):8441–53, September 2004.
10.1523/JNEUROSCI.1400-04.2004. 90
ISSN 1529-2401.
doi:
135
bibliography
M Block. A note on the refraction and image formation of the rat’s
eye. Vision Research, 9(6):705–711, June 1969. ISSN 00426989. doi:
10.1016/0042-6989(69)90127-8.
URL http://dx.doi.org/10.1016/
0042-6989(69)90127-8. 97
Francisco Bonin-Font, Alberto Ortiz, and Gabriel Oliver. Visual Navigation for Mobile Robots: A Survey. J. Intell. Robotics Syst., 53(3):263–
296, November 2008. ISSN 0921-0296. doi: 10.1007/s10846-008-9235-4.
90
Jeffrey S. Bowers. On the biological plausibility of grandmother cells:
Implications for neural network theories in psychology and neuroscience.
Psychological Review, 116(1):220, 2009. 10
Cynthia Breazeal, Matt Berlin, Andrew Brooks, Jesse Gray, and Andrea L.
Thomaz. Using perspective taking to learn from ambiguous demonstrations. Robotics and Autonomous Systems, 54(5):385–393, May 2006.
ISSN 09218890. doi: 10.1016/j.robot.2006.02.004. 110
Farran Briggs and Edward M. Callaway. Layer-Specific Input to Distinct
Cell Types in Layer 6 of Monkey Primary Visual Cortex. J. Neurosci.,
21(10):3600–3608, May 2001. URL http://www.jneurosci.org/cgi/
content/abstract/21/10/3600. 7
Dean V Buonomano and Wolfgang Maass. State-dependent computations: spatiotemporal processing in cortical networks. Nature reviews.
Neuroscience, 10(2):113–25, February 2009.
ISSN 1471-0048.
doi:
10.1038/nrn2558. URL http://dx.doi.org/10.1038/nrn2558. 108
G Buzsáki. Rhythms of the brain. Oxford University Press, New York,
New York, USA, 2006. ISBN 9780195301069. URL http://books.
google.es/books?id=ldz58irprjYC. 15, 51, 103
C. Cadoz. Le geste canal de communication homme/machine: la communication instrumentale. TSI. Technique et science informatiques, 13(1):
31–61, 1984. ISSN 0752-4072. URL http://cat.inist.fr/?aModele=
afficheN&cpsidt=3595521. 75
Edward M. Callaway.
Local circuits in primary visual cortex of the
macaque monkey. Annual Review of Neuroscience, 21:47–74, November
136
1998.
bibliography
URL http://www.annualreviews.org/doi/abs/10.1146/
annurev.neuro.21.1.47?prevSearch=Local+circuits+in+primary+
visual+cortex+of+the+macaque+monkey&searchHistoryKey=. 7
Edward M Callaway.
Structure and function of parallel pathways in
the primate early visual system. The Journal of physiology, 566(Pt
1):13–9, July 2005.
ISSN 0022-3751.
doi: 10.1113/jphysiol.2005.
088047. URL http://www.pubmedcentral.nih.gov/articlerender.
fcgi?artid=1464718&tool=pmcentrez&rendertype=abstract. 5
Giorgio Cannata, Marco Maggiali, Giorgio Metta, and Giulio Sandini. An
embedded artificial skin for humanoid robots. In 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent
Systems, pages 434–438. IEEE, August 2008. ISBN 978-1-4244-2143-5.
doi: 10.1109/MFI.2008.4648033. 63, 114
Mikael A Carlsson, Philipp Knusel, Paul F M J Verschure, and Bill S
Hansson. Spatio-temporal Ca2+ dynamics of moth olfactory projection
neurones. European Journal of Neuroscience, 22(3):647–657, 2005. ISSN
0953-816X (Print). doi: 10.1111/j.1460-9568.2005.04239.x. 15, 24
Malinda Carpenter, Michael Tomasello, and Tricia Striano. Role Reversal
Imitation and Language in Typically Developing Infants and Children
With Autism. Infancy, 8(3):253–278, November 2005. ISSN 15250008.
doi: 10.1207/s15327078in0803\ 4. 126
Jayaram Chandrashekar, Mark A Hoon, Nicholas J P Ryba, and Charles S
Zuker. The receptors and cells for mammalian taste. Nature, 444(7117):
288–94, November 2006. ISSN 1476-4687. doi: 10.1038/nature05401.
URL http://dx.doi.org/10.1038/nature05401. 27
Shi-Huang Chen and Jhing-Fa Wang. Extraction of pitch information in
noisy speech using wavelet transform with aliasing compensation. In
Acoustics, Speech, and Signal Processing, 2001. Proceedings., volume 1,
pages 89–92 vol.1, Salt Lake City, USA, 2001. doi: 10.1109/ICASSP.
2001.940774. 50
Taishih Chi, Powen Ru, and Shihab A Shamma. Multiresolution spectrotemporal analysis of complex sounds. The Journal of the Acoustical
137
bibliography
Society of America, 118(2):887–906, 2005. doi: 10.1121/1.1945807. URL
http://link.aip.org/link/?JAS/118/887/1. 25
S Chikkerur, T Serre, C Tan, and T Poggio. What and where: a Bayesian
inference theory of attention. Vision research, 50(22):2233–47, October
2010. ISSN 1878-5646. doi: 10.1016/j.visres.2010.05.013. URL http:
//www.ncbi.nlm.nih.gov/pubmed/20493206. 11, 54
Bevil R Conway, Soumya Chatterjee, Greg D Field, Gregory D Horwitz,
Elizabeth N Johnson, Kowa Koida, and Katherine Mancuso. Advances
in color science: from retina to behavior. The Journal of neuroscience
: the official journal of the Society for Neuroscience, 30(45):14955–63,
November 2010. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.4348-10.
2010. URL http://www.jneurosci.org/cgi/content/abstract/30/
45/14955. 3
A. Cowey and E.T. Rolls. Human cortical magnification factor and its
relation to visual acuity. Experimental Brain Research, 21(5), December
1974. ISSN 0014-4819. doi: 10.1007/BF00237163. URL http://www.
springerlink.com/content/q7v12n66365h123h/. 6
Pablo D’Angelo. Hugin, 2010. URL http://hugin.sourceforge.net/.
97
J Daugman. Two-dimensional spectral analysis of cortical receptive field
profiles. Vision Research, 20(10):847–856, 1980. ISSN 00426989. doi:
10.1016/0042-6989(80)90065-6.
URL http://dx.doi.org/10.1016/
0042-6989(80)90065-6. 25
Kerstin Dautenhahn. Socially intelligent robots: dimensions of humanrobot interaction. Philosophical transactions of the Royal Society of
London. Series B, Biological sciences, 362(1480):679–704, April 2007.
ISSN 0962-8436. doi: 10.1098/rstb.2006.2004. 110
Licurgo de Almeida, Marco Idiart, and John E Lisman. The input-output
transformation of the hippocampal granule cells: from grid cells to place
fields. The Journal of neuroscience : the official journal of the Society
for Neuroscience, 29(23):7504–12, June 2009a. ISSN 1529-2401. doi: 10.
1523/JNEUROSCI.6048-08.2009. URL http://www.jneurosci.org/
138
bibliography
cgi/content/abstract/29/23/7504. 92, 96, 107
Licurgo de Almeida, Marco Idiart, and John E Lisman. A second function
of gamma frequency oscillations: an E%-max winner-take-all mechanism
selects which cells fire. The Journal of neuroscience : the official journal
of the Society for Neuroscience, 29(23):7497–503, June 2009b. ISSN
1529-2401. doi: 10.1523/JNEUROSCI.6044-08.2009. URL http://www.
jneurosci.org/cgi/content/abstract/29/23/7497. 96, 97
R L De Valois, D G Albrecht, and L G Thorell. Spatial frequency selectivity of cells in macaque visual cortex. Vision research, 22(5):545–59,
January 1982. ISSN 0042-6989. URL http://www.ncbi.nlm.nih.gov/
pubmed/7112954. 57
Sven J. Dickinson, Aleš Leonardis, Bernt Schiele, and Michael J. Tarr,
editors. Object Categorization. Cambridge University Press, Cambridge,
2009. ISBN 9780511635465. doi: 10.1017/CBO9780511635465. 110
G S Doetsch. Patterns in the brain. Neuronal population coding in the
somatosensory system. Physiology and Behavior, 69(1-2):187–201, 2000.
ISSN 0031-9384 (Print). 16, 24
Peter Ford Dominey and Felix Warneken. The basis of shared intentions
in human and robot cognition. New Ideas in Psychology, 29(3):260–274,
December 2009. ISSN 0732118X. doi: 10.1016/j.newideapsych.2009.07.
006. 126
Armin Duff and Paul F.M.J. Verschure. Unifying perceptual and behavioral learning with a correlative subspace learning rule.
computing, 73(10-12):1818–1830, June 2010.
Neuro-
ISSN 09252312.
doi:
10.1016/j.neucom.2009.11.048. 61, 112
Armin Duff, César Rennó-Costa, Encarni Marcos, Andre Luvizotto, Andrea Giovannucci, Marti Sanchez-Fibla, Ulysses Bernardet, and Paul
Verschure. From Motor Learning to Interaction Learning in Robots,
volume 264 of Studies in Computational Intelligence. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-05180-7. doi:
10.1007/978-3-642-05181-4.
content/v348576tk12u628h.
URL http://www.springerlink.com/
139
bibliography
Armin Duff, Marti Sanchez Fibla, and Paul F M J Verschure. A biologically based model for the integration of sensory-motor contingencies in
rules and plans: A prefrontal cortex based extension of the Distributed
Adaptive Control architecture. Brain research bulletin, 85(5):289–304,
June 2011. ISSN 1873-2747. doi: 10.1016/j.brainresbull.2010.11.008.
URL http://www.ncbi.nlm.nih.gov/pubmed/21138760. 27
Gaute T. Einevoll and Hans E. Plesser. Extended difference-of-Gaussians
model incorporating cortical feedback for relay cells in the lateral geniculate nucleus of cat. Cognitive Neurodynamics, pages 1–18, November 2011. ISSN 1871-4080. doi: 10.1007/s11571-011-9183-8. URL
http://www.springerlink.com/content/g64572342p61vk40/. 6, 27,
94
M Fabre-Thorpe, G Richard, and S J Thorpe. Rapid categorization of
natural images by rhesus monkeys. Neuroreport, 9(2):303–8, January
1998. ISSN 0959-4965. URL http://www.ncbi.nlm.nih.gov/pubmed/
9507973. 17
Winrich A Freiwald, Doris Y Tsao, and Margaret S Livingstone. A face
feature space in the macaque temporal lobe. Nature neuroscience, 12
(9):1187–96, September 2009. ISSN 1546-1726. doi: 10.1038/nn.2363.
URL http://dx.doi.org/10.1038/nn.2363. 10, 55
M. Frigo and S.G. Johnson. The Design and Implementation of FFTW3.
Proceedings of the IEEE, 93(2):216–231, February 2005. ISSN 00189219. doi: 10.1109/JPROC.2004.840301. URL http://ieeexplore.
ieee.org/xpl/freeabs_all.jsp?arnumber=1386650. 60
K Fukushima. Neocognitron: a self organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, January 1980a. ISSN 0340-1200. URL
http://www.ncbi.nlm.nih.gov/pubmed/7370364. 2, 23
K Fukushima.
Neocognitron for handwritten digit recognition.
rocomputing, 51(1):161–180, 2003.
S0925-2312(02)00614-8.
ISSN 09252312.
Neu-
doi: 10.1016/
URL http://linkinghub.elsevier.com/
retrieve/pii/S0925231202006148. 11, 54
140
bibliography
Kunihiko Fukushima. Neocognitron: A self-organizing neural network
model for a mechanism of pattern recognition unaffected by shift in
position. Biological Cybernetics, 36(4):193–202, April 1980b. ISSN 03401200. doi: 10.1007/BF00344251. URL http://www.springerlink.
com/content/r6g5w3tt54528137/. 11, 54
G Geiger, N Alber, S Jordà, and M Alonso. The Reactable: A Collaborative Musical Instrument for Playing and Understanding Music.
Her&Mus. Heritage & Museography, 2(4):36 – 43, 2010. 111, 116
A S Georghiades, P N Belhumeur, and D J Kriegman. From Few to
Many: Illumination Cone Models for Face Recognition under Variable
Lighting and Pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23
(6):643–660, 2001. xxviii, 19, 66, 67
A. Georgopoulos, A. Schwartz, and R. Kettner. Neuronal population
coding of movement direction. Science, 233(4771):1416–1419, September 1986.
ISSN 0036-8075.
doi: 10.1126/science.3749885.
URL
http://www.sciencemag.org/content/233/4771/1416.abstract. 12
CD Gilbert and TN Wiesel. Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci.,
9(7):2432–2442, July 1989.
URL http://www.jneurosci.org/cgi/
content/abstract/9/7/2432. 7, 8
Susan Goldin-Meadow.
out childhood.
gust 2009.
How gesture promotes learning through-
Child development perspectives, 3(2):106–111, Au-
ISSN 1750-8606.
doi: 10.1111/j.1750-8606.2009.00088.
x. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?
artid=2835356&tool=pmcentrez&rendertype=abstract. 75
Tim Gollisch and Markus Meister. Rapid neural coding in the retina with
relative spike latencies. Science (New York, N.Y.), 319(5866):1108–11,
February 2008. ISSN 1095-9203. doi: 10.1126/science.1149639. URL
http://www.sciencemag.org/content/319/5866/1108.abstract. 4,
14
Tim Gollisch and Markus Meister. Eye smarter than scientists believed:
neural computations in circuits of the retina. Neuron, 65(2):150–64,
141
bibliography
January 2010. ISSN 1097-4199. doi: 10.1016/j.neuron.2009.12.009. URL
http://www.ncbi.nlm.nih.gov/pubmed/20152123. 4
Leslie J. Gonzalez Rothi, Cynthia Ochipa, and Kenneth M. Heilman.
A Cognitive Neuropsychological Model of Limb Praxis.
Neuropsychology, 8(6):443–458, November 1991.
Cognitive
ISSN 0264-3294.
doi: 10.1080/02643299108253382. URL http://dx.doi.org/10.1080/
02643299108253382. 75
Charles G Gross. Genealogy of the ”grandmother cell”. The Neuroscientist
: a review journal bringing neurobiology, neurology and psychiatry, 8(5):
512–8, October 2002. ISSN 1073-8584. URL http://www.ncbi.nlm.
nih.gov/pubmed/12374433. 10
Alfred Haar. Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen, 71(1):38–53, March 1911. ISSN 0025-5831. doi:
10.1007/BF01456927. URL http://www.springerlink.com/content/
q1n417x61q8r2827/. 26
Ileana L Hanganu, Yehezkel Ben-Ari, and Rustem Khazipov.
Reti-
nal waves trigger spindle bursts in the neonatal rat visual cortex.
The Journal of neuroscience, 26(25):6728–36, June 2006. ISSN 15292401.
doi: 10.1523/JNEUROSCI.0752-06.2006.
URL http://www.
jneurosci.org/cgi/content/abstract/26/25/6728. 26
D O Hebb. Studies of the organization of behavior. I. Behavior of the rat
in a field orientation. Journal of Comparative Psychology, 25:333–353,
1932. 97, 107
Micha Hersch and Aude G. Billard. Reaching with multi-referential dynamical systems. Autonomous Robots, 25(1-2):71–83, January 2008.
ISSN 0929-5593. doi: 10.1007/s10514-007-9070-7. 120
D H Hubel and T N Wiesel.
Receptive fields, binocular interaction
and functional architecture in the cat’s visual cortex.
The Jour-
nal of physiology, 160:106–54, January 1962. ISSN 0022-3751. URL
http://www.ncbi.nlm.nih.gov/pubmed/14449617. 7, 54
Chou P Hung, Gabriel Kreiman, Tomaso Poggio, and James J DiCarlo. Fast readout of object identity from macaque inferior tempo-
142
ral cortex.
bibliography
Science, 310(5749):863–6, November 2005.
ISSN 1095-
9203. doi: 10.1126/science.1117593. URL http://www.sciencemag.
org/content/310/5749/863.abstract. 22
J M Hupé, A C James, P Girard, and J Bullier. Response modulations
by static texture surround in area V1 of the macaque monkey do not
depend on feedback connections from V2. Journal of neurophysiology,
85(1):146–63, January 2001. ISSN 0022-3077. URL http://www.ncbi.
nlm.nih.gov/pubmed/11152715. 8
L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention
for rapid scene analysis. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(11):1254–1259, 1998. ISSN 01628828. doi: 10.
1109/34.730558. URL http://ieeexplore.ieee.org/xpl/freeabs_
all.jsp?arnumber=730558. 92, 93, 94, 107
Giuliano Iurilli, Fabio Benfenati, and Paolo Medini. Loss of Visually
Driven Synaptic Responses in Layer 4 Regular-Spiking Neurons of Rat
Visual Cortex in Absence of Competing Inputs. Cerebral cortex, pages
bhr304–, November 2011. ISSN 1460-2199. doi: 10.1093/cercor/bhr304.
URL http://cercor.oxfordjournals.org/cgi/content/abstract/
bhr304v1. 26
E M Izhikevich. Simple model of spiking neurons. IEEE transactions on
neural networks, 14(6):1569–72, January 2003. ISSN 1045-9227. doi:
10.1109/TNN.2003.820440. URL http://ieeexplore.ieee.org/xpl/
freeabs_all.jsp?arnumber=1257420. 28, 78
Eugene Izhikevich. Dynamical Systems in Neuroscience: The Geometry
of Excitability and Bursting. The MIT Press, Cambridge, USA, 1 edition, 2006. ISBN 0262090430. URL http://www.citeulike.org/user/
sdaehne/article/1396879. 30, 80
Eugene M Izhikevich.
neurons?
Which model to use for cortical spiking
IEEE transactions on neural networks, 15(5):1063–
70, September 2004.
ISSN 1045-9227.
doi:
10.1109/TNN.2004.
832719. URL http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?
arnumber=1333071. 28, 30, 79, 80, 95
143
bibliography
J P Jones and L A Palmer. An evaluation of the two-dimensional Gabor
filter model of simple receptive fields in cat striate cortex. Journal of
Neurophysiology, 58(6):1233–58, December 1987. ISSN 0022-3077. URL
http://www.ncbi.nlm.nih.gov/pubmed/3437332. 25
Sewoong Jun, Youngouk Kim, and Jongbae Lee. Difference of wavelet
SIFT based mobile robot navigation. In 2009 IEEE International Conference on Control and Automation, pages 2305–2310. IEEE, December
2009. ISBN 978-1-4244-4706-0. doi: 10.1109/ICCA.2009.5410318. 90
Martin Kaltenbrunner and Ross Bencina. reacTIVision: a computer-vision
framework for table-based tangible interaction. In Proceedings of the 1st
international conference on Tangible and embedded interaction - TEI
’07, page 69, New York, New York, USA, February 2007. ACM Press.
ISBN 9781595936196. doi: 10.1145/1226969.1226983. 116
Martin Kaltenbrunner, Till Bovermann, Ross Bencina, and Enrico
Costanza. TUIO - A Protocol for Table Based Tangible User Interfaces. In Proceedings of the 6th International Workshop on Gesture
in Human-Computer Interaction and Simulation (GW 2005), Vannes,
France, 2005. 116
O. Khatib. Real-time obstacle avoidance for manipulators and mobile
robots. In IEEE International Conference on Robotics and Automation, volume 2, pages 500–505. Institute of Electrical and Electronics
Engineers, 1985. doi: 10.1109/ROBOT.1985.1087247. 120
Philipp Knüsel, Reto Wyss, Peter König, and Paul F M J Verschure.
Decoding a temporal population code.
Neural Computa-
tion, 16(10):2079–100, October 2004. ISSN 0899-7667. doi: 10.1162/
0899766041732459. URL http://portal.acm.org/citation.cfm?id=
1119706.1119711. 16, 24
Philipp Knusel, Mikael A Carlsson, Bill S Hansson, Tim C Pearce, and
Paul F M J Verschure. Time and space are complementary encoding
dimensions in the moth antennal lobe. Network, 18(1):35–62, 2007. ISSN
0954-898X (Print). doi: 10.1080/09548980701242573. 15, 24
Olivier Koch, Matthew R Walter, Albert S Huang, and Seth Teller.
144
bibliography
Ground robot navigation using uncalibrated cameras. In 2010 IEEE International Conference on Robotics and Automation, pages 2423–2430.
IEEE, May 2010. ISBN 978-1-4244-5038-1. doi: 10.1109/ROBOT.
2010.5509325. URL http://ieeexplore.ieee.org/lpdocs/epic03/
wrapper.htm?arnumber=5509325. 90
Randal A Koene and Michael E Hasselmo. An integrate-and-fire model of
prefrontal cortex neuronal activity during performance of goal-directed
decision making.
Cerebral Cortex, 15(12):1964–81, December 2005.
ISSN 1047-3211. doi: 10.1093/cercor/bhi072. URL http://cercor.
oxfordjournals.org/cgi/content/abstract/15/12/1964. 33
Gord Kurtenbach and Eric Hulteen. Gestures in Human-Computer Communication. In Brenda Laurel, editor, The Art and Science of Interface
Design., pages 309–317. Addison-Wesley Publishing Co., Boston, 1 edition, 1990. 75
Stephane Lallée, Séverin Lemaignan, Alexander Lenz, Chris Melhuish,
Lorenzo Natale, S Skachek, T Van Der Zant, Felix Warneken, and P F
Dominey. Towards a Platform-Independent Cooperative Human-Robot
Interaction System: I. Perception. Proceedings of the IEEERSJ International Conference on Intelligent Robots and Systems, pages 4444–4451,
2010. 65, 126
M Laubach.
Wavelet-based processing of neuronal spike trains prior
to discriminant analysis.
159–168, April 2004.
Journal Neuroscience Methods, 134(2):
ISSN 0165-0270.
doi: http://dx.doi.org/10.
1016/j.jneumeth.2003.11.007. URL http://dx.doi.org/10.1016/j.
jneumeth.2003.11.007. 48
Sylvain Le Groux, Jonatas Manzolli, Marti Sanchez, Andre Luvizotto,
Anna Mura, Aleksander Valjamae, Christoph Guger, Robert Prueckl,
Ulysses Bernardet, and Paul Verschure. Disembodied and Collaborative Musical Interaction in the Multimodal Brain Orchestra. In Proceedings of the international conference on New Interfaces for Musical
Expression, 2010. URL http://www.citeulike.org/user/slegroux/
article/8492764.
145
bibliography
A B Lee, B Blais, H Z Shouval, and L N Cooper. Statistics of lateral geniculate nucleus (LGN) activity determine the segregation of ON/OFF
subfields for simple cells in visual cortex. Proceedings of the National
Academy of Sciences of the United States of America, 97(23):12875–9,
November 2000. ISSN 0027-8424. doi: 10.1073/pnas.97.23.12875. URL
http://www.pnas.org/cgi/content/abstract/97/23/12875. 4
K C Lee, J Ho, and D Kriegman. Acquiring Linear Subspaces for Face
Recognition under Variable Lighting. IEEE Trans. Pattern Anal. Mach.
Intelligence, 27(5):684–698, 2005. xxviii, xxxiii, 19, 66, 67, 72
Dong Zhang Li, S. Z. Li, and Daniel Gatica-perez. Real-Time Face Detection Using Boosting in Hierarchical Feature Spaces, 2004. URL http://
citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.2088. 65
Bao-hua Liu, Guangying K Wu, Robert Arbuckle, Huizhong W Tao, and
Li I Zhang. Defining cortical frequency tuning with recurrent excitatory
circuitry. Nature neuroscience, 10(12):1594–600, December 2007. ISSN
1097-6256. doi: 10.1038/nn2012. URL http://dx.doi.org/10.1038/
nn2012. 90
N K Logothetis and D L Sheinberg. Visual object recognition. Annual
review of neuroscience, 19:577–621, January 1996. ISSN 0147-006X.
doi: 10.1146/annurev.ne.19.030196.003045. URL http://www.ncbi.
nlm.nih.gov/pubmed/8833455. 9
David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints.
International Journal of Computer Vision, 60(2):91–110,
November 2004. ISSN 0920-5691. doi: 10.1023/B:VISI.0000029664.
99615.94. URL http://portal.acm.org/citation.cfm?id=993451.
996342. 73
Mantas Lukoševičius and Herbert Jaeger.
Reservoir computing ap-
proaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009.
ISSN 15740137.
doi: 10.1016/j.cosrev.
2009.03.005. URL http://linkinghub.elsevier.com/retrieve/pii/
S1574013709000173. 24, 91
Andre Luvizotto, César Rennó-Costa, Ugo Pattacini, and Paul Verschure.
146
bibliography
The encoding of complex visual stimuli by a canonical model of the
primary visual cortex: temporal population coding for face recognition
on the iCub robot. In IEEE International Conference on Robotics and
Biomimetics, page 6, Thailand, 2011. 19, 23, 76, 81, 86, 92, 132
Andre Luvizotto, Maxime Petit, Vasiliki Vouloutsi, and Et Al. Experimental and Functional Android Assistant : I . A Novel Architecture
for a Controlled Human-Robot Interaction Environment. In IROS 2012
(Submitted), 2012a. 61
Andre Luvizotto, César Rennó-Costa, and Paul F.M.J. Verschure.
A
framework for mobile robot navigation using a temporal population
code. In Springer Lecture Notes in Computer Science - Living Machines, page 12, 2012b. 20
Andre Luvizotto, César Rennó-Costa, and Paul F.M.J. Verschure.
A
wavelet based neural model to optimize and read out a temporal population code. Frontiers in Computational Neuroscience, 6(21):14, 2012c.
17, 82
Sean P MacEvoy, Thomas R Tucker, and David Fitzpatrick. A precise form
of divisive suppression supports population coding in the primary visual
cortex. Nature Neuroscience, 12(5):637–45, May 2009. ISSN 1546-1726.
doi: 10.1038/nn.2310. URL http://dx.doi.org/10.1038/nn.2310. 23
Stéphane Mallat. A Wavelet Tour of Signal Processing. Academic Press,
Burlington, USA, 1998. 17, 25, 31, 82
Gary Marsat and Leonard Maler.
Neural heterogeneity and efficient
population codes for communication signals. Journal of neurophysiology, 104(5):2543–55, November 2010.
ISSN 1522-1598.
doi: 10.
1152/jn.00256.2010. URL http://jn.physiology.org/cgi/content/
abstract/104/5/2543. 15
Marialuisa; Martelli, Najib J.; Majaj, and Denis G.; Pelli. Are faces
processed like words? A diagnostic test for recognition by parts. Journal
of Vision, 5(1):58–70, 2005. 19, 55
Zenon Mathews, Sergi Bermúdez i Badia, and Paul F.M.J. Verschure.
PASAR: An integrated model of prediction, anticipation, sensation, at-
147
bibliography
tention and response for artificial sensorimotor systems. Information
Sciences, 186(1):1–19, March 2011. ISSN 00200255. doi: 10.1016/j.ins.
2011.09.042. 94, 132
André Maurer, Micha Hersch, and Aude G Billard. Extended hopfield
network for sequence learning: Application to gesture recognition. Proceedings of ICANN, 1, 2005. URL http://citeseerx.ist.psu.edu/
viewdoc/summary?doi=10.1.1.68.7770. 76
David McNeill. Hand and Mind: What Gestures Reveal about Thought.
University Of Chicago Press, 1996.
ISBN 0226561348.
URL
http://www.amazon.com/Hand-Mind-Gestures-Reveal-Thought/
dp/0226561348. 75
Giorgio Metta, Paul Fitzpatrick, and Lorenzo Natale. YARP: Yet Another
Robot Platform. International Journal of Advanced Robotic Systems, 3
(1), 2006. URL http://eris.liralab.it/yarp/. 64, 115
Giorgio Metta, Giulio Sandini, David Vernon, Lorenzo Natale, and
Francesco Nori. The iCub humanoid robot: an open platform for research in embodied cognition. In Proceedings of the 8th Workshop on
Performance Metrics for Intelligent Systems - PerMIS ’08, page 50,
New York, New York, USA, 2008. ACM Press. doi: 10.1145/1774674.
1774683. 55, 63, 110, 114
Ethan M Meyers, David J Freedman, Gabriel Kreiman, Earl K Miller,
and Tomaso Poggio. Dynamic population coding of category information in inferior temporal and prefrontal cortex. Journal of Neurophysiology, 100(3):1407–19, September 2008. ISSN 0022-3077. doi: 10.
1152/jn.90248.2008. URL http://jn.physiology.org/cgi/content/
abstract/100/3/1407. 23
Vikramjit Mitra, Hosung Nam, Carol Espy-Wilson, Elliot Saltzman, and
Louis Goldstein. Recognizing articulatory gestures from speech for robust speech recognition. The Journal of the Acoustical Society of America, 131(3):2270–87, March 2012. ISSN 1520-8524. doi: 10.1121/1.
3682038. URL http://www.ncbi.nlm.nih.gov/pubmed/22423722. 75
V. Mohan, P. Morasso, G. Metta, and G. Sandini. A biomimetic, force-field
148
bibliography
based computational model for motion planning and bimanual coordination in humanoid robots. Autonomous Robots, 27(3):291–307, August
2009. ISSN 0929-5593. doi: 10.1007/s10514-009-9127-x. 63, 120
V B Mountcastle. Modality and topographic properties of single neurons
of cat’s somatic sensory cortex. Journal of neurophysiology, 20(4):408–
34, July 1957. ISSN 0022-3077. URL http://www.ncbi.nlm.nih.gov/
pubmed/13439410. 6
F.A. Mussa Ivaldi, P. Morasso, and R. Zaccaria. Kinematic networks
distributed model for representing and regularizing motor redundancy.
Biological Cybernetics, 60(1):1–16, November 1988. ISSN 0340-1200.
120
Jonathan J Nassi and Edward M Callaway. Parallel processing strategies
of the primate visual system. Nature Reviews Neuroscience, 10(5):360–
372, April 2009. ISSN 1471-003X. doi: 10.1038/nrn2619. URL http:
//dx.doi.org/10.1038/nrn2619. xxiv, 3, 4, 5, 6
Ian Nauhaus, Andrea Benucci, Matteo Carandini, and Dario L Ringach.
Neuronal selectivity and local map structure in visual cortex. Neuron, 57
(5):673–9, March 2008. ISSN 1097-4199. doi: 10.1016/j.neuron.2008.01.
020. URL http://www.cell.com/neuron/fulltext/S0896-6273(08)
00105-0. 7
Ian Nauhaus, Laura Busse, Matteo Carandini, and Dario L Ringach. Stimulus contrast modulates functional connectivity in visual cortex. Nature Neuroscience, 12(1):70–6, January 2009. ISSN 1546-1726. doi:
10.1038/nn.2232. URL http://dx.doi.org/10.1038/nn.2232. 7, 90
Andreas Nieder and Katharina Merten. A labeled-line code for small and
large numerosities in the monkey prefrontal cortex. The Journal of
Neuroscience, 27(22):5986–93, May 2007. ISSN 1529-2401. doi: 10.
1523/JNEUROSCI.1056-07.2007. URL http://www.jneurosci.org/
cgi/content/abstract/27/22/5986. 27
Behrad Noudoost, Mindy H Chang, Nicholas A Steinmetz, and Tirin
Moore. Top-down control of visual attention. Current opinion in neurobiology, 20(2):183–90, April 2010. ISSN 1873-6882. doi: 10.1016/
149
bibliography
j.conb.2010.02.003. URL http://dx.doi.org/10.1016/j.conb.2010.
02.003. 16
Selim Onat, Nora Nortmann, Sascha Rekauzke, Peter König, and Dirk
Jancke. Independent encoding of grating motion across stationary feature maps in primary visual cortex visualized with voltage-sensitive dye
imaging. NeuroImage, 55(4):1763–70, April 2011. ISSN 1095-9572.
doi: 10.1016/j.neuroimage.2011.01.004. URL http://dx.doi.org/10.
1016/j.neuroimage.2011.01.004. 17, 49, 103
R C O’Reilly. Six principles for biologically based computational models of
cortical cognition. Trends in cognitive sciences, 2(11):455–62, November
1998. ISSN 1364-6613. URL http://www.ncbi.nlm.nih.gov/pubmed/
21227277. 10
C.P. Papageorgiou, M. Oren, and T. Poggio.
work for object detection.
dia, January 1998.
A general frame-
Narosa Publishing House, Bombay, In-
ISBN 81-7319-221-9.
doi: 10.1109/ICCV.1998.
710772. URL http://www.computer.org/portal/web/csdl/doi/10.
1109/ICCV.1998.710772. 26
U Pattacini, F Nori, L Natale, G Metta, and G Sandini. An experimental evaluation of a novel minimum-jerk cartesian controller for humanoid robots. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1668–1674. IEEE, October 2010. ISBN
978-1-4244-6674-0. doi: 10.1109/IROS.2010.5650851. 63, 120
Maxime Petit, Stéphane Lallée, Jean-David Boucher, Grégoire Pointeau,
Pierrick Cheminade, Dimitri Ognibene, Eris Chinellato, Ugo Pattacini,
Ilaria Gori, Giorgio Metta, Uriel Martinez-Hernandez, Hector Barron,
Martin Inderbitzin, Andre Luvizotto, Vicky Vouloutsi, Yannis Demiris,
and Peter Ford Dominey. The Coordinating Role of Language in RealTime Multi-Modal Learning of Cooperative Tasks. IEEE Transactions
on Autonomous Mental Development (TAMD), 2012. 125
N Pinto, Y Barhomi, D D Cox, and J J DiCarlo. Comparing state-of-theart visual features on invariant object recognition tasks. In Applications
of Computer Vision (WACV), 2011 IEEE Workshop on, pages 463–470,
150
bibliography
2011. doi: 10.1109/WACV.2011.5711540. 110
Nicolas Pinto, David D Cox, and James J DiCarlo. Why is real-world
visual object recognition hard? PLoS computational biology, 4(1):e27,
January 2008. ISSN 1553-7358. doi: 10.1371/journal.pcbi.0040027. URL
http://dx.plos.org/10.1371/journal.pcbi.0040027. 2
Tomaso Poggio and Emilio Bizzi. Generalization in vision and motor control. Nature, 431(7010):768–74, October 2004. ISSN 1476-4687. doi: 10.
1038/nature03014. URL http://dx.doi.org/10.1038/nature03014.
11, 54
P Reinagel and R C Reid. Temporal coding of visual information in the
thalamus. The Journal of neuroscience : the official journal of the
Society for Neuroscience, 20(14):5392–400, July 2000. ISSN 0270-6474.
URL http://www.ncbi.nlm.nih.gov/pubmed/10884324. 12
César Rennó-Costa, John E Lisman, and Paul F M J Verschure. The
mechanism of rate remapping in the dentate gyrus. Neuron, 68(6):1051–
8, December 2010. ISSN 1097-4199. doi: 10.1016/j.neuron.2010.11.
024. URL http://www.cell.com/neuron/fulltext/S0896-6273(10)
00940-2. 92, 96, 106
César Rennó-Costa, André L Luvizotto, Encarni Marcos, Armin Duff,
Martı́ Sánchez-Fibla, and Paul F M J Verschure.
Integrating
Neuroscience-based Models Towards an Autonomous Biomimetic Synthetic.
In 2011 IEEE International Conference on RObotics and
BIOmimetics (IEEE-ROBIO 2011), Phuket Island, Thailand, 2011.
IEEE, IEEE. 91, 97
César Rennó-Costa, André Luvizotto, Alberto Betella, Marti Sanchez Fibla, and Paul F. M. J. Verschure. Internal drive regulation of sensorimotor reflexes in the control of a catering assistant autonomous robot.
In Lecture Notes in Artificial Intelligence: Living Machines, 2012.
M Riesenhuber and T Poggio. Hierarchical models of object recognition
in cortex. Nature Neuroscience, 2(11):1019–25, November 1999. ISSN
1097-6256. doi: 10.1038/14819. URL http://www.ncbi.nlm.nih.gov/
pubmed/10526343. 2, 23
151
bibliography
M Riesenhuber and T Poggio. Models of object recognition. Nature
neuroscience, 3 Suppl:1199–204, November 2000.
ISSN 1097-6256.
doi: 10.1038/81479. URL http://www.nature.com/neuro/journal/
v3/n11s/full/nn1100_1199.html#B47. 2
Dario L Ringach. Spatial Structure and Symmetry of Simple-Cell Receptive Fields in Macaque Primary Visual Cortex. Journal of Neurophysiology, 88(1):455–463, 2002. URL http://jn.physiology.org/cgi/
content/abstract/88/1/455. 25
Dario L. Ringach, Robert M. Shapley, and Michael J. Hawken. Orientation
Selectivity in Macaque V1: Diversity and Laminar Dependence. J.
Neurosci., 22(13):5639–5651, July 2002. URL http://www.jneurosci.
org/cgi/content/abstract/22/13/5639. 7
R W Rodieck and J Stone. Analysis of receptive fields of cat retinal ganglion cells. Journal of Neurophysiology, 28(5):833, 1965. ISSN 00223077.
URL http://jn.physiology.org/cgi/reprint/28/5/833.pdf. 6, 27,
94
F Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386–408,
November 1958. ISSN 0033-295X. URL http://www.ncbi.nlm.nih.
gov/pubmed/13602029. 2
David E. Rumelhart, James L. McClelland, and PDP Research
Group.
Parallel Distributed Processing:
crostructure of Cognition:
Book, 1986.
Explorations in the Mi-
Foundations (Volume 1).
ISBN 0262181207.
A Bradford
URL http://www.amazon.com/
Parallel-Distributed-Processing-Explorations-Microstructure/
dp/0262181207. 2
P A Salin, J Bullier, and H Kennedy. Convergence and divergence in the
afferent projections to cat area 17. The Journal of comparative neurology, 283(4):486–512, May 1989. ISSN 0021-9967. doi: 10.1002/cne.
902830405. URL http://www.ncbi.nlm.nih.gov/pubmed/2745751. 8
Jason M Samonds and A B Bonds. From another angle: Differences in
cortical coding between fine and coarse discrimination of orientation.
152
bibliography
Journal of neurophysiology, 91(3):1193–202, March 2004. ISSN 00223077. doi: 10.1152/jn.00829.2003. URL http://www.ncbi.nlm.nih.
gov/pubmed/14614106. 23
Miguel Sarabia, Raquel Ros, and Yiannis Demiris. Towards an open-source
social middleware for humanoid robots. In 11th IEEE-RAS International Conference on Humanoid Robots, pages 670 – 675, Bled, 2011.
132
Alan B Saul. Temporal receptive field estimation using wavelets. Journal of Neuroscience Methods, 168(2):450–464, 2008. ISSN 0165-0270
(Print). doi: 10.1016/j.jneumeth.2007.11.014. 25
Dirk Schubert, Rolf Kötter, and Jochen F Staiger.
Mapping func-
tional connectivity in barrel-related columns reveals layer- and cell
type-specific microcircuits.
Brain structure & function, 212(2):107–
19, September 2007. ISSN 1863-2653. doi: 10.1007/s00429-007-0147-z.
URL http://www.ncbi.nlm.nih.gov/pubmed/17717691. 90
Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber,
and Tomaso Poggio.
Robust Object Recognition with Cortex-Like
Mechanisms. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 29:411–426, 2007.
ISSN 0162-8828.
doi: http://doi.
ieeecomputersociety.org/10.1109/TPAMI.2007.56. 2, 23
Robert Shapley and Michael J Hawken. Color in the cortex: single- and
double-opponent cells. Vision research, 51(7):701–17, April 2011. ISSN
1878-5646. doi: 10.1016/j.visres.2011.02.012. URL /pmc/articles/
PMC3121536/?report=abstract. 5
S Shushruth, Pradeep Mangapathy, Jennifer M Ichida, Paul C Bressloff,
Lars Schwabe, and Alessandra Angelucci. Strong recurrent networks
compute the orientation tuning of surround modulation in the primate primary visual cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience, 32(1):308–21, January
2012. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.3789-11.2012. URL
http://www.jneurosci.org/content/32/1/308.full. 8
Bruno Siciliano, Lorenzo Sciavicco, Luigi Villani, and Giuseppe Oriolo.
153
bibliography
Robotics: Modelling, Planning and Control (Advanced Textbooks in Control and Signal Processing). Springer, 2011. ISBN 1846286417. 119
Kyriaki Sidiropoulou, Fang-Min Lu, Melissa A Fowler, Rui Xiao, Christopher Phillips, Emin D Ozkan, Michael X Zhu, Francis J White, and
Donald C Cooper. Dopamine modulates an mGluR5-mediated depolarization underlying prefrontal persistent activity. Nature Neuroscience,
12(2):190–199, January 2009. ISSN 1097-6256. doi: 10.1038/nn.2245.
URL http://dx.doi.org/10.1038/nn.2245. 33
Markus Siegel, Tobias H Donner, and Andreas K Engel. Spectral fingerprints of large-scale neuronal interactions. Nature reviews. Neuroscience,
13(2):121–34, February 2012. ISSN 1471-0048. doi: 10.1038/nrn3137.
URL http://dx.doi.org/10.1038/nrn3137. 17
Lawrence C Sincich and Jonathan C Horton.
The circuitry of V1
and V2: integration of color, form, and motion. Annual review of
neuroscience, 28:303–26, January 2005. ISSN 0147-006X. doi: 10.
1146/annurev.neuro.28.061604.135731.
URL http://www.ncbi.nlm.
nih.gov/pubmed/16022598. 8
Lawrence Sirovich and Marsha Meytlis. Symmetry, probability, and recognition in face space. Proceedings of the National Academy of Sciences
of the United States of America, 106(17):6895–9, April 2009. ISSN
1091-6490. doi: 10.1073/pnas.0812680106. URL http://www.pnas.
org/cgi/content/abstract/106/17/6895. 19, 55
Olaf Sporns and Jonathan D Zwi.
cortex.
The small world of the cerebral
Neuroinformatics, 2(2):145–62, January 2004.
ISSN 1539-
2791. doi: 10.1385/NI:2:2:145. URL http://www.ncbi.nlm.nih.gov/
pubmed/15319512. 7, 23
Dan D Stettler, Aniruddha Das, Jean Bennett, and Charles D Gilbert.
Lateral Connectivity and Contextual Interactions in Macaque Primary
Visual Cortex. Neuron, 36(4):739–750, November 2002. ISSN 08966273.
doi: 10.1016/S0896-6273(02)01029-2. URL http://dx.doi.org/10.
1016/S0896-6273(02)01029-2. 7, 31
Charles F Stevens. Preserving properties of object shape by computations
154
bibliography
in primary visual cortex. Proceedings of the National Academy of Sciences of the United States of America, 101(43):15524–15529, 2004. doi:
10.1073/pnas.0406664101. URL http://www.pnas.org/content/101/
43/15524.abstract. 25
Stephen Sutton, Ronald Cole, Jacques De Villiers, Johan Schalkwyk,
Pieter Vermeulen, Mike Macon, Yonghong Yan, Ed Kaiser, Brian Rundle, Khaldoun Shobaki, Paul Hosom, Alex Kain, Johan, Johan Wouters,
Dominic Massaro, and Michael Cohen. Universal Speech Tools: The
Cslu Toolkit. in proceedings of the international conference on spoken
language processing (ICSLP), 7:3221 – 3224, 1998. 65, 122
Carolina Tamara and William Timberlake. Route and landmark learning
by rats searching for food. Behavioural processes, 86(1):125–32, January
2011. ISSN 1872-8308. doi: 10.1016/j.beproc.2010.10.007. URL http:
//dx.doi.org/10.1016/j.beproc.2010.10.007. 1
K Tanaka. Inferotemporal cortex and object vision. Annual review of
neuroscience, 19:109–39, January 1996. ISSN 0147-006X. doi: 10.1146/
annurev.ne.19.030196.000545.
URL https://vpn.upf.edu/+CSCO+
0h756767633A2F2F6A6A6A2E6E6161686E7965726976726A662E626574+
+/doi/abs/10.1146/annurev.ne.19.030196.000545. 10, 11
S Thorpe, D Fize, and C Marlot. Speed of processing in the human
visual system. Nature, 381(6582):520–2, June 1996. ISSN 0028-0836.
doi: 10.1038/381520a0. URL http://www.ncbi.nlm.nih.gov/pubmed/
8632824. 17, 41, 68
Sebastian Thrun.
Toward a framework for human-robot interaction.
Human-Computer Interaction, 19, 2004. 110
Paul Tiesinga and Terrence J Sejnowski. Cortical enlightenment: are
attentional gamma oscillations driven by ING or PING? Neuron, 63
(6):727–32, September 2009a. ISSN 1097-4199. doi: 10.1016/j.neuron.
2009.09.009. URL http://dx.doi.org/10.1016/j.neuron.2009.09.
009. 16
Paul Tiesinga and Terrence J Sejnowski. Cortical enlightenment: are
attentional gamma oscillations driven by ING or PING? Neuron, 63(6):
bibliography
155
727–32, September 2009b. ISSN 1097-4199. doi: 10.1016/j.neuron.2009.
09.009. URL http://www.pubmedcentral.nih.gov/articlerender.
fcgi?artid=2778762&tool=pmcentrez&rendertype=abstract. 104
Paul Tiesinga, Jean-Marc Fellous, and Terrence J Sejnowski. Regulation
of spike timing in visual cortical circuits. Nature reviews. Neuroscience,
9(2):97–107, February 2008. ISSN 1471-0048. doi: 10.1038/nrn2315.
URL http://dx.doi.org/10.1038/nrn2315. 12
Thomas Töllner, Michael Zehetleitner, Klaus Gramann, and Hermann J
Müller. Stimulus saliency modulates pre-attentive processing speed in
human visual cortex. PloS one, 6(1):e16276, January 2011. ISSN 19326203. doi: 10.1371/journal.pone.0016276. URL http://dx.plos.org/
10.1371/journal.pone.0016276. 41
Michael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and
Henrike Moll. Understanding and sharing intentions: the origins of
cultural cognition. The Behavioral and brain sciences, 28(5):675–91;
discussion 691–735, October 2005. ISSN 0140-525X. doi: 10.1017/
S0140525X05000129. 125, 126
Doris Y Tsao, Winrich A Freiwald, Tamara A Knutsen, Joseph B Mandeville, and Roger B H Tootell. Faces and objects in macaque cerebral cortex. Nature neuroscience, 6(9):989–95, September 2003. ISSN
1097-6256. doi: 10.1038/nn1111. URL http://dx.doi.org/10.1038/
nn1111. 10
Doris Y Tsao, Winrich A Freiwald, Roger B H Tootell, and Margaret S
Livingstone. A cortical region consisting entirely of face-selective cells.
Science (New York, N.Y.), 311(5761):670–4, February 2006. ISSN 10959203. doi: 10.1126/science.1119983. URL http://www.sciencemag.
org/content/311/5761/670.abstract. 10
P F M J Verschure and P Althaus. A real-world rational agent: Unifying
old and new AI. Cognitive Science, 27:561–590, 2003. 61, 111, 112
Paul F M J Verschure, Thomas Voegtlin, and Rodney J Douglas. Environmentally mediated synergy between perception and behaviour in
mobile robots. Nature, 425(6958):620–4, October 2003. ISSN 1476-
156
bibliography
4687. doi: 10.1038/nature02024. URL http://dx.doi.org/10.1038/
nature02024. 27, 105
J Victor and K Purpura. Estimation of information in neuronal responses.
Trends in Neurosciences, 22(12):543, 1999. ISSN 0166-2236 (Print). 41,
51
P. Viola and M. Jones.
Rapid object detection using a boosted cas-
cade of simple features. IEEE Comput. Soc, Kauai, HI, USA, 2001.
ISBN 0-7695-1272-0. doi: 10.1109/CVPR.2001.990517. URL http://
ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=990517. 26
Andreas Wächter and Lorenz T. Biegler. On the implementation of an
interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, April 2005. ISSN
0025-5610. doi: 10.1007/s10107-004-0559-y. 120
T N Wiesel and D H Hubel.
Spatial and chromatic interactions in
the lateral geniculate body of the rhesus monkey.
Journal of neu-
rophysiology, 29(6):1115–56, November 1966. ISSN 0022-3077. URL
http://www.ncbi.nlm.nih.gov/pubmed/4961644. 4
Ben Willmore, Ryan J Prenger, Michael C-K Wu, and Jack L Gallant. The
berkeley wavelet transform: a biologically inspired orthogonal wavelet
transform. Neural Computation, 20(6):1537–1564, 2008. ISSN 0899-7667
(Print). doi: 10.1162/neco.2007.05-07-513. 25
Thilo Womelsdorf, Jan-Mathijs Schoffelen, Robert Oostenveld, Wolf
Singer, Robert Desimone, Andreas K Engel, and Pascal Fries. Modulation of neuronal interactions through neuronal synchronization. Science (New York, N.Y.), 316(5831):1609–12, June 2007. ISSN 10959203. doi: 10.1126/science.1139597. URL http://www.sciencemag.
org/content/316/5831/1609.abstract. 16
Reto Wyss and Paul F. M. J. Verschure. Bounded Invariance and the
Formation of Place Fields. In Advances in Neural Information Processing Systems 16. MIT Press, 2004a. URL http://citeseerx.ist.psu.
edu/viewdoc/summary?doi=10.1.1.6.4379. 99
Reto Wyss and Paul F M J Verschure. Bounded Invariance and the For-
bibliography
157
mation of Place Fields. In Neural Information Processing Systems 16,
Vancouver, B.C. Canada, 2004b. 18, 22
Reto Wyss, Peter Konig, and Paul F M J Verschure. Invariant representations of visual patterns in a temporal population code. Proceedings of the
National Academy of Sciences of the United States of America, 100(1):
324–9, January 2003a. ISSN 0027-8424. doi: 10.1073/pnas.0136977100.
URL http://www.pnas.org/cgi/content/abstract/100/1/324. 12,
13, 18, 22, 27, 51, 55, 56, 76, 77, 90, 91, 92, 94
Reto Wyss, Paul F M J Verschure, and Peter König. Properties of a
temporal population code. Reviews in the neurosciences, 14(1-2):21–
33, January 2003b. ISSN 0334-1763. URL http://www.ncbi.nlm.nih.
gov/pubmed/12929915. xxiv, 12, 13, 18, 103
Reto Wyss, Paul F M J Verschure, and Peter König. Properties of a
temporal population code. Reviews in the neurosciences, 14(1-2):21–
33, January 2003c. ISSN 0334-1763. URL http://www.ncbi.nlm.nih.
gov/pubmed/12929915. 22, 27, 35, 41, 50, 51, 55, 56, 77, 86, 90, 92, 94
Reto Wyss, Peter König, and Paul F M J Verschure. A model of the
ventral visual system based on temporal stability and local memory.
PLoS biology, 4(5):e120, May 2006. ISSN 1545-7885. doi: 10.1371/
journal.pbio.0040120. URL http://dx.plos.org/10.1371/journal.
pbio.0040120. 18, 91, 106
T Yousef, E Tóth, M Rausch, U T Eysel, and Z F Kisvárday. Topography
of orientation centre connections in the primary visual cortex of the cat.
Neuroreport, 12(8):1693–9, July 2001. ISSN 0959-4965. URL http:
//www.ncbi.nlm.nih.gov/pubmed/11409741. 7
Jochen Zeil, Martin I. Hofmann, and Javaan S. Chahl. Catchment areas of
panoramic snapshots in outdoor scenes. Journal of the Optical Society
of America A, 20(3):450, March 2003. ISSN 1084-7529. doi: 10.1364/
JOSAA.20.000450. URL http://josaa.osa.org/abstract.cfm?URI=
josaa-20-3-450. 101, 107
S Zeki and S Shipp. The functional logic of cortical connections. Nature, 335(6188):311–7, September 1988. ISSN 0028-0836. doi: 10.1038/
158
bibliography
335311a0. URL http://dx.doi.org/10.1038/335311a0. 6
Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1330–1334,
2000. ISSN 01628828. doi: 10.1109/34.888718. 119
W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition:A literature survey. ACM Computing Surveys (CSUR), 35(4),
2003. ISSN 0360-0300. URL http://portal.acm.org/citation.cfm?
id=954342. 19, 55
Davide Zoccolan, Nadja Oertelt, James J DiCarlo, and David D Cox.
A rodent model for the study of invariant visual object recognition. Proceedings of the National Academy of Sciences of the United
States of America, 106(21):8748–53, May 2009. ISSN 1091-6490. doi:
10.1073/pnas.0811583106. URL http://www.pnas.org/cgi/content/
abstract/106/21/8748. 1