Download Efficient reinforcement learning of a reservoir network model of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
J Neurophysiol 114: 3296 –3305, 2015.
First published October 7, 2015; doi:10.1152/jn.00378.2015.
Efficient reinforcement learning of a reservoir network model of parametric
working memory achieved with a cluster population winner-take-all
readout mechanism
Zhenbo Cheng,1,2 Zhidong Deng,1 Xiaolin Hu,1 Bo Zhang,1 and X Tianming Yang3
1
State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and
Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China; 2Department of
Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China; and 3Institute of Neuroscience, Key
Laboratory of Primate Neurobiology, CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai
Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
Submitted 17 April 2015; accepted in final form 7 October 2015
working memory; reservoir network; reinforcement learning
DECISION MAKING often requires one to compare sensory inputs
against contents stored in memory. For example, when shopping
at a store one often has to inspect the features of an item and make
mental comparisons against the features of another item he just
looked at. Such behavior requires the capability of storing sensory
information in memory and retrieving it later for decision making.
Recent advances in neuroscience show that the persistent activity of neurons in the prefrontal cortex (PFC) plays an important
Address for reprint requests and other correspondence: T. Yang, Inst. of
Neuroscience, 320 Yue Yang Rd., Bldg. 23, Rm. 313, Shanghai, China 200031
(e-mail: [email protected]).
3296
role in working memory (Baddeley 1992; Miller et al. 1996;
Padmanabhan and Urban 2010). There have been many theoretical efforts to explain the neural mechanisms of working memory
(Barak and Tsodyks 2014). Classic models assume that information in working memory is represented by the persistent activity of
neurons in stable network states, which are called attractors. The
attractor network models have received some experimental support (Deco et al. 2010; Machens et al. 2005; Miller et al. 2003;
Verguts 2007; Wang 2008). However, further studies show that
the persistent delay activity in the PFC is highly heterogeneous
and dynamical, which is not reflected by this category of recurrent
network models (Brody et al. 2003).
A different approach based on reservoir networks has recently
started to gain attention in the theoretical field (Buonomano and
Maass 2009; Laje and Buonomano 2013). The connection within
a reservoir is fixed and task independent. The information is
encoded by network dynamics. Task-relevant information retrieval depends on an appropriate readout mechanism. Reservoir
networks can model the dynamical and heterogeneous neural
activity that has been observed in experiment data (Barak et al.
2013). However, it has been a challenge to implement the training
of a task-dependent readout mechanism with a biologically plausible learning scheme. Thus its application in modeling the neural
circuits in the brain remains controversial.
In the present study, we show a solution of the readout
and learning problem by modeling a somatosensory delayed
discrimination task (Romo and Salinas 2003) with a multilayer network. The key components of our network are a
reservoir network layer (RNL) and a cluster population layer
(CPL) serving as its readout interface (Fig. 1A). First, we
show that neurons in the RNL exhibit similar heterogeneity
and dynamics as the PFC neuronal activity reported by
Romo and colleagues (Brody et al. 2003; Hernandez et al.
2002; Jun et al. 2010). Second, we show that the CPL
neurons achieve appropriate readouts via a winner-take-all
mechanism within each cluster and the most sensitive neurons are selected for decision making by simple reinforcement learning. Finally, we demonstrate that the CPL greatly
improves the efficiency of the reinforcement learning. Together, we manage to create a biologically plausible working memory model based on a reservoir network with a
readout layer that can be efficiently trained by reinforcement
learning.
0022-3077/15 Copyright © 2015 the American Physiological Society
www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
Cheng Z, Deng Z, Hu X, Zhang B, Yang T. Efficient reinforcement learning of a reservoir network model of parametric working
memory achieved with a cluster population winner-take-all readout
mechanism. J Neurophysiol 114: 3296 –3305, 2015. First published
October 7, 2015; doi:10.1152/jn.00378.2015.—The brain often has to
make decisions based on information stored in working memory, but
the neural circuitry underlying working memory is not fully understood. Many theoretical efforts have been focused on modeling the
persistent delay period activity in the prefrontal areas that is believed
to represent working memory. Recent experiments reveal that the
delay period activity in the prefrontal cortex is neither static nor
homogeneous as previously assumed. Models based on reservoir
networks have been proposed to model such a dynamical activity
pattern. The connections between neurons within a reservoir are
random and do not require explicit tuning. Information storage does
not depend on the stable states of the network. However, it is not clear
how the encoded information can be retrieved for decision making
with a biologically realistic algorithm. We therefore built a reservoirbased neural network to model the neuronal responses of the prefrontal cortex in a somatosensory delayed discrimination task. We first
illustrate that the neurons in the reservoir exhibit a heterogeneous and
dynamical delay period activity observed in previous experiments.
Then we show that a cluster population circuit decodes the information from the reservoir with a winner-take-all mechanism and contributes to the decision making. Finally, we show that the model
achieves a good performance rapidly by shaping only the readout with
reinforcement learning. Our model reproduces important features of
previous behavior and neurophysiology data. We illustrate for the first
time how task-specific information stored in a reservoir network can
be retrieved with a biologically plausible reinforcement learning
training scheme.
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
A
3297
Reward
f1>f2
Action
f1<f2
IL
RNL
DML
CPL
Excitatory
Inhibitory
Plastic
Modulatory
D
f2
(Hz) 1
f2
(Hz) 1.2
1
0.8
0.6
0.4
0.2
96
90
81
66
94 89 79 66 51 66 79 88 94
67
82
%
92
94
0.2
0.4
0.6 0.8
f1 (Hz)
1
E
92
93
90
0.8
89
0.6
0.4
0.2
83
77
88
90
92
89
0.2 0.4 0.6 0.8 1
f1(Hz)
%
% f1 called higher
C
100
60
40
20
0
1.2
Behavior task. We use the same behavior task as described by Romo
and colleagues (Hernandez et al. 2002; Romo and Salinas 2003). Two
vibration stimuli are presented with a delay in between. Each stimulus
lasts for 500 ms. The delay is set to 3,000 ms in the training sessions and
randomly varied between 3,000 ms and 4,000 ms during the testing
sessions. The first stimulus occurs at 500 ms after the trial starts. Each
trial lasts for 5,000 ms. The stimulus frequencies are randomly chosen
between 0.1 and 1.2 Hz. The task is to discriminate whether the frequency
of the first or the second vibration is larger.
Neural network model. The model consists of four layers: an input
layer (IL), an RNL, a CPL, and a decision making output layer (DML)
(Fig. 1A).
The IL is just one neuron, whose firing rate I is proportional to the
vibrational frequencies of stimulus inputs, mimicking the responses of
frequency-tuning cells found in S1 (Mountcastle et al. 1990; Salinas
et al. 2000). For simplicity, its firing rate I is set to the vibrational
frequency during the presentation of stimulus; otherwise, it is set to 0.
The input neuron projects to the RNL. The synaptic weight w(I)
i of
the synapse between the input neuron and neuron i in the RNL circuit
is set to 0 with a probability of 0.5. Nonzero weights are assigned
independently from a standard uniform distribution [0,1].
In the RNL, there are N ⫽ 1,000 neurons. Each neuron is described
by an activation variable xi for i ⫽ 1, 2, . . . , N, satisfying
dxi
dt
⫽ ⫺xi ⫹ g
N
wij y j ⫹ wiII ⫹ ␴noisedWi
兺
j⫽1
f2=0.6Hz
80
MATERIALS AND METHODS
␶
f1=0.6Hz
0
0.2 0.4 0.6 0.8 1 1.2
f1/f2 frequency (Hz)
yi ⫽
y 0 ⫹ y 0 tanh(x ⁄ y 0),
xⱕ0
y 0 ⫹ (y max ⫺ y 0)tanh(x ⁄ (y max ⫺ y 0)), x ⬎ 0
(2)
The firing rate thus varies between 0 and 1. Neurons in the RNL
circuit are sparsely connected with probability p ⫽ 0.1. In other
words, the weight wij of the synaptic connection between a presynaptic neuron j and a postsynaptic neuron i is set to and held at 0 at the
probability of 0.9. Nonzero weights are chosen randomly and independently from a Gaussian distribution with zero mean and a variance
of g2/N, where the gain g acts as the control parameter in the RNL
circuit. In our simulations, we set g ⫽ 1.5.
The RNL neurons project to the CPL. The synaptic weight w(2)
ki for
the synapse between neuron i in the RNL and neuron k in the CPL is set
to 0 with probability pM ⫽ 0.9. Nonzero weights are set independently
from a standard uniform distribution [0,1]. The firing rate of neuron k in
the CPL is given as follows: uk ⫽ i wki共2兲yi for k ⫽ 1, 2, . . . , M. There
are M ⫽ 50,000 neurons in the CPL, grouped into 500 clusters. Each
cluster therefore has 100 neurons. Within each cluster, a winner-take-all
mechanism selects a winning neuron as its output. The neuron with the
maximal average response from the onset of f2 to the end of a trial is
selected as the winner. Its firing rate is set to 1.0. The activity of all other
neurons within the same cluster is set to zero at the end of a trial.
The CPL projects to the final layer, the DML, which has two
competing neurons that correspond to the choices f1 ⬎ f2 and f1 ⬍ f2,
respectively. The firing rate of neuron l is
兺
(1)
where ␶ ⫽ 500 ms represents the time constant, wij is the synaptic
weight between neurons i and j, dWi stands for the white noise, and
␴noise (⫽ 0.001) is the noise variance. The firing rate yi of neuron i is
a function of the activation variable xi relative to a fixed baseline firing
rate y0 (⫽ 0.1) and the maximal rate ymax (⫽ 1):
再
vl ⫽
(3)
lk
兺k wlk(3)uk for l ⫽ 1, 2
(3)
where w represents the weight of the synapse between neuron k in
the CPL circuit and neuron l in the decision making circuit. Synaptic
weights are randomly initialized with uniform distribution [0, 0.001]
and then updated based on a reward signal during the training phase.
We hold these synaptic weights fixed in the test phase for analyses.
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
B
Fig. 1. A: schematic diagram of the model.
The 4 layers in the model are the input layer
(IL), the reservoir network layer (RNL), the
cluster population readout layer (CPL), and
the decision making output layer (DML). See
MATERIALS AND METHODS for details. B: schematic diagram of the delayed discrimination
task. Two vibration stimuli are presented.
Each stimulus lasts for 0.5 s. There is a 3-s
delay between the 2 stimuli. The task is to
report the frequency of which stimulus is
larger. C: performance of the model. Each
number indicates average % of correct trials
for pairs of stimuli (f1, f2) in all simulation
rounds during the test phase. The vertical
column is the performance when f1 is fixed at
0.6 Hz, and the horizontal column is the
performance when f2 is fixed at 0.6 Hz. D:
similar to C, but the difference between f1
and f2 is fixed at ⫾0.3 Hz. E: performance of
the model when f1 (gray curve) or f2 (black
curve) is fixed at 0.6 and the other stimulus
frequency changes. Error bars show SE.
3298
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
The stochastic choice behavior of our model is described by a
sigmoidal function:
p1 ⫽
1
1 ⫹ exp(⫺ ␣⌬v)
.
(4)
where p1 represents the probability for the choice f1 ⬎ f2, ⌬v ⫽ v1 ⫺
v2 is the difference between the firing rates of the two DML units, and
␣ is set to 20.
Reinforcement learning. At the end of trial n, the plastic weights in
Eq. 3 in trial n ⫹ 1 are updated as follows:
(3)
(3)
wlk
(n ⫹ 1) ⫽ wlk
(n) ⫹ ⌬wlk
(5)
The update term ⌬wlk depends on the reward prediction error and
the responses of the neurons in the CPL circuit:
⌬wlk ⫽ ␩(r ⫺ E[r])ukv1
(6)
(3)
(n)
wlk
冑兺
M
⁄
k⫽1
(3)
[wlk
(n)]2
(3)
li
兺i wli(3)yi for l ⫽ 1, 2, and i ⫽ 1, 2, . . . , N
where ⌬␤j ⬅ ␤j ⫺ bj and
Sij ⫽
(7)
so that the vector length of w(3)
lk (n) remains constant throughout the
training phase. The normalization avoids the weights growing infinitely (Royer and Pare 2003).
Model without CPL. The model consists of only three layers: IL,
RNL, and DML. The IL and RNL are the same as described above.
However, the neurons in the DML in this model are directly
connected with all neurons in the RNL. The firing rate of neuron l
in the DML is
vl ⫽
⌬␤1(t)2S11 ⫹ ⌬␤2(t)2S22 ⫹ 2⌬␤1(t)⌬␤2(t)S12 ⱕ ps2(t)F(p, v, 1 ⫺ ␣)
(9)
(8)
where w represents the weight of the synapse between neuron i in
the RNL circuit and neuron l in the DML. Synaptic weights are
randomly initialized with uniform distribution [0, 0.001] and then
updated similarly as in the previous model according to Eqs. 5–7.
Data analysis. The results we present are based on 50 simulation
rounds unless otherwise noted. In each round, 200 training trials are
run. Subsequently, the model is tested with 1,000 trials. All the
synaptic weights in the model are reset for each round. The stimulus
pairs used in the test phase might not all be presented during the
training phase. Our analyses reported here are done with the test data
sets, unless otherwise stated.
Chaos analysis. We determine whether the RNL is chaotic by
calculating its maximal Lyapunov exponent (MLE) (Verstraeten et al.
2007). A positive MLE indicates that the network is chaotic. We
numerically compute the MLE of the RNL as follows (Dechert et al.
1999): Let Y be an N-dimensional vector representing the firing rates
of RNL neurons at a given time point. A copy Yc of the Y representing
the RNL’s output is generated. Every component of Yc is perturbed by
a small positive number such that the Euclidean distance between Y
and Yc equals d0 ⫽ 10⫺8. At each time step, Eq. 1 is solved twice to
obtain the new values for Y and Yc. The distance d1 between the new
values of Y and Yc is then calculated. The position of Yc is readjusted
after each iteration to be in the same direction as that of the vector
Yc ⫺ Y, but with separation d0, by adjusting each component Ycj of Yc
according to Ycj ⫽ Yj ⫹ (d0/d1)(Ycj ⫺ Yj). The average of log|d1/d0| in
4 s gives the MLE.
NT
兺i (f (i)j ⫺ ៮f j)2
(10)
for j ⫽ 1, 2, i ⫽ 1, 2, . . . , NT, and
S12 ⫽
NT
兺i (f (i)1 ⫺ ៮f 1)(f (i)2 ⫺ ៮f 2)
(11)
where NT is the number of trials used in the regression, ៮fj is the mean
frequency of the stimulus ensemble over NT trials, and b ⫽ [b1, b2] is
the center of an ellipse in the coordinates described by ␤ ⫽ [␤1, ␤2].
F(.) is the 1-␣ percentile of the F-distribution with 3 parameters and
the degrees of freedom v ⫽ NT ⫺ 3. The variance is estimated from
s2(t) ⫽ [(Xb T(t) ⫺ FR(t))T(XbT(t) ⫺ FR(t))] ⁄ (3NT)
(12)
where the regression model X is defined in the same way as Equation
4 in Jun et al. (2010).
The encoding type of a neuron is characterized by b1 and b2. A
neuron is defined as ⫹f1 (⫺f1) encoding at a given time point if its
response depends on f1 and b1 ⬎ 0 (b1 ⬍ 0) at that time point.
Similarly, a neuron is ⫹f2 (⫺f2) encoding if its response depends on
f2 and b2 ⬎ 0 (b2 ⬍ 0) at that time point. A neuron is f1 ⫺ f2 (f2 ⫺
f1) encoding if b1 and b2 have opposite signs and b1 ⬎ 0 (b2 ⬎ 0).
Winner-take-all competition. We study the importance of the winner-take-all competition in the CPL by varying the competition
strength with a softmax function. In each cluster, the average firing
rate û of each neuron is computed from the onset f2 to the end of a
trial, and then normalized as follows:
u
៮⫽
1
1⫹e
␤(û⫺E(û))
(13)
where the parameter ␤ adjusts the competition strength and E(û) is the
mean firing rate of all neurons within the cluster. As ␤ increases, the
normalization approximates the winner-take-all competition.
Tuning index. We quantify the discriminative sensitivity of each
neuron in the CPL by its tuning index. If a neuron wins a group a
times when f1 is larger, and b times when f2 is larger, its tuning index
is defined as (a ⫺ b)/(a ⫹ b).
Neuronal correlation. To show how the CPL activity differentiates
stimulus conditions, we calculate the correlation of the CPL neural
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
where ␩ (⫽ 0.001) is the learning rate and r is the reward signal for
the current trial (1 if a reward is given and 0 otherwise). E[r] denotes
the average of the previously received rewards, which represents the
predicted probability of making a correct choice given f1 and f2 for
that trial (Law and Gold 2009; Seung 2003). Based on the choice for
that trial, E[r] is set to the probability p1 for the choice f1 ⬎ f2 and
1 ⫺ p1 for the choice f1 ⬍ f2. The difference between r and E[r] thus
represents the reward prediction error. After each update, the weights
w(3)
lk (n) are normalized by
Simple linear regression model. Following the approach of Romo
and Salinas (2003), we model the response of each neuron in the RNL
during the f1 and the delay periods as follows: firing rate ⫽ b1 ⫻
f1 ⫹ b0. A neuron is considered frequency tuned if b1 is significantly
different from zero over more than two-thirds of the time points
during the period being considered. The neuron is called early encoding if b1 is significant over more than two-thirds of the time points
during the first second but not significant during the last 2 s of the
delay period. The neuron is called late encoding if b1 is significant
over more than two-thirds of the time points during the last second but
not significant during the first 2 s of the delay period. A neuron is
called persistent encoding if b1 is significant over more than twothirds of the time points during the entire delay period. Significance is
determined at P ⬍ 0.01.
Full regression model and encoding types. We model the firing
rates during the f2 period as a linear function of the frequencies of
both stimuli, f1 and f2: FR(t) ⫽ b1(t)f1 ⫹ b2(t)f2 ⫹ b0(t), where t
represents time and the coefficients b1(t) and b2(t) are calculated at
each time point in a trial. To estimate the coefficients b1 and b2
through multivariate regression analysis, we consider the regression
point, which is reduced to an ellipse that specifies the uncertainty as
follows (Jun et al. 2010):
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
B
1
performance
performance
A
0.9
0.8
0.7
0.5
1
1.5 2
gain
3
0.9
0.8
100 200 300 400 500
τ(ms)
Fig. 2. A: performance of the model with different gains in the RNL. Dotted
line indicates the transition to chaotic networks. B: performance of the model
with different time constants (␶) of the RNL neurons. Each data point is based
on 10 simulation runs.
activity between the trials when f1 and f2 should lead to the same
decision (both are either f1 ⬍ f2 or f1 ⬎ f2 trials) and between the
trials when the stimuli should lead to different decisions (an f1 ⬍ f2
trial vs. an f1 ⬎ f2 trial). The calculation is based on 100 trials
randomly picked from each run. The same trials are used in the
calculation of both the same- and the different-condition correlation,
but paired differently.
RESULTS
Model’s behavior performance. Our model reproduces the
main characteristics of monkeys’ behavioral data in the delayed discrimination task (Hernandez et al. 2002). Performances in different conditions during the test phase are shown
in Fig. 1, C–E. Similar to the monkeys’ performance, the
model’s performance depends on the difference between the
frequencies of the two stimuli, f1 and f2. It approaches 100%
when the frequency difference is large and is at the chance
level when the two frequencies are equal (Fig. 1, C and E). The
performance does not depend on the absolute magnitude of
the stimulus frequencies (Fig. 1D; F9,100 ⫽ 0.97, P ⬎ 0.1). We
reset all synaptic weights of the model in each simulation
round, and the behavioral results are not significantly different
(LSD t-test, P ⬎ 0.1). Thus the network performance is robust
and does not depend on initial conditions.
To further test the model’s robustness, we vary two critical
parameters of the RNL, namely the gain and the time constant,
and calculate the model’s performance. We find that the gain of
Activity
A
0.9
0
Activity
C
0.3
0
1
2
3
4
0.5
0
E
Activity
f1 freq.
B
D
0
1
2
3
4
0
Fig. 3. Neuronal responses of 6 example neurons from the
RNL. Darkness of each curve indicates the stimulus frequency f1. Darker grays indicate lower frequencies. Light
gray box indicates the f1 presentation period, and black bar
above each plot indicates the time points when the neuron’s
response is significantly modulated by f1. A, C, and E:
positive monotonic neurons. B, D, and F: negative monotonic neurons. A and B: early neurons. C and D: persistent
neurons. E and F: late neurons. Grayscale map for f1
frequency is shown on right.
0.3
0
1
2
3
4
F
0
1
2
3
4
0.2
0.1
1.2
0
0.4
0
1
2
3
Time (s)
4
0
1
2
3
Time (s)
4
0.1
0.1
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
the RNL does not critically affect the model’s performance
when it is between 0.5 and 3.5 (Fig. 2A). Sompolinsky et al.
(1988) have shown that the RNL is in a chaotic regime in the
absence of external input when the gain of the weight matrix
is ⬎1. Similarly, we calculate the maximal Lyapunov exponent
for the RNL network and find that the RNL is chaotic when the
gain is ⬎3. However, the model achieves a reasonable performance even with a large gain. The model’s performance also
does not depend critically on a specific setting of the time
constant. It reaches ⬎80% when the time constant is ⬎100 ms
(Fig. 2B).
Dynamical and heterogeneous representation of working
memory in RNL. One of the key aspects of the delayed
discrimination task is that it requires the maintenance of the
first stimulus f1 in working memory during the delay period.
The mnemonic representation of f1 is dynamical and heterogeneous during the delay period (Barak et al. 2010; Jun et al.
2010; Romo et al. 1999). Neurons in the RNL exhibit similar
features.
The responses of many neurons in the RNL are monotonically modulated by the f1 frequency during both the stimulus
and delay periods. Most of them fire persistently during the
delay period. To quantify the tuning, we use a simple linear
regression for each neuron (see MATERIALS AND METHODS for
details). Among the total of 1,000 neurons, on average 769 ⫾
22 (SE) neurons encode f1 during the f1 stimulus period and
687 ⫾ 19 encode f1 during the delay period. The numbers of
positively and negatively tuned neurons do not differ
significantly.
The neuronal activity during the delay period is not static.
The activity patterns of six example neurons are shown in
Fig. 3. Some neurons encode f1 early in the delay period, but
the encoding disappears later during the delay period (early
neurons; average number of neurons ⫽ 22; Fig. 3, A and B).
Some neurons exhibit sustained frequency encoding during the
entire delay period (persistent neurons; average number of neurons ⫽ 200; Fig. 3, C and D). Finally, a number of neurons
develop frequency encoding relatively late during the last second
(late neurons; average number of neurons ⫽ 39; Fig. 3, E and F).
We use the regression models to capture each neuron’s
frequency tuning at each time point during a trial (see MATE-
1
0.7
3.5
3299
3300
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
A
n Type
69 I
300
Neuron ID (sorted)
53
II
200
128 III
# of Neurons of
Categories V and VI
B
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5
Time (s)
60
72
IV
20
13
V
VI
40
20
0
0 0.5 1 1.5 2 2.5 3 3.5
Gain
Fig. 4. A: heterogeneity of single-neuron activity of the RNL in an example
simulation run. Each row represents a neuron’s tuning at different time point
during a trial. Yellow (green) indicates time points when a neuron is negatively
(positively) tuned to f1. Magenta (red) indicates time points when a neuron is
negatively (positively) tuned to f2. Blue indicates time points when a neuron
is tuned to f1 ⫺ f2, and cyan indicates time points when a neuron is tuned to
f2 ⫺ f1. Black indicates time points when a neuron is tuned to neither the
stimuli frequency nor the frequency difference. Neurons are sorted into 6
categories. The numbers of neurons in each category for the example run are
shown on right. The total number of neurons in the RNL is 1,000. Only
neurons that we can categorize are shown. B: the number of type V and VI
neurons varies with the gain. Note that the trend is similar to the gain
modulation of the model performance shown in Fig 2A.
RIALS AND METHODS for details). We find the same six types of
neurons in the RNL as previously described in the PFC (Jun et
al. 2010). Results from an example run are shown in Fig. 4.
Type I and II neurons [average across 10 runs (same below):
Table 1. Numbers of neurons of each type compared between model and previously reported data (Jun et al. 2010)
Model (n ⫽ 1,000)
Jun et al. (2010) (n ⫽ 912)
Type I
Type II
Type III
Type IV
Type V
Type VI
Unsorted
6.8
1.8
5.5
4.6
12.5
5.2
7
6.2
2.5
24.9
1.5
28.6
64.2
28.7
Neurons of each type (%) compared between the model and the previously reported data [numbers of type I–IV neurons are read off from Fig. 3 in Jun et
al. (2010); numbers of type V and VI neurons are quoted directly from Jun et al. (2010)].
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
100
n ⫽ 123, type I ⫽ 68, type II ⫽ 55] start to encode f1 tuning
with relatively long latencies but maintain their f1 encoding
throughout the delay period even after f2 is presented. They
differ only in the signs of their f1 tuning. Type III and IV
neurons (n ⫽ 195, type III ⫽ 125, type IV ⫽ 70) have short
latencies and are positively and negatively tuned to f1 during
the delay period, respectively. However, as soon as the second
stimulus is presented, they switch to encode f2’s frequency.
The signs of the tunings of f1 and f2 for these types of neurons
are always consistent. Type V and VI neurons (type V ⫽ 25,
type VI ⫽ 15) show modulations similar to type I and II
neurons initially. However, halfway through the delay period
the signs of their modulation are reversed. More interestingly,
after f2 is presented, they begin to encode the difference
between f1 and f2. These neurons provide useful information
for the frequency difference discrimination task. There are a
further 642 neurons that may show frequency modulation at
some point during the task but cannot be easily categorized into
any of the above six groups. This result can be compared with
the previous experimental data (Jun et al. 2010). It is apparent
that the proportions differ greatly between the two, but each
observed neuronal type exists in the RNL (Table 1).
The number of type V and VI neurons should correlate with
the model’s performance, as they encode the frequency difference between f1 and f2. Indeed, when we vary the gain in the
RNL, we find that the number of type V and VI neurons varies
similarly as the mode performance varies (Fig. 4B). When the
gain is set to 2, both the model’s performance and the number
of type V and VI neurons reach a maximum.
In addition to the great heterogeneity described above, the
RNL exhibits a dynamical encoding of stimulus frequencies.
To compare the network dynamics with previous findings in
the PFC neurons, we show how the population activity represents stimulus frequencies and how this representation evolves
during a trial. We calculate an instantaneous population state as
a vector of the firing rates of all neurons in the RNL circuit at
a given time point (Barak et al. 2010). For each time point in
the trial until the end of the delay period, we calculate the
correlation between the population states at two frequencies.
The correlation reflects how different stimuli drive the population response in the RNL to different states. The dynamics of
the network mimics that of the PFC neurons. At the onset of f1,
the correlations for all pairs of frequencies drop rapidly,
diverging according to the degree of the frequency difference
(Fig. 5A). The larger the difference is, the faster and to the
lower value the correlations drop, indicating a higher level of
response heterogeneity and a better differentiated encoding of
the frequencies. The low correlation states last until the stimulus is switched off. Afterwards, the correlations again rise and
plateau toward the end of the delay period. The different levels
of correlations among different frequency pairs are maintained
but degrade throughout the delay period.
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
B
1
Correlation coefficient
Correlation coefficient
A
0.9
0.8
0.7
Δ f1 =
0.6
0.5
0
1
2
3
Time (s)
0.4
0.5
0.6
0.7
0.8
0.9
1
4
5
1
0.8
sensory
memory
0.6
0.4
0.2
0
0
1
2
3
Time (s)
Fraction of stimulusdependent clusters
Number of neurons
B
200
100
0
0
0.5
1
|Tuning index|
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Δf
Fig. 6. Activities of the CPL after f2 presentation. A: distribution of CPL
neurons’ tuning indexes. Dashed line indicates the mean. B: fraction of the
clusters in which the winning neuron depends on whether f1 or f2 is larger is
plotted as a function of the frequency difference between f1 and f2. Error bars
show SE.
4
Fig. 5. Dynamical population state in the RNL. A: average
correlation coefficients of all pairs of stimuli. Frequency pairs
with the same frequency difference are grouped together. Gray
shades of curves indicate the frequency difference. B: crosscorrelation of the population responses. Black curve is calculated using the f1 stimulus onset time as the reference time.
Dark gray curve is calculated using the end of the delay period
as the reference time. Light gray shading indicates f1 presentation. The delay period lasts for 3.5 s in each trial.
5
The performance of the model depends on the parameters of
the CPL. First of all, it grows with both the cluster size and the
number of the clusters in the CPL (Fig. 7A). With the total
number of neurons fixed, however, the number of the clusters
turns out to be a larger factor in deciding the model’s performance. Moreover, the performance reaches ⬎80% when the
number of the clusters reaches ⬎100, regardless of the cluster
size. Second, the CPL depends critically on the winner-take-all
mechanism. We vary the competition strength and find that the
network performs poorly when the within-cluster competition
is weak (Fig. 7B).
To further illustrate how the CPL helps decoding the information from the RNL, we calculate and compare the Pearson
correlation coefficients between pairs of neurons for both the
CPL and the RNL. While the RNL neurons show high correlations, the distribution of correlation coefficients of the CPL
neurons shows a much wider range (Fig. 7C). We further
calculate the correlation of the activities of neurons in the CPL
between pairs of trials whose f1 and f2 should lead to the same
decision (f1 ⬍ f2 or f1 ⬎ f2) and between pairs of trials whose
stimuli should lead to different decisions (f1 ⬍ f2 vs. f1 ⬎ f2).
The CPL neuronal responses show a significantly smaller
correlation between the pairs of trials in which the stimuli
should lead to different decision outcomes (Fig. 7D). These
results indicate that the CPL circuit achieves substantial pattern
decorrelation, which facilitates the learning that we describe
below.
Reinforcement learning. In the final stage of our model, the
CPL neurons project to the DML and form a decision. The
weights of connections between the CPL and DML units are
chosen randomly initially and updated during the training
phase. The update depends on both the neuronal activity and
the prediction error. During training, the connections between
the DML and the neurons in the CPL with high tuning index
values that are consistent with the correct choice are reinforced
(Fig. 8A). Such changes are absent in the connections between
the DML and the neurons in the CPL with opposite tuning
indexes. Within the former group, the ones that have greater
frequency sensitivity have stronger synaptic weights and contribute more to the decision after the training (Fig. 8, A and B).
The synaptic weights increase rapidly within the first 50 trials
and slow down to a steady rate as the training progresses (Fig.
8C). In summary, the training results in an increasingly selective readout of the most sensitive neurons in the CPL network.
The CPL is critical for the whole network to achieve adequate performance via reinforcement learning. To illustrate the
importance of the CPL, we test the performance of a model that
does not contain a CPL. The output DML neurons in this
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
To further compare our model against the dynamics of
neuronal activity during the delay period described previously,
we follow Barak and colleagues’ analysis and calculate two
cross-correlations of the population states of the RNL neurons
using two reference times (Barak et al. 2010). One is calculated
against the population state at the onset of f1 (Fig. 5B, sensory), and the other is against the population state at the end of
the delay period (Fig. 5B, memory). The former is a measure of
how well the stimulus encoding is maintained, while the latter
shows how the memory representation evolves. Consistent
with the experimental observations (Barak et al. 2010), the
stimulus representation decays during the delay period while
the memory representation develops. Both representations coexist in the network, suggesting a mixture of states.
Decoding information from RNL by CPL. In this section, we
show how the CPL decodes the information from the RNL for
decision making.
We examine the activity of the CPL neurons after the f2
stimulus is presented (Fig. 6). We quantify the discriminative
sensitivity of each neuron using a tuning index that indicates
how often it wins a cluster when f1 is larger and when f2 is
larger (see MATERIALS AND METHODS). The index values of some
neurons in the CPL circuit are close to 0, indicating that their
winning is independent of the frequency order between f1 and
f2. These neurons do not provide useful readout for the task.
Many neurons, however, win their competitions more consistently depending on which of the two frequencies is larger (Fig.
6A). These neurons are biggest contributors for the learning.
The fraction of the clusters in which different neurons win the
competition under f1 ⬎ f2 and f1 ⬍ f2 conditions increases as
the difference between two frequencies grows larger (Fig. 6B).
A
3301
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
A
cluster size
B
1
0.9
20k
0.8
10k
400
300
0.7
200
0.6
5
25
50
75
0.7
0.6
0.5
0.5
100
0.8
100
0.4
# clusters
0.01 0.1
1
2
10
100
Beta
0.7
D
CPL
RNL
0.6
0.9
0.3
0.2
2
0.4
0.1
0
0.7
0.8
0.9
Correlation coefficient
model are directly connected with all neurons in the RNL. The
synapse weights are initialized randomly and then adjusted by
a reinforcement learning algorithm. For the first 100 training
trials, the accuracy of the model without CPL remains at the
chance level (Fig. 9). In contrast, the accuracy of the model
with CPL reaches 80% rapidly with the same amount of
training.
0.8
0.8
1
served in the PFC during a delayed discrimination task (Hernandez et al. 2002; Jun et al. 2010; Romo et al. 1999). More
importantly, we show that a simple cluster population circuit
combined with a reinforcement learning algorithm decodes
information from the reservoir and achieves good performance.
Reservoir networks have distinct features that are not found
in attractor-based neural network models. A reservoir consists
of a large number of neurons that are connected sparsely with
random weights. In contrast to previously proposed attractor
networks that model similar tasks (Deco et al. 2010; Machens
et al. 2005; Miller and Wang 2006; Verguts 2007), the weights
DISCUSSION
In this study, we first show that a reservoir network reproduces the response heterogeneity and temporal dynamics ob-
Connection Weights with
DML neuron f1>f2
A
0.9
Mean Correlation
Same Condition
Connection Weights with
DML neuron f1<f2
−3
x 10
8
50
6
100
100
4
150
150
200
2
200
0.5
0.6
0.7
0.8
0.9
1.0
0
0.5
0.6
Tuning index value
B
C
x 10
Last trial
First trial
13
8
3
-2
0.5
0.6
0.7
0.8
0.8
0.9
1.0
−3
x 10
8
Synaptic Weight
18
0.7
Tuning index value
−3
Synaptic Weight
Fig. 8. Changes in synaptic strengths between the CPL and the
DML neurons during the training phase in an example simulation run. A: evolution of the synaptic weights between the CPL
neurons and the DML neurons. Left: weights of synapses
between the CPL neurons and the DML neuron f1 ⬎ f2. Right:
weights of synapses between the CPL neurons and the DML
neuron f1 ⬍ f2. Each column from top to bottom is the synapse
weight of 1 CPL neuron from the 1st to the 200th training trial.
CPL neurons are sorted by the values of their average tuning
indexes across the 200 training trials. For clarity, only neurons
with tuning indexes ⬎0.5 are shown. A color scale of the
weight magnitudes is shown on right. B: synaptic weights of
the connections between the CPL neurons with positive tuning
indexes (f1 ⬎ f2) and the f1 ⬎ f2 DML neuron are plotted
against their tuning index values. Black dots are the weights at
the beginning of the training phase and red dots those at the end
of the training phase. C: evolution of mean synaptic strength
during the training between the CPL neurons with positive
tuning indexes (f1 ⬎ f2) and the f1 ⬎ f2 DML neuron.
Trials#
50
0.9
Tuning Index Value
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
1.0
6
4
2
0
50
100
150
Trial Number
200
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
0.5
0.0
Mean Correlation
Different Condition
1
Frequency of # neurons
C
2
Fig. 7. CPL network characteristics. A: performance of the
model depends on the number of clusters and the number of
neurons within each cluster of the CPL. Colors indicate the
performance. Dotted lines are the iso-number lines indicating
the model’s performance when the total numbers of neurons are
10,000, 20,000, and 30,000. Each data point is the average from
10 runs. B: model performance critically depends on the
strength of the competition. Each data point is 10 rounds of
simulation runs, each of which includes 200 training and 200
testing trials. C: Pearson correlation coefficients between activities of all the neurons of the RNL and the CPL at the end of
f2 presentation. D: mean correlation coefficients of the CPL
neurons for the same stimulus condition trials and for the
different stimulus condition trials. Inset histogram shows the
distribution of the difference between the correlation coefficients between 2 conditions.
0.9
30k
500
Performance
3302
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
B
1
0.8
Performance
Performance
A
0.6
0.4
with CPL
without CPL
0.2
0
0
20
40 60
Trial#
80
100
0.8
0.6
0.4
0.2
0
Without
CPL
With
CPL
Fig. 9. The CPL improves performance. A: learning curves of the models with
and without the CPL. Shading area indicates SE calculated from 10 simulation
runs. B: network performance with or without the CPL after 100 training trials
averaged across 50 runs.
the task. For example, the temporal dynamics of type V and VI
neurons do not completely mirror real neurons from the data.
The sign flipping of some type V and VI neurons in the RNL
occurs throughout the delay period, while real neurons only flip
their tuning at the f1 offset and the f2 onset (Jun et al. 2010).
Again, this discrepancy may be due to the lack of plasticity or
feedback inputs in our reservoir model.
The heterogeneous and dynamical responses of the RNL
create an obstacle for extracting relevant information from the
network. Previous studies used supervised learning for training
readout networks, which is not biologically feasible (Deng and
Zhang 2007; Jaeger and Haas 2004; Joshi 2007; Sussillo and
Abbott 2009). Biologically feasible algorithms such as reinforcement learning directly based on RNL are inefficient. In
our network, reinforcement learning without CPL has to depend on type V and VI neurons, which encode the difference
between f1 and f2 but are few in the reservoir. The inputs from
these relevant neurons to the DML neurons are overwhelmed
by other irrelevant neurons.
The winner-take-all cluster structure of CPL is critical in
achieving good performance. Each individual neuron in CPL is
essentially a randomly sampled readout of the reservoir. The
winner-take-all mechanism within a cluster selects the neuron
with the largest responses. The losing neurons are not relevant
in the learning. As long as different neurons win when f1 ⬎ f2
and when f1 ⬍ f2, the cluster is able to contribute during the
learning. The winner-take-all mechanism amplifies the response difference between two task conditions f1 ⬎ f2 and
f1 ⬍ f2, and greatly enhances the learning efficiency.
The cluster structure of the CPL also helps to improve the
temporal robustness of the network performance. We test the
network performance with a variable delay period. As the reservoir is dynamic, the responses of its neurons may change dramatically over the time course of the task, especially when the gain is
large. Any learning mechanism based on reservoir neurons’ responses directly tends to fail when a temporal variation is introduced into the task. In contrast, the winner-take-all mechanism
used in the CPL depends more on neurons’ firing rate differences,
which change relatively more slowly compared with instantaneous firing rates. Therefore the CPL becomes much more tolerant to the dynamics of the network and allows the model to
achieve reasonable performance even when the gain is large and
when temporal variations exist.
The CPL itself is independent from learning. Many CPL
neurons do not contribute to this particular task and appear to
be a waste. However, these apparently task-irrelevant neurons
can be recruited by similar reinforcement learning rules for
other tasks in which different stimulus parameters are relevant.
Therefore, just as the RNL can serve as a task-independent
generic model for working memory, the CPL can serve as a
generic readout mechanism for the RNL. The combination of
the two can be trained to perform different tasks with a
task-specific reinforcement learning paradigm.
Several algorithms in the machine learning fields share
features of the CPL. Conceptually, our model is analogous to
boosting algorithms used in machine learning (Freund and
Schapire 1999). Each cluster within the CPL serves as a weak
classifier. The reinforcement learning adjusts the weights assigned to each cluster based on reward feedback and finds a set
of weights that combines the outputs from all the clusters to
form a much stronger classifier. The winner-take-all mecha-
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
of synapses in our model do not need to be precisely tuned and
the inputs do not depend on specific values (Deco et al. 2010)
or gating mechanisms (Machens et al. 2005). The information
is sparsely encoded by the reservoir network. The randomness
creates a population of mixed selectivity, which has been
suggested to be essential in task rule encoding (Rigotti et al.
2010). The encoding of information is not specific to the task.
Thereby information stored in the reservoir can be used for
different behavior tasks with a different readout scheme.
The dynamics of the RNL is the key to frequency difference
tuning observed in some neurons (types V and VI in Fig. 4).
These neurons switch their tuning direction for the f1 frequency during the delay period, suggesting that they receive
recurrent inputs from neurons with opposite frequency tunings
that arrive at different time points. This temporal difference
also allows the representation of f1 and f2 to be combined into
the frequency difference when the f2 stimulus is presented.
Thus the dynamical nature of the RNL is important in encoding
task-relevant information.
There are, however, important discrepancies between the
RNL neurons and the experimental data previously reported.
First of all, although we show that all types of neurons reported
previously can be found in the RNL, their proportions differ
significantly (Table 1). The type V and VI neurons only
compose 4% in the total RNL population, whereas over half of
the population were found to be of these two types in the
experiment (Jun et al. 2010). This discrepancy cannot be
eliminated within the range of parameters we have tested. It is
not surprising, as the RNL is not tuned specifically for this task
and does not benefit from the training. The neurons recorded in
the previous study might reflect an overtrained PFC. Perceptual
learning has been shown to have many different stages, which
can be explained by plasticity at different brain areas (Shibata
et al. 2014). The learning in our model may be considered as
only the first stage of learning. The RNL in our model might
correspond to a naive PFC. By introducing feedback connections from the CPL and the DML to the RNL, we may train the
RNL and obtain results that match the experimental data more
closely (Sussillo and Abbott 2009). It would also be interesting
to introduce short-term plasticity into the RNL and see how it
can affect the network’s performance. It has been suggested
that short-term plasticity may play a role in working memory
(Itskov et al. 2011; Mongillo et al. 2008). For simplicity, we
omit plasticity in our model, but our preliminary results and a
previous study (Barak et al. 2013) have suggested that adding
plasticity can increase the stimulus encoding efficiency of the
RNL. Second, the RNL is agnostic to the temporal structure of
3303
3304
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
ACKNOWLEDGMENTS
We thank Bo Yang for his helpful discussions about echo state networks.
We are grateful to Cheng Chen, Chengyu Li, Bailu Si, Xiao-Jing Wang, and
Peter Dayan for their comments on the drafts.
GRANTS
This work is supported by Strategic Priority Research Program of the
Chinese Academy of Science Grant XDB02050500, the CAS Hundreds of
Talents Program, and the Shanghai Pujiang Program to T. Yang and by the
National Science Foundation of China (NSFC) under Grants 91420106,
90820305, 60775040, and 61005085 and the Natural Science Foundation of
Zhe Jiang (ZJNSF) under Grant Y2111013 to Z. Cheng. We thank the National
Supercomputer Center in Tianjin for providing computing resources.
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the author(s).
AUTHOR CONTRIBUTIONS
Author contributions: Z.C. conception and design of research; Z.C. performed experiments; Z.C., Z.D., X.H., and T.Y. analyzed data; Z.C. and T.Y.
interpreted results of experiments; Z.C. and T.Y. prepared figures; Z.C. and
T.Y. drafted manuscript; Z.C., Z.D., X.H., B.Z., and T.Y. edited and revised
manuscript; Z.C., Z.D., X.H., B.Z., and T.Y. approved final version of
manuscript.
REFERENCES
Baddeley A. Working memory. Science 255: 556 –559, 1992.
Barak O, Sussillo D, Romo R, Tsodyks M, Abbott LF. From fixed points to
chaos: three models of delayed discrimination. Prog Neurobiol 103: 214 –
222, 2013.
Barak O, Tsodyks M. Working models of working memory. Curr Opin
Neurobiol 25C: 20 –24, 2014.
Barak O, Tsodyks M, Romo R. Neuronal population coding of parametric
working memory. J Neurosci 30: 9424 –9430, 2010.
Bolam JP, Hanley JJ, Booth PA, Bevan MD. Synaptic organisation of the
basal ganglia. J Anat 196: 527–542, 2000.
Brody CD, Hernandez A, Zainos A, Romo R. Timing and neural encoding
of somatosensory parametric working memory in macaque prefrontal cortex.
Cereb Cortex 13: 1196 –1207, 2003.
Buonomano DV, Maass W. State-dependent computations: spatiotemporal
processing in cortical networks. Nat Rev Neurosci 10: 113–125, 2009.
Cohen MX, Frank MJ. Neurocomputational models of basal ganglia function
in learning, memory and choice. Behav Brain Res 199: 141–156, 2009.
Dechert WD, Sprott JC, Albers DJ. On the probability of chaos in large
dynamical systems: a Monte Carlo study. J Econ Dyn Control 23: 1197–
1206, 1999.
Deco G, Rolls ET, Romo R. Synaptic dynamics and decision making. Proc
Natl Acad Sci USA 107: 7545–7549, 2010.
Deng Z, Zhang Y. Collective behavior of a small-world recurrent neural
system with scale-free distribution. IEEE Trans Neural Netw 18: 1364 –
1375, 2007.
Freund Y, Schapire RE. A short introduction to boosting. J Jpn Soc Artif
Intell 14: 771–780, 1999.
Gold JI, Shadlen MN. Neural computations that underlie decisions about
sensory stimuli. Trends Cogn Sci 5: 10 –16, 2001.
Hernandez A, Zainos A, Romo R. Temporal evolution of a decision-making
process in medial premotor cortex. Neuron 33: 959 –972, 2002.
Itskov PM, Vinnik E, Diamond ME. Hippocampal representation of touchguided behavior in rats: persistent and independent traces of stimulus and
reward location. PloS One 6.1: e16462– e1646, 22011.
Jaeger H, Haas H. Harnessing nonlinearity: predicting chaotic systems and
saving energy in wireless communication. Science 304: 78 – 80, 2004.
Joel D, Niv Y, Ruppin E. Actor-critic models of the basal ganglia: new
anatomical and computational perspectives. Neural Netw 15: 535–547,
2002.
Johansson C, Lansner A. Imposing biological constraints onto an abstract
neocortical attractor network model. Neural Comput 19: 1871–1896, 2007.
Joshi P. From memory-based decisions to decision-based movements: a model
of interval discrimination followed by action selection. Neural Netw 20:
298 –311, 2007.
Jun JK, Miller P, Hernandez A, Zainos A, Lemus L, Brody CD, Romo R.
Heterogenous population coding of a short-term memory and decision task.
J Neurosci 30: 916 –929, 2010.
Laje R, Buonomano DV. Robust timing and motor patterns by taming chaos
in recurrent neural networks. Nat Neurosci 16: 925–933, 2013.
Law CT, Gold JI. Reinforcement learning can account for associative and
perceptual learning on a visual-decision task. Nat Neurosci 12: 655– 663,
2009.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 521: 436 – 444, 2015.
Levesque M, Parent A. The striatofugal fiber system in primates: a reevaluation of its organization based on single-axon tracing studies. Proc Natl
Acad Sci USA 102: 11888 –11893, 2005.
Lucke J, von der Malsburg C. Rapid processing and unsupervised learning
in a model of the cortical macrocolumn. Neural Comput 16: 501–533, 2004.
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
nism in the CPL is also similar to the max-pooling algorithm
used in the convolutional neural networks, which helps to
improve the learning efficiency and the robustness of the
network (LeCun et al. 2015).
The architecture of the CPL is similar to the columnar
organization of neocortex that has been demonstrated in many
brain areas and has been suggested to play important roles in
learning (Johansson and Lansner 2007; Lucke and von der
Malsburg 2004). One of the possible neural substrates of the
CPL is the basal ganglia (BG). One can find similar structural
features of the CPL in the BG. Within the BG, the striatum
receives extensive projections from the PFC. Most striatum
neurons send converging axons to the lateral pallidum as their
first target (Levesque and Parent 2005). Compared with 31
million striatal spiny neurons, there are far fewer pallidal
neurons. It was estimated that on average each pallidal neuron
receives inputs from ⬃100 striatal neurons (Percheron et al.
1984), suggesting a converging information flow and a clustering organization within the striatum (Bolam et al. 2000). In
addition, the involvement of the BG in reinforcement learning
has been extensively studied. The primary mechanism of
learning in the BG model is believed to depend on dopaminergic modulation of neurons that encodes reward prediction
error (Cohen and Frank 2009; Joel et al. 2002; Schultz et al.
1997). A network similar to the CPL can also be found in the
locust olfactory circuit, where the Kenyon cells in the mushroom body help to decode the information in a sparse manner
(Perez-Orive et al. 2002; Rabinovich et al. 2000).
Our network model can be extended in several directions.
First of all, we can create a spiking neural network model
based on our present model. A spiking model provides further
opportunities to study how reservoir networks can model the
real brain. Second, our realization of the CPL is simplified. The
winner-takes-all mechanism is not implemented with a fullscale firing rate model. However, there are plenty of existing
implementations of winner-takes-all competition that are physiologically feasible (Gold and Shadlen 2001; Salinas and Abbott 1996; Wang 2002). Our CPL can be implemented with a
network in which an inhibitory unit delivers a common inhibitory signal to all of the excitatory units within the cluster.
In summary, we demonstrate here a biologically feasible
way to decode heterogeneous and dynamic information from
a reservoir network. This opens up future possibilities of
modeling work of various brain functions with reservoir
networks. Further experimental work may shed light on
whether the brain uses computational mechanisms similar to
our model.
RESERVOIR NETWORK WITH REINFORCEMENT LEARNING
Romo R, Brody CD, Hernandez A, Lemus L. Neuronal correlates of
parametric working memory in the prefrontal cortex. Nature 399: 470 – 473,
1999.
Romo R, Salinas E. Flutter discrimination: neural codes, perception, memory
and decision making. Nat Rev Neurosci 4: 203–218, 2003.
Royer S, Pare D. Conservation of total synaptic weight through balanced
synaptic depression and potentiation. Nature 422: 518 –522, 2003.
Salinas E, Abbott LF. A model of multiplicative neural responses in parietal
cortex. Proc Natl Acad Sci USA 93: 11956 –11961, 1996.
Salinas E, Hernandez A, Zainos A, Romo R. Periodicity and firing rate as
candidate neural codes for the frequency of vibrotactile stimuli. J Neurosci
20: 5503–5515, 2000.
Schultz W, Dayan P, Montague PR. A neural substrate of prediction and
reward. Science 275: 1593–1599, 1997.
Seung HS. Learning in spiking neural networks by reinforcement of stochastic
synaptic transmission. Neuron 40: 1063–1073, 2003.
Shibata K, Sagi D, Watanabe T. Two-stage model in perceptual learning:
toward a unified theory. Ann NY Acad Sci 1316: 18 –28, 2014.
Sompolinsky H, Crisanti A, Sommers HJ. Chaos in random neural networks.
Phys Rev Lett 61: 259 –262, 1988.
Sussillo D, Abbott LF. Generating coherent patterns of activity from chaotic
neural networks. Neuron 63: 544 –557, 2009.
Verguts T. How to compare two quantities? A computational model of flutter
discrimination. J Cogn Neurosci 19: 409 – 419, 2007.
Verstraeten D, Schrauwen B, D’Haene M, Stroobandt D. An experimental
unification of reservoir computing methods. Neural Netw 20: 391– 403,
2007.
Wang XJ. Probabilistic decision making by slow reverberation in cortical
circuits. Neuron 36: 955–968, 2002.
Wang XJ. Decision making in recurrent neuronal circuits. Neuron 60: 215–
234, 2008.
J Neurophysiol • doi:10.1152/jn.00378.2015 • www.jn.org
Downloaded from http://jn.physiology.org/ by 10.220.33.2 on June 17, 2017
Machens CK, Romo R, Brody CD. Flexible control of mutual inhibition: a
neural model of two-interval discrimination. Science 307: 1121–1124, 2005.
Miller EK, Erickson CA, Desimone R. Neural mechanisms of visual working
memory in prefrontal cortex of the macaque. J Neurosci 16: 5154 –5167,
1996.
Miller P, Brody CD, Romo R, Wang XJ. A recurrent network model of
somatosensory parametric working memory in the prefrontal cortex. Cereb
Cortex 13: 1208 –1218, 2003.
Miller P, Wang XJ. Inhibitory control by an integral feedback signal in
prefrontal cortex: a model of discrimination between sequential stimuli.
Proc Natl Acad Sci USA 103: 201, 2006.
Mongillo G, Barak O, Tsodyks M. Synaptic theory of working memory.
Science 319: 1543–1546, 2008.
Mountcastle VB, Steinmetz MA, Romo R. Frequency discrimination in the
sense of flutter: psychophysical measurements correlated with postcentral
events in behaving monkeys. J Neurosci 10: 3032–3044, 1990.
Padmanabhan K, Urban NN. Intrinsic biophysical diversity decorrelates
neuronal firing while increasing information content. Nat Neurosci 13:
1276 –1282, 2010.
Percheron G, Yelnik J, Francois C. A Golgi analysis of the primate globus
pallidus. III. Spatial organization of the striato-pallidal complex. J Comp
Neurol 227: 214 –227, 1984.
Perez-Orive J, Mazor O, Turner GC, Cassenaer S, Wilson RI, Laurent G.
Oscillations and sparsening of odor representations in the mushroom body.
Science 297: 359 –365, 2002.
Rabinovich MI, Huerta R, Volkovskii A, Abarbanel HD, Stopfer M,
Laurent G. Dynamical coding of sensory information with competitive
networks. J Physiol (Paris) 94: 465– 471, 2000.
Rigotti M, Ben Dayan Rubin D, Wang XJ, Fusi S. Internal representation of
task rules by recurrent dynamics: the importance of the diversity of neural
responses. Front Comput Neurosci 4: 24, 2010.
3305