Download Temporal Dependent Plasticity: An Information Theoretic Approach

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Single-unit recording wikipedia , lookup

Metastability in the brain wikipedia , lookup

Holonomic brain theory wikipedia , lookup

Donald O. Hebb wikipedia , lookup

Artificial neural network wikipedia , lookup

Central pattern generator wikipedia , lookup

Convolutional neural network wikipedia , lookup

Neural modeling fields wikipedia , lookup

Perceptual learning wikipedia , lookup

Nonsynaptic plasticity wikipedia , lookup

Eyeblink conditioning wikipedia , lookup

Biological neuron model wikipedia , lookup

Neural coding wikipedia , lookup

Activity-dependent plasticity wikipedia , lookup

Learning wikipedia , lookup

Synaptic gating wikipedia , lookup

Recurrent neural network wikipedia , lookup

Learning theory (education) wikipedia , lookup

Machine learning wikipedia , lookup

Catastrophic interference wikipedia , lookup

Nervous system network models wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
Temporal Dependent Plasticity:
An Information Theoretic Account
Gal Chechik
and
Naftali Tishby
Institute of Computer Science and Engineering
and the Interdisciplinary Center for Neural Computation
Hebrew University, Jerusalem, Israel
[email protected] [email protected]
Abstract
The fundamental paradigm of Hebbian learning has recently received a novel interpretation with the discovery of synaptic plasticity that depends on the relative timing of pre and post synaptic
spikes. [Markram et al. 1997; Zhang et al. 1998]. While this type of
plasticity has already been studied in various computational frameworks, a generic computational motivation or its derivation from
rst principles is still missing. In this paper we derive temporally
dependent learning from the basic principle of mutual information
maximization and study its relation to the experimentally observed
plasticity. The comparison shows that not only is the biological
learning rule similar in form to the analytically derived one, but
it also increases mutual information to a near-optimal level. The
analysis yields new insights into the temporal characteristics of the
observed learning rule, and experimental predictions as to the time
constants involved, depending on neuronal biophysical parameters.
1 Introduction
Hebbian plasticity, the major learning paradigm in neuroscience, was until recently
interpreted as learning by correlated neuronal It has recently received novel interpretation showing that changes in synaptic eÆcacies highly depend on the relative
timing of the pre and postsynaptic spikes: The eÆcacy of a synapse between two
excitatory neurons increases when the presynaptic spike precedes the post synaptic
one, but decreases when the converse is true [1]. Moreover, experiments both in
hippocampal slices [2] and in retino-tectal synapses of an anesthetized rat [3] have
provided a quantitative characterization of synaptic eÆcacy changes as a function
of the temporal dierence between the pre and post synaptic spikes with millisecond
temporal resolution. These data showed that the magnitude of changes in synaptic
eÆcacy decays exponentially with this temporal dierence.
This novel learning rule, termed temporally dependent plasticity (TDP), was studied in various computational frameworks, showing that it maintains irregularity of
neuronal ring [4], normalizes synaptic eÆcacies [5], leads to synchronous subpopulation ring in a recurrent network [6], and plays an important role in sequence
learning [7, 8, 9]. However, a generic computational motivation for this learning
rule, or its derivation from basic principles is still missing.
A natural candidate for this purpose is the fundamental concept of input-output
mutual information maximization. This idea, known as the Infomax principle [10],
states that the goal of a neural network's learning procedure is to maximize the
mutual information between its output and input. The information maximization
principle serves as a unifying learning goal for both supervised and unsupervised
learning, with applications ranging from self-organization of neural networks to
associative memory.
In this paper we derive an optimal TDP rule that maximizes mutual information for
a leaky integrator neuron with spiking inputs. We then compare the analytically
derived rule with the experimentally observed rule, showing that not only does
the biological learning rule have similar general form as the analytically derived
TDP rule, but it also increases mutual information to a near-optimal level. This
derivation of TDP from basic principles yields quantitative experimental predictions
with regard to the shape of the learning curve, as well as new insights into its
dynamics.
The following section describe our model. Section 3 derive an Infomax gradient ascent rule for a generic supervised learning task, that is extended to the unsupervised
case in Section 4.
2 The Model
We study a generic learning task in a network with N 1 binary f0; 1g spiking
input neurons and a single output (target) neuron. The output neuron is a perfect
leaky integrator, and the inputs are Poisson spike trains. At any point in time,
the target neuron accumulates its inputs with exponential decay due to voltage
attenuation
() =
Y t
N
X
i=1
() ;
W i Xi t
( )
Xi t
Z t
1
(t
0
e t)
Xi0 t0 dt0
()
;
(1)
where Wi is the synaptic eÆcacy between the ith input neuron and the target
neuron, Xi0(t) = Ptspike Æ(t tspike ) is the ith spike train, Xi (t) is the total weighted
input of the ith unit until time t, and is the membrane time constant. The
learning task is to discriminate M random input patterns 2 IR+n = 0::M by
setting the weights appropriately.
Each pattern determines the ring rates of the
input neurons, and X 0 is then a noisy
realization of due to the stochasticity of
the Poisson process. The patterns are assumed to be uncorrelated. Clearly the
model can be extended to several target neurons as in the case of hetero-associative
memory networks.
3 Supervised Learning
3.1 Gradient Ascent Learning Rule
We consider the case where input patterns are presented for periods of length T ,
being on the order of magnitude of few time-constants 's. Within each period,
a pattern is randomly chosen
for presentation with probability q , where most
of the patterns are rare (PM=1 q 1) and the pattern 0 is abundant. This
reects the biological scenario where a neuron is 0continuously presented with a
noisy background activity (corresponding here
to ), and is rarely presented with
one of its salient stimuli (corresponding
to
). However we do not assume that the
statistical properties of 0 dier from those of .
Let us focus on a single presentation period omitting the notation of t, and look at
the value of Y at the end of this period
Y
N
X
i=1
Z
Wi
0
0
T
et = Xi0 t0 dt0
()
N
X
i=1
W i Xi
(2)
:
The input-output mutual information [11] in this network is dened by
( ; ) = h(Y )
I Y ( j) ;
Z
( )=
h Y h X
X
( ) ( ( ))
f x log f x dx
(3)
where the rst term is the dierential entropy of the distribution of Y , and the
second term is the dierential entropy of the Y distribution given that the network
is presented with a known input pattern, f (Y ) is the p.d.f. of Y .
As input neurons re independently and their number is large, the input of
the target neuron when the network is presented with the pattern is normally distributed f (Y j) = N ( ; 2 ) with mean = E [W X ] and variance
2 = E [(W X )(W X )T ] E 2 [W X ], where averaging is over the possible realizations of inputs X when the network is presented with the pattern . To calculate
the entropy of Y we note that f (Y ) is a mixture
P of Gaussians (each resulting from
the presentation
of
an
input
pattern)
f (Y ) = q f (Y j ), and use the assumption
that PM=1q is small to approximate the entropy of this distribution. Dierentiating the mutual information with regard to Wi we obtain
M
X
@I (Y ; )
(4)
= q K E (X ) K0E (X 0)
@Wi
with
K0
K
=1
Wi
Wi
i
i
(
1
0
0
0 4
)2
0 2
1 + 2
0 4
0
+ 20
0
0 2
This gradient of mutual information manifold may be used for a \batch" gradient
ascent learning rule, by repeatedly changing Wi according to Wi / @W@ i I (Y ; ),
while recalculating the distribution moments that depend on W .
What can be learned from the general form of this learning rule ? First, the learning
rule 0has two parts: it decreases weights that correspond to inputs that re stronger
at , and increases the others. The main result of this process is an increase
of dierence 0, while constraining the variances 2, thus providing better
discrimination of the 0 pattern from the rest 1. Thus, as in the case of real-valued
inputs investigated in [12] a gradient ascent rule on the mutual information manifold
alternates between Hebbian learning increasing response to the signal patterns (in
our model 1) and anti-Hebbian learning decreasing response to the background
noise pattern.
Secondly, note that both
increase and decrease of weights, occur with a magnitude
that depends on fq gM=1. This can be interpreted as follows: each pattern 1,
1
It can be shown that when starting from random positive weights, if > 0 then the
K -terms are positive, the dierence 0 increases, and the weights remain positive.
Clearly, a mirror solution also exists with negative weights and 0 > .
presented with probability q , contributes K E (Xi ) K0E (Xi0) to the gradient. It
is therefor straight forward to implement the learning rule of Eq. 4 in an \on-line"
manner by setting
Wi = E (Xi )K E (Xi0)K0
(5)
whenever the pattern was presented, for = 1::M only. This rule implements
a stochastic gradient ascent on the information manifold.
Interestingly, modifying the weights only when thePpatterns
1 are presented has
the advantage of being robust to uctuations in M=1 q . This is in contrast with
standard memory learning procedures in which weights are changed for any pattern
presented but with alternating signs (a procedure that increases weights strongly
associated with > 0, and decreases those of = 0). Quantitative results are not
shown here due to space considerations.
3.2 A Simplied Learning Rule
Implementing the above gradient descent rule is clearly diÆcult to achieve in a
biological system, since in order to calculate the K -terms the system has to estimate
the distribution moments for all input patterns. We now turn to consider two
simplications of the analytically derived rule, and will later show numerically that
the simplied learning rule approximates the gradient-ascent learning rule fairly
well. First, we ignore the K -coeÆcients in Eq. 5 and consider the learning rule
Wi = + E (Xi ) 0E (Xi0) when is presented 8 = 1::M ;
(6)
where ; 0 determine the learning rates, but unlike the K 's do not depend on W .
Again, no changes occur when the pattern 0 is presented. Secondly, expectations
will be approximated by averaging over presentations. E0(Xi ) can by clearly approximated by averaging over Xi when is presented.
E (Xi ) can be similarly estimated
by averaging over a period where the pattern 0 was presented, however, having the
experimentally observed TDP
rule in mind, we choose here a dierent approximation: Using the fact that 0 is highly frequent, and hence with high probability it
will be presented on the consecutive
time period t = 0::T , we treat spikes in that
period as emerging from 0 . These spikes can be weighted in many ways, all yielding
similar results due to averaging. We arbitrarily choose hereR uniform
weighting for
t0 t 0 0
0
sake of generality (but see also [8]). Substituting Xi (t) T e Xi (t )dt0, we
conclude that the following learning rule approximate Eq. 6
Z T
Z 0
0
Wi (t) = + e t t X 0(t0 )dt0 0 1 X 0(t0 )dt0 ;
(7)
T
i
T
0
i
when ; 1 is presented.
How good are these approximations ? Does learning with the approximated learning
rule increase mutual information ? To answer these question we have performed
a simulation in which we traced the input-output mutual information of a 1000neurons network, with 101 patterns presented. The two upper curves in Figure 1
compare the mutual information of the learning rule of Eq. 4, with that of Eq. 7.
Apparently, the approximated learning rule achieves fairly good results compared
to the optimal rule. Convergence is slower though, as Eq. 7 does not ascend along
the information gradient (i.e. the steepest ascent), but climbs along more moderate
slopes.
0.4
Analyticaly derived
0.35
Simplified Supervised
0.3
I(X,Y)
0.25
Unsupervised
0.2
0.15
0.1
0.05
0
1000
2000
3000
time steps
4000
5000
Figure 1: Comparing optimal and approximated learning rules. Poisson spike trains
were simulated by discretizing time into bins. Patterns were constructed by setting
0:1 of the input neurons to re at a high rate with probability 0:4 per bin, while
the rest re with rate of 0:05 per bin. Rest of simulation parameters: M = 100,
N = 1000, q0 = 0:9, q = 0:001, = 0:01,0 = 0:0125. Repeating simulation for
dierent realization of the spike trains yielded almost identical results, thus error
bars were omitted.
These results thus show that a simplied TDP rule, similar in form to the
experimentally observed TDP, yields near optimal input-output mutual
information.
4 The Unsupervised Case
4.1 Using Postsynaptic Spikes As Learning Signals
The analysis so far has concentrated on the supervised learning case, where the
identity of the presented pattern was used by the learning rule. Could these results
be extended to the unsupervised case?
A possible replacement for the teacher's learning signal is the postsynaptic spike: If
spikes are elicited when input exceeds a threshold, then in our model a postsynaptic
spike0 should correspond to one of the patterns 1 while spike absence corresponds
to (since > 0). This yields a learning procedure identical to Eq. 7, but this
time learning is triggered by the postsynaptic spike. This learning rule simulates the
experimentally observed TDP rule [3, 2], with possible exception of the temporal
shape of the weakening part that is not constrained in our model.
Clearly if spikes faithfully signal the presence of input patterns this procedure
should yield eective learning. But at the beginning of the learning procedure,
when weights are not appropriately set, spikes do not necessarily signal for those
patterns. What is then the eectiveness of this learning mechanism? To answer this
question, we have compared the proposed unsupervised learning rule with the two
supervised rules derived earlier. Figure 1 shows that indeed this spike dependent
learning rule increases the input-output mutual information, and achieves reasonable results compared to the optimum. We conclude that the biological TDP
rule approximates an optimal information maximization learning rule
for the unsupervised task.
It can be seen that the learning rate of the unsupervised procedure is much slower.
This is due to the fact that at early stages the neuron does not signal reliably the
presentation of patterns, thus causing highly noisy learning. However, as learning
proceeds, the neuron's learning signal is more faithful, allowing the system to increase its input-output mutual information. How does the system depend on its
initial conditions then? The next section shows that a range of parameters exists,
where the neuron learns to respond to rare events regardless of initial conditions.
4.2 Detecting Rare Events
Consider for simplicity the case where two patterns 0 ,1 (q0 q1 ) are presented
to a neuron that uses the unsupervised learning procedure described above. Each
of the patterns has k \foreground" neurons that re with high rate, while the rest
N k of the neurons re with a low rate.
Note that if the neuron responds to the presentation of the pattern 0 (which happens with probability q0), then weights corresponding to foreground neurons are
strengthened by , but are then weakened on average by 0 with probability q0
(due to weights decrease on the following time step) . Similar calculation for 1
yields on average
if the neuron spikes given 0
Wi = ++qq10(( qq1000 )) if
:
(8)
the neuron spikes given 1
As q0 > q1, setting in the range
1 > 0 > 1 ; yields q0( q0 0) < 0 :
(9)
q1 ( q1 0 ) > 0
q0
q1
At this regime the weights
corresponding to foreground neurons of 0 will be de1
creased while those of strengthened, leading
to a spiking response of the neuron
when presented with the rare pattern 1. Thus, the learning rates ratio 0 = ,
sets a cuto that allows only for patterns with probabilities q < =0 to be
strengthened, while the rest are weakened. Therefore, under the correct parameter
range, the system converges to the same functional steady state regardless of initial
conditions.
0
5 Discussion
We have taken an information theoretic approach to the study of TDP - a novel
type of plasticity recently observed in brain tissues. Within the Infomax framework,
we have derived a TDP rule that maximizes mutual information in a spiking neural
network, and compared it with the biological TDP rule. This comparison shows that
not only is the biological learning rule similar in form to the analytically derived
one, but it also increases mutual information to a near-optimal level
This derivation provides a new computational insight into the shape of the learning
curve, showing that strengthening synaptic eÆcacy should depend exponentially
on the time dierence between pre and post synaptic spikes, with a time constant
determined by the time constant of the post synaptic membrane. Indeed, the time
constants of the experimentally observed TDP rule are on the scale of 10 20 milliseconds, that t well the membrane time constant of cortical neurons. This yield
direct experimentally testable predictions as to the learning curve shape depending
on the biophysical parameters of the neuron: Neurons with larger membrane time
constants should exhibit larger time constants in their TDP learning curves. It
should be noted however that the precise learning curve shape should depend on
synaptic transfer function and delays, that were not incorporated to the current
analysis.
Cortical inhibitory neurons were not found to exhibit TDP. Under the view that
inhibitory cortical neurons do not participate in spatial coding but serve to limit
network activity, our analysis predicts that indeed no temporal learning would occur
in these neurons, as they are not needed for pattern discrimination. However, it
predicts that in systems where inhibitory neurons do participate in spatial coding,
as in the olfactory system, TDP should be observed.
Our analysis was performed for a single target neuron, but can be extended to
the case of several target neurons as in the case of hetero-associative memory networks. Moreover, as the input-output mutual information determines the ability of
a neuron to discriminate its inputs, this analysis bears relevance to a wide range of
computational paradigms such as self-organization of neural networks.
Although TDP was suggested to support the possibility of temporal coding in the
brain, it is shown here that even in a learning task with no temporal coding, TDP
is necessary for learning. It remains an intriguing question to explore the learning
rules that maximize mutual information for models with temporal structures in
their input spike trains.
References
[1] H. Markram, J. Lubke, M. Frotscher, and B. Sakmann. Regulation of synaptic eÆcacy
by coincidence of postsynaptic aps and epsps. Science, 275(5297):213{215, 1997.
[2] Q. Bi and M.M. Poo. Precise spike timing determines the direction and extent of
synaptic modications in cultured hippocampal neurons. J. Neurosci., 18:10464{
10472, 1999.
[3] L. Zhang, H.W.Tao, C.E. Holt, W.A. Harris, and M m. Poo. A critical window
for cooperation and competition among developing retinotectal synapses. Nature,
395(3):37{44, 1998.
[4] L.F. Abbot and Sen Song. Temporally asymetric Hebbian learning, spike timing and
neural respons variability. In Sara A. Solla and David A. Cohen, editors, Advances in
Neural Information Processing Systems 11, Proceedings of the 1998 conference, pages
69{75. MIT Press, 1999.
[5] R. Kempter, W. Gerstner, and J.L. van Hemmen. Hebbian learning and spiking
neurons. Phys. Rev. E., 59(4):4498{4514, 1999.
[6] D. Horn, N. Levy, E. Ruppin, and I. Meilijson. Distributed synchrony in a Hebbian
cell assembly of spiking neurons. Submitted to Neural Computation, 2000.
[7] M.R. Mehta, M. Quirk, and M. Wilson. From hippocampus to v1: Eect of ltp on
spatio-temporal dynamics of receptive elds. In J.M. Bower, editor, Computational
Neuroscience: Trends in Research 1999. Elsevier, 1999.
[8] P.M. Munro and G. Hernandez. Ltd facilitates learning in a noisy environment. In
S.A. Solla, T.K. Leen, and K.R. Muller, editors, Advances in Neural Information
Processing Systems 12, Proceedings of the 1999 conference. MIT Press, 1999.
[9] R. Rao and T. Sejnowski. Predictive sequence learning in recurrent neocortical circuits. In S.A. Solla, T.K. Leen, and K.R. Muller, editors, Advances in Neural Information Processing Systems 12, Proceedings of the 1999 conference, volume 12. MIT
Press, 1999.
[10] R. Linsker. Self-organization in a perceptual network. Computer, 21(3):105{117, 1988.
[11] C.E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379{
423, 1948.
[12] R. Linsker. Local synaptic learning rules suÆce to maximize mutual information in
a linear network. Neural Computation, 4(5):691{702, 1992.