* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Temporal Dependent Plasticity: An Information Theoretic Approach
Single-unit recording wikipedia , lookup
Metastability in the brain wikipedia , lookup
Holonomic brain theory wikipedia , lookup
Donald O. Hebb wikipedia , lookup
Artificial neural network wikipedia , lookup
Central pattern generator wikipedia , lookup
Convolutional neural network wikipedia , lookup
Neural modeling fields wikipedia , lookup
Perceptual learning wikipedia , lookup
Nonsynaptic plasticity wikipedia , lookup
Eyeblink conditioning wikipedia , lookup
Biological neuron model wikipedia , lookup
Neural coding wikipedia , lookup
Activity-dependent plasticity wikipedia , lookup
Synaptic gating wikipedia , lookup
Recurrent neural network wikipedia , lookup
Learning theory (education) wikipedia , lookup
Machine learning wikipedia , lookup
Catastrophic interference wikipedia , lookup
Temporal Dependent Plasticity: An Information Theoretic Account Gal Chechik and Naftali Tishby Institute of Computer Science and Engineering and the Interdisciplinary Center for Neural Computation Hebrew University, Jerusalem, Israel [email protected] [email protected] Abstract The fundamental paradigm of Hebbian learning has recently received a novel interpretation with the discovery of synaptic plasticity that depends on the relative timing of pre and post synaptic spikes. [Markram et al. 1997; Zhang et al. 1998]. While this type of plasticity has already been studied in various computational frameworks, a generic computational motivation or its derivation from rst principles is still missing. In this paper we derive temporally dependent learning from the basic principle of mutual information maximization and study its relation to the experimentally observed plasticity. The comparison shows that not only is the biological learning rule similar in form to the analytically derived one, but it also increases mutual information to a near-optimal level. The analysis yields new insights into the temporal characteristics of the observed learning rule, and experimental predictions as to the time constants involved, depending on neuronal biophysical parameters. 1 Introduction Hebbian plasticity, the major learning paradigm in neuroscience, was until recently interpreted as learning by correlated neuronal It has recently received novel interpretation showing that changes in synaptic eÆcacies highly depend on the relative timing of the pre and postsynaptic spikes: The eÆcacy of a synapse between two excitatory neurons increases when the presynaptic spike precedes the post synaptic one, but decreases when the converse is true [1]. Moreover, experiments both in hippocampal slices [2] and in retino-tectal synapses of an anesthetized rat [3] have provided a quantitative characterization of synaptic eÆcacy changes as a function of the temporal dierence between the pre and post synaptic spikes with millisecond temporal resolution. These data showed that the magnitude of changes in synaptic eÆcacy decays exponentially with this temporal dierence. This novel learning rule, termed temporally dependent plasticity (TDP), was studied in various computational frameworks, showing that it maintains irregularity of neuronal ring [4], normalizes synaptic eÆcacies [5], leads to synchronous subpopulation ring in a recurrent network [6], and plays an important role in sequence learning [7, 8, 9]. However, a generic computational motivation for this learning rule, or its derivation from basic principles is still missing. A natural candidate for this purpose is the fundamental concept of input-output mutual information maximization. This idea, known as the Infomax principle [10], states that the goal of a neural network's learning procedure is to maximize the mutual information between its output and input. The information maximization principle serves as a unifying learning goal for both supervised and unsupervised learning, with applications ranging from self-organization of neural networks to associative memory. In this paper we derive an optimal TDP rule that maximizes mutual information for a leaky integrator neuron with spiking inputs. We then compare the analytically derived rule with the experimentally observed rule, showing that not only does the biological learning rule have similar general form as the analytically derived TDP rule, but it also increases mutual information to a near-optimal level. This derivation of TDP from basic principles yields quantitative experimental predictions with regard to the shape of the learning curve, as well as new insights into its dynamics. The following section describe our model. Section 3 derive an Infomax gradient ascent rule for a generic supervised learning task, that is extended to the unsupervised case in Section 4. 2 The Model We study a generic learning task in a network with N 1 binary f0; 1g spiking input neurons and a single output (target) neuron. The output neuron is a perfect leaky integrator, and the inputs are Poisson spike trains. At any point in time, the target neuron accumulates its inputs with exponential decay due to voltage attenuation () = Y t N X i=1 () ; W i Xi t ( ) Xi t Z t 1 (t 0 e t) Xi0 t0 dt0 () ; (1) where Wi is the synaptic eÆcacy between the ith input neuron and the target neuron, Xi0(t) = Ptspike Æ(t tspike ) is the ith spike train, Xi (t) is the total weighted input of the ith unit until time t, and is the membrane time constant. The learning task is to discriminate M random input patterns 2 IR+n = 0::M by setting the weights appropriately. Each pattern determines the ring rates of the input neurons, and X 0 is then a noisy realization of due to the stochasticity of the Poisson process. The patterns are assumed to be uncorrelated. Clearly the model can be extended to several target neurons as in the case of hetero-associative memory networks. 3 Supervised Learning 3.1 Gradient Ascent Learning Rule We consider the case where input patterns are presented for periods of length T , being on the order of magnitude of few time-constants 's. Within each period, a pattern is randomly chosen for presentation with probability q , where most of the patterns are rare (PM=1 q 1) and the pattern 0 is abundant. This reects the biological scenario where a neuron is 0continuously presented with a noisy background activity (corresponding here to ), and is rarely presented with one of its salient stimuli (corresponding to ). However we do not assume that the statistical properties of 0 dier from those of . Let us focus on a single presentation period omitting the notation of t, and look at the value of Y at the end of this period Y N X i=1 Z Wi 0 0 T et = Xi0 t0 dt0 () N X i=1 W i Xi (2) : The input-output mutual information [11] in this network is dened by ( ; ) = h(Y ) I Y ( j) ; Z ( )= h Y h X X ( ) ( ( )) f x log f x dx (3) where the rst term is the dierential entropy of the distribution of Y , and the second term is the dierential entropy of the Y distribution given that the network is presented with a known input pattern, f (Y ) is the p.d.f. of Y . As input neurons re independently and their number is large, the input of the target neuron when the network is presented with the pattern is normally distributed f (Y j) = N ( ; 2 ) with mean = E [W X ] and variance 2 = E [(W X )(W X )T ] E 2 [W X ], where averaging is over the possible realizations of inputs X when the network is presented with the pattern . To calculate the entropy of Y we note that f (Y ) is a mixture P of Gaussians (each resulting from the presentation of an input pattern) f (Y ) = q f (Y j ), and use the assumption that PM=1q is small to approximate the entropy of this distribution. Dierentiating the mutual information with regard to Wi we obtain M X @I (Y ; ) (4) = q K E (X ) K0E (X 0) @Wi with K0 K =1 Wi Wi i i ( 1 0 0 0 4 )2 0 2 1 + 2 0 4 0 + 20 0 0 2 This gradient of mutual information manifold may be used for a \batch" gradient ascent learning rule, by repeatedly changing Wi according to Wi / @W@ i I (Y ; ), while recalculating the distribution moments that depend on W . What can be learned from the general form of this learning rule ? First, the learning rule 0has two parts: it decreases weights that correspond to inputs that re stronger at , and increases the others. The main result of this process is an increase of dierence 0, while constraining the variances 2, thus providing better discrimination of the 0 pattern from the rest 1. Thus, as in the case of real-valued inputs investigated in [12] a gradient ascent rule on the mutual information manifold alternates between Hebbian learning increasing response to the signal patterns (in our model 1) and anti-Hebbian learning decreasing response to the background noise pattern. Secondly, note that both increase and decrease of weights, occur with a magnitude that depends on fq gM=1. This can be interpreted as follows: each pattern 1, 1 It can be shown that when starting from random positive weights, if > 0 then the K -terms are positive, the dierence 0 increases, and the weights remain positive. Clearly, a mirror solution also exists with negative weights and 0 > . presented with probability q , contributes K E (Xi ) K0E (Xi0) to the gradient. It is therefor straight forward to implement the learning rule of Eq. 4 in an \on-line" manner by setting Wi = E (Xi )K E (Xi0)K0 (5) whenever the pattern was presented, for = 1::M only. This rule implements a stochastic gradient ascent on the information manifold. Interestingly, modifying the weights only when thePpatterns 1 are presented has the advantage of being robust to uctuations in M=1 q . This is in contrast with standard memory learning procedures in which weights are changed for any pattern presented but with alternating signs (a procedure that increases weights strongly associated with > 0, and decreases those of = 0). Quantitative results are not shown here due to space considerations. 3.2 A Simplied Learning Rule Implementing the above gradient descent rule is clearly diÆcult to achieve in a biological system, since in order to calculate the K -terms the system has to estimate the distribution moments for all input patterns. We now turn to consider two simplications of the analytically derived rule, and will later show numerically that the simplied learning rule approximates the gradient-ascent learning rule fairly well. First, we ignore the K -coeÆcients in Eq. 5 and consider the learning rule Wi = + E (Xi ) 0E (Xi0) when is presented 8 = 1::M ; (6) where ; 0 determine the learning rates, but unlike the K 's do not depend on W . Again, no changes occur when the pattern 0 is presented. Secondly, expectations will be approximated by averaging over presentations. E0(Xi ) can by clearly approximated by averaging over Xi when is presented. E (Xi ) can be similarly estimated by averaging over a period where the pattern 0 was presented, however, having the experimentally observed TDP rule in mind, we choose here a dierent approximation: Using the fact that 0 is highly frequent, and hence with high probability it will be presented on the consecutive time period t = 0::T , we treat spikes in that period as emerging from 0 . These spikes can be weighted in many ways, all yielding similar results due to averaging. We arbitrarily choose hereR uniform weighting for t0 t 0 0 0 sake of generality (but see also [8]). Substituting Xi (t) T e Xi (t )dt0, we conclude that the following learning rule approximate Eq. 6 Z T Z 0 0 Wi (t) = + e t t X 0(t0 )dt0 0 1 X 0(t0 )dt0 ; (7) T i T 0 i when ; 1 is presented. How good are these approximations ? Does learning with the approximated learning rule increase mutual information ? To answer these question we have performed a simulation in which we traced the input-output mutual information of a 1000neurons network, with 101 patterns presented. The two upper curves in Figure 1 compare the mutual information of the learning rule of Eq. 4, with that of Eq. 7. Apparently, the approximated learning rule achieves fairly good results compared to the optimal rule. Convergence is slower though, as Eq. 7 does not ascend along the information gradient (i.e. the steepest ascent), but climbs along more moderate slopes. 0.4 Analyticaly derived 0.35 Simplified Supervised 0.3 I(X,Y) 0.25 Unsupervised 0.2 0.15 0.1 0.05 0 1000 2000 3000 time steps 4000 5000 Figure 1: Comparing optimal and approximated learning rules. Poisson spike trains were simulated by discretizing time into bins. Patterns were constructed by setting 0:1 of the input neurons to re at a high rate with probability 0:4 per bin, while the rest re with rate of 0:05 per bin. Rest of simulation parameters: M = 100, N = 1000, q0 = 0:9, q = 0:001, = 0:01,0 = 0:0125. Repeating simulation for dierent realization of the spike trains yielded almost identical results, thus error bars were omitted. These results thus show that a simplied TDP rule, similar in form to the experimentally observed TDP, yields near optimal input-output mutual information. 4 The Unsupervised Case 4.1 Using Postsynaptic Spikes As Learning Signals The analysis so far has concentrated on the supervised learning case, where the identity of the presented pattern was used by the learning rule. Could these results be extended to the unsupervised case? A possible replacement for the teacher's learning signal is the postsynaptic spike: If spikes are elicited when input exceeds a threshold, then in our model a postsynaptic spike0 should correspond to one of the patterns 1 while spike absence corresponds to (since > 0). This yields a learning procedure identical to Eq. 7, but this time learning is triggered by the postsynaptic spike. This learning rule simulates the experimentally observed TDP rule [3, 2], with possible exception of the temporal shape of the weakening part that is not constrained in our model. Clearly if spikes faithfully signal the presence of input patterns this procedure should yield eective learning. But at the beginning of the learning procedure, when weights are not appropriately set, spikes do not necessarily signal for those patterns. What is then the eectiveness of this learning mechanism? To answer this question, we have compared the proposed unsupervised learning rule with the two supervised rules derived earlier. Figure 1 shows that indeed this spike dependent learning rule increases the input-output mutual information, and achieves reasonable results compared to the optimum. We conclude that the biological TDP rule approximates an optimal information maximization learning rule for the unsupervised task. It can be seen that the learning rate of the unsupervised procedure is much slower. This is due to the fact that at early stages the neuron does not signal reliably the presentation of patterns, thus causing highly noisy learning. However, as learning proceeds, the neuron's learning signal is more faithful, allowing the system to increase its input-output mutual information. How does the system depend on its initial conditions then? The next section shows that a range of parameters exists, where the neuron learns to respond to rare events regardless of initial conditions. 4.2 Detecting Rare Events Consider for simplicity the case where two patterns 0 ,1 (q0 q1 ) are presented to a neuron that uses the unsupervised learning procedure described above. Each of the patterns has k \foreground" neurons that re with high rate, while the rest N k of the neurons re with a low rate. Note that if the neuron responds to the presentation of the pattern 0 (which happens with probability q0), then weights corresponding to foreground neurons are strengthened by , but are then weakened on average by 0 with probability q0 (due to weights decrease on the following time step) . Similar calculation for 1 yields on average if the neuron spikes given 0 Wi = ++qq10(( qq1000 )) if : (8) the neuron spikes given 1 As q0 > q1, setting in the range 1 > 0 > 1 ; yields q0( q0 0) < 0 : (9) q1 ( q1 0 ) > 0 q0 q1 At this regime the weights corresponding to foreground neurons of 0 will be de1 creased while those of strengthened, leading to a spiking response of the neuron when presented with the rare pattern 1. Thus, the learning rates ratio 0 = , sets a cuto that allows only for patterns with probabilities q < =0 to be strengthened, while the rest are weakened. Therefore, under the correct parameter range, the system converges to the same functional steady state regardless of initial conditions. 0 5 Discussion We have taken an information theoretic approach to the study of TDP - a novel type of plasticity recently observed in brain tissues. Within the Infomax framework, we have derived a TDP rule that maximizes mutual information in a spiking neural network, and compared it with the biological TDP rule. This comparison shows that not only is the biological learning rule similar in form to the analytically derived one, but it also increases mutual information to a near-optimal level This derivation provides a new computational insight into the shape of the learning curve, showing that strengthening synaptic eÆcacy should depend exponentially on the time dierence between pre and post synaptic spikes, with a time constant determined by the time constant of the post synaptic membrane. Indeed, the time constants of the experimentally observed TDP rule are on the scale of 10 20 milliseconds, that t well the membrane time constant of cortical neurons. This yield direct experimentally testable predictions as to the learning curve shape depending on the biophysical parameters of the neuron: Neurons with larger membrane time constants should exhibit larger time constants in their TDP learning curves. It should be noted however that the precise learning curve shape should depend on synaptic transfer function and delays, that were not incorporated to the current analysis. Cortical inhibitory neurons were not found to exhibit TDP. Under the view that inhibitory cortical neurons do not participate in spatial coding but serve to limit network activity, our analysis predicts that indeed no temporal learning would occur in these neurons, as they are not needed for pattern discrimination. However, it predicts that in systems where inhibitory neurons do participate in spatial coding, as in the olfactory system, TDP should be observed. Our analysis was performed for a single target neuron, but can be extended to the case of several target neurons as in the case of hetero-associative memory networks. Moreover, as the input-output mutual information determines the ability of a neuron to discriminate its inputs, this analysis bears relevance to a wide range of computational paradigms such as self-organization of neural networks. Although TDP was suggested to support the possibility of temporal coding in the brain, it is shown here that even in a learning task with no temporal coding, TDP is necessary for learning. It remains an intriguing question to explore the learning rules that maximize mutual information for models with temporal structures in their input spike trains. References [1] H. Markram, J. Lubke, M. Frotscher, and B. Sakmann. Regulation of synaptic eÆcacy by coincidence of postsynaptic aps and epsps. Science, 275(5297):213{215, 1997. [2] Q. Bi and M.M. Poo. Precise spike timing determines the direction and extent of synaptic modications in cultured hippocampal neurons. J. Neurosci., 18:10464{ 10472, 1999. [3] L. Zhang, H.W.Tao, C.E. Holt, W.A. Harris, and M m. Poo. A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395(3):37{44, 1998. [4] L.F. Abbot and Sen Song. Temporally asymetric Hebbian learning, spike timing and neural respons variability. In Sara A. Solla and David A. Cohen, editors, Advances in Neural Information Processing Systems 11, Proceedings of the 1998 conference, pages 69{75. MIT Press, 1999. [5] R. Kempter, W. Gerstner, and J.L. van Hemmen. Hebbian learning and spiking neurons. Phys. Rev. E., 59(4):4498{4514, 1999. [6] D. Horn, N. Levy, E. Ruppin, and I. Meilijson. Distributed synchrony in a Hebbian cell assembly of spiking neurons. Submitted to Neural Computation, 2000. [7] M.R. Mehta, M. Quirk, and M. Wilson. From hippocampus to v1: Eect of ltp on spatio-temporal dynamics of receptive elds. In J.M. Bower, editor, Computational Neuroscience: Trends in Research 1999. Elsevier, 1999. [8] P.M. Munro and G. Hernandez. Ltd facilitates learning in a noisy environment. In S.A. Solla, T.K. Leen, and K.R. Muller, editors, Advances in Neural Information Processing Systems 12, Proceedings of the 1999 conference. MIT Press, 1999. [9] R. Rao and T. Sejnowski. Predictive sequence learning in recurrent neocortical circuits. In S.A. Solla, T.K. Leen, and K.R. Muller, editors, Advances in Neural Information Processing Systems 12, Proceedings of the 1999 conference, volume 12. MIT Press, 1999. [10] R. Linsker. Self-organization in a perceptual network. Computer, 21(3):105{117, 1988. [11] C.E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379{ 423, 1948. [12] R. Linsker. Local synaptic learning rules suÆce to maximize mutual information in a linear network. Neural Computation, 4(5):691{702, 1992.