* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download NIHMS263877-supplement-1
Neural oscillation wikipedia , lookup
Mirror neuron wikipedia , lookup
Neuroanatomy wikipedia , lookup
Nonsynaptic plasticity wikipedia , lookup
Neural modeling fields wikipedia , lookup
Development of the nervous system wikipedia , lookup
Holonomic brain theory wikipedia , lookup
Premovement neuronal activity wikipedia , lookup
Convolutional neural network wikipedia , lookup
Metastability in the brain wikipedia , lookup
Recurrent neural network wikipedia , lookup
Neuropsychopharmacology wikipedia , lookup
Optogenetics wikipedia , lookup
Central pattern generator wikipedia , lookup
Neural coding wikipedia , lookup
Feature detection (nervous system) wikipedia , lookup
Sparse distributed memory wikipedia , lookup
Pre-Bötzinger complex wikipedia , lookup
Channelrhodopsin wikipedia , lookup
Types of artificial neural networks wikipedia , lookup
Biological neuron model wikipedia , lookup
1 Supporting Text for the manuscript “A reservoir of time constants for memory traces in cortical neurons” Alberto Bernacchia, Hyojung Seo, Daeyeol Lee, Xiao-Jing Wang 2 Contents Figures 1) Distribution of timescales and amplitudes in the random network model 1.1) Description of the model 1.2) Choice of the network parameters 1.3) Theoretical Analysis 1.4) Interpretations and limits of modelling 2) Factorization of epoch code and readout of memory-epoch conjunctions 2.1) Integrating the activity of cortical neurons 2.2) Encoding of memory-epoch conjunctions References 3 7 7 8 11 13 16 16 17 20 3 Figure S1: Panel (a) and (b) show the fraction of timescales for the single (a) and double exponential functions (b). The two histograms are virtually identical. Note that panel (b) includes both 1 (red) and 2 (green) for the double exponential, where 1 tends to be more concentrated at small values while 2 at large values. This is expected since, for each double exponential, we define 1 as the short timescale and 2 as the long timescale, therefore 1<2 by definition for each neuron. Panels (c), (d) and (e) show the same histograms separately for the three areas (c – ACCd, d – DLPFC, e – LIP). Note that ACCd tends to have longer timescales while LIP tends to have shorter timescales. 4 Figure S2: The distribution of timescales (panel (a)) and amplitudes (panel (b)) for the neural memory traces of choice (same format as of Figs. 4 and 8 in the main text). Results are qualitatively similar to those for reward memory. However, the exponents fitted in the curves are slightly different, and results are also less consistent across different areas. The number of neurons displaying memory for choice is different across areas (40% in ACCd, 71% in DLPFC, 73% in LIP, 2-test p~10-12). 5 Figure S3: Schematic illustration of the model. The model consists of a reservoir network of recurrently connected neurons (green), receiving an input pulse when a reward is delivered (red). The output (blue) illustrates a hypothetical system reading out the timescales in the reservoir network (not explicitly modelled). The memory traces of four example neurons are shown in the bottom, taken from actual simulations. 6 a b Figure S4: Histogram of the on-diagonal (a) and off-diagonal (b) entries of one example matrix J, the former representing the self-interactions of nodes in the network, the latter representing the strength of the interaction among different nodes. 7 1) Distribution of timescales and amplitudes in the random network model The purpose of this section is to build a mathematical model accounting for the distribution of memory timescales and amplitudes observed in the exponential fits of the experimental data. We will compare the exponential decay ex(t) observed in the data with the variable v(t) of our model that corresponds to the temporally integrated response of neurons to reward. Consistent with the data, the response v(t) shows an exponential decay, and we will focus on studying its amplitude and timescale across neurons. In this section we model only the exponential factor of the memory trace in the experimental data, and we do not consider the multiplicative effect of memory on neural activity. The label v(t) is reminiscent of the value v(t) (or reward expectation) in reinforcement learning theories, defined as an exponential filter of past rewards1-4. 1.1) Description of the model Our model consists of a large number of neurons, all driven by the input signalling the reward. The neurons and the recurrent synaptic interactions among them form a network architecture. The equation of dynamics of the network activity is given by dv/dt = Jv(t) + hRew(t) (1.1) where v(t) is a vector of M components, each component corresponds to the activity of a neuron in response to the reward sequence Rew(t) (we set M=1000 in simulations). The matrix J controls the synaptic interactions among different neurons, including the selfinteraction, and h is the vector of the input weights to different neurons. The equation of dynamics for each separate neuron, i.e. the scalar form of Eq.(1.1), is written as dvi/dt = j Jijvj(t) + hiRew(t) (1.2) 8 where different neurons are labelled by the indices i (post-synaptic) and j (pre-synaptic). We consider a single reward event, by modelling Rew(t) as a single pulse at time zero, i.e. Rew(t) = (t). The response to a sequence of rewards can be obtained by a straightforward summation of single reward responses and is not considered further. Eq.(1.1) can be readily solved, and the response of neuron i is given by vi(t) = ∑k Ai(k)e-t/(k) (1.3) Hence, the response to a single reward is a sum of exponential functions, labelled by the index k, each exponential characterized by a different amplitude Ai(k) and timescale (k). Eq.(1.3) is the focus of our model and has to be compared with the memory traces observed in the experimental data. The values of amplitudes and timescales depend on the choice of the matrix J and the vector h, which are specified in the section 1.2 “Choice of network parameters”. In simulations, amplitudes and timescales are determined from the spectral decomposition of J (i.e. its eigenvalues and eigenvectors) following Eqs.(1.5) and (1.6). The distribution of amplitudes and timescales across neurons in the model is then plotted and compared with experimental data. The synaptic interactions Jij are described in detail in the next sections. We will show that three main features of the interactions allows the model to reproduce the observed data: 1) They are strong, implying that neurons are able to reverberate the reward input, memorizing it for long time. This will give rise to a wide distribution (power-law) of timescales. 2) They are heterogeneous, implying that each neuron displays a quantitatively different decay of the memory. 3) They are symmetric, implying that the decay is exponential and that the amplitudes are distributed exponentially. 1.2) Choice of the network parameters The free parameters of the network model are the values of the matrix J and the vector h in Eq.(1.1). We pick each component of the vector of input weights hi independently 9 from a Gaussian distribution of zero mean and variance equal to M. The specific choice of hi does not affect our results, and a different choice would lead to the same distribution of timescales and amplitudes in the model, although the magnitude of |h| controls the scale of amplitudes (see section 1.3 “Theoretical analysis”). Instead, the choice of the synaptic matrix of interactions J strongly affects the distribution of timescales and amplitudes, and we give the specific prescriptions for J in the following. We construct a matrix J from its spectral decomposition. As well known in linear algebra, every square matrix J (non-defective) can be decomposed in its eigenvalues and -1 eigenvectors following the expression J = VDV , where D is the diagonal matrix of eigenvalues (k) (i.e. Dkk=(k)), the columns of the matrix V are the right eigenvectors -1 R(k) (i.e. Vik = Ri(k)) and the rows of its inverse V are the left eigenvectors L(k) (i.e. -1 -1 (V )kj = Lj(k)). It is convenient to rewrite the spectral decomposition J = VDV in terms of its eigenvalues and right and left eigenvectors as (k = 1,…,M) Jij = ∑kRi(k)(k)Lj(k) (1.4) Instead of setting the values of Jij directly, we set the values of the eigenvalues and eigenvectors and we compute Jij according to Eq.(1.4). Eigenvalues (k) are taken independently from a uniform distribution in the interval [-2,0]. Instead of drawing M eigenvalues, we draw M/2 eigenvalues from [-2,0] and we count each eigenvalue twice, i.e. each eigenvalue is two-fold degenerate. The choice of eigenvectors is specific: we assume eigenvectors to form an orthogonal basis of vectors, i.e. each eigenvector is orthogonal to all other eigenvectors and is normalized to one. Formally, this T T -1 corresponds to the expression V V = I or, equivalently, V = V (where the superscript T indicates transpose, and I is the identity matrix), implying that the matrix V is orthogonal and the left and right eigenvectors are equal. Among the infinite number of possible orthogonal matrices of dimension M, we draw V randomly from the uniform distribution of orthogonal matrices (also known as Haar Measure). 10 In order to gain some intuition about the uniform distribution of orthogonal matrices V, it is useful to recognize that each orthogonal matrix corresponds to the rotation of a vector by a given angle. In a space of dimension M, the rotation is defined by specifying the values of M-1 angles, where each angle has a finite interval of allowed values. For example, in three dimensions (M=3) the rotation of a vector is defined by two angles, the azimuth angle, varying in the interval [0,2], and the zenith angle, varying in the interval [0,]. The uniform distribution of V can be understood as the uniform distribution of all angles in each of their respective finite interval (see below for a recipe for generating V from this distribution). Note that our model does not have any free parameter: the statistics of J is completely determined by the uniform distributions of its eigenvalues and eigenvectors. We set the left bound of the distribution of eigenvalues to -2 in order to have the mean eigenvalue equal to -1, which sets the mean self-interaction in the matrix J, hence it sets the characteristic integration timescale of single neurons. Fig.S4a shows the distribution of on-diagonal terms (self-interaction) of one example matrix J obtained by the above procedure: the self-interaction is approximately -1, corresponding to a single-neuron integration timescale equal to 1 in Eq.(1.2). The choice of the right bound of the distribution of eigenvalues to be zero has important consequences, such that the magnitude of synaptic interactions is large and the network is in a critical state (see section 1.3 “Theoretical analysis”). Fig.S4b shows the off-diagonal terms (crossinteraction) of the matrix J: cross interactions include both positive (excitatory) and negative (inhibitory) terms of order of magnitude M-1/2. The choice of an orthogonal basis of eigenvectors, together with the fact that eigenvalues are real, implies that the interactions between each pair of connected neurons i and j are symmetric, i.e. Jij = Jji. Weak asymmetries could be considered by allowing complex values for eigenvalues and eigenvectors. 11 We conclude this section with a recipe for generating a specific instance of the orthogonal matrix V. This is consists of three simple steps: 1) Generate a square matrix W, of dimension M, by drawing each element independently from an arbitrary distribution (for instance, a Gaussian distribution with zero mean and unitary variance). 2) Perform the orthogonal-triangular decomposition of matrix W (also known as QR decomposition), namely W=QR, where Q is an orthogonal matrix and R is an upper triangular matrix. 3) The final result V is obtained by multiplying each column of Q by the sign of the corresponding value in the diagonal of R, i.e. V=QS, where S is a diagonal matrix where each element in the diagonal is equal to the sign of the corresponding element in R. This procedure has been shown equivalent to drawing V from the uniform distribution of orthogonal matrices5. 1.3) Theoretical Analysis We can predict the distribution of timescales and amplitudes in the model by a theoretical analysis. The first step is to write the amplitudes and timescales of the exponential functions in Eq.(1.3) as a function of the eigenvalues (k) and the left and right eigenvectors Li(k) and Ri(k) of the matrix J, defined by Eq.(1.4). By straightforward linear algebra, those expressions read (k) = -1/(k) (1.5) Ai(k) = Ri(k) ∑jLj(k)hj (1.6) Note that the amplitudes depend also on the vector of input weights hj. We assume that the eigenvalues of the matrix J are random and follow a distribution G(). We can calculate the distribution of timescales P() as a function of the distribution of eigenvalues G(), by using Eq.(1.5) and the identity P()d = G()d. The result is 12 P() = -2 G(-1/) (1.7) Even before specifying the distribution of eigenvalues G, it is clear that the first factor of Eq.(1.7) corresponds to a power law with an exponent of -2, consistent with the experimental data. If we assume that the distribution of eigenvalues is uniform, i.e. G is constant in a given interval, the distribution of timescales is P() ≈ -2 (1.8) in the corresponding interval of values of . The interval of is obtained by applying Eq.(1.5) to the interval for . Since the interval for is [-2,0], then the interval for is [1/2,+∞]. It is now clear that setting to zero the right bound for implies that the power law distribution of extends to large values (up to infinity). Such a wide distribution of timescales is the signature of a critical system6. In the machine learning literature, it has been proposed that critical systems may perform useful types of computation (the so called “computation at the edge of chaos”7-9). Now we investigate the distribution of amplitudes in the model. We assume that the eigenvectors of the matrix J are orthogonal (i.e. the matrix J is normal10), implying that the left and right eigenvectors are equal (or complex conjugate), and we can normalize them to unitary norm. Without loss of generality, we focus on a single eigenvector R, and we write the distribution of its components as Q(R) ≈ (|R|2-1) (1.9) This distribution corresponds to a uniform distribution in which the normalization property of the eigenvector R is enforced (uniform distribution on the (M-1) sphere), where |R|2 is the standard Euclidean norm of a vector. From Eqs.(1.6, 1.9), the distribution of the amplitudes can be computed, and the result depends only on whether the eigenvector R(k) is complex or it is real, and on the degeneracy of the corresponding 13 eigenvalue (k). We found that in the case of a real eigenvector, and for a two-fold degenerate eigenvalue, the distribution of amplitudes is exponential, i.e. P(A) ≈ e-|A| (1.10) where = M/|h|. This expression is consistent with the distribution of amplitudes observed in the data. Note that, if the components of h are independently distributed following a Gaussian with zero mean and variance equal to M, then |h| ≈ M and ≈ 1. It remains unknown whether other classes of matrices J (non-normal) could give rise to an exponential distribution of amplitudes. In summary, theoretical analysis of the model suggest that the power law distribution of memory timescales and the exponential distribution of amplitudes are consistent with a random network model in which the matrix of interactions J satisfies two prescriptions. First, the power law distribution of timescales is observed if the eigenvalues of J are allowed to approach zero, pushing the network to the critical state (edge of chaos). Second, the exponential distribution of amplitudes is observed if the matrix J is normal and its eigenvectors are distributed uniformly in the space of orthogonal matrices. 1.4) Interpretations and limits of modelling The model we considered has several limitations. The first, obvious one is that the neural network does not incorporate the biophysics in the dynamics of neurons and synapses. Although the model does not reflect the biology of the cortex, the results of the model can be quantitatively compared with the experimental observations, offering a detailed account of the heterogeneous picture of the response to reward of a large number of neurons. In this section, we discuss a few issues to be considered when comparing the model with the experimental results. 14 Eq.(1.3) shows the sum of a large number of exponential functions. However, a single exponential, or sum of two exponentials, is good enough to fit the memory trace of each neuron observed in the experimental data. Furthermore, neurons in the model have different amplitudes but they share the same set of timescales, while experimental results show that different neurons feature different timescales. We list in the following a few possible scenarios, not mutually exclusive, able to reconcile modeling and experimental data. 1) Since the distribution of amplitudes within a single neuron is exponential, characterized by a strong peak at zero, some timescales would have amplitudes close to zero. Then, each neuron shows only the timescales that correspond to a non-zero amplitude for that neuron, implying that different neurons might have different timescales, consistent with the experimental data. 2) Because the distribution of timescales is power-law, then, with high probability, the largest timescale in the network is much larger than the second largest, which in turn is much larger then the third and so on and so forth, down to a set of timescales that are of similar and small magnitude. Since exponentials with short timescales decays away very fast with respect to those with long timescales, very fast responses are negligible. Then, only one or two timescales contribute to the sum in Eq.(1.3), consistent with the single or double exponential fits used in the experimental data. 3) Different neurons measured experimentally might be accounted for by different instances of the network model, i.e. different realizations of the synaptic ineractions. This would explain why different neurons in the data are characterized by different timescales, and is consistent with at least two different interpretations. First, different cortical neurons might belong to different and disconnected sub-networks. Second, Eq.(1.1) might not describe 15 a neural network, but an intracellular network, and each neuron is separately characterized by a network of intracellular reactions that can be modelled by Eq.(1.1)11-14. In particular, in simulations we have implemented together the scenarios 2) and 3). We performed 1000 simulations, each characterized by different instances of the interactions, i.e. the matrix J and the vector h. In each simulation we consider only the longest timescale and a single neuron. Without loss of generality, we set k=1 for the longest timescale, and i=1 for the single neuron, and we record in each simulation (1) and A1(1), discarding all the other timescales and amplitudes. During 1000 simulations, we collect 1000 timescales and 1000 amplitudes, and we compute histograms for A and across the different simulations. Hence, we interpret each simulation as a separate measure of the activity from a single neuron, and the number of simulations, rather than the number of neurons in a single simulation, determines the number of neurons measured in the model. The results of implementing this scheme shows no qualitative difference with respect to considering a single network, namely the distribution of timescales is still power law and the distribution of amplitudes is still exponential. This result is obvious for the amplitudes, since we pick in each network an independent instance from the same exponential distribution. For the timescales, since we pick the longest timescale in each network, the distribution across networks will be equal to the extreme value density of a power law distribution with exponent -2, which is equal to the Cauchy distribution15. As shown in Fig.7b in the main text, this distribution shows a cut between a plateau and the power law tail. Consistently, the cut is present also in the data (Fig.4), occurring at roughly = 0.5 trials. In the model, the characteristic integration timescale of a neuron is set to 10ms, and timescales have been normalized such that one corresponds to 3.4s, which is the mean duration of one trial. The cut occurs at = 0.5 trials, consistent with data. 16 2) Factorization of epoch code and readout of memory-epoch conjunctions In this section we discuss possible computational advantages of the factorization of the epoch code observed in the data. We show that the factorization allows the computation of arbitrary memory-epoch conjunctions. 2.1) Integrating the activity of cortical neurons Following Eq.(2) in the main text, the firing rate of a cortical neuron in trial n and epoch k can be expressed as the epoch code g(k) times a factor that depends on the past rewards, namely FR(n,k) = g(k)[1+∑n’=0:5 ex(t)Rew(n-n’)] (2.1) where t denotes the time elapsed since the outcome of n’ trials in the past, and ex(t) is an exponential function (see Methods). Our goal is to describe the general properties of a readout neuron integrating signals of the type of Eq.(2.1). As observed in the data, different cortical neurons correspond to a variety of different epoch codes g(k) and exponential functions ex(t). We label cortical neurons by the index i, we denote the firing rate of neuron i as FRi, and the corresponding epoch code and exponential by gi(k) and exi(t). We restrict our analysis to linear integration, namely the input Iout, from cortical neurons to the readout neuron, is a sum over the firing rates of cortical neurons Iout = ∑i wiFRi = ∑i wigi(k)[1+∑n’=0:5 exi(t)Rew(n-n’)] (2.2) Where wi denotes the synaptic weight of the connection from neuron i to the readout neuron. Because the contributions of different outcomes (Rew) to Iout add up linearly 17 (sum over n’), we can focus on the contribution of a single outcome, denoted by fout and given by fout(k,t) = ∑i wi[gi(k)exi(t)] (2.3) This corresponds to the response of the readout neuron to a single outcome (note that we have dropped the 1 in Eq.(2.2), since it does not depend on the outcome). We aim at describing the encoding of the two variables, epoch k and elapsed time t, by the readout neuron. The former represents the present task context, while the latter represents the memory of a specific past event that occurred t seconds ago, and we are interested in the conjunction of information between that past event and the present task demands. Note that the two variables (k,t) are not independent, but we show that, formally, they can be treated as independent in most cases (see below), and for simplicity we assume that t measures the number of elapsed trials since a given outcome. 2.2) Encoding of memory-epoch conjunctions In this section, we show that the multiplicative form [gi(k)exi(t)] gives the readout neuron a substantial power in computing arbitrary conjunctions of the two variables (k,t). As a comparison, we show that the multiplicative form does better than the additive form [gi(k)+exi(t)]. Following Eq.(2.3), the problem can be stated as finding the weights wi resulting in the output fout(k,t) that best approximates a given, arbitrary target function ftarget(k,t), i.e. the weights minimizing their average square difference. The target function is chosen depending on the requirements of an arbitrary task. For example, if the task requires learning the effects of outcome t = 3 trials in the past on the epoch k = 4 of current trial, then we may require the readout neuron to be active only during epoch k = 4 and to signal the outcome of only t = 3 trials in the past, by defining the target function as ftarget(4,3) = 1 and ftarget(k,t) = 0 for all other values of (k,t). We assume that k takes K values (k = 1,…,K), t takes T values (t = 1,…,T), such that the 18 function ftarget(k,t) is characterized by KT arbitrary values, and the number of inputs in Eq.(2.3) (labelled by the index i) is of large magnitude, close to KT. In general, solving this problem requires a standard application of linear algebra, namely the pseudoinversion of the matrix in square brackets of Eq.(2.3), where each row of the matrix corresponds to a combination of the two values (k,t) and each column corresponds to a different input i. The quality of the reconstruction of ftarget from the available inputs depends on the rank of the matrix, and better results are obtained when the rank approaches the number KT of values of ftarget to be reconstructed. When summation [gi(k)+exi(t)] is used instead of multiplication [gi(k)exi(t)], the matrix has only K+T independent entries, and its rank is at most K+T, a very small number with respect to KT. Instead, the multiplicative matrix always has full rank, and it guarantees the best possible reconstruction given the available number of input neurons. Given that the multiplicative form [gi(k)exi(t)] is full rank, optimal reconstruction of the target function requires the inversion of the following matrix Aij = [∑kgi(k)gj(k)][∑texi(t)exj(t)] (2.4) Note that this expression measures how similar are the encoding of the two variables (k,t) by the two neurons i and j. We may call gi(k) and exi(n’) the “tuning curves” of neuron i for the encoding of k and t, in analogy with other behavioural variables explored in the neurophysiology literature. If the tuning curves of different neurons are very different, or they have a small overlap (for example, different neurons are tuned to different values of the variable of interest), then the on-diagonal elements of matrix A are much larger than the corresponding off-diagonal terms (|Aii||Ajj| >> |Aij||Aji|), implying that the matrix is well-conditioned and can be robustly inverted. In other words, the heterogeneous encoding of k and t implies that the problem of finding the optimal synaptic connections is computationally easy. If neurons display a wide variety of tuning curves for epoch and for elapsed trials, then the reconstruction of the target 19 function is robust. In fact, consistent with previous work, we observe a wide heterogeneity of tuning curves, and this seems an experimental support to our arguments. We conclude this section with a technical remark about the dependence of the two variables k and t. The time t elapsed since the outcome depends on the epoch k, because epochs in the current trial follow one after another in time. However, since the dependence of time on the epoch is linear, the exponential function ex(t) gives one additional factor that can be absorbed in the epoch code g(k). More concretely, as an example, we assume that we can express time in terms of elapsed trials n’ and epoch k, in the form t = n’+k, where is the duration of one trial and is the duration of one epoch. Then, in the case of a single exponential, ex(t) = Aexp(-t/), we can rewrite the term in square brackets of Eq.(3.3) as g(k)ex(t) = g(k)Aexp(-t/) = [g(k)exp(-k/)][Aexp(-n’/)] = h(k)e(n’) (2.5) Now the two variables k and n’ are both factorized and independent, and the last equality defines the functions h(k) and e(n’). The case in which ex(t) is the sum of two exponentials can be treated similarly, where the two exponentials can be considered as two separate inputs. In the light of Eq.(2.5), we rewrite Eq.(2.3) in terms of the independent variables, epoch k and elapsed trials n’, as fout(k,n’) = ∑i wi[hi(k)ei(n’)] (2.6) Hence, all the above considerations for the encoding of epoch and elapsed time hold true for the case in which the two variables of interests are independent. 20 References 1. Sutton, R. S. & Barto, A. G., Reinforcement Learning, An Introduction. MIT Press, Cambridge, MA (1998). 2. Lee, D. & Wang, X. J., Neural circuit mechanisms for stochastic decision making in the primate frontal cortex. In Neuroeconomics: Decision making and the brain. Academic Press, New York, NY (2009), pp 481- 501 (2009). 3. Rushworth, M. F. S. & Behrens T. E. J., Choice, uncertainty, and value in prefrontal and cingulate cortex. Nature Neuroscience: 11, 389 - 397 (2008). 4. Kable, J. W. & Glimcher, P.W., The Neurobiology of Decision: Consensus and Controversy. Neuron 63, 733-745 (2009). 5. Mezzadri, F., How to Generate Random Matrices from the Classical Compact Groups. Not. Am. Math. Soc. 54, 592-604 (2007). 6. Sornette, D., Critical Phenomena in Natural Sciences. Springer Verlag, Berlin (2004). 7. Langton, C. G., Computation at the edge of chaos: phase transitions and emergent computations. Phisica D 42, 12-37 (1990). 8. Bertschinger, N. & Natschlager, T., Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks. Neural Comput. 16, 1413-1436 (2004). 9. Sussillo, D. & Abbott, L.F., Generating Coherent Patterns of Activity from Chaotic Neural Networks. Neuron 63:544-557 (2009). 10. Trefethen, L. N. & Embree, M., Spectra and Pseudospectra: The Behavior of Nonnormal Matrices and Operators. Princeton University Press, Princeton, NJ (2005). 11. Curtis, C. E. & Lee, D. (2010) Beyond working memory: the role of persistent activity in decision making. Trends Cog Sci. In press. 12. Egorov, A.V., Hamam, B.N., Fransen, E., Hasselmo, M.E., Alonso, A.A. (2002) Graded persistent activity in entorhinal cortex neurons. Nature 420:173-178. 13. Wang, XJ (2001). Synaptic reverberation underlying mnemonic persistent activity. Trends Neurosci., 24:455-63. 14. Goldman M, Compte A and Wang, X-J (2008) Neural integrators: recurrent mechanisms and models. In New Encyclopedia of Neuroscience, edited by Larry Squire, Tom Albright, Floyd Bloom, Fred Gage and Nick Spitzer. MacMillan Reference Ltd, pp. 165-178. 15. Coles, S.G., An Introduction to Statistical Modeling of Extreme Values. Springer Series in Statistics, Springer, New York (2001).