* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Nonlinear Population Codes - Department of Nonlinear Dynamics
Mirror neuron wikipedia , lookup
Caridoid escape reaction wikipedia , lookup
Neural oscillation wikipedia , lookup
Single-unit recording wikipedia , lookup
Neural modeling fields wikipedia , lookup
Development of the nervous system wikipedia , lookup
Stimulus (physiology) wikipedia , lookup
Recurrent neural network wikipedia , lookup
Neuropsychopharmacology wikipedia , lookup
Holonomic brain theory wikipedia , lookup
Circumventricular organs wikipedia , lookup
Convolutional neural network wikipedia , lookup
Central pattern generator wikipedia , lookup
Neuroanatomy wikipedia , lookup
Premovement neuronal activity wikipedia , lookup
Pre-Bötzinger complex wikipedia , lookup
Types of artificial neural networks wikipedia , lookup
Optogenetics wikipedia , lookup
Metastability in the brain wikipedia , lookup
Synaptic gating wikipedia , lookup
Feature detection (nervous system) wikipedia , lookup
Biological neuron model wikipedia , lookup
Channelrhodopsin wikipedia , lookup
Neural coding wikipedia , lookup
LETTER Communicated by Peter Latham Nonlinear Population Codes Maoz Shamir maoz@z.huji.ac.il Haim Sompolinsky haim@z.huji.ac.il Racah Institute of Physics and Center for Neural Computation, Hebrew University of Jerusalem, Jerusalem 91904, Israel Theoretical and experimental studies of distributed neuronal representations of sensory and behavioral variables usually assume that the tuning of the mean ring rates is the main source of information. However, recent theoretical studies have investigated the effect of cross-correlations in the trial-to-trial uctuations of the neuronal responses on the accuracy of the representation. Assuming that only the rst-order statistics of the neuronal responses are tuned to the stimulus, these studies have shown that in the presence of correlations, similar to those observed experimentally in cortical ensembles of neurons, the amount of information in the population is limited, yielding nonzero error levels even in the limit of innitely large populations of neurons. In this letter, we study correlated neuronal populations whose higherorder statistics, and in particular response variances, are also modulated by the stimulus. We ask two questions: Does the correlated noise limit the accuracy of the neuronal representation of the stimulus? and, How can a biological mechanism extract most of the information embedded in the higher-order statistics of the neuronal responses? Specically, we address these questions in the context of a population of neurons coding an angular variable. We show that the information embedded in the variances grows linearly with the population size despite the presence of strong correlated noise. This information cannot be extracted by linear readout schemes, including the linear population vector. Instead, we propose a bilinear readout scheme that involves spatial decorrelation, quadratic nonlinearity, and population vector summation. We show that this nonlinear population vector scheme yields accurate estimates of stimulus parameters, with an efciency that grows linearly with the population size. This code can be implemented using biologically plausible neurons. 1 Introduction In many areas of the brain, information on certain stimulus features is coded in a distributed manner—by a population of neurons (e.g., Georc 2004 Massachusetts Institute of Technology Neural Computation 16, 1105–1136 (2004) ° 1106 M. Shamir and and H. Sompolinsky gopoulos, Schwartz, & Kettner, 1986; Lee, Rohrer, & Sparks, 1988; Wilson & McNaughton, 1993; Fitzpatrick, Batra, Stanford, & Kuwada, 1997; Young & Yamane, 1992). It is generally accepted that the average response of each neuron is tuned to the stimulus; hence, the average ring rate codes for the stimulus. The ring rates of each neuron uctuate from trial-to-trial about their average value. If the trial-to-trial uctuations of different neurons are uncorrelated, then by pooling information from large populations of neurons, the brain can overcome the noise inherent in the single neuron responses. However, experimental ndings suggest that correlations between the uctuations of different neurons are considerable (Zohary, Shadlen, & Newsome, 1994; Lee, Port, Kruse, & Georgopoulos, 1998). Theoretical studies studies that investigated the effect of correlations on population coding yielded conicting results (e.g., see Zohary et al., 1994; Abbott, & Dayan, 1999). Recently Sompolinsky, Yoon, Kang, and Shamir (2001) studied the effect of correlations on the accuracy of population coding. They showed that long-range positive correlations that vary smoothly with the functional distance between the neurons lead to saturation of the accuracy by which the stimulus parameters can be extracted to a nite value, even in the limit of innitely large ensembles. This work, however, was limited to stimulusindependent covariance matrices. Experimental ndings show that information exists not only in the mean ring rates but also in higher-order statistics of the neuronal responses. The approximate linear relationship between the average ring rate and its variance reported in many brain areas (e.g., see Vogels, Spileers, & Orban, 1989; Britten, Shadlen, Newsome, & Movshon, 1992) implies that the variance is also tuned to the stimulus. Maynard et al. (1999) showed a (weak) dependence of the cross correlations of different neurons on the stimulus. Here we address two questions. Is information coded in the higher-order statistics of the neuronal responses bounded by the correlated noise in the system? And, How can such a code be read by a biologically plausible mechanism? For concreteness, we discuss the coding for an angle, such as the direction of arm movement in M1 during a simple unimanual movement task or the coding for the direction of movement of visual stimuli in area V5. The outline of this letter is as follows. In section 1, the problem of coding information in a correlated population of neurons is introduced. First, the stochastic model of the neuronal population that codes for the angle is dened. Then the Fisher information of the system is calculated, and the effect of the correlated noise on linear readout mechanisms is discussed. The saturation of information coded in the rst-order statistics of the neuronal behavior, on the one hand, and the inability of linear readout schemes to extract information coded in the high-order statistics, on the other hand, encourage the study of nonlinear readout schemes in section 2. Motivated by the results, we suggest a nonlinear readout mechanism and study its efciency. We then discuss the statistical properties of the neurons in the readout layer. A combination of linear and nonlinear readout schemes is Nonlinear Population Codes 1107 presented and is shown to have superior performance. In section 3, we summarize the results of our analysis, discuss the effects of different variations of this model, and compare our model to other existing models of nonlinear population readout. Preliminary results have been published in Shamir and Sompolinsky (2001a, 2001b). 1.1 The Statistical Model of the Input Layer. We consider a population of N neurons that code for an angle µ, and denote by ri the response of the ith neuron (i 2 f1; 2; : : : Ng) to the stimulus µ . We shall refer to this population of neurons hereafter as the input layer. Assuming the ring rates of the neurons are sufciently high, we model the probability distribution of the neuronal activities according to a multivariate gaussian distribution, P.fri g j µ / D p 1 .2¼ /N det.C/ 0 1 X 1 £ exp @¡ C¡1 .ri ¡ fi .µ //.rj ¡ fj .µ //A 2 ij ij (1.1) where fi .µ / is the average ring rate of neuron i, and Cij .µ / is the covariance of the responses of neurons i and j, for a given stimulus µ. The average ring rate of each neuron is modeled by a smooth stereotypical tuning curve with a single peak at the neuron’s preferred direction (PD), denoted here by Ái , as shown in Figure 1a for a neuron with a PD of 0 deg. The tuning curve of a neuron with a PD of Á is obtained by a horizontal translation of this tuning curve by Á. In our numerical simulations, we use the following tuning curve, fi .µ / D f .µ ¡ Ái / Á f .µ / D . fmax ¡ fref / exp cos.µ / ¡ 1 ¾ f2 (1.2) ! C fref; (1.3) where ¾ f is the tuning width, fmax is the peak response at the neuron’s PD, and fref determines the modulation amplitude of the tuning curve. We assume that the PDs of the neurons are evenly spaced on the ring— C 2¼ Ák D ¡¼ N¡1 N N k. The covariance matrix, C, on the ring is, in general, a function of three angles Cij .µ / D .Ái ; Áj ; µ/. Assuming isotropy of the system implies that there is no preferred spatial direction. Hence, C should obey C.Ái ; Áj ; µ / D C.Ái ¡ Ã; Áj ¡ Ã; µ ¡ à / for arbitrary Ã. Thus, C is a function of only two angles: Cij .µ / D C.Ái ¡ µ; Áj ¡ µ /: (1.4) M. Shamir and and H. Sompolinsky 1 fi(q) (sec ) 1108 (a) 80 60 40 120 60 0 q (deg) 60 120 2 180 (b) bi(q) (sec ) 180 80 60 40 120 60 0 q (deg) 60 2 120 180 (c) Cij (sec ) 180 10 0 0 60 f j (deg) 120 180 Figure 1: The statistics of neurons in the input layer in model 1. (a) The tuning curve of a neuron i with PD of Ái D 0 degree. The average ring rate of the neuron, fi .µ /, is plotted as a function of the stimulus, µ . The tuning curve of a neuron with a PD of Á can be obtained by a horizontal shift of the above curve by Á. (b) The tuning of the variance. The variance, bi .µ /, of neuron i with PD of Ái D 0 degree is plotted as a function of the stimulus µ . (c) The decay of the covariance, Cij , between neuron i with PD of 0 degree and neuron j with PD of Áj is plotted as a function of Áj . Note that in model 1, Cij , for i 6D j, is independent of µ . The above gures present the rate statistics, which is equivalent to the spike count statistics in a 1 sec time interval. In order to obtain the spike count statistics in different time intervals, these tuning curves should be scaled by the relevant time window. Unless stated otherwise, the parameters that were used for the numerical simulations of the p input layer are: for the tuning curve fmax D 80 sec ¡1 , fref D 40 sec ¡1 , ¾ f D 2; for the variance bmax D 80 sec ¡1 , p bref D 40 sec ¡1 , ¾b D 1=2; for the cross-correlations c D 0:3, ½ D 1. Note that these parameters are given in rates, that is, in numbers per 1 sec. In all the simulations, we used a time window of T D 0:25 sec by which the averages and the correlations were scaled linearly. Nonlinear Population Codes 1109 A particularly simple example is a system in which only the variances of the neurons depend on the external stimulus, while the cross-correlations do not depend on the stimulus but depend on the angular coordinates of the neurons. We shall refer to this model hereafter as model 1. We assume that correlations are stronger for pairs of neurons with closer PDs, a property that has been observed experimentally in some systems (Zohary et al., 1994; Lee et al., 1998; van Kan, Scobey, & Gabor, 1985; Mastronarde, 1983). For concreteness, we have used the following form of C for model 1, ³ ´ jÁi ¡ Áj j D C ¡ exp ¡ model 1 (1.5) Cij ±ij bi .µ / cbref .1 ±ij / ½ Á ! cos.µ ¡ Ái / ¡ 1 C bref (1.6) bi .µ / D b.Ái ¡ µ/ D .bmax ¡ bref / exp ¾b2 where ±ij is a Kronecker delta. Here the variance, bi .µ /, has a unimodal tuning with regard to µ with a peak at the neuron’s PD, as shown in Figure 1b. The tuning of the variance is determined by three parameters: ¾b is the tuning width, bmax is the peak variance, and bref determines the modulation of the variance tuning curve. This stimulus dependence of the neuronal variances is qualitatively similar to the monotonic relationship between variance and mean observed in many neurons. The dependence of the cross-correlations on the distance between neurons (i.e., the difference in their PDs) is modeled as an exponential decay with correlation length ½ and correlation strength c. Here jxj denotes angular distance and ranges from 0 to ¼ . Figure 1c shows the decay of the correlation between the neurons as a function of the distance between them. We assume that the correlations are long distance, that is, ½ D O .1/ (note the maximum distance is ¼ ) and remains xed when we change the size of the population. For example, for model p 1 with the parameters c D 0:3, ½ D 1, bmax D 80, bref D 40, and ¾b D 1=2 for a time interval of T D 1 sec (see Figure 1), the average correlation coefcient (CC) across pairs of neurons with PD differences of less than 90 degrees is 0.12, while for pairs of neurons with PD differences of more than 90 degrees, the average CC is 0.025. For comparison, Zohary et al. (1994) report an average CC of 0.18 for neurons with close PDs and 0.04 for neurons with PD differences of more than 90 degrees. Although these correlation values may seem weak, their effect on the population code is dramatic, as shown below. Another simple example is the multiplicative model (Abbott & Dayan, 1999), which we shall refer to as model 2. Here, the covariance matrix has the form ³ ³ ´´ q jÁi ¡ Áj j model 2: (1.7) Cij .µ / D bi .µ /bj .µ / ±ij C c.1 ¡ ±ij / exp ¡ ½ The neuronal variances are given by bi .µ /, as before. The cross-correlations 1110 M. Shamir and and H. Sompolinsky q consist of a stimulus-dependent multiplicative component, bi .µ /bj .µ /, multiplied by the distance-dependent function. The two models for the correlation matrix are introduced as two example that will yield qualitatively similar results (see below). As will be clear from the analysis below, the essential features of the correlation matrices are the long range of the crosscorrelations and the discontinuity of the correlation matrix on the diagonal resulting from the variance term. 1.2 Estimation Errors and Fisher Information. Throughout this letter, we will evaluate the efciency of an estimator µO of the stimulus angle µ by the inverse of the mean square of its estimation error, h.± µO /2 i¡1 ; where O and h: : :i denotes averaging with regard to the distribution of ± µO D µO ¡ hµi the neuronal responses for a given stimulus µ. The Fisher information (FI) (Thomas & Cover, 1991; Kay, 1993), *³ JD @ log P.rjµ/ @µ ´2 + ; (1.8) provides an upper bound on the efciency of any unbiased estimator of µ. The FI of the gaussian ensemble, equation 1.1, can be written as a sum of the FI of the mean responses, denoted as Jmean and that of the tuning of the covariance matrix, Jcov (see, e.g., Kay, 1993) J D Jmean C Jcov (1.9) Jmean D ft0 C¡1 f0 1 Jcov D TracefC0 C¡1 C0 C¡1 g: 2 (1.10) (1.11) Here f0 and C0 are derivatives with regard to µ; of the mean response vector and the covariance matrix, respectively. In the case of uncorrelated population, the FI is given by J D Jmean C Jvar, where Z Jmean D N ¼ ¡¼ 0 dÁ f 2 .Á/ 2¼ b.Á/ (1.12) and Z Jvar D N ¼ ¡¼ 0 dÁ b 2 .Á/ : 2¼ 2b2 .Á/ (1.13) Here and throughout the letter, we assume that the system size, N, is large; hence spatial summations can be converted to integrals over angles. Figure 2 Nonlinear Population Codes 1111 0.04 D q 2 (deg 2) J 0.03 Jcov 0.02 0.01 J 0 0 500 1000 N 1500 mean 2000 Figure 2: FI of a correlated population of neurons. The FI, J, of the input layer (model 1) is plotted as a function of the number of cells, N, in the pool by the solid line. The contributions of the different terms Jmean , and Jcov are plotted by the dash-dot and the dotted lines, respectively. Notice that Jmean saturates for large pools, while Jcov grows linearly with the size of the pool, N. The Fisher information, J, Jmean , and Jcov , was calculated using equations 1.9–1.11 and parameters appearing in Figure 1 for model 1. Qualitatively similar results are obtained in model 2 as well. shows the values of J, Jmean , and Jcov for model 1 as a function of N. Note that while the FI of the covariance matrix, Jcov , grows linearly with the population size, N, the FI of the average ring rates, Jmean , saturates to a nite value, in contrast to the FI of uncorrelated population, equation 1.12. The saturation of Jmean in the large N limit is not specic to this model, but is characteristic of systems with long-range positive correlations, as previously shown (Sompolinsky et al., 2001). Thus, in the present system, most of the information in the neuronal responses comes from the stimulus dependence of the covariance matrix. In this letter, we study readout models for correlated populations. As we will show below, a linear readout is incapable of extracting the information embedded in the second-order statistics; its performance is bounded by Jmean . This raises the question of whether there is a nonlinear readout scheme whose performance is close to the total J, and if so, how complex such a readout needs to be. Here we present a relatively simple bilinear readout that extracts a large fraction of the information embedded in the covariance matrix, providing a plausible model for readout of population codes with stimulus-dependent correlations. 1112 M. Shamir and and H. Sompolinsky 1.3 Linear and Recurrent Network Readouts. A linear readout is an P estimator of the form i ri wi , where fwi g is a set of readout weights. Here we shall consider the linear readout in the context of two tasks: estimation and discrimination of an angular variable. 1.3.1 Estimation. A well-known example of a linear estimator of µ is the linear population vector (LPV), rst proposed by Georgopoulos et al. (1986) for the readout of arm-reaching-movement directions in two dimensions, from the responses of motor cortical cells. The LPV can be written as zO D N X ri eiÁi ; (1.14) iD1 where we have used a complex representation of two-dimensional unit vectors. In this notation, zO is an estimator of the unit vector z D eiµ , which represents a movement in the direction of the angle µ. Using the rotational symmetry of the system, one can show that the estimator zO =jhOzij is an unbiased estimator of z D eiµ and that it is the optimal linear readout in terms of squared error averaged over all angles µ, with a xed set of weights, w (see appendix A.1). The efciency of any estimator that respects the symmetry of the system (i.e., its parameters are periodic functions of Ái / is independent of µ . Assuming that the uctuations in the estimation of µ are small, one can write the average estimation error of such an estimator as a signal-to-noise relation for µ D 0 degree (see appendix A.1), ³ O 2 i¡1 D h.± µ/ signal noise ´2 O signal D hxi (1.15) (1.16) 2 O 2 i; noise D h.± y/ (1.17) O For the LPV, these signal and where we have used the notation zO D xO C iy. noise terms are O D hxi N X f .Ái / cos.Ái / (1.18) iD1 O 2i D h.± y/ N X Cij .µ / sin.Ái / sin.Áj /; (1.19) i;jD1 where Cij .µ / is the covariance matrix given by equations 1.5 and 1.6 for model 1. In the case of an uncorrelated population of neurons, C.Ái ; Áj / D b.Ái /±ij ; hence, both equations 1.16 and 1.17 grow linearly with N, yielding Nonlinear Population Codes 1113 (see also Seung & Sompolinsky, 1993) h.± µO /2 i¡1 D N 2 2 f1 ; b0 ¡ b2 (1.20) R dÁ where fn D 2¼ f .Á/einÁ and similarly bn . This result should be compared to the upper bound provided by J D Jmean C Jvar of equations 1.12 and 1.13. Note that the performance of the LPV is bounded by Jmean (see section 1.3.2) reecting the fact that the linear readout extracts information only from the mean responses. In fact, the slope of the LPV performance versus N, equation 1.20, is smaller than that of Jmean , equation 1.12, since due to the smoothness of the LPV weights, it is sensitive only to the low Fourier components of f and b, the rst Fourier component of f , and zero and second Fourier components of b, as can be seen from equation 1.20. In the uncorrelated case, the efciency of the LPV grows linearly with N although with a suboptimal slope. This is not the case in the presence of the correlations. Figure 3 shows the effect of the correlations on the accuracy of the LPV readout (open circles) for model 1 for several values of the parameter c, equation 1.5. As can be seen from the gure for c D 0, the efciency of the LPV grows linearly with N. However, in the case of c > 0, there exists some Nsat such that for N » Nsat the efciency saturates to a size-independent limit. Note that because the tuning curves of the mean rates for the chosen parameters are broad, they contain little power in high Fourier modes. Consequently, the LPV performs close to the bound of Jmean . 1.3.2 Discrimination. Another interesting readout problem is that of discriminating between two close angles. We model a two-interval discrimination task by a system that is given two sets of neuronal activities r.1/ ; r.2/ generated by two proximal stimuli µ and µ C ±µ . From these responses, the system must infer which stimulus generated which activity. For cases where the FI is large, the maximum likelihood (ML) discrimination for p two close angles yields a probability of error given by H.d0 = 2/, where R1 2 H.x/ D .2¼ /¡1=2 x dxe¡x =2 and the discriminability d0 equals p d0 D j±µj J.µ /; (1.21) where J.µ / is the FI of the system (Seung & Sompolinsky, 1993). A linear readout can be used to perform the discrimination task. We dene the discrimination rule to be according to the sign of q D w.r.1/ ¡ r.2/ /, where a plus sign is interpreted as µ C ±µ was presented rst and the minus sign is interpreted as µ was presented rst. The set of weights w that optimize the 0 discrimination for close angles, ±µ ¿ 1, is given by w D C¡1 f . Note that these weights depend on the angle µ. The probability of error of the optimal 1114 M. Shamir and and H. Sompolinsky Figure 3: The effect of correlations on the efciency of the LPV and RN estimators. The efciency of the readouts is shown in terms of the reciprocal squared estimation error. The efciency of the LPV (circles) and the RN (boxes) is plotted as a function of the number of cells in the pool, for different values of correlation strength, from the bottom c D 0:3, c D 0:03, and c D 0; all other parameters that were used are as dened in Figure 1. The efciency was calculated numerically by simulating the neuronal responses and averaging the LPV/RN squared estimation errors over 4000 trials, for model 1. The contribution of the average ring rates to the FI of the system, Jmean , is given in the solid lines, from bottom to top for c D 0:3, c D 0:03 and c D 0. Jmean was calculated using equation 1.10. p linear discriminator is given by H.d0L = 2/, where (see section A.2) p d0L D j±µj Jmean .µ /; (1.22) which is inferior to the ML performance, equation 1.21. Note that the linear discriminator is the optimal local linear estimator (dened by a linear readout with weights that depend on the neighborhood of angles to be estimated). Thus, Jmean is an upper bound on the performance of any linear estimator of µ. 1.3.3 Recurrent Networks. Recently a recurrent network (RN) model has been proposed (Deneve, Latham, & Pouget, 1999) as an improvement on the performance of the LPV in estimation tasks. Essentially, the RN readout is superior to the LPV in that it is sensitive to the ner structure of the neuronal tuning curves. The efciency of the RN readout is presented in Nonlinear Population Codes 1115 Figure 3 (open boxes) for different values of c, as a function of N. In this example, the RN performance is only slightly better than that of the LPV (circles). This is due to the fact that for smooth tuning curves, the LPV already captures almost all the information in the mean responses. The numerical results are in agreement with the claim (Deneve et al., 1999) that the efciency of the optimal RN readout in the limit of small estimation errors approaches 1=h.± µORN /2 i D Jmean . Thus, although the performance of the RN readout is in general superior to that of the LPV, it too saturates to a size-independent limit in the presence of long-range positive correlations. In summary, both the LPV and the RN schemes read out only the information that is embedded in the rst-order statistics of the data, that is, the tuning curves of the neurons, and hence both are bounded by Jmean . 2 The Bilinear Readout 2.1 Covariance Fisher Information. In order to understand how the information in the second-order statistics can be extracted, it is important to gain insights into the properties of the FI, equations 1.9–1.11. As observed above, the rst-order statistics contain only a nite amount of information. The reason for the saturation of Jmean is that the tuning curves for the average ring rates of the neurons are relatively smooth functions of the stimulus angle; hence, most of the embedded information resides in the low Fourier modes of the network responses. On the other hand, since the cross-correlations vary smoothly with angular distance, these slowly varying modes contain noise whose standard deviation grows linearly with N, yielding a signal-to-noise ratio of order 1. This raises the question of why the second-order statistics of the same smooth system yields information that grows linearly with N. Indeed, our study shows that most of the information in Jcov resides in the stimulus dependence of the variances, not the cross-correlations. More precisely, the covariance matrix of a system with smoothly varying cross-correlations (such as equations 1.5 and 1.7) has a discontinuity on the diagonal resulting from the fact that the variance, bi .µ /, is larger than the zero distance limit of the cross-correlations, which is cbref in model 1 (see Figure 1) and cbi .µ / in model 2. It is useful to write the matrix C as Cij D ±ij Di .µ / C Sij (2.1) with S as the continuous (smooth) part of C and D is the discontinuous part of C on the diagonal. In the models described above, we obtain Di .µ / D b.Ái ¡ µ/ ¡ cbref; model 1 ³ ´ jÁi ¡ Áj j D exp ¡ Sij cbref ; model 1 ½ (2.2) (2.3) 1116 M. Shamir and and H. Sompolinsky 1.4 x 10 Model 1 3 4 3 D J | D J | 1 cov 0.8 cov 0.6 |J |J Model 2 5 3.5 1.2 0.4 2.5 2 1.5 1 0.2 0 0 x 10 0.5 500 1000 N 1500 0 0 2000 500 1000 N 1500 2000 Figure 4: The difference between the contribution of the correlations to the total FI of the system, Jcov, and the FI of an uncorrelated gaussian population of neurons with zero means and variances D, JD , is shown as a function of the size of the system, N, for model 1 (left) and for model 2 (right). Note that the difference jJcov ¡ JD j is a sublinear function of N. For model 2, the tuning curves of the mean and variance are as dened in Figure 1, with ½ D 1 and c D 0:3 (see equation 1.7). for model 1, and Di .µ / D b.Ái ¡ µ/.1 ¡ c/; model 2 ³ ´ q jÁi ¡ Áj j Sij D c bi .µ /bj .µ / exp ¡ ; model 2 ½ (2.4) (2.5) for model 2. As will be explained below, the discontinuity in the covariance matrix along the diagonal plays an important role in the system’s behavior. In fact, for both models, we nd numerically that for large N, Jcov can be approximated by Jcov N 1X ¼ JD ´ 2 iD1 Á 0 Di .µ / Di .µ / !2 Z ¼N 0 dÁ D 2 .Á/ : 2¼ 2D2 .Á/ (2.6) Equation 2.6 is a central result of this article. Figure 4 shows the difference between Jcov and JD as a function of N for models 1 (left) and 2 (right). In both cases, as can be seen from the gure, the difference jJcov ¡ JD j is a sublinear function of N for large N. However from equation 2.6, we nd that JD scales linearly with N; hence, to a leading order in N, one obtains the result of equation 2.6. In section 2.4, we present an analytical calculation that bounds Jcov from below by JD to a leading order in N. This result, equation 2.6, suggests that the source of the fact that the second-order statistics Nonlinear Population Codes 1117 of the correlated system contains an extensive amount of information is the discontinuous nature of the variance. Because of this discontinuity, the information in the variance resides in all modes, including the modes with high Fourier numbers, and are therefore immune to the correlated noise, which resides mostly in the slowly varying modes. Comparing equation 2.6 with equation 1.13 indicates that the information in the second-order statistics of the correlated system is equivalent to that of an uncorrelated population of neurons with zero mean tuning but with stimulus dependence variances equal to Di .µ /. This important insight motivates our proposal for a nonlinear readout in the next section. 2.2 The Bilinear Readout Mechanism. Based on the above results, we propose a two-layer readout network (see Figure 5) for extracting information out of the second-order statistics. The rst layer of the readout system receives as input the activities of neurons in the input layer, fri g, and performs a linear ltering of the inputs followed by quadratic nonlinearity. Similar to the input-layer neurons, the output response of a neuron in the rst readout layer is a real scalar value. Denoting the output of the ith neuron in this layer by Ri , we dene Ri according to the deterministic mapping, 0 12 N X Ri D @ Wij rj A ; (2.7) jD1 where fW ij g is a set of linear ltering weights between the presynaptic neuron j in the input layer and the postsynaptic neuron i in the rst processing layer. The form of the N £ N lter W will be discussed below. The output layer of the readout mechanism receives, via feedforward connections, the responses of neurons in the rst layer (see Figure 5) and calculates their population vector, termed hereafter the nonlinear population vector (NLPV), O ZO D XO C iY: ZO D N X Ri eiÁi : (2.8) iD1 Note that we associate with the ith neuron of the rst layer an angle, Ái , which is the same as the PD of the ith input neuron. The rationale for this will be evident when we specify W below. Using the symmetry of the correlation matrix, Cij .µ /, equation 1.4, and constraining the linear lter, W, to be of the form Wij D W.Ái ¡ Áj /, one obtains 0 O D eiµ @ hZi 1 X ijk .C.Áj ; Ák / C f .Áj / f .Ák //W.Áj ¡ Ái /W.Ák ¡ Ái /eiÁi A : (2.9) 1118 M. Shamir and and H. Sompolinsky r1 r2 rN Input layer W11 W 1N R1 WNN RN Readout 1st layer Output layer X W Y ij 1 0.5 0 -0.5 180 120 60 0 f- f i j 60 120 (deg) 180 Figure 5: The architecture of the NLPV readout model. The input to the rst layer of the readout mechanism is a linear combination of the activities of neurons in the input. The rst layer responds by a nonlinear transfer function, and its output is transmitted in a feedforward manner to the second layer that calculates its population vector. At the bottom left appears a plot of the NLPV weight matrix, Wij , as a function of .Ái ¡ Áj /, for N D 300 and p D 6. O Zij O is an unbiased estimator of µ. As before, we Hence, the estimator Z=jh O 2i O 2 i D h.± Y/ calculate its variance for µ D 0 degree by h.± µ/ O 2 (see appendix A.1), the results of which are shown below. hXi 2.3 The Efciency of the Bilinear Readout. In order to complete the denition of the readout, one has to choose the form of the linear lter W. Motivated by the observation that most of the signal lies in the discontinuous part of the correlation matrix on the diagonal and that the noise mainly resides in the slowly varying modes of the network, that is, in the low-order Nonlinear Population Codes 1119 Fourier modes, we choose Wkj D ±kj ¡ p 1 X in.Ák ¡Áj / 1 X in.Ák ¡Áj / D e e : N nD¡p N jnj>p (2.10) With this choice, the matrix W simply lters out the rst .2p C 1/ low-order Fourier modes of the inputs, fri g. The parameter p determines how many low-order Fourier modes are ltered out by W. Substituting equations 2.8 and 2.10 into equation 1.16, the signal term reduces to ( ) N ± ² X ¤ O D 2NReal hXi CQ nC1;n C fnC1 fn ; (2.11) n>p where CQ mn is the double Fourier transform of the correlation matrix, 1 X imÁk ¡inÁj CQ m;n D 2 e Ckj : N kj (2.12) In the large N limit, the Fourier transform is given by (see equation 2.1) 1 CQ m;n D Dm¡n C Sm;n N Z dÁ Dm D D.Á/eimÁ 2¼ Z dÁdà Sm;n D S.Á ; Ã/eimÁ¡inà : .2¼/2 (2.13) (2.14) (2.15) Both S and f are relatively smooth functions of the angles; hence, Sm;n and fn decay fast with the increase of m and n. Therefore, for sufciently large p, one can neglect the contribution of S and f to the signal, equation 2.11. Finally, noting that CQ nC1;n D N1 D1 C SnC1;n , we obtain O D hXi N X jnj>p Z D1 D [N ¡ .2p C 1/]D1 ¼ N dÁ D.Á/eiÁ ; 2¼ (2.16) where D is the discontinuous piece of the covariance matrix, equations 2.2 and 2.4. It is important to note the limit in which equation 2.16 is valid. First, we have assumed (in the last step) that p ¿ N such that terms of order p are negligible compared to terms of order N. On the other hand, we have assumed that p is sufciently large such that the Fourier transforms of S and f can be neglected relative to the contribution of the terms containing D. As will be shown below, this latter condition is equivalent to the requirement 1120 M. Shamir and and H. Sompolinsky Model 1 Model 2 0.02 D q 2 (deg 2) 0.03 p=6 p=3 p=6 0.015 p=3 0.02 0.01 p=1 0.01 p=1 0.005 0 0 500 1000 N 1500 2000 0 0 500 1000 N 1500 2000 Figure 6: Nonlinear PV efciency as a function of the size of the pool for different values of p. Efciency of the NLPV readout in terms of the reciprocal average squared estimation error, shown in solid lines for p D 1; 3; 6 from bottom to top, for model 1 (left) and model 2 (right). The dashed line is the asymptotical value of the NLPV efciency, equation 2.19. N ¿ Nsat.p/, where Nsat .p/ is a strongly growing function of p. The form of this function depends on the asymptotic decay of the Fourier transform of S. Thus, the range of validity of equation 2.16 is p ¿ N ¿ Nsat.p/. Similar but more tedious analysis of the noise term yields (see appendix A.3), for p ¿ N ¿ Nsat .p/ h.±Y/2 i ¼ N D N 2 X m Z .Dm D¡m ¡ Dm D2¡m / dÁ B.Á/ .1 ¡ cos 2Á/ ; 2¼ (2.17) where B.Á/ D 2D2 .Á/: (2.18) Combining equations 2.16 and 2.17, we obtain O 2 i¡1 D N h.± µ/ 2jD1 j2 ´ NJ0 ; p ¿ N ¿ Nsat.p/: B0 ¡ B2 (2.19) This result shows that for sufciently large p such that N ¿ Nsat .p/ , the squared estimation error will scale like 1=N. Figure 6 shows the efciency of the NLPV readout as a function of the pooling size N, for models 1 (left) and 2 (right) and several values of p. The dashed line is the asymptotical limit of the performance of the NLPV, given by equation 2.19. As is indicated by the results, initially the estimation efciency of the NLPV grows linearly with Nonlinear Population Codes 1121 Figure 7: Typical network size for saturation of the NLPV efciency, Nsat , as a function of p, for model 1. The solid line is the exact expression, Nsat D NID =IS (equation A.22, calculated for N D 2000), is presented as a function of p. The approximated result, equation A.19, is shown by the dashed line. N. However, for any given p, there exists some scale of N, Nsat .p/, above which the efciency of the NLPV begins to saturate. Indeed we show in the appendix that for any q xed p, if N is made large enough, the standard O 2 i; grows eventually like N, saturating the deviation of the noise, h.± Y/ signal-to-noise ratio. Analyzing this saturation in detail for model 1, we derive the following expression for the saturation size, Nsat .p/ D B0 ¡ B2 p3 ; model 1; ¢2 4=3 cbref ¼=½ .1 ¡ e¡2¼=½ / ¡ (2.20) where B.Á/ is given in equation 2.18. The fact that the saturation value grows fast with p implies that we can use the bilinear readout to obtain an accurate estimate of µ even with moderate values of p, as shown in Figure 6. Figure 7 shows Nsat as a function of p in model 1. The exact expression for Nsat (equation A.23 in the appendix) is given by the solid line. The dashed line shows the approximate analytical calculation, equation 2.20. Finally, when N is large compared with Nsat , the saturated value of the efciency of the NLPV is given by lim h.± µO /2 i¡1 D J0 Nsat ; N!1 (2.21) 1122 M. Shamir and and H. Sompolinsky 35 (a) 150 (b) 30 25 100 ri 20 R 15 i 50 10 5 10 180 120 60 0 60 f (deg) i 120 180 180 120 60 0 60 f (deg) i 120 180 Figure 8: Typical result of simulating the activities of neurons in the model during a single trial of a presentation of µ D 0 degree. (a) Activities of a population of N D 500 neurons in the input layer during a single trial of presentation of µ D 0 degree, model 1. The neuronal activities are plotted as a function of their PDs by the thin line. The thick smooth line is the average across trial activity of the different neurons. (b) The activities of neurons in layer 1 of the readout are shown as a function of their PDs by the thin line. The input to the system in this single trial is given in a. The average activity of the neurons is shown by the thick line. For this graph, p D 10 was used. where J0 is dened in equation 2.19. Thus, Nsat determines both the linear regime of the efciency of the NLPV readout in N and the asymptotic efciency of this readout. Qualitatively similar results are obtained for model 2. 2.4 Statistics of the Neuronal Responses in Layer 1 of the Readout. To obtain more insight into the performance of the NLPV, we compared the statistics of the responses of neurons in the input layer, fri g, and those in the rst readout layer, fRi g: An example of the two responses is displayed in Figure 8 for a network of 500 neurons during a presentation of stimulus µ D 0 degree as a function of their PDs. Figure 8a shows the activities in the input layer (thin line) together with their trial-averaged activities (thick line). The rapid jittering of the neuronal responses is caused by uctuations in the high-order Fourier modes of the system. However, it is the uctuations in the low-order Fourier modes of the system that cause the population activity prole to shift left (as in the gure) or right from the average prole (i.e., the population tuning curve), and thus cause substantial errors in the estimation of µ. In the specic example shown here, the input layer activities (thin line) looks as though they were shifted from the average prole of activities (thick Nonlinear Population Codes 1123 (a) (b) 20 250 200 15 C ij R 10 Cij 5 150 100 50 0 180 0 180 180 0 f j (deg) 180 0 0 f 180 180 i (deg) f j (deg) 0 180 180 f i (deg) Figure 9: (a) Covariance matrix of the stochastic network. The covariance between the activities of neurons i and j is plotted on the vertical axis as a function of Ái and Áj . The covariance is calculated with respect to the distribution of the ri , given µ D 0 degree, in model 1. Note that the covariance in the activities of different neurons (off-diagonal elements) is independent of µ and decays slowly with the functional distance between the neurons. The variance (diagonal elements) depends on µ and peaks at Ái D µ . (b) Covariance matrix of neurons in layer 1 of the readout. The covariance is calculated with respect to the distribution of their inputs, fri g, given µ D 0 degree, for p D 15. Note that the covariance between different neurons is negligible. The variance (diagonal elements of the matrix) is stimulus dependent and peaks at Ái D µ . line) by 30 degrees, causing an error of ¼ 30 degrees in the LPV estimation of µ. Figure 8b shows the activities of neurons in the rst readout layer. The rapid jittering in the activities is obvious, but the long-range uctuations observed in the input layer are absent, suggesting that the uctuations in Ri are largely independent. Indeed, using calculations similar to those mentioned in the previous section, we show (see appendix A.3) that in the limit of p ¿ N ¿ Nsat .p/, the statistics of Ri obeys hRi i D Di (2.22) 2 h±Ri ±Rj i D 2±ij .Di / : (2.23) These analytical results are supported by the numerical results of Figure 9b, which shows the covariance matrix of neurons in layer 1, CRij .µ / D h±Ri ±Rj i, for µ D 0 degree. As can be seen, the matrix CRij .µ / D h±Ri ±Rj i is almost a diagonal matix; the covariance between different neurons is negligible. In contrast, the covariance matrix of the input neurons, Figure 9a, contains a substantial smooth off-diagonal part that gives rise to the collective uctuation that limits the accuracy of the LPV (see also Figure 1c). Figures 10a and 10b show the average and standard deviation of the Ri , respectively, as 1124 M. Shamir and and H. Sompolinsky 20 (a) 30 (b) 25 15 i DR i <R > 20 10 15 10 5 5 180 120 60 0 60 f (deg) i 120 180 180 120 60 0 60 f (deg) 120 180 i Figure 10: Statistics of neurons in the rst layer of readout. (a) The population tuning prole. The average activity of neurons in the rst layer is plotted as a function of their PDs (solid line). The population prole peaks at Ái D µ . For comparison, Cdi .Ái / is also plotted (dashed line). (b) The standard deviation of neurons in theprst layer is plotted as a function of their PDs (solid line). For comparison 2.Cdi .Ái //2 is also plotted (dashed line). The statistics were calculated for a system in model 1 with N D 300 and for p D 4. a function of their PDs, for stimulus µ D 0 degree. They show good t with the analytical predictions of equations 2.22 and 2.23. We conclude that for sufciently large p, the linear ltering of the loworder Fourier modes effectively decorrelates the activities of different neurons. The subsequent squaring of the ltered activity is required to be able to extract the information embedded in the second-order statistics of the Ri by a linear readout downstream (the second layer of weights in our readout). The results of equations 2.22 and 2.23 also shed light onto our expression for the efciency of the bilinear readout, equation 2.19. Comparing equation 2.19 and equation 1.20, one observes that the efciency of the bilinear readout is equivalent to the efciency of a linear population vector of independent neurons with mean responses, D.Ái ¡ µ/, and standard deviations p of 2D.Ái ¡ µ/, in line with equations 2.22 and 2.23. 2.4.1 The FI of the First Readout Layer Population. Our bilinear readout contains rst a decorrelation step, a nonlinear operation, and nally a population vector summation. We may inquire how much better we can do if we replace the last layer by a more complicated operation (e.g., a RN). This question can be addressed by calculating the FI, JR , of the R layer. To do this, we P use the fact that the Ri are squares of gaussian variables: Ri D t2i I ti D j Wij rj . Hence, the FI of the Ri is equal to the FI of the ti . In the limit of p ¿ N ¿ Nsat.p/, the ti become independent random gaussian vari- Nonlinear Population Codes 1125 ables with zero mean and variance Di . Applying equation 1.13, we obtain in this limit JR D JD , where JD is given by equation 2.6. A corollary of this result is that Jcov ¸ JD C o.N/. In particular one obtains that Jcov scales (at least) linearly with N. In fact, as indicated above (see equation 2.6), our numerical results indicate that Jcov ¼ JD , implying that the ltering out of the low-lying modes of the inputs by Ri leaves most of the information intact. Nevertheless, the NLPV estimator does not saturate the FI of the system (or the Ri / as its variance, equation 2.19, is larger than 1=Jcov ¼ 1=JD . This is due to the choice of a population vector for the second-layer weights, which extracts only the low-order components in the tuning of the Ri . 2.4.2 Optimal Bilinear Readout and Discrimination. For a stimulusindependent set of weights of the second layer, the NLPV weights are optimal in the sense of squared error averaged over all angles, similar to the case with a linear PV. However, one can obtain optimal results with a bilinear readout appropriate for local tasks. Considering a discrimination task, as described in section 1.3, we dene a bilinear discriminator that is based on t the sign of q D W.R.1/ ¡ R.2/ /. Minimizing the noise h.±q/2 i ¼ 2WCR W , while constraining the signal hqi ¼ ±µ WhRi0, for small angular deviations, one obtains the optimal weights Wi D D0i =.2D2i /, for p ¿ N ¿ Nsat . Calculating the discriminability in this limit (similar to the calculation in section A.2) yields p d0R D j±µ j JD : (2.24) This result shows once again that the R layer retains most of the information coded in the correlated neurons of the input layer, and furthermore that locally this information can be extracted by a further linear pooling. 2.5 Combined Population Vector. So far we have focused on extracting the information from the covariance matrix. In fact, linear and nonlinear readout schemes can be combined to implement a readout mechanism that is sensitive to the different aspects of the statistics of the neural responses. We dene a combined readout as a linear combination of the LPV estimator, zO L , equation 1.14, and the NLPV estimator, zO B , equation 2.8, zO D aL zO L C aB zO B : (2.25) Note that this estimator is unbiased and its accuracy is independent of µ. Minimizing the quadratic error with regard to the coefcients aL and aB , we obtain the optimal set of coefcients, ³ ´ µ h.± yO L /2 i aL D aB h± yO L ± yO B i ¶¡1 ³ ´ h± yO L ± yO B i hxO L i ; hxO B i h.± yO B /2 i (2.26) 1126 M. Shamir and and H. Sompolinsky D q 2 (deg 2) 0.04 0.03 J Asymp NLPV LPV NLPV Combined 0.02 0.01 0 0 500 1000 N 1500 2000 Figure 11: Comparison of the efciency of different readout schemes. Efciency of the LPV (boxes), NLPV (circles), and the combined readout (asterisks), as dened in section 2.5, was calculated by averaging the estimation errors of these readout schemes over 4000 trials. For comparison, we show the Fisher information, J (solid line), and the asymptotic limit of the NLPV, equation 2.19 (dashed line). For the NLPV, p D 12 was used. In this gure, model 1 was used for the input-layer neurons. where averages are taken with regard to the probability distribution of the neural responses, fri g, for µ D 0 degree. The efciency of the combined readout is shown in Figure 11 (asterisks). As can be seen, the performance of the combined readout is superior to that of the NLPV due to the addition of the information embedded in the rst-order statistics. 3 Summary and Discussion We have investigated readout mechanisms for correlated populations of neurons that code for an angle, whose second-order statistics vary with the stimulus. Using the Fisher information, we show that the total amount of information in the system about the stimulus is extensive: it grows linearly with population size (see equation 2.6). This is in contrast to the case of stimulus-independent correlations where the FI saturates to a nite value due to the long-range correlations (e.g., compare J and Jmean in Figure 2) as has been shown previously (Sompolinsky et al., 2001). This information is embedded in the stimulus-dependent covariance matrix since the information in the tuning of the mean rates saturates to a nite value (see Figure 2). Nonlinear Population Codes 1127 We nd that the main source of information in the second-order statistics of the correlated populations is the stimulus dependence of the variances and not the cross-correlations (see Figure 4). This is because the variances, being larger than the cross-correlations between nearby neurons, induce a stimulus-dependent discontinuity in the spatial structure of the covariance matrix (see Figure 9a). This stimulus sensitivity is spread across all spatial modes and hence is not signicantly suppressed by the slowly varying modes of correlated noise. The above analysis has a direct bearing on the assessment of distributed coding in cortical neuronal ensembles. Although pairs of cortical neurons often exhibit signicant cross-correlations, the degree and ubiquity of the modulation of the noise cross-correlograms by changing the stimulus are unclear (Aertsen, Gerstein, Habib, & Palm, 1989; Ahissar, Ahissar, Bergman, & Vaadia, 1992; Maynard et al., 1999). On the other hand, the variance of the spike rates of cortical cells typically changes monotonically with the mean rates; hence, they are strongly modulated by the stimulus. In addition, the variance in spike counts is in general substantially larger than the mean cross-correlations of nearby neurons (see, e.g., Lee et al., 1998; note that the condition for the variance, b.Ái /, to be larger than the correlation with nearby neurons, cbref , in model 1, is equivalent to the demand that correlation coefcient of nearby neurons cbref=b.Ái / will be less than one in absolute value), validating our assumption of discontinuity in the spatial structure of the neuronal covariance matrices. If this analysis is correct, then despite the presence of signicant noise cross-correlations in cortical ensembles, the response of large cortical pools contains accurate information about the coded stimuli, mainly due to the tuning of the variance of the ring rates. This suggests that experimental characterization of the tuning properties of the response variances may be as important as the traditional characterization of the tuning properties of the trial-averaged responses. Extracting the information embedded in the response variances requires more complex readouts than the ones usually assumed in modeling of neuronal population coding. We show that linear schemes (e.g., the population vector; see Figure 3) are incapable of extracting information from secondorder statistics. Their efciency is bounded by the relatively small Fisher information of the mean rates. Here we propose a bilinear readout model that we call a nonlinear population vector. The NLPV consists of two stages of processing (see Figure 5). In the rst stage, a linear ltering of the input neurons is performed, such that the slowly varying Fourier modes of the inputs are subtracted out. This high-pass ltering is followed by a quadratic nonlinearity. We show that this stage effectively decorrelates the neuronal responses and generates a population of uncorrelated responses with stimulus-dependent means and standard deviations (see Figures 9 and 10), both of which are proportional to the variances of the input neurons (see equations 2.22 and 2.23). In the second stage, the outputs of the rst stage are linearly summed similar to a linear population vector. The resul- 1128 M. Shamir and and H. Sompolinsky tant estimate has efciency that is extensive in the pool size, although it is below the Fisher information of the system (see Figure 11). Our readout model suggests that the degree of noise correlations in cortical ensembles will exhibit substantial heterogeneity, where ensembles higher in the sensory processing pathways are decorrelated by high-pass ltering mediated by the internal (or interlaminar) synaptic patterns. In addition, the decorrelated neurons should exhibit more signicant inputoutput nonlinearity than correlated lower-level neurons. Our readout model employs a quadratic nonlinearity. The realization of such a computation by neurons has been discussed in several previous studies (Poirazi & Mel, 2000; Schwartz & Simoncelli, 2001; Deneve et al., 1999). We have also investigated the robustness of our readout to deviations from the precise form of quadratic nonlinearity. We have studied numerically the efciency of a more general nonlinearity by taking ® N X Ri D Wij rj ; jD1 (3.1) where ® is a parameter of the neural responses that characterizes the neuron’s input-output nonlinearity. The simulations have shown that moderate deviations of ® from 2 yield NLPV performances that are qualitatively the same as the quadratic one, as shown in Figure 12. Further reasonable altering of other parameters of the model will not change qualitatively the results presented here. For example, changing the DC component of the mean responses tuning curves (i.e., changing fref while keeping fmax ¡ fref constant) will not affect any result. Increasing the modulation of the tuning curve, fmax ¡ fref, by a factor ¯ will increase Jmean and the efciency of the LPV by factor ¯ 2 ; however, they will still saturate to a nite limit at the same system size as before scaling. An interesting feature of our readout scheme is (approximate) scale invariance. As can be seen from equation 2.19, the performance of the readout depends on the tuning of the variances but is independent of their absolute scale. Such a change in the scale of the variances may be implemented by a change in the overall time window of the observed responses. In contrast, linear readout will be sensitive to a change in the overall time window since it will affect their signal-to-noise ratio. This scale invariance of the bilinear readout can be traced to the form of the Fisher information, equations 1.10 and 1.11. Changing the time window by a factor of say, ° , is expected to change both the mean spike counts and the covariance matrix by the same factor. Hence, whereas Jmean will change by ° , Jcov will not. This suggests that for short time intervals, the NLPV is superior to the LPV even for uncorrelated populations. It should be noted, however, that this scale invariance is a property of the gaussian statistics assumed here, which is no longer a Nonlinear Population Codes (D q) 2 (deg 2) 0.02 0.015 1129 a=3.2 a=2 a=1.6 NLPV asymp 0.01 0.005 0 0 200 400 N 600 800 1000 Figure 12: Efciency of the NLPV readout for small deviations from the quadratic nonlinearity (see equation 3.1), as a function of the number of neurons in the pool. From the bottom ® D 1:6 (diamonds), ® D 3:2 (squares), ® D 2 (circles). The asymptotic limit of the NLPV, equation 2.19, is shown for comparison (dashed line). For all plots, p D 10 was used. In this gure, model 1 was used for the input-layer neurons. valid model for the neuronal responses at sufciently low spike counts (i.e., at very short observation times). Finally, we have shown that linear and nonlinear readout schemes can be combined to implement a readout mechanism that is sensitive to several aspects of the statistics of the neural responses (see equation 2.25). Even in this simple example, a wisely chosen weighted sum of the linear and nonlinear PVs yields a readout that is more accurate than the LPV and the NLPV (see Figure 11). In this article, we have chosen a high-pass lter form of the linear weights matrix, W, based on heuristic arguments. Alternatively, one may nd the optimal weights by minimizing the squared error of the readout (averaged over all angles). However, nding the optimal weights is complicated. In a previous study we used this approach for model 2 (Shamir & Sompolinsky, 2001a, 2001b). We found the efciency of the optimal readout is similar to that of the high-pass NLPV with an appropriate value of p (for details, see Shamir & Sompolinsky, 2001a). Recent studies have used model neurons with nonlinear input-output functions in order to obtain an improved population readout. Deneve et al. (1999) used recurrent network (RN) nonlinear neurons, followed by linear population vector pooling. They show that their network yields superior 1130 M. Shamir and and H. Sompolinsky performances compared to LPV. In fact, they showed that for low noise levels, their network extracts all the information that is coded in the average ring rates of the neurons, Jmean . Interestingly almost all of the information that can be read by their mechanism is obtained in the rst iteration of the dynamics. The rst iteration step of their dynamics can be modeled according to equation 2.7 with a feedforward linear lter W, identical to their recurrent weight matrix. Because information coded in the average ring rates of the neurons resides mainly in the low-order Fourier modes of the network, their choice of lter, W, leaves only these few slow modes of the system and lters out all of the higher Fourier modes. Thus, their readout suffers from two major drawbacks. First, it is unable to read information coded in the variance of the responses. Second, it does not overcome the problem of the strong correlated noise in the slow Fourier modes of the system (see Figure 3). Note that their study shows how bilinear readouts can be used in order to read information efciently from the rst-order statistics of the neuronal responses. In the context of this work, we have studied the effect of replacing the feedforward layer Ri by an RN with recurrent weights equal to our high-pass lter W. We have found (results not shown) that iterating the dynamics of the network beyond the rst step causes a deterioration of the readout accuracy of the system rather than an improvement. On the other hand, it is expected that feeding the output of our Ri into a RN with weights similar to that of Deneve et al. (1999) might yield slightly improved performance, closer to the bound given by Jcov t JD (see equation 2.6). Appendix A.1 Derivation of Angular Estimation Error. Let µO D arg.Oz/ be an unO For small uctuations in xO and y, O biased estimator of µ . Dene zO D xO C iy. one can expand µO in xO and yO around their mean, obtaining O 2i D h.± µ/ µ O 2i h.± x/ .¡ sin µ; cos µ / O yi O h± x± jhOzij2 O O yi h± x± O 2i h.± y/ ¶³ ´ ¡ sin µ cos µ ; (A.1) where averages are taken with respect to the distribution of the neuronal O 2 i, is independent of responses for a given µ. If the estimation error, h.± µ/ µ, one can evaluate the estimation error for µ D 0 degree. In this case, equation A.1 is reduced to O 2i D h.± µ/ O 2i h.± y/ ; O 2 hxi O to noise, which is in the form of a signal, hxi, (A.2) p O 2 i, relation. h.± y/ Nonlinear Population Codes 1131 the optimality of the LPV, A.1.1 Optimality of the LPV. We will now showP assuming isotropy of the neural network. Let zO D i wi ri be an estimator of eiµ . Dene a cost function, E, as the squared estimation error of zO averaged over all angles: Z 1 dµ iµ hje ¡ wrj2 i: (A.3) ED 2 2¼ Minimizing E with respect to w, one obtains that the optimal set of linear weights, wopt , is given by wopt D Q¡1 u Z dµ iµ uk D e fk .µ / 2¼ Z ¢ dµ ¡ Qij D Cij .µ / C fi .µ / fj .µ / : 2¼ (A.4) (A.5) (A.6) Now, using the symmetry of the tuning curves, fi .µ / D f .Ái ¡µ/, one obtains from equation A.5, ui D f1 eiÁi . Both the correlation matrix Cij and the outer product fi fj are functions of two angles, .Ái ¡ µ/ and .Áj ¡ µ/. After integration over all values of µ (e.g., equation A.6), one obtains by a change of variables µ 0 D µ ¡ Áj that Q is a function of only one angle Qij D Q.Ái ¡ Áj /. Hence, u is an eigenvector of Q. Thus, the optimal linear weights for discrimination are wi / eiÁi , that is, zO is the LPV. A.2 Derivation of the Discrimination Error. One can apply the linear readout to the task of discrimination. We dene the readout according to the sign of q D w.r.1/ ¡ r.2/ /. Minimizing the noise, h.±q/2 i D 2wCwt , while constraining the signal, hqi ¼ wf0±µ, one obtains that the optimal linear readout weights are given by w D C¡1 f0 . Thus, for the optimal discriminator, q is a gaussian random variable with mean ±µf0t C¡1 f0 D ±µ Jmean and variance 2Jmean . The probability of discrimination error, q±µ < 0, for the optimal linear p discriminator is given by H.d0L = 2/ with a discriminability p d0L D j±µj Jmean .µ /; (A.7) thus yielding the result of equation 1.22. A.3 Derivation of the NLPV Noise Term. For the calculation of the NLPV noise term, it is convenient to dene ZO D TrfArrt g N X Aij D W ik eiÁk W kj ; kD1 (A.8) (A.9) 1132 M. Shamir and and H. Sompolinsky where rrt is an .N £ N/ matrix of rank 1. Using the above denition of A with the notation A D AR C iAI , with AR and AI real matrices, the signal O D Tr.AR C/ C Tr.AR fft /. The noise is given by can be written as hXi X O 2i D h.± Y/ (A.10) AIij AIkl M ijkl ijkl Mijkl D h±.ri rj /±.rk rl /i; (A.11) where ±.ri rj / D ri rj ¡ hri rj i. Using the gaussianity of the ri , we obtain Mijkl D Cik Cjl C Cil Cjk C Cik fj fl C Cil fj fk C Cjk fi fl C Cjl fi fk : (A.12) Substituting equation A.12 into equation A.10 yields O 2 i D 2Trf.AI C/2 g C 4Trf.AI C/AI fft g: h.± Y/ (A.13) The last term on the right-hand side of equation A.13 contains only terms including Fourier transforms of the tuning curve, f .µ /, of order > p. Due to the smoothness of f .µ /, these terms will decay faster than any power of p, and hence will be neglected in our analysis. Using the form of W, equation 2.10, the expression for AI , imaginary of equation A.8, is reduced to ² X ± 1 (A.14) AIij D ei.nC1/Ái ¡inÁj ¡ einÁi ¡i.nC1/Áj : 2iN > p n < ¡p ¡ 1 Substituting equation A.14 into equation A.13, we obtain n o X O 2 i D N2 h.± Y/ CQ m;n CQ nC1;mC1 ¡ CQ mC1;n CQ nC1;m ; m;n (A.15) >p < ¡p ¡ 1 where CQ m;n is the double Fourier transform of C, as dened in equation 2.12. We now consider the contribution of the different parts of the correlation matrix, namely S and D, to the noise term, equation A.15. As mentioned above, equation 2.13, the transform of C is a sum of transforms of S and D. Thus, the products CQ m;n CQ t;u contain three different terms: products of transforms of D, products of transforms of S, and a mixture term containing products of transforms of D and transforms of S. Using the relation, equation 2.13, we obtain the contribution of terms containing only D, ID , to the noise X fDm¡n Dn¡m ¡ DmC1¡n DnC1¡m g ID D m;n ¼N >p < ¡p ¡ 1 X m fDm D¡m ¡ Dm D2¡m g D N B0 ¡ B2 ; 2 (A.16) Nonlinear Population Codes 1133 where we have neglected terms of order p relative to N. Bn is the nth Fourier transform of B.Á/ D 2D2 .Á/. Note that this term scales linearly with N and does not decay fast as p grows. The contribution of terms containing only S, IS , is given by IS D N 2 X m;n >p < ¡p ¡ 1 © ª Sm;n SnC1;mC1 ¡ SmC1;n SnC1;m : (A.17) This term, IS , scales like N 2 . However, since it contains only transforms of S that are of order > p, it will be a strongly decaying function of p. Below, we focus on model 1. In model 1, the Fourier modes are eigenvectors of S: Smn D ±mn S.n/, thus yielding IS D 2N 2 X n>p S.n/S.n C 1/; model 1 (A.18) For large n, we can approximate S.n/ ¼ cbref =.¼½/.1¡.¡1/n e¡¼=½ /n¡2 , model 1. Replacing the summation in the last equation by an integral, we obtain IS D 2N 2 3 ³ cbref ¼½ ´2 .1 ¡ e¡2¼=½ /p¡3 ; model 1: (A.19) The contribution of the mixed term, ISD is Á ISD D 2N.D0 ¡ D2 / 2 X n>p ! S.n/ ¡ S.p/ / Np¡1 ; model 1: (A.20) Note that although this term yields a contribution that scales linearly with N, this contribution is suppressed due to the algebraic decay of ISD with the increase of p. Neglecting the contribution of ISD , we obtain that the efciency of the NLPV is given by 1 1 C N=Nsat NID B0 ¡ B2 ¼ Nsat .p/ D lim p3 ; model 1: ± ²2 N!1 I S cbref ¡2¼=½ 4=3 ¼½ .1 ¡ e / h.± µO /2 i¡1 D NJ0 (A.21) (A.22) The above results are qualitatively the same for model 2 as well. The algebraic decay of the transforms of S results from the discontinuity of the derivatives of S on the diagonal. This discontinuity causes the nth Fourier component to decay asymptotically like n¡2 in model 1. In model 2, the Fourier components are not its eigenvectors. However, the scaling of the 1134 M. Shamir and and H. Sompolinsky different contributions to the noise term, namely ID , IS , and ISD , is the same as in model 1. Thus, for model 2, the relation, equation A.21, still holds with the denition NID : N!1 IS Nsat.p/ D lim (A.23) A.3.1 The CalculationPof the Moments of Ri . The Ri are squares of gaussian variables:Ri D t2i I ti D j W ij rj with average hti i D X einÁi fn (A.24) jnj>p and covariance h±ti ±tj i D X jmj;jnj>p e¡inÁi CimÁj CQ mn : (A.25) In the limit of p ¿ N ¿ Nsat.p/, the averages of t are exponentially small in p, and hence will be neglected, hti i ¼ 0 p ¿ N ¿ Nsat .p/ (A.26) Using equation 2.13, we obtain from equation A.25 in the limit of p ¿ N ¿ Nsat .p/: h±ti ±tj i D Di ±ij p ¿ N ¿ Nsat.p/: (A.27) From equations A.26 and A.27 and the gaussianity of the ti , we obtain the results of the equations 2.22 and 2.23. Acknowledgments This research is partially supported by the Israel Science Foundation, Center of Excellence Grant no 8006/00. M.S. is supported by a scholarship from the Clore Foundation. References Abbott, L. F., & Dayan P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11(1), 91–101. Nonlinear Population Codes 1135 Aertsen, A. M., Gerstein, G. L., Habib, M. K., & Palm, G. (1989) Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900–917. Ahissar, M., Ahissar, E., Bergman, H., & Vaadia, E. (1992) Encoding of soundsource location and movement: Activity of single neurons and interactions between adjacent neurons in the monkey auditory cortex. J. Neurophysiol., 67(1), 203–215. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neurosci., 12(12), 4745–4765. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nat. Neurosci., 2(8), 740–745. Fitzpatrick, D. C., Batra, R., Stanford, T. R., & Kuwada, S. (1997). A neuronal population code for sound localization. Nature, 388(6645), 871–874. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Kay, S. M. (1993). Fundamentals of statistical signal processing. London: Prentice Hall International. Lee, C., Rohrer, W. H., & Sparks, D. L. (1988). Population coding of saccadic eye movements by neurons in the superior colliculus. Nature, 332(6162), 357–360. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. J. Neurosci., 18(3), 1161–1170. Mastronarde, D. N. (1983). Correlated ring of cat retinal ganglion cells. II. Responses of X- and Y-cells to single quantal events. J. Neurophysiol., 49(2), 325–349. Maynard, E. M., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement direction. J. Neurosci., 19(18), 8083– 8093. Poirazi, P., & Mel, B. W. (2000). Choice and value exibility jointly contribute to the capacity of a subsampled quadratic classier. Neural Comput., 12(5), 1189–1205. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4(8), 819–825. Seung H. S., & Sompolinsky H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90(22), 10749–10753. Shamir, M., & Sompolinsky, H. (2001a). Correlation codes in neuronal networks. In D. G. Thomas, B. Suzanna, & G. Zoubin (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Shamir, M., & Sompolinsky, H. (2001b). Nonlinear population vector for correlated neurons. Abstract presented at the Society for Neuroscience’s 31st Annual Meeting. Sompolinsky, H., Yoon, H., Kang, K., & Shamir, M. (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64(5 Pt 1), 051904. Thomas, J. A., & Cover, T. M. (1991). Elements of information theory. New York: Wiley. 1136 M. Shamir and and H. Sompolinsky van Kan, P. L., Scobey, R. P., & Gabor A. J. (1985). Response covariance in cat visual cortex. Exp. Brain. Res., 60(3), 559–563. Vogels, R., Spileers, W., & Orban, G. A. (1989). The response variability of striate cortical neurons in the behaving monkey. Exp. Brain Res., 77(2), 432–436. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261(5124), 1055–1058. Young, M. P., & Yamane, S. (1992). Sparse population coding of faces in the inferotemporal cortex. Science, 256(5061), 1327–1331. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370(6485), 140–143. Received August 6, 2003; accepted November 6, 2003.