Download A neural implementation of Bayesian inference based on predictive

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neural engineering wikipedia , lookup

Neuroanatomy wikipedia , lookup

Neural oscillation wikipedia , lookup

Sensory cue wikipedia , lookup

Premovement neuronal activity wikipedia , lookup

Artificial neural network wikipedia , lookup

Optogenetics wikipedia , lookup

Holonomic brain theory wikipedia , lookup

Circumventricular organs wikipedia , lookup

Mixture model wikipedia , lookup

Development of the nervous system wikipedia , lookup

Neuropsychopharmacology wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Catastrophic interference wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Metastability in the brain wikipedia , lookup

Neural modeling fields wikipedia , lookup

Central pattern generator wikipedia , lookup

Neural coding wikipedia , lookup

Biological neuron model wikipedia , lookup

Efficient coding hypothesis wikipedia , lookup

Synaptic gating wikipedia , lookup

Convolutional neural network wikipedia , lookup

Recurrent neural network wikipedia , lookup

Nervous system network models wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
Connection Science, doi: 10.1080/09540091.2016.1243655
A neural implementation of Bayesian inference based on predictive coding
M. W. Spratling
King’s College London, Department of Informatics, London. UK. [email protected]
Abstract
Predictive coding is a leading theory of cortical function that has previously been shown to explain a great
deal of neurophysiological and psychophysical data. Here it is shown that predictive coding can perform almost exact Bayesian inference when applied to computing with population codes. It is demonstrated that the
proposed algorithm, based on predictive coding, can: decode probability distributions encoded as noisy population codes; combine priors with likelihoods to calculate posteriors; perform cue integration and cue segregation;
perform function approximation; be extended to perform hierarchical inference; simultaneously represent and
reason about multiple stimuli; and perform inference with multi-modal and non-Gaussian probability distributions. Predictive coding thus provides a neural network based method for performing probabilistic computation
and provides a simple, yet comprehensive, theory of how the cerebral cortex performs Bayesian inference.
Keywords: Bayes; priors; inference; multisensory integration; function approximation; predictive coding; population coding; neural networks
1
Introduction
It is widely believed that the brain performs Bayesian inference (Chater et al., 2006; Griffiths et al., 2008; Griffiths and Tenenbaum, 2006; Kersten et al., 2004; Knill and Richards, 1996; Lee and Mumford, 2003; Rao et al.,
2002; Vilares and Kording, 2011; Yuille and Kersten, 2006). This requires the representation and manipulation
of probability distributions. A leading theory of how the brain represents probability distributions suggests that it
does so using population coding (Anderson and Van Essen, 1994; Barber et al., 2003; Beck et al., 2011; Deneve
et al., 1999, 2001; Földiák, 1993; Ganguli and Simoncelli, 2014; Jazayeri and Movshon, 2006; Latham et al.,
2003; Ma et al., 2006, 2008; Pouget et al., 2013, 2003, 2000, 1998; Sanger, 1996; Seilheimer et al., 2014; Zemel
et al., 1998). In such a code, the activity across a population of neurons represents a probability distribution:
each neuron represents a value of the random variable, and its firing rate encodes the probability associated with
that value (although the firing rate may be corrupted with noise). Together, the population of neurons provide a
discretely sampled approximation to a continuous probability density function.
There have been a number of previous, neurally-based, accounts of how the brain manipulates such population
codes to perform probabilistic inference. For example, models of how probability distributions can be encoded
by neural firing rates (Anderson and Van Essen, 1994; Barber et al., 2003; Zemel et al., 1998); models of how
priors can be encoded into the receptive fields (RFs) of neurons (Ganguli and Simoncelli, 2010, 2014; Girshick
et al., 2011; Shi and Griffiths, 2009); and models of how separate sources of sensory evidence can be combined in
a statistically optimal manner. Notable models of the latter type employ a network of integrate-and-fire neurons
(Ma et al., 2006, 2008; Seilheimer et al., 2014), or a basis function neural network with attractor dynamics (Deneve
et al., 1999, 2001; Latham et al., 2003; Pouget et al., 2003, 2000, 1998). The latter type of network can also be
used to perform function approximation with probabilistically defined variable values (Deneve et al., 2001; Pouget
et al., 2002). For details of the wide range of different models of probabilistic computation the reader is referred
to recent reviews on this topic (Ma, 2012; Ma and Jazayeri, 2014; Pouget et al., 2013; Vilares and Kording, 2011).
In this article an alternative neural network model is proposed for computing with population codes that approximate probability distributions. The motivation for developing this new algorithm was to create a single,
biologically-plausible, method capable of performing all the probabilistic inference tasks mentioned in the previous paragraph, and hence, to provide a more comprehensive neural model of probabilistic computation (as
performed by the cerebral cortex) than has previously been proposed. The proposed model succeeds in being
able to: decode probability distributions encoded as noisy population codes; combine priors with likelihoods to
calculate posteriors; perform cue integration; and perform function approximation; Furthermore, it goes beyond
the existing algorithms in being able to additionally: simultaneously represent and reason about multiple stimuli; perform inference with non-Gaussian probability distributions; perform cue integration with a non-flat prior;
perform cue segregation as well as integration; and perform hierarchical inference.
The proposed algorithm, PC/BC-DIM, is a version of Predictive Coding (PC; Huang and Rao, 2011; Rao and
Ballard, 1999) reformulated to make it compatible with Biased Competition (BC) theories of cortical function
(Spratling, 2008a,b) and that is implemented using Divisive Input Modulation (DIM; Spratling et al., 2009) as
1
y
prediction
W
W
V (∝ Wt)
error
input
V (∝ Wt)
reconstruction
ea
eb
ec
output
xa
xb
xc
(a)
ra
rb
rc
(b)
Figure 1: (a) A single processing stage in the PC/BC-DIM neural network architecture. Rectangles
represent populations of neurons and arrows represent connections between those populations. The population of prediction neurons constitute a model of the input environment. Individual neurons represent
distinct causes that can underlie the input. The belief that each cause explains the current input is encoded
in the activation level, y, and is used to reconstruct the expected input given the predicted causes. This
reconstruction, r, is calculated using a linear generative model (see equation 1). Each element of the
reconstruction is compared to the corresponding element of the actual input, x, in order to calculate the
residual error, e, between the predicted input and the actual input (see equation 2). The errors are subsequently used to update the predictions (via the feedforward weights W, see equation 3) in order to make
them better able to account for the input, and hence, reduce the error at the next iteration. The weights
V are the transpose of the weights W, but are normalised so that the maximum value of each column is
unity. The inputs to a processing stage may come from the prediction neurons of this or another processing stage, or the reconstruction neurons of another processing stage, or may be external, sensory-driven,
signals. The inputs can also be a combination of any of the above. (b) When inputs come from multiple
sources, it is sometimes convenient to consider the population of error neurons to be partitioned into
sub-populations which receive these separate sources of input. As there is a one-to-one correspondence
between error neurons and reconstruction neurons, this means that the reconstruction neuron population
can be partitioned similarly.
the method for updating error and prediction neuron activations. DIM calculates reconstruction errors using division, which is in contrast to other implementations of PC that calculate reconstruction errors using subtraction
(Spratling, 2016). The PC/BC-DIM algorithm has previously been shown to explain a large range of neurophysiological and psychophysical data including: orientation tuning, surround suppression and cross-orientation
suppression in primary visual cortex (V1; Spratling, 2010, 2011, 2012a), the learning of Gabor-like RFs in V1
(Spratling, 2012c), gain modulation as is observed, for example, when a retinal RF is modulated by eye position
(De Meyer and Spratling, 2011, 2013), contour integration (Spratling, 2013b, 2014), the modulation of neural
response due to attention (Spratling, 2008a, 2014), and the saliency of visual stimuli (Spratling, 2012b). A second
motivation for the current work was to extend the range of phenomena that can be simulated by PC/BC-DIM to
include Bayesian inference. The current work also suggests that all the diverse biophysical behaviours that can be
explained by PC/BC-DIM may have a single, probabalistic, interpretation.
2
2.1
Methods
The PC/BC-DIM Algorithm
PC/BC-DIM is a hierarchical neural network. Each level, or processing stage, in the hierarchy is implemented
using the neural circuitry illustrated in Fig. 1a. A single PC/BC-DIM processing stage consists of three separate
neural populations,a and the behaviour of the neurons in these three populations is determined by the following
equations:
r = Vy
(1)
e = x (2 + r)
(2)
y ← (1 + y) ⊗ We
(3)
a Previous work with this algorithm, and with other implementations of predictive coding, have proposed that each processing stage consists
of two neural populations: error neurons and prediction neurons. The operation performed by the reconstruction neurons in the current version
of PC/BC-DIM was performed within the error neurons in previous versions.
2
Where x is a (m by 1) vector of input activations, e is a (m by 1) vector of error neuron activations; r is a (m
by 1) vector of reconstruction neuron activations; y is a (n by 1) vector of prediction neuron activations; W is
a (n by m) matrix of feedforward synaptic weight values; V is a (m by n) matrix of feedback synaptic weight
values; 1 and 2 are parameters; and and ⊗ indicate element-wise division and multiplication respectively. For
all the experiments described in this paper 1 and 2 were given the values 1 × 10−6 and 1 × 10−4 respectively.
Parameter 1 prevents prediction neurons becoming permanently non-responsive. It also sets each prediction
neuron’s baseline activity rate and controls the rate at which its activity increases when an input stimulus is
presented within its RF. Parameter 2 prevents division-by-zero errors and determines the minimum strength that
an input is required to have in order to effect prediction neuron response. As in all previous work with PC/BC-DIM, these parameters have been given small values compared to typical values of y and x, and hence, have
negligible effects on the steady-state activity of the network. The matrix V is equal to the transpose of the W, but
each column is normalised to have a maximum value of one. Hence, the feedforward and feedback weights are
simply rescaled versions of each other. Given that the V weights are fixed to the W weights there is only one set
of free parameters, W, and references to the “synaptic weights” refer to the elements of W. Here, as in previous
work with PC/BC-DIM only non-negative weights, inputs, and activations are used.
Initially the values of y were all set to zero, although random initialisation of the prediction node activations
can also be used with little influence on the results. Equations 1, 2 and 3 were then iteratively updated with the new
values of y calculated by equation 3 substituted into equation 1 and 3 to recursively calculate the neural activations.
To perform simulations with a hierarchical model, equations 1, 2 and 3 were evaluated for each processing stage in
turn (starting from the lowest stage in the hierarchy), and this process was repeated to iteratively update the neural
activations in each processing stage at each time-step. If the input remains constant, the network activity will
converge to a steady-state. The time taken to reach a steady-state is strongly influenced by the number of synaptic
weights. For small networks, like those used in sections 3.1–3.4, five iterations are sufficient. For medium-sized
networks, like those used in section 3.5, approximately 20 iterations are sufficient. For large networks, like that
used in section 3.6, approximately 50 iterations are required. To ensure each network reached a steady-state, for
simulations on small and medium sized networks, the iterative process was terminated after 25 iterations, and
for simulations on large networks (and those for the hierarchical network used in section 3.5) 50 iterations were
performed. It is the response of the network at the time when the iterative process was terminated that are reported
in the results.
PC/BC-DIM is an abstract, functional, model that aims to explore the computational, rather than the biophysiological, mechanisms which underlie cortical function (Spratling, 2011). However, it is possible to speculate
about the potential biological implementation of the model. There are many different ways in which the simple
circuitry of PC/BC-DIM model could potentially be implemented in the much more complex circuitry of the
cortex (Kok and de Lange, 2015; Spratling, 2008b, 2011, 2012b, 2013a). However, the most straightforward
explanation would equate prediction neurons with the sub-population of cortical pyramidal cells (mostly found
in cortical layers II and III) whose axon projections form the feedforward connections between cortical regions,
and to equate reconstruction neurons with the sub-population of cortical pyramidal cells (mostly found in cortical
layer VI) whose axon projections form the feedback connections between cortical regions (Barbas and RempelClower, 1997; Barone et al., 2000; Budd, 1998; Crick and Koch, 1998; Felleman and Van Essen, 1991; Johnson
and Burkhalter, 1997; Markov et al., 2014; Mountcastle, 1998). This is consistent with previous work showing
that the behaviour of the prediction neurons in the PC/BC-DIM model can explain the response properties of
cortical pyramidal cells in both the ventral (Spratling, 2008a, 2010, 2011, 2012a,c, 2014), and dorsal (De Meyer
and Spratling, 2011, 2013) pathways of the cortical visual system, and is also consistent with brain imaging data
(Alink et al., 2010; Egner et al., 2010; Kok and de Lange, 2015; Kok et al., 2012; Smith and Muckli, 2010;
Summerfield and Egner, 2009; Summerfield et al., 2006; Wacongne et al., 2011). It is possible to equate the
error-detecting neurons with the spiny-stellate cells in cortical layer IV, which are the major targets of cortical
feedforward connections and sensory inputs. However, it is also possible that the error-detection is performed in
the dendrites of the superficial layer pyramidal cells (Spratling and Johnson, 2003) rather than in a separate neural
population; or via synaptic depression which can produce the specific form of divisive inhibition required by the
error-neurons in the PC/BC-DIM model (Rothman et al., 2009); or that the error neurons reside in the thalamus,
individual regions of which receive connections from layer VI pyramidal cells (putative reconstruction neurons)
as well as either sensory input or input from lower cortical regions (Olsen et al., 2012; Sherman, 2016; Shipp,
2004).
The mechanisms employed by the prediction and error neurons differ from those typically used in artificial
neural networks; i.e., linear summation of inputs followed by a nonlinear activation function (Rosenblatt, 1962;
Rumelhart et al., 1986). Specifically, the prediction neurons perform a multiplication operation and the error neurons perform a division operation. However, both these nonlinear mechanisms are biologically-plausible. Neurons
that perform multiplicative (Salinas and Sejnowski, 2001; Salinas and Thier, 2000), and divisive (Carandini and
3
Heeger, 1994; Heeger, 1992), operations are common throughout the brain. A range of biophysical mechanisms
have been proposed to underlie both these nonlinear operations, including the interplay between linear neurons in a
network (Brozović et al., 2008; Chance and Abbott, 2000; Murphy and Miller, 2003; Reynolds and Chelazzi, 2004;
Salinas and Abbott, 1996), nonlinear dendritic integration (Gabbiani et al., 2002; Jaffe and Carnevale, 1999; Koch
and Segev, 2000; Larkum et al., 2004; London and Häusser, 2005; Mehaffey et al., 2005; Mel, 1994; Mitchell and
Silver, 2003; Phillips, 2016; Spruston, 2008; Spruston and Kath, 2004; Stuart and Häusser, 2001), and synaptic
mechanisms (Alger, 2002; Branco and Staras, 2009; Rothman et al., 2009; Sherman and Guillery, 1998).
In previous work with this algorithm, the reconstruction has been used purely as a means to calculate the errors,
and hence, equations 1 and 2 have been combined into a single equation. Thus, the underlying mathematical model
is identical to that used in previous work, but the interpretation has changed in order to consider the reconstruction
to be represented by a separate neural population. Furthermore, in the current work the reconstruction neurons
constitute the output of the model, and provide inputs to other processing stages in a hierarchical model. In
contrast, previous work with PC/BC-DIM has used the prediction neurons as the outputs of each processing stage
(Spratling, 2008a, 2012c). This is also in contrast to other versions of predictive coding (Friston, 2005; Rao and
Ballard, 1999) that have used two sources of output from each processing stage: error neurons for the feedforward
connections from lower to higher processing stages, and prediction neurons for the feedback connections.
2.2
Representing Causes and Performing Explaining Away
The values of y represent predictions of the causes underlying the inputs to the network (i.e., latent variables).
The values of r represent the expected inputs given the predicted causes. The values of e represent the residual
error between the reconstruction, r, and the actual input, x. The full range of possible causes that the network can
represent are defined by the weights, W. Each row of W (which correspond to the weights targeting an individual
prediction neuron, or its RF) can be thought of as a “basis vector” or “elementary component” or “preferred
stimulus”, and W as a whole can be thought of as a “dictionary” or “codebook” of possible representations, or
as a model of the external environment, or as the parameters of a generative model. The activation dynamics,
described by equations 1, 2 and 3, perform gradient descent on the reconstruction error in order to find prediction
neuron activations that accurately reconstruct the input (Achler, 2014; Spratling, 2012c; Spratling et al., 2009).
Specifically, the equations operate to minimise the Kullback-Leibler (KL) divergence between the input (x) and
the reconstruction of the input (r) (Solbakken and Junge, 2011; Spratling et al., 2009). Gradient descent can also be
implemented using subtraction rather than division to calculate the reconstruction errors (Achler, 2014; Harpur,
1997) as is the case in the Rao and Ballard (1999) version of predictive coding. In this case, gradient descent
attempts to find the prediction neuron activations that minimise the sum squared residual error (Achler, 2014;
Harpur, 1997). However, the subtractive method typically converges to a solution more slowly, and the solution
is less sparse. Furthermore, the subtractive version is less biologically-plausible as it requires error neurons to be
able to have negative firing rates,b and it also successfully simulates far less neurophysiological data (Spratling,
2008a, 2013a).
At the steady-state, the PC/BC-DIM algorithm will have selected a subset of active prediction neurons whose
RFs (which correspond to basis functions) best explain the underlying causes of the sensory input. The strength
of activation, y, reflects the strength with which each basis function is required to be present in order to accurately
reconstruct the input. This strength of response also reflects the probability with which that basis function (the
preferred stimulus of the active prediction neuron) is believed to be present, taking into account the evidence provided by the input signal and the full range of alternative explanations encoded in the RFs of the whole population
of prediction neurons.
If prediction neurons represent distinct causes such as the presence of different objects in a visual scene
(Lochmann and Deneve, 2011), or different odours in an olfactory scene (Beck et al., 2012), then each prediction
neuron’s activation represents the probability that its preferred stimulus is present in the input (Spratling, 2008b,
2012c, 2013b, 2014; Spratling et al., 2009). This is consistent with the idea (Achler and Amir, 2008; Anastasio
et al., 2000; Barlow, 1969; Deneve, 2008; Lee and Mumford, 2003; Lochmann et al., 2012) that the brain computes
with “explicit probability codes” (Ma et al., 2008). The activation dynamics of the PC/BC-DIM algorithm enable
the prediction neurons to perform a form of perceptual inference, in which evidence that supports one cause is
explained away preventing responses from prediction neurons representing other, less likely, causes (Kersten et al.,
2004; Lochmann and Deneve, 2011; Spratling, 2014). In common with other neural networks that can perform
explaining away (Beck et al., 2012; Lochmann and Deneve, 2011; Lochmann et al., 2012), PC/BC-DIM employs
divisive normalisation that targets the inputs to the network (see equation 2). The mechanism of divisive input
b It is possible to re-implement the Rao and Ballard (1999) algorithm using only non-negative firing rates (Ballard and Jehee, 2012),
however, this results in a model that is extremely complex and requires a degree of coordination between the actions of different connections
that is unlikely to be feasible in a biological system.
4
normalisation employed by PC/BC-DIM is, however, slightly different as prediction neurons can inhibit their own
inputs (in contrast to Lochmann and Deneve, 2011; Lochmann et al., 2012) and each prediction neuron contributes
independently to the strength of inhibition (in contrast to Beck et al., 2012).
Rather than representing distinct causes, the prediction neurons can alternatively be used to represent possible
values of a continuous variable, such as the orientation of a visual stimulus (Spratling, 2010, 2011, 2012a,b,c,
2013b). In this case, different prediction neurons can be tuned to different values. Each prediction neuron then
signals the belief that the input stimulus takes a particular value (e.g., is at a particular orientation), while the
population of prediction neurons represent the probabilities for the range of possible values.
Previous work with PC/BC-DIM has explored the ability of the prediction neurons to identify latent causes
and has shown that the behaviour of the prediction neurons is consistent with the response properties of cortical
pyramidal cells (De Meyer and Spratling, 2011; Spratling, 2010, 2011, 2012a,c, 2014; Spratling et al., 2009). In
contrast, this article explores how the reconstruction neurons can be used to calculate probability distributions,
and hence, how PC/BC-DIM can be used to perform Bayesian inference. This interpretation of the PC/BC-DIM
algorithm will be described in the next section.
2.3
Computing with Probability Distributions
As described in the preceding section, the weights in a PC/BC-DIM network are basis functions or elementary
components that can be combined together to reconstruct the input stimulus. If the inputs to a PC/BC-DIM network
are probability distributions, then the weights need to represent the elementary components of such probability
distributions so that any specific probability distribution that is presented to the network can be reconstructed from
those elementary components. Hence, when applied to population codes that encode probability distributions,
the PC/BC-DIM model can be seen to have strong similarities to the kernel density estimate (KDE) model of
encoding and decoding probability distributions (Anderson and Van Essen, 1994; Barber et al., 2003; Zemel et al.,
1998). Specifically, in this earlier model it is proposed that a probability distribution can be reconstructed by
summing basis functions in proportion to the firing rates of the neurons associated with each basis function. This
is the operation performed by equation 1 when the columns of V are interpreted as basis functions appropriate for
representing probability distributions. One difference is that in the current model each basis function is equal (up
to a scaling factor) to the synaptic weights of that neuron, whereas in the KDE model the basis functions are not
the weights (Barber et al., 2003; Zemel et al., 1998). It is necessary to find neural firing rates (the y values in the
PC/BC-DIM model) appropriate for representing an input probability distribution in terms of basis functions. One
method for doing this for the KDE model uses the expectation-maximisation algorithm to find neural responses
that minimise the Kullback-Leibler (KL) divergence between the input distribution and the reconstructed distribution (Zemel et al., 1997, 1998). For the PC/BC-DIM algorithm, prediction neuron firing rates appropriate for
reconstructing a probability distribution are found using equations 1 to 3. The PC/BC-DIM algorithm is closely related to the particular method of performing non-negative matrix factorisation (NMF) proposed by Lee and Seung
(2001) (Solbakken and Junge, 2011; Spratling et al., 2009). This form of NMF also minimises the KL divergence.
It would also be possible to apply the Rao and Ballard (1999) version of predictive coding, using subtraction
to calculate the reconstruction errors, to find the prediction neuron activities. This would find neural responses
that minimise the least squares error between the input probability distribution and the reconstructed distribution
(Achler, 2014; Harpur, 1997). This succeeds in reproducing most of the results presented in this article, but not
all. There are also other reasons for preferring the PC/BC-DIM implementation of predictive coding, as listed in
the preceding section.
If a PC/BC-DIM network reconstructs the probability distribution that is presented to its inputs, then one
useful operation that may be performed is to reconstruct a less noisy version of a corrupted input distribution.
Experiments demonstrating this ability of the PC/BC-DIM network are described in section 3.1. However, simply
reconstructing the input distribution is otherwise not very useful. For example, it does not allow computations
to be performed, such as combining a likelihood with a prior to calculate a posterior in accordance with Bayes
theorem, or combining different sources of sensory evidence in a statistically optimal way, or computing functions
of variables whose values are defined in probabilistic terms. However, as will be seen in the Results section, it is
possible for the PC/BC-DIM algorithm to perform all of these forms of probabilistic inference. This is because
PC/BC-DIM networks can be wired-up so that rather than representing the input probability distribution (the
likelihood) the reconstruction neurons represent the posterior. In the simple case, where the PC/BC-DIM network
reconstructs the input distribution, the reconstruction neurons can be interpreted as representing the posterior
when the prior is uniform, and hence, the posterior is proportional to the likelihood. In order to calculate the
posterior with a non-uniform prior, the prediction neuron RFs are scaled differently (see section 3.2). This results
in prediction neurons with large weights (representing causes with a high prior probability) being preferentially
selected by the PC/BC-DIM algorithm to represent the input distribution. As the posterior is represented by the
5
reconstruction neurons and is calculated as a combination of active prediction neuron RFs, the posterior will
reflect the prior. When combining two sources of sensory evidence (see section 3.3), the prediction neurons have
RFs that receive input from both sources. When there is a small cue conflict, the prediction neurons that have
RFs which overlap with both input distributions are made active by the PC/BC-DIM algorithm. These prediction
neurons produce a probability distribution at the reconstruction neurons that is intermediate between the two input
distributions. Finally, to perform function approximation (see section 3.5), a PC/BC-DIM network is wired-up so
that each prediction neuron has RFs representing the values of multiple variables. When probability distributions
representing the likelihoods of a subset of variables are presented to the network, those prediction neurons whose
RFs most closely match the given inputs are activated. Because these active prediction neurons have RFs tuned to
the missing inputs, they also reconstruct a probability distribution representing the values of these missing inputs.
When multiple population codes are used as input to a PC/BC-DIM processing stage, it is convenient to
think of the input vector being partitioned into separate sub-vectors representing the separate population codes,
as illustrated in Fig. 1b. Each partition of the input encodes the probability distribution for a different variable.
If the input is partitioned into multiple population codes, then the reconstruction neurons also represent multiple
population codes, as illustrated in Fig. 1b. In all the experiments described in section 3, one or multiple probability
distributions (represented as population codes) are provided as input to the PC/BC-DIM neural network. Each
probability distribution is encoded as follows. Imagine a probability distribution p(s|ω) for a variable s. This is a
continuous function over all possible values for s. This function can be encoded as a population code by sampling
this continuous function at a finite number of locations, i.e., at specific values of s. In all the simulations reported
here the sampling locations were equally distributed.
2.4
Decoding and Quantitative Assessment Methods
Once a probability distribution has been represented by a population of neurons this can be used by the brain, and
by the PC/BC-DIM model, as input to further probabilistic computations. Hence, the activity of the reconstruction
neuron population does not need to be decoded. However, decoding is useful to demonstrate the accuracy of the
model. The mean and variance are sufficient to fully characterise Gaussian probability distributions. To obtain
these parameters the standard equations for calculating the mean (µ) and variance (σ 2 ) of a discrete probability
distribution were used:
P
zi si
(4)
µ = Pi
i zi
P
zi (si − µ)2
(5)
σ2 = i P
i zi
Where zi is the activation of neuron i, and si is the RF centre (the preferred stimulus value) of neuron i. The
denominator is necessary to normalise the reconstruction neuron responses (which can have arbitrary scaling)
to form a valid probability distribution. Equation 4 is equivalent to the standard method of population vector
decoding proposed by Georgopoulos et al. (1986).
In section 3.6, the calculations were performed with a variable (orientation) that wraps around. In this case the
mean (in degrees) was calculated as:

√
P
2π −1
z
exp
s
i 180
i i
180

P
µ=
phase 
(6)
2π
i zi
The experiments preceding section 3.6 can also be performed using variables with periodic boundary conditions
with negligible effects on the results.
To decode the reconstruction neuron responses, the vector of neural activations, z, was set equal to rc in the
above equations. Where c is an integer equal to the number of cues to the same sensory stimulus. Hence, in
experiments where cue integration or segregation was performed with two cues (Figs. 6, 8, 9, 11c-d, 14d-g,
and 15) the squared response of the reconstruction neurons was used for decoding, i.e., z = r2 was used in
equations 4–6. In the experiment on cue integration with three cues (Fig. 7) the cubed reconstruction neuron
responses were used, i.e., z = r3 . In all other cases z = r was used.
Using exponential responses to decode the posterior in cue integration tasks is necessary in order to produce
an accurate estimate of the variance of the posterior distribution in those tasks (otherwise the variance would be
over-estimated). In the brain, downstream neurons performing further probabilistic computations would need to
know c in order to be able to raise the reconstruction neuron responses to the correct power. It is fairly easy to
imagine additional neural circuitry that could calculate c, but the need to raise the reconstruction neuron responses
to different powers for different computations is a limitation of the current model. However, it is a fairly minor
6
limitation compared to previous work which has either proposed completely different algorithms for performing
cue integration (Ma et al., 2006) and function approximation (Beck et al., 2011), or which has failed to compute
the variance of the posterior at all (Deneve et al., 2001; Pouget et al., 2003).
2.5
Code
Open-source software, written in MATLAB, which performs the experiments described in this article is available
from: http://www.corinet.org/mike/Code/pcbc_prob.zip.
3
Results
The results of example simulations are presented in a standard format like that shown in Fig. 2b. In these figures the lower histogram shows the input to the PC/BC-DIM network which is a population code describing the
input probability distribution (or likelihood), p(s|ω). The length of each bar (indicated on the y-axis) represents
the probability p(si |ω) at a specific value of the variable s (indicated by the labels on the x-axis). The middle
histogram shows the responses of the prediction neurons. The y-axis is in arbitrary units representing firing rate
and the x-axis is labelled with neuron number. The upper histogram shows the responses of the reconstruction
neurons. The length of each bar (indicated on the y-axis) represents the firing rate of the neuron, and hence, the
value of the posterior probability distribution at a specific value of s (indicated by the labels on the x-axis). In
most simulations the values of s are measured in units of degrees. This is simply to give these values concrete
units and does not mean that PC/BC-DIM is limited to computing with variables measured in degrees. For each
experiment the weights of the PC/BC-DIM network have been set by trial and error to produce good results on
that task.
3.1
Decoding Noisy Population Coded Probability Distributions
One important issue when dealing with population codes that encode probability distributions is how to deal with
random fluctuations in the neural firing rates: how to accurately estimate the probability distribution despite the
samples being unreliable due to corruption by noise. A decoding method is thus required that can combine noisy
input samples to calculate an estimate of the underlying probability distribution. Previous work has shown that,
when the probability distribution is Gaussian, attractor neural networks can be used to convert a noisy population
code into a smooth Gaussian centred close to the maximum likelihood value (Deneve et al., 1999; Latham et al.,
2003; Pouget et al., 2000, 1998). Such decoding would allow the statistically optimum estimate to be easily read
off as the peak of the output distribution.
The PC/BC-DIM network can also perform near optimum decoding of a noisy, Gaussian, probability distribution. While PC/BC-DIM is not limited to dealing with Gaussian probability distributions, we consider 1D
Gaussian probability distributions as this allows direct comparison with previous work as well as with the results
that would be expected from exact Bayesian inference. The prediction neurons are given Gaussian RFs (all with
standard deviation 10o ) covering the range of possible values (means distributed uniformly in the range −180o
to 180o ), as shown in Fig. 2a. The PC/BC-DIM network reconstructs the input as a linear combination of basis
functions (the prediction neuron RFs, see Methods). In this case, the input probability distribution is reconstructed
as a combination of Gaussians. This fitting of a set of Gaussians to the data can be seen as a form of kernel density estimation of the input distribution. When the input is a noisy Gaussian population code, the reconstruction
is a smooth Gaussian, as shown for two specific examples in Fig. 2b and c. The smoothing effect results from
each Gaussian RF receiving input from a number of samples of the input distribution, which means that noise is
averaged out. The accurate reconstruction of the input distribution results from the PC/BC-DIM algorithm minimising the KL divergence between the reconstruction and the input. To confirm that the network’s estimate of the
probability distribution is close to the statistically optimal estimate, in general, experiments were performed using
Gaussian input distributions with random mean values chosen uniformly from the range [−90o : 90o ] and random standard deviations chosen uniformly from the range [15o : 45o ]. Each input distribution was corrupted using
poisson noise which is commonly used to simulate noise in biological neurons. To do so, each input activation was
a sample taken from a poisson distribution whose mean was the noise-free value of that input. Figs. 2d and e show
plots of the network’s estimate of the mean and variance of the probability distribution (given by the reconstruction
neurons) compared to the optimal estimates of these parameters (calculated from the input distribution) for 100
trials. It can be seen that both the mean and variance are very accurately estimated by the PC/BC-DIM network
in all 100 trials. To quantify the decoding accuracy the absolute difference between the mean of the probability
distribution given by PC/BC-DIM and the statistically optimal value was calculated for each of 100 000 trials.
The maximum absolute difference was 0.36o , the median absolute difference was 0.002o and the mean absolute
7
0.04
0
−180
−90
0
90
93.5
1r
0.5
0.5
0
−180
0
−180
−90
0
90
0
90
−90
−90
0
90
Optimal Estimate of Mean
(d)
10
15
20
25
30
35
0
5
10
15
20
25
30
1x
1x
0.5
0.5
−90
0
90
0
−180
−90
(b)
0
(c)
90
35
2000
1200
5
400
0.5
0.5
0
−180
−90
0
1y
1y
0
92.4
1r
90
Network Estimate of σ2
(a)
Network Estimate of Mean
0.02
400
1200 2000
Optimal Estimate of σ2
(e)
Figure 2: Decoding noisy population coded probability distributions. (a) The uniform population of
Gaussian RFs used for the W weights in the simulations reported in (b)-(e), and Fig. 4. Note that for
clarity the RFs of every other neuron are shown at a finer sampling rate (1o ) than has been used to define the weights used in the simulations (5o ). (b) and (c) Example simulation results. Each example
shows an input population code representing a Gaussian probability distribution (bottom histograms)
with mean 93o that has been corrupted by poisson noise. The standard deviations of the original distributions are (a) 20o , and (b) 30o . The middle histograms show the prediction neuron activations, and the
upper histograms show the reconstruction neuron responses. The numbers above each histogram show
the maximum likelihood estimate of the stimulus value calculated from that histogram (equation 4). Note
that if the input population code was normalised to sum to unity this would result in the population code
generated by the reconstruction neurons also summing to one. (d) Each dot shows a comparison between
the PC/BC-DIM network’s estimate of the mean of the probability distribution (given by the reconstruction neurons) and the statistically optimal estimate of the mean of the input distribution. 100 experiments
were performed using input distributions with randomly chosen means and standard deviations that were
encoded with noisy population codes. (e) As for (d) but for variance.
difference was 0.014o . The percentage absolute difference between the network’s estimate of the variance and the
statistically optimal value, over the same 100 000 trials, had a maximum of 1.8%, a median of 0.10%, and a mean
of 0.18%.
Note that the network has been used to estimate the parameters of a single noisy probability distribution (i.e.,
the same corrupted input was presented to the network during all iterations of the PC/BC-DIM algorithm). A
more biologically valid approach would model the variability of neural responses in the inputs and within the
PC/BC-DIM network. Doing this leads to a reduction in the accuracy in the estimate of the posterior probability
distribution. Specifically, the median absolute difference between the network’s estimate of the mean and the
optimal estimate of the mean drops to 2.46o if the noise on the input changes each iteration, and is 2.80o if,
additionally, poisson noise is added to the prediction neuron responses at each iteration (the corresponding errors
in the estimates of the variance are 11.6% and 55.3%). To improve the performance in these circumstances it would
be necessary to modify the PC/BC-DIM algorithm to update neural responses slowly, and hence, to estimate mean
firing rates.
Zemel et al. (1998) performed an experiment to show that the KDE model (Anderson and Van Essen, 1994;
Barber et al., 2003) is incapable of reconstructing probability distributions which are narrower than the basis functions. In common with the KDE model, PC/BC-DIM also reconstructs the probability distribution as a linear sum
of basis functions, and thus, it also suffers from this limitation. To demonstrate this, PC/BC-DIM was tested using
the experiment described in Zemel et al. (1998). In this experiment, the prediction neurons had Gaussian RFs
8
r
1 r
0.2
0.5
−2
0
2
4
0.2 y
−4
−2
10
20
30
40
0
2
x
0.2
10
20
30
40
x
−4
−2
0
2
4
x 10
30
20
10
0
0.1
0
−4
−2
(a)
0
2
4
(b)
2
r
0.5
Width (σ)
2
1.5
1
0.5
0
0.1
1
0.5
Amplitude
(c)
1
(d)
r
1
0
−4
−2
0
2
4
−6
0
y
−4
−2
0
2
4
y
0.1
0.5
0
50
100
150
200
0
2
x
0.2
50
100
150
200
x
1
−4
−2
0
(e)
2
4
0
2
0.15
0.1
0.05
0
0.1
−4
−2
0
2
4
(f)
Reconstruction Error
0.2
0
4
1
0
0.4
2
0.5
0
0.4
0
1 y
0.1
0.4
−6
0
Reconstruction Error
−4
Reconstruction Error
0
Reconstruction Error
0.4
0.5
Width (σ)
(g)
1
x 10
1.5
1
0.5
0
0.1
0.5
Amplitude
1
(h)
Figure 3: Effects of width and height when decoding population coded probability distributions. Results
are for prediction neurons with Gaussian RFs that have a standard deviation of 0.3 (top row), and 0.08
(bottom row). (a) and (e) the input distribution has a standard deviation of 1. (b) and (f) the input
distribution has a standard deviation of 0.2. (c) and (g) The sum of the squared difference between x
and r as a function of the width of the input distribution. (d) and (h) The sum of the squared difference
between x and r as a function of the amplitude of the input distribution.
with a standard deviation of 0.3 uniformly spaced in the range [−10 : 10]. As in the preceding experiments, when
the input distribution was wider than the RFs the reconstruction was accurate (Fig. 3a). However, when the input
distribution was narrower than the prediction neuron RFs the distribution encoded by the reconstruction neurons
was too wide (Fig. 3b). To quantify this effect, Zemel et al. (1998) performed experiments with noisy population
codes of different widths and calculated the sum of the squared difference between the reconstruction and the
true (uncorrupted) distribution. For the PC/BC-DIM algorithm, this reconstruction error was large for narrow
input distributions (Fig. 3c), due to the inability of the PC/BC-DIM network to accurately represent distributions
narrower than the prediction neuron RFs. The prediction neuron weights are basis functions or elementary components that can be used to reconstruct a probability distribution. Clearly, these components need to be appropriate
for a given task, or the reconstruction will be poor. Hence, to represent narrow distributions a network would need
prediction neurons with narrow RFs. Repeating the preceding experiment using RFs with a standard deviation of
0.08 did result in more accurate results (Fig. 3e-g). However, this is at odds with biological data which shows that
discrimination is more finely tuned than the RFs of cortical neurons (Zemel et al., 1998).
Another issue explored in Zemel et al. (1998) is the inability of the KDE model to accurately represent probability distributions with small amplitude. To assess this Zemel et al. (1998) performed experiments with Gaussian
input distributions of varying height, and they calculated the sum of the squared difference between the input
distribution and the one reconstructed by the KDE algorithm. Results for the same experiments performed with
PC/BC-DIM are shown in Figs 3d and h. It can be seen from these results that the PC/BC-DIM algorithm is
capable of very accurately representing probability distributions regardless of their amplitude.
Some previous neural models of decoding (Deneve et al., 1999; Latham et al., 2003; Pouget et al., 2000,
1998) have used attractor networks that can only generate a single mono-modal Gaussian distribution, and hence,
can not simultaneously represent multiple, distinct, stimuli (Sahani and Dayan, 2003). Unlike these previous
algorithms, PC/BC-DIM is not limited to representing a mono-modal distribution, as illustrated in Fig. 4. When
encoding probability distributions using population codes there is an inherent ambiguity between a complex distribution that represents uncertainty about a single cause, and a complex distribution that represents multiple separate
causes (Sahani and Dayan, 2003). This work does not address this issue, but multi-modal distributions are treated
9
1r
1r
0.5
0.5
0
−180
−90
0
0
−180
90
1y
0.5
0
0
90
0.5
5
10
15
20
25
30
0
35
1x
5
10
15
20
25
30
35
1x
0.5
0
−180
−90
1y
0.5
−90
0
0
−180
90
(a)
−90
0
90
(b)
Figure 4: Decoding multi-modal population coded probability distributions. Each example shows a trimodal input distribution (bottom histograms) with poisson noise (a), and without noise (b). The middle
histograms show the prediction neuron activations, and the upper histograms show the reconstruction
neuron responses. The network was identical to that used to produce the results shown in Fig 2.
throughout as representations of multiple stimuli.
3.2
Combining Likelihoods and Priors to Calculate Posterior Probability Distributions
The defining characteristic of Bayesian inference is that it makes use of prior information. Specifically, Bayes
theorem expresses how the likelihood (the probability distribution derived from the current observation) should
be combined with the prior (the probability derived from our knowledge about the state of nature) in order to
calculate the posterior distribution. Surprisingly, many previous neural implementations of Bayesian inference
with population codes have ignored priors, or equivalently have assumed that the prior is flat, and have hence
failed to demonstrate an ability to perform Bayesian inference (Deneve et al., 1999, 2001; Jazayeri and Movshon,
2006; Latham et al., 2003; Pouget et al., 2013, 2003, 2000, 1998; Seilheimer et al., 2014; Zemel et al., 1998).
Other theories propose that priors are represented by spontaneous activity (Fiser et al., 2010) or by a separate
population of neurons whose activity is added to the activity of the population of neurons representing the likelihood (Ma et al., 2006). Representing the prior using the activity of a separate population of neurons would also
enable the prior to be combined with the likelihood in the same way that two sensory cues can be integrated (see
section 3.3). Alternatively, it has been proposed that the prior can be encoded by the distribution of RFs, such that
more neurons, typically with narrower tuning widths, are allocated to representing stimulus values that are more
probable (Ganguli and Simoncelli, 2010, 2014; Girshick et al., 2011; Shi and Griffiths, 2009). Storing priors in
the synaptic weights makes intuitive sense as priors result from previous experience and should change relatively
slowly (Vilares and Kording, 2011). The current model also incorporates the prior into the synaptic weights of the
network, however, this is done by scaling the weights of uniformly distributed RFs, rather than by changing the
distribution of the RFs.
Figure 5 shows a specific example of how PC/BC-DIM succeeds in calculating a posterior distribution (represented by the reconstruction neuron activations) by combining a likelihood (represented by the input population
code) and a prior (incorporated in the synaptic weights of the network). The network is identical to that used to
generate the results shown in Fig. 2, except that in this experiment the weights have been changed as shown in
Fig. 5a. The prior probability distribution is a Gaussian centred at 0o and with a standard deviation of 60o . Each
neuron’s weights (the rows of W) have simply been multiplied by this prior distribution (V is set equal to the
transpose of W and re-normalised so the each column has a maximum value of one, as in all other experiments).
For the two examples shown in Fig. 5b and c the firing rates of the reconstruction neurons provide an almost exact
approximation to the posterior that would be calculated via Bayes theorem. For an intuitive understanding of why
this happens consider the result in Fig. 5c. In this example, the two prediction neurons with the highest responses
have RFs centred at 80o and 70o . These RFs are less similar to the input distribution than the neuron with an RF
centred at 90o . However, the neurons with RFs centred at 80o and 70o have weights that are larger in magnitude.
The product of the weight vector and the input population code is thus greater for the prediction neurons with RFs
10
0.04
0
−180
−90
0
90
83.7
1r
0.5
0.5
0
−180
0
−180
−90
0
90
0
90
0
−90
−90
0
90
Optimal Estimate of Mean
(d)
5
10
15
20
1x
25
30
93
35
0
5
10
15
20
1x
25
30
92.9
0.5
0.5
−90
0
90
0
−180
−90
(b)
0
(c)
90
35
400 800 1200
0.5
0.5
0
−180
−90
90
1y
1y
0
74.4
1r
Network Estimate of σ2
(a)
Network Estimate of Mean
0.02
400 800 1200
Optimal Estimate of σ2
(e)
Figure 5: Bayesian inference with a prior. (a) RFs incorporating a prior as used for the W weights
in the simulations reported in (b)-(e). Note that for clarity the RFs of every other neuron are shown at
a finer sampling rate (1o ) than has been used to define the weights used in the simulations (5o ). (b)
and (c) Example simulation results. Each example shows a Gaussian input distribution representing the
likelihood (bottom histograms), and the response of the reconstruction neurons which represent the posterior (upper histograms). The numbers above each histogram show the maximum likelihood estimate
of the stimulus value calculated from that histogram (equation 4). Due to the prior being centred at 0o
the posterior probability distributions in these experiments are shifted towards 0o compared to the corresponding experiments shown in Fig. 2. The Gaussian curves superimposed on the upper histograms show
the posterior distribution calculated via exact Bayesian inference (scaled in amplitude to fit the network’s
estimate). (d) Each dot shows a comparison between the PC/BC-DIM network’s estimate of the mean of
the posterior probability distribution (given by the reconstruction neurons) and the mean of the posterior
calculated via exact Bayesian inference. 100 experiments were performed using input distributions with
randomly chosen means and standard deviations that were encoded with noisy population codes. (e) As
for (d) but for variance.
centred at 80o and 70o than it is for the prediction neuron with an RF centred at 90o . This results in the higher
firing rates of these two neurons. As the posterior (encoded by the reconstruction neuron responses) is a linear
combination of basis functions (RFs) weighted by the corresponding prediction neuron firing rates, the posterior
peaks between 80o and 70o , and is thus shifted towards the prior.
For clarity, Figs. 5a and b show results for calculations performed using likelihoods that have not been corrupted with noise. However, PC/BC-DIM performs almost exact Bayesian inference even when the input distributions are noisy. To confirm that the network’s estimate of the posterior probability distribution is close to the
statistically optimum estimate, 100 000 trials were performed using likelihood distributions corrupted using poisson noise. In each trial the likelihood was given a random mean, chosen uniformly from the range [−90o : 90o ],
and a random standard deviation chosen uniformly from the range [15o : 45o ]. Over 100 000 trials, the absolute
difference between the optimal estimate of the mean of the posterior distribution (calculated by Bayes theorem)
and the network estimate of the mean (from the PC/BC-DIM reconstruction neuron responses) had a maximum
value of 1.23o . The median absolute difference was 0.07o , and the mean absolute difference was 0.11o . Over the
same 100 000 trials, the variance of the posterior distribution given by the network and by Bayes theorem had a
maximum, median, and mean, percentage absolute difference of 15.9%, 0.98%, and 1.35% respectively. Figs. 5d
and e show plots of the network’s estimate of the mean and variance of the probability distribution compared to
the optimal estimates of these parameters for 100 trials.
11
r2b
−15
1 r2
a
0.5
r2b
−17.2
0.5
0
−180 −90
0
90
−180 −90
0
90
0
−180 −90
0
90
−180 −90
0
90
0
−8
−8
0
8
Optimal Estimate of Mean
(c)
0.5
0.5
2
6
−20
1x
a
10
14
xb
18
−10
22
0
2
1x
a
0.5
6
−20
10
14
xb
18
−10
22
0.5
0
−180 −90
0
90
−180 −90
0
90
0
−180 −90
0
(a)
90
−180 −90
(b)
0
90
500 1000 1500
1y
Network Estimate of σ2
1y
0
−17.2
Network Estimate of Mean
−15
1 r2
a
8
500 1000 1500
Optimal Estimate of σ2
(d)
Figure 6: Cue integration. (a) and (b) Example simulation results. Each example shows two Gaussian
input distributions (bottom histograms), representing two cues to the same sensory stimulus. The network produces, in the reconstruction neuron activations (upper histograms), probability distributions that
closely match those that would be obtained by the statistically optimal combination of the two cues. The
Gaussian curves superimposed on the upper-right histograms show the posterior distribution calculated
via exact Bayesian inference (scaled in amplitude to fit, approximately, the network’s estimate). The
numbers above each histogram show the maximum likelihood estimate of the stimulus value calculated
from that histogram (equation 4). (c) and (d) Show cue integration accuracy for noisy population coded
probability distributions. (c) Each dot shows a comparison of the PC/BC-DIM network’s estimate of the
mean of the combined probability distribution compared to the probabilistically optimal estimate of the
mean. 100 experiments were performed with randomly chosen cue conflicts and cue precisions. As in
Ma et al. (2006), each dot represents the average of 1008 trials performed with different noisy population
codes. (d) As for (c) but for variance.
3.3
Cue Integration
In many circumstances multiple sources of information may be available about the same sensory stimulus. Cue
integration results in these separate sources of sensory evidence being combined together to produce a single
estimate of the stimulus’ properties. The distinct cues may be derived from the same sensory modality, such as
when estimating depth from multiple visual cues like disparity and linear perspective, or may come from different
modalities, such as when estimating depth using vision and proprioception. Various experiments have shown that
human performance in cue integration tasks is optimal, which requires the reliability of each sensory cue to be
taken into account (Seilheimer et al., 2014). For cues that can be represented by Gaussian probability distributions,
and assuming a flat prior, the mean of the combined estimate is the sum of the means of the two cues weighted
by the precision (the inverse of the variance) of each cue (Ernst and Jäkel, 2003; Ma et al., 2006; Ma and Pouget,
2008; Pouget et al., 2013).
Figure 6 illustrates that optimal cue integration can be performed using PC/BC-DIM. In these experiments the
input was partitioned into two in order to represent the two cues. For convenience both cues were measured in
the same units and could take values over the same range, but this is not a requirement. Each prediction neuron
had a Gaussian RF (with standard deviation 15o ) centred at the same location in each input space. The population
of predictions neurons had RFs covering the range of possible values (means distributed uniformly in the range
−180o to 180o ). When population codes representing Gaussian probability distributions were presented to the
two input spaces, the PC/BC-DIM network generated population codes (the reconstruction neuron responses)
that peaked very near to the optimal estimate obtained by probabilistically combining the two cues. Because
there are two partitions of the input, there are also two partitions of the reconstruction neuron populations. Both
represent the combined estimate of the stimulus values, and hence, both generate the same population code. For an
intuitive understanding of how PC/BC-DIM performs cue integration consider the result shown in Fig. 6a. Here,
12
r3b
−15.0
r3c
−15.0
0.5
−90 0 90
−90 0 90
−90 0 90
0
−16.7
r3c
−16.7
−90 0 90
−90 0 90
−90 0 90
0
−8
−8
0
8
Optimal Estimate of Mean
(c)
1x
a
6
−20.0
10
xb
14
−10.0
18
xc
22
−15.0
0.5
0
2
1x
a
6
10
−20.0
xb
14
−10.0
18
xc
22
−15.0
0.5
−90 0 90
−90 0 90
−90 0 90
0
−90 0 90
(a)
−90 0 90
(b)
−90 0 90
600
2
1000
y
0.5
200
y
0
r3b
0.5
0.5
0
−16.7
Network Estimate of σ2
0
1 r3
a
Network Estimate of Mean
−15.0
1 r3
a
8
200
600 1000
Optimal Estimate of σ2
(d)
Figure 7: Cue integration with three cues. The format of this figure is identical to, and described in
the caption of, Fig. 6, except that here there are three Gaussian input distributions (bottom histograms),
representing three cues to the same sensory stimulus.
the most responsive prediction neuron has a RF centred at −15o in both input spaces. This is the most active
prediction neuron because both its RFs overlap with the input distributions, it thus receives the most support. The
reconstruction neuron responses are a linear combination of all the active prediction neuron RFs, and hence, also
peaks at −15o in both input spaces.
For clarity, Figs 6a and b show results for calculations performed using input population codes that have not
been corrupted with noise. However, PC/BC-DIM performs near optimal cue integration even when the input
distributions are noisy. To confirm this the method used in Ma et al. (2006) was employed. The two Gaussian
input distributions were both corrupted by poisson noise. One distribution had a fixed mean (0o ) while the other
had a randomly chosen mean (uniformly selected from the range [−12o : 12o ]), so that cue conflict varied by
up to ±12o . Each input distribution was assigned a random standard deviation, chosen uniformly from the range
[20o : 60o ]. Figs. 6c and d show plots of the network’s estimate of the mean and variance of the combined
probability distribution compared to the optimal estimates of these parameters. To quantify these results, the
maximum, median, and mean absolute difference between the network’s and the optimal estimate of the mean
was 0.76o , 0.18o , and 0.24o . The maximum, median, and mean percentage absolute difference between the
network’s and the optimal estimate of the variance was 34.62%, 5.03%, and 7.26%.
Repeating the above experiment but for cue integration with three sensory cues produced the results shown
in Fig 7. An identical network was used except that each prediction neuron received input from three partitions
representing the three cues. Again, the network’s estimate of the combined probability distribution was accurate.
Specifically, the maximum, median, and mean absolute difference between the network’s and the optimal estimate
of the mean was 0.58o , 0.07o , and 0.09o . The maximum, median, and mean percentage absolute difference
between the network’s and the optimal estimate of the variance was 20.60%, 2.69%, and 3.72%.
PC/BC-DIM can also perform optimal cue integration with a non-flat prior. As in section 3.2, the prior was
incorporated by modifying the relative strengths of the synaptic weights. Figures 8a and b show illustrative results
when the prior probability distribution, for both cues, was a Gaussian centred at 0o and with a standard deviation of
60o . Each neuron’s weight vector, in both input spaces, was multiplied by this prior distribution. To demonstrate
that the network’s cue integration with a prior is near optimal even when the input distributions are noisy, the
experiment described in the previous paragraphs was repeated, and the results are shown in Figs. 8c and d. The
maximum, median, and mean absolute difference between the network’s and the optimal estimate of the mean was
0.28o , 0.05o , and 0.06o . The maximum, median, and mean percentage absolute difference between the network’s
and the optimal estimate of the variance was 10.11%, 1.59%, and 2.35%.
13
r2b
−13.4
1 r2
a
0.5
r2b
−14.8
0.5
0
−180 −90
0
90
−180 −90
0
90
0
−180 −90
0
90
−180 −90
0
90
0
−6
−6
0
6
Optimal Estimate of Mean
(c)
2
6
−20
1x
a
10
14
xb
18
−10
22
0
2
1x
a
0.5
6
−20
10
14
xb
18
−10
22
0.5
0
−180 −90
0
90
−180 −90
0
90
0
−180 −90
0
(a)
90
−180 −90
(b)
0
90
500
0.5
200
0.5
800
1y
Network Estimate of σ2
1y
0
−14.8
Network Estimate of Mean
−13.4
1 r2
a
6
200 500 800
Optimal Estimate of σ2
(d)
Figure 8: Cue integration in the presence of a prior. (a) and (b) Example simulation results. (c) and
(d) Show cue integration accuracy for noisy population coded probability distributions. The format of
this figure is identical to, and explained in the caption of, Fig. 6. Due to the prior being centred at
0o the posterior probability distributions in these experiments (represented by the reconstruction neuron
responses in the upper histograms in (a) and (b)) are shifted towards 0o compared to the corresponding
experiments shown in Fig. 6. The Gaussian curves superimposed on the upper-right histograms show the
posterior distribution calculated via exact Bayesian inference (scaled in amplitude to approximately fit
the network’s estimate).
3.4
Cue Segregation
In multisensory integration experiments with human subjects, large cue conflict results in cue segregation rather
than integration (Beierholm et al., 2008). In which case the two cues are not perceived to have the same cause.
Determining if two cues should be integrated or segregated (i.e., determining if cues arise from a single cause,
or from multiple, independent causes) can be posed in terms of an inference problem: causal inference (Ma and
Pouget, 2008; Seilheimer et al., 2014; Shams and Beierholm, 2010; Vilares and Kording, 2011). While there exist
Bayesian models of causal inference, there are no current neurally-based models.
Existing neural models of cue integration and the closely related task of function approximation (see section 3.5) either take the form of (radial) basis function networks (Deneve and Pouget, 2003; Pouget and Sejnowski,
1994, 1997; Pouget and Snyder, 2000; Salinas and Abbott, 1995; Salinas and Sejnowski, 2001) or attractor networks (Deneve et al., 1999; Latham et al., 2003; Ma et al., 2006; Pouget et al., 2002, 1998). Both types of model
require that all the sensory information originates from a single source. Hence, if sensory inputs originating from
multiple underlying causes are presented simultaneously, these networks will erroneously combine this information into a single, incorrect, estimate (Pouget et al., 2002). Thus, while such networks can simulate cue integration,
they are incapable of modelling cue segregation. In contrast, PC/BC-DIM can model cue segregation as well as
cue integration. Unlike the quantitative analysis of cue integration in the preceding section, the assessment of cue
segregation offered here is only qualitative. Specifically, segregation is assumed to have occured if the probability
distribution represented by the reconstruction neurons is multi-modal: as mentioned in section 3.1, multi-modal
distributions are treated as representations of multiple stimuli.
Figure 9a shows that when two inputs encode very different stimulus values the reconstruction neurons generate a bi-modal probability distribution with peaks at the positions of the two cues. This is in contrast to when there
is less cue conflict, as in Fig 6a, where the reconstruction neurons generate a mono-modal distribution that peaks
at the weighted mean of the two cues. PC/BC-DIM can perform cue segregation and integration simultaneously,
as illustrated in Fig. 9b. Here, one input space receives a bi-modal population code that peaks at −30o and +50o ,
and the other input space receives a mono-modal distribution with a mean of +60o . The reconstruction neurons
generate a bi-modal distribution with peaks at approximately −30o and +55o . The first peak is thus the result
of information presented to only one input space (i.e., cue segregation), while the second peak is an estimate of
the stimulus value based on combining two cues that have the same precision presented to different input spaces
14
1 r2
a
1 r2
a
r2b
0.5
r2b
0.5
0
−180 −90
0
90
−180 −90
0
0
−180 −90
90
1y
0
90
−180 −90
0
90
1y
0.5
0.5
0
2
6
−20
1x
a
10
14
18
0
22
70
xb
2
6
10
14
1x
a
0.5
18
22
70
xb
0.5
0
−180 −90
0
90
−180 −90
0
0
−180 −90
90
0
90
(a)
−180 −90
0
90
(b)
Figure 9: Cue segregation. (a) Shows two Gaussian input distributions (bottom histograms), representing
two cues with a large cue conflict. The network produces, in the reconstruction neuron responses (upper
histograms), bi-modal probability distributions representing stimuli at both locations predicted by the
two cues. (b) As for (a) but with a second cause represented by the bi-model distribution presented to the
first input space. This second cause is integrated with the cue presented to the other input space.
0.8 r2
a
0.6
0.4
0.2
0
r2a
−90
0
90
r2a
−90
0
90
r2a
−90
0
90
r2a
−90
0
90
−90
0
90
−90
0
90
(a)
0.8 r2
a
0.6
0.4
0.2
0
r2a
−90
0
90
r2a
−90
0
90
r2a
−90
0
90
r2a
−90
0
90
(b)
Figure 10: Causal inference. The input (not shown) consist of two cues encoded as Gaussian probability
distributions with equal precisions. Cue conflict increases from left to right, from 10o to 90o in steps of
20o . Each subplot shows the responses of the first partition of the reconstruction neurons. The Gaussian
curves superimposed on the histograms show the posterior distributions calculated via exact Bayesian
inference for cue integration. (a) Each cue has a standard deviation of 20o . The left most subplot,
showing results for a conflict of 10o , is a repeat of the experiment shown in Fig. 6a. The right most
subplot, showing results for a conflict of 90o , is a repeat of the experiment shown in Fig. 9a. (b) As
(a) except cues have a standard deviation of 30o . In both cases, as cue conflict increases the probability
distribution generated by PC/BC-DIM changes from mono-modal (cue integration) to bi-modal (cue
segregation). When cues have lower precision, as shown in (b), cue integration occurs for a wider range
of cue conflicts.
15
(i.e., cue integration). In PC/BC-DIM whether signals are integrated or segregated depends on the degree of cue
conflict and the precision of the two cues. For cues with small standard deviation, segregation will occur at a
smaller conflict than when the cues have larger standard deviation, as illustrated in Fig. 10.
3.5
Function Approximation
Function approximation is required for many tasks faced by the brain. For example, to perform sensory-sensory
coordinate transformations in order to bring different sources of sensory information into a common reference
frame, or to perform sensory-motor mappings in order to control movement. Here, to allow direct comparison with
previous work (Beck et al., 2011; Deneve et al., 2001; Pouget et al., 2002; Pouget and Sejnowski, 1997; Pouget and
Snyder, 2000) we only consider simple, linear, functions of one-dimensional variables, encoded using Gaussian
distributions. However, like this previous work, PC/BC-DIM is not limited to linear function approximation, nor
is PC/BC-DIM limited to computing with one-dimensional Gaussian inputs as has been shown previously (De
Meyer and Spratling, 2011, 2013).
Previous work (Beck et al., 2011; Deneve et al., 2001; Pouget et al., 2002; Pouget and Sejnowski, 1997; Pouget
and Snyder, 2000) has considered approximating a function of three variables (A, B, and C) such that C=A+B. If,
for example, A is considered to be a representation of the retinal position of an object, and B a representation of
eye position, then C can be considered a representation of the head-centred bearing of the object. Given any two
of these values, existing models can calculate the third (Deneve et al., 2001; Pouget et al., 2002). Alternatively,
if supplied with all three values existing networks can perform cue integration (Deneve et al., 2001; Pouget et al.,
2002). Furthermore, it has been shown that neurons in such networks display gain modulated responses, similar
to those observed in the dorsal pathway of the cortex (Deneve et al., 2001; Pouget et al., 2002). PC/BC-DIM
can reproduce all of these results, as is illustrated in Fig. 11a-e. For this task we can consider the input to
the PC/BC-DIM network, and hence, the reconstruction produced by the network to be partitioned into three
parts, as illustrated in Fig. 1b. Each partition represents a population coded probability distribution encoding the
uncertainty about the value of a different variable (A, B, or C). Each prediction neuron has a Gaussian RF, of
standard deviation 5o , in each of the three partitions. These RFs are centred in each input space so as to encode the
relationship A+B=C for one specific set of values. The RFs of the population as a whole evenly tile the A and B
input spaces, so that the network can approximate C=A+B for all values of A and B. For an intuitive understanding
of how the PC/BC-DIM network performs function approximation, consider the result shown in Fig. 11a. The two
input distributions cause responses in the subset of prediction neurons with RFs that are centred near −30o in the
first partition and near 20o in the second partition. Each of these prediction neurons has an RF centred near −10o
in the third partition. The reconstruction neuron responses are a linear combination of all the active prediction
neuron RFs, and hence, will peak at the appropriate places in each of the three partitions.
To confirm that the PC/BC-DIM network can perform accurate function approximation even when the population codes are corrupted with noise, the method used in Deneve et al. (2001) was employed. Values for A
and B were chosen at random (uniformly from the range [−40o : 40o ]) and encoded using Gaussian probability
distributions (with a fixed standard deviation of 10o ). These population codes were corrupted with poisson noise.
The statistically optimum estimate of C was found using the maximum likelihood estimate of A and B calculated
from the noisy input population codes. The network’s estimate of C was also calculated by taking the maximum
likelihood estimate from the reconstruction generated by the PC/BC-DIM algorithm. Across 100 000 trials, the
variance between the true value of C and the network’s estimate of C was only 0.01% worse than the variance between the true value of C and the maximum likelihood estimate calculated using the input distributions. Repeating
this experiment to estimate B given A and C, found that the network’s estimate was 0.08% poorer than the statistically optimum estimate. In comparison, Deneve et al. (2001) report corresponding values for their algorithm of
3.3% and 2.1%.
Further experiments were performed in which the input distributions encoding the values of A and B were
given randomly selected standard deviations in addition to randomly selected means (as in the experiments reported in previous sections). The mean was chosen from the range [−40o : 40o ] and the standard deviation
chosen from the range [10o : 20o ]. Each input was encoded using a population code corrupted with poisson noise.
Fig. 11f plots the network’s estimate of the mean of the probability distribution for variable C compared to the
statistically optimal estimate for 100 trials. Fig. 11g shows a similar comparison for the variance of the probability distribution encoding variable C. It can be seen that the PC/BC-DIM network produces accurate estimates
of both parameters. To quantify these results, the maximum, median, and mean absolute difference between the
network’s and the optimal estimate of the mean was 0.04o , 0.009o , and 0.01o over 100 000 trials. The maximum,
median, and mean percentage absolute difference between the network’s and the optimal estimate of the variance
was 11.79%, 4.63%, and 4.99% over the same 100 000 trials. Results for an equivalent experiment to estimate B
given A and C, are shown in Figs. 11h and i. In this case, the maximum, median, and mean absolute difference
16
1r
a
−30.0
+20.0
rb
rc
−10.0
1r
a
0.5
0
0.1
+20.0
rb
−60
0
60
−60
0
60 −120
0
0
120
0.1
y
200
−30.0
400
1x
a
600 800
+20.0
xb
−60
0
200
−30.0
1x
a
xc
60
−60
400
0
600
60 −120
800
xb
0
120
1000 1200
−10.0
xc
0.5
−60
0
60
−60
0
60 −120
0
0
120
−60
0
60
−60
(a)
−27.2
1r
0
60 −120
0
120
(b)
+22.8
r
a
b
+0.4
r
−28.4
1r
c
+21.6
r
a
0.5
0.1
−10.0
y
0
1000 1200
0.5
0
rc
0.5
0
0
−30.0
−5.2
r
b
c
B=−30.0 .
B=−22.5
B=−15.0
B=−7.5
B= 0.0
0.5
−60
0
60
−60
0
60 −120
0
0
120
0.1
y
−60
0
60
−60
0
60 −120
0
120
y
0.1
x
a
600 800
+20.0
b
1x
c
0.5
200
−30.0
400
x
a
600 800
+20.0
b
1000 1200
+10.0
x
c
−60
0
60
−60
0
60 −120
0
0
120
−60
0
60
−60
(f)
500
800
Network Estimate of Mean
−80
−80
0
80
Optimal Estimate of Mean
200
0
0.04
0
60 −120
0
0
−60
120
200
500
800
Optimal Estimate of σ2
(g)
−30
0
30
A
(d)
Network Estimate of σ2
Network Estimate of Mean
(c)
80
0.06
0.02
0.5
(e)
45
0
−45
−45
0
45
Optimal Estimate of Mean
(h)
Network Estimate of σ2
0
0
1000 1200
+10.0
x
800
400
500
1x
200
−30.0
200
0
response
0.08
200
500
800
Optimal Estimate of σ2
(i)
between the network’s and the optimal estimate of the mean was 0.52o , 0.03o , and 0.05o over 100 000 trials.
The maximum, median, and mean percentage absolute difference between the network’s and the optimal estimate
of the variance was 13.60%, 5.43%, and 5.67% over the same 100 000 trials. While the poor performance in
estimating the variance is a clear limitation of the PC/BC-DIM model, the results presented here still go beyond
those reported for previous methods. Specifically, the attractor network model (Deneve et al., 2001; Pouget et al.,
2002) is unable to correctly calculate the uncertainty of each variable as the width of each output distribution is
fixed (Pouget et al., 2003). Hence, this model can not estimate the variance of the posterior and it would fail on
the experiment reported in Fig. 11g as well as the experiment reported in Fig. 11i. Other existing methods for
function approximation (Beck et al., 2011; Pouget and Sejnowski, 1997; Pouget and Snyder, 2000) can correctly
represent the variance of the posterior, but are limited to performing function approximation in one direction (e.g.
calculating C from A and B), and hence, would not be able to perform the experiment shown in Fig. 11b or provide
any estimate of either the mean or the variance of B, like that shown in Fig. 11h and i.
Fig. 12 shows results for a PC/BC-DIM network performing function approximation with four variables (A,
17
Figure 11: (previous page) Function approximation with three variables. A PC/BC-DIM network as
shown in Fig. 1b is used, where the three partitions of the input are used to represent probability distributions for three different variables. If these variables are denoted as A, B, and C, then the network has been
wired-up to approximate C=A+B. Note that C has a wider range of possible values than A and B, and
hence, the x-axes of the histograms representing C have a different scale than those representing A and
B. (a) When the two inputs representing A and B are presented (lower histograms), the reconstruction
neurons generate an output (upper histograms) that represents the correct estimate of the value of C (as
well as outputs representing the given values of A and B). (b) When the two inputs representing A and C
are presented (lower histograms), the reconstruction neurons generate an output (upper histograms) that
represents the correct estimate of the value of B (as well as outputs representing the given values of A
and C). In (a) and (b) the population codes generated by the reconstruction neurons have means which
correctly represent the maximum likelihood estimate of the corresponding stimulus value and standard
deviations that reflect the certainty in this estimate, such that the estimated values (C in (a) and B in (b))
are represented by population codes with larger variance, although this variance is less than the optimal
value. (c) and (d) When all inputs are presented simultaneously to the network, it has multiple (potentially conflicting) estimates of the true value for C. The network performs cue integration with these
separate sensory inputs. (c) the inputs to the first two partitions are consistent with a value of C equal
to -10 while the input to the third partition indicates that the most likely value of C is +10. In this case,
the optimal, combined, estimate of the true value of C is 0. (d) As for (c) except that the precisions of
the input cues are no longer equal. The precision of the input to the third partition has been reduced, and
hence, the combined estimate of C is now weighted more towards the estimate given by A+B. (e) Gain
modulation. Here the response of a single prediction neuron has been measured. Its response is plotted
as a function of the value of variable A, for a number of different values of variable B. The position and
width of the tuning curve is unaffected by the value of B, but the gain of the response is affected by B.
Such gain modulation is observed in various regions along the dorsal pathway, for example, when a retinal RF (variable A) is modulated by eye position (variable B). (f) 100 experiments were performed using
input distributions for A and B with randomly chosen means and standard deviations that were encoded
with noisy population codes. Each dot shows a comparison between the PC/BC-DIM network’s estimate
of the mean of the probability distribution encoding the estimated value of C and the optimal estimate for
C. (g) As for (f) but for variance. (h) and (i) As for (f) and (g) except using inputs encoding A and C and
estimating the probability distribution for B.
B, C, and D). The network encodes the relationship A+B+C=D. When any three variables are presented as population codes to the network, it calculates a Gaussian probability distribution encoded by the firing rates of the
reconstruction neurons that represents the correct value of the missing variable (Fig. 12a and b). Unlike previous
methods of function approximation (Deneve et al., 2001; Pouget et al., 2003, 2002; Pouget and Sejnowski, 1997;
Pouget and Snyder, 2000), PC/BC-DIM can simultaneously represent multiple stimuli. It can therefore simultaneously calculate two (or more) separate results, encoded as a bi-modal (or multi-modal) probability distribution,
as illustrated in Fig. 12c.
When fewer than three inputs are present the uncertainty about the missing values is large, and this is reflected
in the population codes calculated by the network (as shown in Fig. 12d). However, as for the previous network
(with three inputs) the variance of the output distributions is underestimated. Specifically, the probability distribution for B should be uniform over the full range of possible values [−60o : 60o ]. However, edge effects result
in the probability distribution not being completely uniform. Given that B has an equal probability of taking any
value in the range −60o to +60o , and zero probability of taking a value from outside this range, the posterior for
D should be uniform in the range −70o to +50o . The population code produced by the reconstruction neurons is
approximately uniform between −50o and +30o which is less than the expected range (and is due to the underestimation of the range of B). However, as noted earlier in this section, previous algorithms that perform function
approximation (Beck et al., 2011; Deneve et al., 2001; Pouget et al., 2002; Pouget and Sejnowski, 1997; Pouget
and Snyder, 2000) would completely fail on this task.
Figs. 12e and f compare the network’s estimate of the mean and variance of the probability distribution encoding variable D with the statistically optimal estimates of these parameters. These results are for 100 trials with
the values for A, B and C chosen at random (uniformly in the range [−25o : 25o ]) and encoded using Gaussian
probability distributions (with a standard deviation chosen at random from the range [10o : 20o ]). The equivalent
results for B, estimated from inputs encoding A, C, and D are shown in Figs. 12g and h. It can be seen that
PC/BC-DIM is accurate at calculating the parameters of each posterior distribution except that it underestimates
the variance of the probability distribution encoding variable B.
18
−29.8
ra
rb
+19.9
+19.9
rc
rd
+10.0
1
0.5
−29.8
ra
rb
+19.5
+19.9
rc
rd
+9.7
0.5
0
−45
0.1
y
0
45 −45
0
2000
−30.0
1x
a
xb
0
45 −45
6000
+20.0
xc
0
0
−45
0.1
y
45−120 0 120
10000
+20.0
0
45 −45
0
14000
2000
−30.0
1x
a
xd
0.5
0
45 −45
6000
xb
xc
0
45−120 0 120
10000
+20.0
xd
14000
+10.0
0.5
0
−45
0
45 −45
0
45 −45
0
0
−45
45−120 0 120
0
45 −45
0
45 −45
(a)
1r
a
rb
+19.8
0
45−120 0 120
(b)
+19.8
rc
−29.8
0.5 r
a
rd
rb
+19.9
rc
rd
−9.9
0.5
0
−45
0.1
y
0
0
45 −45
2000
1x
a
xb
0
45 −45
6000
+20.0
xc
0
0
−45
0.1
y
45−120 0 120
10000
+20.0
0
45 −45
0
14000
2000
−30.0
1x
a
xd
0.5
45 −45
6000
xb
xc
0
45−120 0 120
10000
+20.0
14000
xd
0.5
0
45 −45
0
0
−45
45−120 0 120
0
45 −45
(e)
45 −45
0
45−120 0 120
200
600
1000
Optimal Estimate of σ2
(f)
Network Estimate of σ2
1000
Network Estimate of Mean
−70
−70
0
70
Optimal Estimate of Mean
600
0
0
(d)
200
70
Network Estimate of σ2
(c)
25
0
−25
−25
0
25
Optimal Estimate of Mean
(g)
600
45 −45
400
0
200
0
−45
Network Estimate of Mean
0
200
400
600
Optimal Estimate of σ2
(h)
It is possible to wire-up PC/BC-DIM networks to encode any function defined over any number of variables.
However, the number of prediction neurons required increases exponentially with the number of variables, as
is the case for any other method that computes with basis functions (Deneve and Pouget, 2003; Pouget and Sejnowski, 1997). To resolve this issue it is theoretically possible to decompose computations into several steps, and
implement each sub-task using a separate basis function network (Pouget et al., 2002). While previous methods of
computing with basis functions (e.g., Deneve et al., 2001; Deneve and Pouget, 2003; Pouget et al., 2002; Pouget
and Sejnowski, 1997; Pouget and Snyder, 2000) should be capable of operating in this way, it has not been demonstrated that they can. Here, it is shown that a PC/BC-DIM network can be decomposed into multiple sub-networks
to compute a function. Fig. 13a shows a single PC/BC-DIM network performing function approximation with
four variables (A, B, C, and D). While it is possible to provide inputs to any of the four partitions, and read outputs
from any of the four partitions of the reconstruction neurons, the particular combination of inputs and outputs
needed to estimate D given A, B and C is shown in Fig. 13a. The architecture required to allow the same function to be approximated with two interconnected PC/BC-DIM networks, forming a simple two-stage PC/BC-DIM
19
Figure 12: (previous page) Function approximation with four variables. A PC/BC-DIM network has
been wired-up to approximate the function D=A+B+C. Note that D has a wider range of possible values
than A, B, and C, and hence, the x-axes of the histograms representing D have different scales to those
representing A, B, and C. (a) When the three inputs representing A, B, and C are presented (lower
histograms), the reconstruction neurons generate an output (upper histograms) that represents the correct
value of D (as well as outputs representing the given values of A, B, and C). (b) When the three inputs
representing A, C, and D are presented (lower histograms), the reconstruction neurons generate an output
(upper histograms) that estimates the correct value of B (as well as outputs representing the given values
of A, C, and D). (c) As (a) but with two values of A represented by a bi-modal population code presented
to the first partition of the input. The network correctly calculates two values for D represented by the
bi-modal population code produced by the reconstruction neurons in the last partition. (d) When the
two inputs representing A and C are presented (lower histograms), there is a large uncertainty about
the values of B and D and this is (partially) represented in the probability distributions encoding these
variables generated by the network (upper histograms). (e) 100 experiments were performed using input
distributions for A, B, and C with randomly chosen means and standard deviations, that were encoded
with noisy population codes. Each dot shows a comparison between the PC/BC-DIM network’s estimate
of the mean of the probability distribution encoding the approximated value of D and the optimal estimate
for D. (f) As for (e) but for variance. (g) and (h) As for (e) and (f) except using inputs encoding A, C,
and D and estimating the parameters of the probability distribution for variable B.
y
yS1
W
WS1
V
ea eb ec ed
ra
xa xb xc
rb rc
yS2
rd
eS1
eS1
eS1
a
b
i
xa
WS2
VS1
rS1
a
rS1
b
rS1
i
eS2
eS2
eS2
i
c
d
xb
VS2
rS2
i
rS2
c
rS2
d
xc
(a)
(b)
Figure 13: PC/BC-DIM neural network architectures for function approximation with four variables. (a)
A single-stage network to calculate D given A, B, and C. (b) A hierarchical architecture, consisting of
two interconnected PC/BC-DIM networks, for calculating the same function.
hierarchy, is shown in Fig. 13b. The first network calculates an intermediate result (A+B) in one partition of its
reconstruction neurons. This intermediate result provides input to one of the partitions of the second PC/BC-DIM
network. The second network’s reconstruction of this intermediate representation is fed-back as input to the first
PC/BC-DIM network. The resulting two-stage network has fewer prediction neurons in total than the equivalent
single-stage network (1850 compared to 15625 for the particular task used here), however, it produces almost
identical results (not shown) to those presented in Fig. 12.
To quantitatively compare the performance of the single-stage network and the hierarchical network, both
networks were used to perform function approximation with noisy input population codes. 100 000 trials were
performed for two conditions. In the first condition each network calculated D given noisy inputs encoding
randomly selected values for A, B, and C. The statistically optimum estimate of D was found using the maximum
likelihood estimate of A, B, and C calculated from the noisy input population codes and this was compared to the
network’s estimate of D calculated by taking the maximum likelihood estimate from the reconstruction generated
by the PC/BC-DIM algorithm. The median absolute difference between the network’s estimate and the optimal
estimate was 0.14o for the single-stage network, and 0.005o for the hierarchical network. The median percentage
absolute difference between the network’s estimate of the variance of the probability distribution encoding D
and the statistically optimum estimate was 1.48% for the single-stage network, and 5.02% for the hierarchical
network. In the second condition, each network calculated B given noisy inputs encoding randomly selected
values for A, C, and D. Across 100 000 trials, the median absolute difference between the network’s estimate and
the optimal estimate of the mean was 0.38o for the single-stage network, and 0.29o for the hierarchical network.
The median percentage absolute difference between the network’s estimate and the statistically optimum estimate
of the variance was 9.75% for the single-stage network, and 2.83% for the hierarchical network. Hence, both the
20
single-stage and hierarchical networks produce reasonable estimates of the posterior distributions.
3.6
Computations with Non-Gaussian Distributions
All the previous experiments have been performed using one-dimensional Gaussian population codes, and networks with Gaussian RFs. The results in this section demonstrate that PC/BC-DIM can perform Bayesian inference with stimuli and synaptic weights that are not Gaussian. The input is a greyscale image and the prediction
neurons are given synaptic weights defined using Gabor functionsc . The RFs of the whole population of prediction neurons tile the input image with Gabor RFs. A second set of inputs (a one-dimensional vector) is defined
to represent orientation. Each prediction neuron receives a Gaussian RFs from this second partition of the input,
with this Gaussian centred at the orientation corresponding to the orientation of that prediction neuron’s Gabor
RFd . Defining this extra partition of the input automatically defines a second partition of the reconstruction neuron
population. The reconstruction in this second partition will be a population code representing the distribution of
orientations signaled by the responses of the prediction neurons. It would be possible to define further partitions
of the input and reconstruction neuron population to encode other features represented by the prediction neurons
(e.g., location or phase), however, here we only consider stimulus orientation. The reconstruction neurons that
represent orientation are like complex-cells in V1 as they pool the responses of multiple prediction neurons (corresponding to simple-cells in this analogy). The orientation-selective reconstruction neurons pool the responses
of prediction neurons with the same orientation preference, but with RFs at a range of spatial locations and with
a range of phase preferences. Here, so that we only have one population code representing orientation, spatial
pooling takes place over the whole image rather than in a small patch of image as would be the case for cortical
complex-cells.
If an input image is presented to this PC/BC-DIM network it generates reconstruction neuron responses that
represent both the image and a population code representing a probability distribution of orientations within the
image, as illustrated in Fig. 14. When the image contains a single sinusoidal grating, the population code is a
Gaussian with a peak approximately at the orientation of the grating (Fig. 14a). If the image is corrupted by
noise, then the population code is more distributed (Fig. 14b). If a second sinusoidal grating is superimposed
over the first, then the population code is bi-modal with peaks at approximately the orientation of both gratings
(Fig. 14c). If both an input image and a population code representing orientation are simultaneously presented
as inputs of the network, then cue integration can occur. For example, if the input image is a single sinusoidal
grating oriented at 37o from the vertical, and the population code is a Gaussian distribution centred at 49o , then
the combined estimate of the orientation represented by the second partition of the reconstruction neurons is
intermediate between the values signalled by the two cues (Fig. 14d). Furthermore, increasing the precision of the
input probability distribution (by reducing its standard deviation from 20o to 10o ) causes the combined estimate to
shift further towards 49o (Fig. 14e). When cue conflict is large, the network performs cue segregation rather than
integration, generating a bi-modal distribution of responses in the second partition of reconstruction neurons. This
distribution has peaks at approximately the orientation of both cues (Fig. 14f). If multiple cues are supplied then
both cue integration and segregation can co-occur (Fig. 14g). To encode a prior into the weights of the network,
the Gaussian weights for each prediction neuron were scaled by a function of orientation (the prior). This function
of orientation was a Gaussian centred at 90o and with a standard deviation of 30o . When the image contains a
single sinusoidal grating, the population code representing orientation that is generated by the network is shifted
towards 90o (Fig. 14h and i).
The above results are qualitatively consistent with Bayesian inference. To determine if the inference performed
by the network is optimal it would be necessary to compare the posterior calculated by the network with that
expected from exact Bayesian inference. However, the form of the true likelihood for the image cue, and hence,
the correct posterior is not known. The situation is analogous to that faced when psychophysical experiments are
performed to assess if human performance on cue integration tasks is optimal (e.g., Battaglia et al., 2003; Ernst and
Banks, 2002; Helbig and Ernst, 2007; Jacobs, 1999; Knill and Saunders, 2003). In such experiments it is assumed
that the probability distributions take a certain form (typically that they are all Gaussian), and the parameters of
the distributions encoding each cue (i.e., the means and variances) are then estimated from experiments in which
those cues are presented to the subject in isolation. From these estimates the maximum-likelihood estimate for the
cue integration task is determined and this value is compared to the subject’s response when presented with both
cues. The same procedure can be followed for the current simulation results. It is assumed that the image cue
c As in previous work using PC/BC-DIM to model V1 (Spratling, 2010, 2011, 2012a,b,c), the input is divided into ON and OFF channels
containing the high and low contrast components of the image, respectively, and the positive and negative values of the Gabor function are
used to define separate RFs for these two input channels. The figures show the original image, which is equal to the ON channel minus the
OFF channel of the input to PC/BC-DIM, and the reconstruction of the ON channel minus the reconstruction of the OFF channel.
d In contrast to previous experiments, here the V weights for each partition were scaled independently to allow both cues to have a similar
influence on prediction neuron response.
21
36.8
1r
r
a
r
b
1 rb
a
0.5
0
0.4
y
0.3
0.2
0.1
0
2000
4000
a
0
45
6000
90
0
135
8000 10000 12000
0.4
y
0.3
0.2
0.1
0
2000
4000
b
a
0
45
90
0
135
40.7
r
b
0
2000
4000
0
6000
a
45
90
0
135
8000 10000 12000
49
0.4
y
0.3
0.2
0.1
0
2000
4000
0
45
90
2000
xa
4000
a
0
45
90
0
8000 10000 12000
1x
b
0.4
y
0.3
0.2
0.1
0
2000
xa
0.5
4000
90
0
135
r
43.3
(g)
45
90
135
8000 10000 12000
0
45
90
135
a
45
90
135
b
0.5
0
45
90
0
135
8000 10000 12000
49
0.4
y
0.3
0.2
0.1
0
2000
4000
0
6000
1x
b
xa
8000 10000 12000
132
0.5
0
45
90
0
135
0
45
90
135
(f)
+41.8
+117.7
1r
b
ra
0.5
0
6000
45
90
0
135
8000 10000 12000
1x
b
0
135
1r
r
b
0.4
y
0.3
0.2
0.1
0
2000
xa
0.5
0
6000
90
(c)
1 rb
135
45
b
(e)
ra
0
1x
0.5
6000
0
4000
x
45
6000
0
135
0.5
0.4
y
0.3
0.2
0.1
0
2000
0.5
1r
b
0
0
1x
b
xa
(d)
ra
8000 10000 12000
0.4
y
0.3
0.2
0.1
0
0.5
1
0.5
0
0
135
0.5
1x
b
xa
6000
90
b
0.5
0.4
y
0.3
0.2
0.1
0
45
(b)
1r
a
0
0.5
(a)
r
b
0.5
1x
x
0.5
0
a
0.5
1x
x
1r
r
4000
0
6000
45
90
135
8000 10000 12000
1x
b
0.5
0
(h)
45
90
135
0
0
45
90
135
(i)
is equivalent to a Gaussian probability distribution with a mean given by the orientation of the image and with a
standard deviation of 11.2o . The latter value is an estimate derived from the width of the posterior produced by
the reconstruction neurons when only the image cue is presented to the PC/BC-DIM network (as in Fig. 14a). 100
cue integration trials were performed with randomly selected cues. In each trial the first cue was an image of a
single sinusoidal grating at a randomly chosen orientation, and the second cue was a Gaussian population code
representing an orientation with a randomly chosen cue conflict of up to ±10o , and a random standard deviation,
chosen uniformly from the range [10o : 20o ]. In each case the optimal estimate of the orientation was calculated
assuming the first cue was equivalent to a Gaussian probability distribution centred at the orientation of the grating
and with a standard deviation of 11.2o . When this value was compared to the network’s estimate of the orientation,
they were found to be in very close agreement (Fig. 15a). Over the 100 trials the maximum, median, and mean
absolute difference between the network’s and the optimal estimate of the orientation was 0.63o , 0.20o , and 0.22o .
The preceding experiment was repeated using input images corrupted with noise, like that shown in Fig. 14b,
as the first cue and Gaussian input distributions corrupted by poisson noise as the second cue. In this case, to
22
Figure 14: (previous page) Decoding, cue integration, cue segregation, and Bayesian inference with a
prior using image stimuli and neurons with Gabor RFs. The format of the diagrams is the same as used
in previous figures, except here the inputs to, and the reconstructions of, the first partition are shown as
2D images rather than 1D vectors. Also, here, the RFs of the most active prediction neurons are indicated
by the grey squares superimposed on the middle histograms. The numbers above the histograms show
the maximum likelihood estimate of the stimulus value calculated from that histogram (equation 6). (a)
The input image (lower left) is a circular sinusoidal grating oriented at 37o from the vertical. The reconstruction neurons generate a population code (upper right histogram) that represents the orientation of
this stimulus. (b) as (a) but the input image is corrupted with speckle noise. The reconstruction neurons’
representation of the orientation (upper right histogram) is peaked at approximately the correct position,
but is wider than in (a). Compare this with the decoding of noisy Gaussian population codes illustrated
in Fig. 2. (c) The input image (lower left) is composed of two superimposed sinusoidal gratings, one
oriented at 37o and the other at 127o from the vertical. The reconstruction neurons’ representation of the
orientation (upper right histogram) is bi-modal and correctly represents the orientations of both gratings
(cf., Fig. 4). (d) and (e) The input image is the same as in (a) but a second Gaussian population code, representing orientation, is also presented to the second partition of the input (lower right). The PC/BC-DIM
network integrates both cues to generate a combined estimate of the orientation (upper right histogram).
When the Gaussian population code that is provided as input to the second partition has greater precision, as shown in (e), the estimate of the orientation moves towards the value indicated by that cue.
Compare these results with cue integration for Gaussian population codes illustrated in Fig. 6. (f) When
the cue conflict between the orientation of the image and that encoded by the Gaussian population code
is large, cue segregation occurs and the reconstruction neurons’ representation of the orientation (upper
right histogram) is bi-modal and correctly represents the orientations of both cues (cf., Fig. 9a). (g) The
orientation input contains two cues (those used in both (e) and (f)) represented by a bi-modal population
code. The first orientation cue is integrated with the orientation information extracted from the image, the
second orientation cue is represented by a separate peak in the reconstruction neurons’ representation of
the orientation (upper right histogram). Compare this with simultaneous cue integration and cue segregation for Gaussian population codes illustrated in Fig. 9b. (h) and (i) To incorporate a prior the amplitude
of each prediction neuron’s Gaussian RF has been modulated by a Gaussian centred at an orientation of
90o from the vertical and with a standard deviation of 30o . The input (lower left images) is a circular
sinusoidal grating oriented at 37o (h) and 121.5o (i) from the vertical. Due to the prior being centred at
90o the network’s estimates of the posterior probability distributions (upper right histograms) are shifted
towards 90o . Compare this with Bayesian inference using Gaussian population codes illustrated in Fig. 5.
calculate the optimal estimate of the orientation it was assumed that the image was equivalent to a Gaussian
probability distribution with a standard deviation of 13.6o . Again, these optimal estimates were found to be
consistent with the network’s estimates of the orientation in this cue combination task, as shown in Fig. 15b. Over
the 100 trials the maximum, median, and mean absolute difference between the network’s and the optimal estimate
of the orientation was 1.40o , 0.40o , and 0.43o . The accuracy of the estimates of the variance of the posterior were
comparable to the same experiment performed with two cues defined by Gaussian probability distributions (see
Section 3.3). Specifically, over the 100 trials the maximum, median, and mean percentage absolute difference
between the network’s and the optimal estimate of the variance was 19.78%, 4.13%, and 5.78%.
Performing 100 trails with the only input being an image of a single sinusoidal grating at a randomly chosen
orientation, but with a prior imposed on the weights (like in the simulation results shown in Fig. 14h and i), also
produced estimates of the orientation close to the statistically optimum value that would be predicted by applying
Bayes theorem (as shown in Fig 15c). Specifically, the maximum, median, and mean absolute difference between
the network’s and the optimal estimate of the orientation was 1.59o , 0.53o , and 0.64o . Hence, despite one cue
being defined in terms of a two-dimensional array of intensity values, rather than a one-dimensional Gaussian
population code, the PC/BC-DIM network is still capable of performing accurate inference.
4
Discussion
Recently Bayesian theories of cognition have been heavily criticised (e.g., Bowers and Davis, 2012; Jones and
Love, 2011; Marcus and Davis, 2013). The PC/BC-DIM model addresses many of these criticisms. For example,
one criticism is that human behaviour is seldom rational and optimal, and hence, there is little evidence that the
brain performs optimal, Bayesian, inference (Bowers and Davis, 2012; Jones and Love, 2011). However, PC/BC-
23
45
45
90
135
Optimal Estimate of Mean
(a)
135
90
45
45
90
135
Optimal Estimate of Mean
(b)
Network Estimate of Mean
90
Network Estimate of Mean
Network Estimate of Mean
135
135
90
45
45
90
135
Optimal Estimate of Mean
(c)
Figure 15: Accuracy of inference when using image stimuli and neurons with Gabor RFs. (a) and (b)
Cue integration accuracy. Each dot shows a comparison of the PC/BC-DIM network’s estimate of the
mean of the combined probability distribution compared to the probabilistically optimal estimate of the
mean calculated by assuming that the probability distribution defined by the image cue is equivalent to
a Gaussian. 100 experiments were performed with randomly chosen image orientations, cue conflicts
and precisions for the Gaussian cue. In (a) there is no noise in the inputs, in (b) the image cue and the
Gaussian cue are corrupted by noise. (c) Accuracy of the posterior calculated from a likelihood and a
prior. The prediction neuron RFs were modified to incorporate a Gaussian prior centred at an orientation
of 90o and with a standard deviation of 30o . Each dot shows a comparison of the PC/BC-DIM network’s
estimate of the posterior compared to the probabilistically optimal estimate obtained by assuming that
the image is equivalent to a Gaussian probability density with standard deviation 11.2o . 100 experiments
were performed with randomly chosen image orientations.
DIM proposes that the brain is engaged in predictive coding (Clark, 2013; Huang and Rao, 2011; Rao and Ballard,
1999), and that Bayesian inference is just one of the functions that can be achieved by predictive coding. Exact
Bayesian inference may be implemented in the brain, using predictive coding, only in specific circumstances,
allowing people to act optimally when their mental models of the environment are veridical, or to reason and act
optimally with respect to the sub-optimal models that they possess (Jones and Love, 2011). Jones and Love (2011)
point out that “the most substantial part of learning lies in constructing a generative model of one’s environment.”
Building such models is a task that predictive coding is particularly suited for (Clark, 2013).
Another criticism is that Bayesian models are tested using a breadth-first strategy that provides a superficial
explanation of a range of carefully selected tasks, rather than using a depth-first strategy where a model is tested
in detail in a challenging domain (Marcus and Davis, 2013). In contrast, PC/BC-DIM has been tested, in-depth,
as a model of V1 and has been shown to provide a comprehensive account of primary visual cortex function
(Spratling, 2010, 2011, 2012a,c). A further criticism is that different models of Bayesian inference have been used
to simulate different tasks (Marcus and Davis, 2013). In contrast, PC/BC-DIM has been used to simulate a range
of probabilistic inference tasks in the current work and a wide range of other neurophysiological and cognitive
processes in previous work (as listed in the Introduction). An additional criticism of Bayesian models is that
they have so many free parameters (such as the choice of priors, generative model, etc.) that they can simulate any
behaviour, and that these parameters are altered, post hoc, to fit the data (Bowers and Davis, 2012; Jones and Love,
2011; Marcus and Davis, 2013). PC/BC-DIM also has many free parameters, principally the synaptic weights of
the prediction neurons. However, in the PC/BC-DIM model of V1 these parameters were either learnt from natural
images, or were defined to be Gabor-like, and hence, to resemble the RFs of V1 neurons. The parameters were,
therefore, not chosen arbitrarily to fit the data, and were kept fixed across numerous experiments.
A further criticism of Bayesian brain models is that they are defined at the computational level of analysis
and make no predictions about, nor are constrained by, neural mechanisms and psychological processes (Bowers
and Davis, 2012; Jones and Love, 2011). However, previous work has described, biologically-plausible, neurallybased implementations of Bayesian inference. For example, several models have proposed how priors can be
combined with likelihoods to calculate posteriors (Ganguli and Simoncelli, 2010, 2014; Girshick et al., 2011;
Shi and Griffiths, 2009), however, these models fail to perform other probabilistic computations such as cue
integration or function approximation. Other models do perform cue integration (Ma et al., 2006) but fail to
perform function approximation and vice versa (Beck et al., 2011). Still other models can perform both cue
integration and function approximation (Deneve et al., 1999, 2001; Latham et al., 2003; Pouget et al., 2003, 2002,
2000, 1998), but fail to calculate the variance of the posterior, are restricted to working with mono-modal Gaussian
distributions, and can not perform Bayesian inference with a non-uniform prior. This article proposes an alternative
24
neural implementation of Bayesian inference that overcomes the limitations of these previous methods. The
PC/BC-DIM model is particularly simple while providing a particularly comprehensive account of probabilistic
computation, that includes: inference with priors; inference with noisy population codes; hierarchical inference;
inference with more than one stimulus or cause; inference with non-Gaussian stimuli and non-Gaussian RFs;
cue integration; cue segregation; and function approximation. However, there remain a number of limitation of
the PC/BC-DIM model of probabilistic inference. Firstly, in all tasks the response of the reconstruction neurons
represents the posterior except in cue integration tasks where the posterior has been encoded by the response of
the reconstruction neurons raised to a power equal to the number of cues. Secondly, the estimate of the variance of
the posterior can be inaccurate, particularly when the PC/BC-DIM algorithm is used to perform certain function
approximations. Thirdly, the PC/BC-DIM model also fails to account for the capacity of the cortex to make fine
distinctions between stimuli using neurons with broadly tuned RFs. Finally, the current paper fails to provide
formal, mathematical, insights into why PC/BC-DIM succeeds in performing Bayesian inference.
Acknowledgements
Thanks to the organisers of, and the participants at, the Lorentz Centre Workshop on Perspectives on Human
Probabilistic Inference (May 2014) for discussions that inspired this work. Thanks also to Kris De Meyer and the
anonymous reviewers for helpful comments on earlier drafts of this paper.
References
Achler, T. (2014). Symbolic neural networks for cognitive capacities. Biologically Inspired Cognitive Architectures, 9(0):71–81.
Achler, T. and Amir, E. (2008). Input feedback networks: Classification and inference based on network structure.
In Wang, P., Goertzel, B., and Franklin, S., editors, Artificial General Intelligence, pages 15–26, Amsterdam,
The Netherlands. IOS Press.
Alger, B. E. (2002). Retrograde signaling in the regulation of synaptic transmission: focus on endocannabinoids.
Progress in Neurobiology, 68(4):247–86.
Alink, A., Schwiedrzik, C. M., Kohler, A., Singer, W., and Muckli, L. (2010). Stimulus predictability reduces
responses in primary visual cortex. The Journal of Neuroscience, 30:2960–6.
Anastasio, T. J., Patton, P. E., and Belkacem-Boussaid, K. (2000). Using Bayes’ rule to model multisensory
enhancement in the superior colliculus. Neural Computation, 12:1165–87.
Anderson, C. H. and Van Essen, D. C. (1994). Neurobiological computational systems. In IEEE World Congress
on Computational Intelligence, pages 213–22.
Ballard, D. H. and Jehee, J. (2012). Dynamic coding of signed quantities in cortical feedback circuits. Frontiers
in Psychology, 3:254.
Barbas, H. and Rempel-Clower, N. (1997). Cortical structure predicts the pattern of corticocortical connections.
Cerebral Cortex, 7:635–46.
Barber, M. J., Clark, J. W., and Anderson, C. H. (2003). Neural representation of probabilistic information. Neural
Computation, 15(8):1843–64.
Barlow, H. B. (1969). Pattern recognition and the responses of sensory neurons. Annals of the New York Academy
of Sciences, 156:872–81.
Barone, P., Batardiere, A., Knoblauch, K., and Kennedy, H. (2000). Laminar distribution of neurons in extrastriate
areas projecting to visual areas V1 and V4 correlates with the hierarchical rank and indicates the operation of a
distance rule. The Journal of Neuroscience, 20:3263–81.
Battaglia, P. W., Jacobs, R. A., and Aslin, R. N. (2003). Bayesian integration of visual and auditory signals for
spatial localization. Journal of the Optical Society of America. A, Optics, Image Science, and Vision, 20(7).
Beck, J., Heller, K., and Pouget, A. (2012). Complex inference in neural circuits with probabilistic population
codes and topic models. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in
Neural Information Processing Systems, volume 25, pages 3059–67. Curran Associates, Inc.
Beck, J. M., Latham, P. E., and Pouget, A. (2011). Marginalization in neural circuits with divisive normalization.
The Journal of Neuroscience, 31(43):15310–9.
Beierholm, U. R., Kording, K. P., Shams, L., and Ma, W. J. (2008). Comparing Bayesian models for multisensory
cue combination without mandatory integration. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors,
Advances in Neural Information Processing Systems, volume 20, pages 81–8. Curran Associates, Inc.
Bowers, J. S. and Davis, C. J. (2012). Bayesian just-so stories in psychology and neuroscience. Psychological
Bulletin, 138(3):389–414.
25
Branco, T. and Staras, K. (2009). The probability of neurotransmitter release: variability and feedback control at
single synapses. Nature Reviews Neuroscience, 10:373–83.
Brozović, M., Abbott, L. F., and Andersen, R. A. (2008). Mechanism of gain modulation at single neuron and
network levels. Journal of Computational Neuroscience, 25:158–68.
Budd, J. M. L. (1998). Extrastriate feedback to primary visual cortex in primates: a quantitative analysis of
connectivity. Proceedings of the Royal Society of London. Series B, Biological Sciences, 265(1400):1037–44.
Carandini, M. and Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science,
264(5163):1333–6.
Chance, F. S. and Abbott, L. F. (2000). Divisive inhibition in recurrent networks. Network: Computation in Neural
Systems, 11:119–29.
Chater, N., Tenenbaum, J. B., and Yuille, A. (2006). Probabilistic models of cognition: conceptual foundations.
Trends in Cognitive Sciences, 10:287–91.
Clark, A. (2013). Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral
and Brain Sciences, 36(03):181–204.
Crick, F. and Koch, C. (1998). Constraints on cortical and thalamic projections: the no-strong-loops hypothesis.
Nature, 391:245–50.
De Meyer, K. and Spratling, M. W. (2011). Multiplicative gain modulation arises through unsupervised learning
in a predictive coding model of cortical function. Neural Computation, 23(6):1536–67.
De Meyer, K. and Spratling, M. W. (2013). A model of partial reference frame transforms through pooling of
gain-modulated responses. Cerebral Cortex, 23(5):1230–9.
Deneve, S. (2008). Bayesian spiking neurons I: Inference. Neural Computation, 20(1):91–117.
Deneve, S., Latham, P. E., and Pouget, A. (1999). Reading population codes: a neural implementation of ideal
observers. Nature Neuroscience, 2(8):740–5.
Deneve, S., Latham, P. E., and Pouget, A. (2001). Efficient computation and cue integration with noisy population
codes. Nature Neuroscience, 4(8):826–31.
Deneve, S. and Pouget, A. (2003). Basis functions for object-centered representations. Neuron, 37:347–59.
Egner, T., Monti, J. M., and Summerfield, C. (2010). Expectation and surprise determine neural population
responses in the ventral visual stream. The Journal of Neuroscience, 30(49):16601–8.
Ernst, M. and Jäkel, F. (2003). Learning to combine arbitrary signals from vision and touch. In 4th International
Multisensory Research Forum.
Ernst, M. O. and Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal
fashion. Nature, 415:429–33.
Felleman, D. J. and Van Essen, D. C. (1991). Distributed hierarchical processing in primate cerebral cortex.
Cerebral Cortex, 1:1–47.
Fiser, J., Berkes, P., Orban, G., and Lengyel, M. (2010). Statistically optimal perception and learning: from
behavior to neural representations. Trends in Cognitive Sciences, 14(3):119–30.
Földiák, P. (1993). The ’ideal homunculus’: statistical inference from neural population responses. In Eeckman,
F. and Bower, J., editors, Computation and Neural Systems: Proceedings of the Computational Neuroscience
Meeting, pages 55–60, London, UK. Kluwer Academic Publishers.
Friston, K. J. (2005). A theory of cortical responses. Philosophical Transactions of the Royal Society of London.
Series B, Biological Sciences, 360(1456):815–36.
Gabbiani, F., Krapp, H. G., Koch, C., and Laurent, G. (2002). Multiplicative computation in a visual neuron
sensitive to looming. Nature, 420:320–4.
Ganguli, D. and Simoncelli, E. P. (2010). Implicit encoding of prior probabilities in optimal neural populations. In
Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., editors, Advances in Neural
Information Processing Systems, volume 23, pages 658–66. Curran Associates, Inc.
Ganguli, D. and Simoncelli, E. P. (2014). Efficient sensory encoding and bayesian inference with heterogeneous
neural populations. Neural Computation, 26(10):2103–34.
Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. (1986). Neuronal population coding of movement
direction. Science, 233:1416–9.
Girshick, A., Landy, M., and Simoncelli, E. (2011). Cardinal rules: Visual orientation perception reflects knowledge of environmental statistics. Nature Neuroscience, 14(7):926–32.
Griffiths, T. L., Kemp, C., and Tenenbaum, J. B. (2008). Bayesian models of cognition. In Sun, R., editor,
Cambridge Handbook of Computational Cognitive Modeling. Cambridge University Press, Cambridge, UK.
Griffiths, T. L. and Tenenbaum, J. B. (2006). Optimal predictions in everyday cognition. Psychological Science,
17(9):767–73.
Harpur, G. F. (1997). Low Entropy Coding with Unsupervised Neural Networks. PhD thesis, Department of
Engineering, University of Cambridge.
26
Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9:181–97.
Helbig, H. B. and Ernst, M. O. (2007). Optimal integration of shape information from vision and touch. Experimental Brain Research, 179(4):595–606.
Huang, Y. and Rao, R. P. N. (2011). Predictive coding. WIREs Cognitive Science, 2:580–93.
Jacobs, R. A. (1999). Optimal integration of texture and motion cues to depth. Vision Research, 39:3621–9.
Jaffe, D. B. and Carnevale, N. T. (1999). Passive normalization of synaptic integration influenced by dendritic
architecture. Journal of Neurophysiology, 82:3268–85.
Jazayeri, M. and Movshon, J. A. (2006). Optimal representation of sensory information by neural populations.
Nature Neuroscience, 9:690–6.
Johnson, R. R. and Burkhalter, A. (1997). A polysynaptic feedback circuit in rat visual cortex. The Journal of
Neuroscience, 17(18):7129–40.
Jones, M. and Love, B. C. (2011). Bayesian fundamentalism or enlightenment? on the explanatory status and
theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34(4):169–188.
Kersten, D., Mamassian, P., and Yuille, A. (2004). Object perception as Bayesian inference. Annual Review of
Psychology, 55(1):271–304.
Knill, D. C. and Richards, W. (1996). Perception as Bayesian Inference. Cambridge University Press, Cambridge,
UK.
Knill, D. C. and Saunders, J. A. (2003). Do humans optimally integrate stereo and texture information for judgments of surface slant? Vision Research, 43:2539–58.
Koch, C. and Segev, I. (2000). The role of single neurons in information processing. Nature Neuroscience,
3(supplement):1171–7.
Kok, P. and de Lange, P. F. (2015). Predictive coding in sensory cortex. In Forstmann, U. B. and Wagenmakers,
E.-J., editors, An Introduction to Model-Based Cognitive Neuroscience, pages 221–44. Springer, New York,
NY.
Kok, P., Rahnev, D., Jehee, J. F. M., Lau, H. C., and de Lange, F. P. (2012). Attention reverses the effect of
prediction in silencing sensory signals. Cerebral Cortex, 22:2197–206.
Larkum, M. E., Senn, W., and Lüscher, H.-R. (2004). Top-down dendritic input increases the gain of layer 5
pyramidal neurons. Cerebral Cortex, 14(10):1059–70.
Latham, P. E., Deneve, S., and Pouget, A. (2003). Optimal computation with attractor networks. Journal of
Physiology – Paris, 97(4–6):683–94.
Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Leen, T. K., Dietterich,
T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems, volume 13, Cambridge, MA.
MIT Press.
Lee, T. S. and Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of the Optical
Society of America. A, Optics, Image Science, and Vision, 20:1434–48.
Lochmann, T. and Deneve, S. (2011). Neural processing as causal inference. Current Opinion in Neurobiology,
21(5):774–81.
Lochmann, T., Ernst, U. A., and Denève, S. (2012). Perceptual inference predicts contextual modulations of
sensory responses. The Journal of Neuroscience, 32(12):4179–95.
London, M. and Häusser, M. (2005). Dendritic computation. Annual Review of Neuroscience, 28:503–32.
Ma, W. J. (2012). Organising probabilistic models of perception. Trends in Cognitive Sciences, 16(10):511–8.
Ma, W. J., Beck, J., Latham, P. E., and Pouget, A. (2006). Bayesian inference with probabilistic population codes.
Nature Neuroscience, 9(11):1432–8.
Ma, W. J., Beck, J. M., and Pouget, A. (2008). Spiking networks for Bayesian inference and choice. Current
Opinion in Neurobiology, 18(2):217–22.
Ma, W. J. and Jazayeri, M. (2014). Neural coding of uncertainty and probability. Annual Review of Neuroscience,
37:205–20.
Ma, W. J. and Pouget, A. (2008). Linking neurons to behavior in multisensory perception: A computational
review. Brain Research, 1242(0):4–12.
Marcus, G. F. and Davis, E. (2013). How robust are probabilistic models of higher-level cognition? Psychological
Science, 24(12):2351–60.
Markov, N. T., Vezoli, J., Chameau, P., Falchier, A., Quilodran, R., Huissoud, C., Lamy, C., Misery, P., Giroud, P.,
Ullman, S., Barone, P., Dehay, C., Knoblauch, K., and Kennedy, H. (2014). Anatomy of hierarchy: Feedforward
and feedback pathways in macaque visual cortex. Journal of Comparative Neurology, 522(1):225–59.
Mehaffey, W. H., Doiron, B., Maler, L., and Turner, R. W. (2005). Deterministic multiplicative gain control with
active dendrites. The Journal of Neuroscience, 25:9968–77.
Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6:1031–85.
Mitchell, S. J. and Silver, R. A. (2003). Shunting inhibition modulates neuronal gain during synaptic excitation.
27
Neuron, 38(3):433–45.
Mountcastle, V. B. (1998). Perceptual Neuroscience: The Cerebral Cortex. Harvard University Press, Cambridge,
MA.
Murphy, B. K. and Miller, K. D. (2003). Multiplicative gain changes are induced by excitation or inhibition alone.
The Journal of Neuroscience, 23:10040–51.
Olsen, S. R., Bortone, D. S., Adesnik, H., and Scanziani, M. (2012). Gain control by layer six in cortical circuits
of vision. Nature, 483:47–52.
Phillips, W. A. (2016). On the cognitive functions of intracellular mechanisms for contextual amplification. Brain
and Cognition, (in press).
Pouget, A., Beck, J. M., Ma, W. J., and Latham, P. E. (2013). Probabilistic brains: knowns and unknowns. Nature
Neuroscience, 16:1170–8.
Pouget, A., Dayan, P., and Zemel, R. S. (2003). Inference and computation with population codes. Annual Review
of Neuroscience, 26:381–410.
Pouget, A., Deneve, S., and Duhamel, J. R. (2002). A computational perspective on the neural basis of multisensory spatial representations. Nature Reviews Neuroscience, 3:741–7.
Pouget, A. and Sejnowski, T. J. (1994). A neural model of the cortical representation of egocentric distance.
Cerebral Cortex, 4(3):314–29.
Pouget, A. and Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal
of Cognitive Neuroscience, 9(2):222–37.
Pouget, A. and Snyder, L. (2000). Computational approaches to sensorimotor transformations. Nature Neuroscience, 3(supplement):1192–8.
Pouget, A., Zemel, R. S., and Dayan, P. (2000). Information processing with population codes. Nature Reviews
Neuroscience, 2:125–32.
Pouget, A., Zhang, K., Deneve, S., and Latham, P. E. (1998). Statistically efficient estimation using population
coding. Neural Computation, 10:373–401.
Rao, R. P. N. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of some
extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87.
Rao, R. P. N., Olshausen, B. A., and Lewicki, M. S., editors (2002). Probabilistic Models of the Brain: Perception
and Neural Function. MIT Press, Cambridge, MA.
Reynolds, J. H. and Chelazzi, L. (2004). Attentional modulation of visual processing. Annual Review of Neuroscience, 27:611–47.
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan Books, Washington, DC.
Rothman, J., Cathala, L., Steuber, V., and Silver, R. A. (2009). Synaptic depression enables neuronal gain control.
Nature, 457:1015–8.
Rumelhart, D. E., McClelland, J. L., and The PDP Research Group, editors (1986). Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Volume 1: Foundations. MIT Press, Cambridge,
MA.
Sahani, M. and Dayan, P. (2003). Doubly distributional population codes: Simultaneous representation of uncertainty and multiplicity. Neural Computation, 15(10):2255–79.
Salinas, E. and Abbott, L. F. (1995). Transfer of coded information from sensory to motor networks. The Journal
of Neuroscience, 15:6461–74.
Salinas, E. and Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proceedings
of the National Academy of Sciences USA, 93:11956–61.
Salinas, E. and Sejnowski, T. J. (2001). Gain modulation in the central nervous system: where behavior, neurophysiology and computation meet. The Neuroscientist, 7(5):430–40.
Salinas, E. and Thier, P. (2000). Gain modulation: a major computational principle of the central nervous system.
Neuron, 27:15–21.
Sanger, T. D. (1996). Probability density estimation for the interpretation of neural population codes. Journal of
Neurophysiology, 76(4):2790–2793.
Seilheimer, R. L., Rosenberg, A., and Angelaki, D. E. (2014). Models and processes of multisensory cue combination. Current Opinion in Neurobiology, 25:38–46.
Shams, L. and Beierholm, U. R. (2010). Causal inference in perception. Trends in Cognitive Sciences, 14(9):425–
32.
Sherman, S. M. (2016). Thalamus plays a central role in ongoing cortical functioning. Nature Neuroscience,
19(4):533–41.
Sherman, S. M. and Guillery, R. W. (1998). On the actions that one nerve cell can have on another: distinguishing
“drivers” from “modulators”. Proceedings of the National Academy of Sciences USA, 95:7121–6.
Shi, L. and Griffiths, T. L. (2009). Neural implementation of hierarchical bayesian inference by importance sam-
28
pling. In Advances in Neural Information Processing Systems, volume 22, pages 1669–77. Curran Associates,
Inc.
Shipp, S. (2004). The brain circuitry of attention. Trends in Cognitive Sciences, 8(5):223–30.
Smith, F. W. and Muckli, L. (2010). Nonstimulated early visual areas carry information about surrounding context.
Proceedings of the National Academy of Sciences USA, 107(46):20099–103.
Solbakken, L. L. and Junge, S. (2011). Online parts-based feature discovery using competitive activation neural
networks. In Proceedings of the International Joint Conference on Neural Networks, pages 1466–73.
Spratling, M. W. (2008a). Predictive coding as a model of biased competition in visual selective attention. Vision
Research, 48(12):1391–408.
Spratling, M. W. (2008b). Reconciling predictive coding and biased competition models of cortical function.
Frontiers in Computational Neuroscience, 2(4):1–8.
Spratling, M. W. (2010). Predictive coding as a model of response properties in cortical area V1. The Journal of
Neuroscience, 30(9):3531–43.
Spratling, M. W. (2011). A single functional model accounts for the distinct properties of suppression in cortical
area V1. Vision Research, 51(6):563–76.
Spratling, M. W. (2012a). Predictive coding accounts for V1 response properties recorded using reverse correlation. Biological Cybernetics, 106(1):37–49.
Spratling, M. W. (2012b). Predictive coding as a model of the V1 saliency map hypothesis. Neural Networks,
26:7–28.
Spratling, M. W. (2012c). Unsupervised learning of generative and discriminative weights encoding elementary
image components in a predictive coding model of cortical function. Neural Computation, 24(1):60–103.
Spratling, M. W. (2013a). Distinguishing theory from implementation in predictive coding accounts of brain
function [commentary]. Behavioral and Brain Sciences, 36(3):231–2.
Spratling, M. W. (2013b). Image segmentation using a sparse coding model of cortical area V1. IEEE Transactions
on Image Processing, 22(4):1631–43.
Spratling, M. W. (2014). A single functional model of drivers and modulators in cortex. Journal of Computational
Neuroscience, 36(1):97–118.
Spratling, M. W. (2016). A review of predictive coding algorithms. Brain and Cognition, (in press).
Spratling, M. W., De Meyer, K., and Kompass, R. (2009). Unsupervised learning of overlapping image components using divisive input modulation. Computational Intelligence and Neuroscience, 2009(381457):1–19.
Spratling, M. W. and Johnson, M. H. (2003). Exploring the functional significance of dendritic inhibition in
cortical pyramidal cells. Neurocomputing, 52-54:389–95.
Spruston, N. (2008). Pyramidal neurons: dendritic structure and synaptic integration. Nature Reviews Neuroscience, 9:206–21.
Spruston, N. and Kath, W. L. (2004). Dendritic arithmetic. Nature Neuroscience, 7(6):567–9.
Stuart, G. and Häusser, M. (2001). Dendritic coincidence detection of EPSPs and action potentials. Nature
Neuroscience, 4(1):63–71.
Summerfield, C. and Egner, T. (2009). Expectation (and attention) in visual cognition. Trends in Cognitive
Sciences, 13(9):403–9.
Summerfield, C., Egner, T., Mangels, J., and Hirsch, J. (2006). Mistaking a house for a face: Neural correlates of
misperception in healthy humans. Cerebral Cortex, 16(4):500–508.
Vilares, I. and Kording, K. (2011). Bayesian models: the structure of the world, uncertainty, behavior, and the
brain. Annals of the New York Academy of Sciences, 1224(1):22–39.
Wacongne, C., Labyt, E., van Wassenhove, V., Bekinschtein, T., Naccache, L., and Dehaene, S. (2011). Evidence
for a hierarchy of predictions and prediction errors in human cortex. Proceedings of the National Academy of
Sciences USA, 108(51):20754–9.
Yuille, A. and Kersten, D. (2006). Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive
Sciences, 10(7):301–8.
Zemel, R. S., Dayan, P., and Pouget, A. (1997). Probabilistic interpretation of population codes. In Mozer, M. C.,
Jordan, M. I., and Petsche, T., editors, Advances in Neural Information Processing Systems, volume 9, pages
676–684. MIT Press.
Zemel, R. S., Dayan, P., and Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10(2):403–30.
29