Download Multinomial Bayesian learning for modeling classical and non

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Transcript
Multinomial Bayesian learning for modeling
classical and non-classical receptive field properties
Haruo Hosoya
Brain Science Institute, RIKEN
PRESTO, Japan Science and Technology Agency
Hirosawa 2-1, Wako-shi, Saitama 351-0198, Japan
[email protected]
Keywords: Bayesian inference, natural image learning, simple cells, V2, divisive normalization, filling-in
Abstract
We study the interplay between Bayesian inference and natural image learning in a hierarchical
vision system, in relation to the response properties of early visual cortex. We particularly focus on
a Bayesian network with multinomial variables that can represent discrete feature spaces similar to
hypercolumns combining minicolumns, enforce sparsity of activation to learn efficient representations,
and explain divisive normalization. We demonstrate that maximal-likelihood learning using samplingbased Bayesian inference gives rise to classical receptive field properties similar to V1 simple cells and V2
cells, while inference performed on the trained network yields non-classical context-dependent response
properties such as cross-orientation suppression and filling-in. Comparison with known physiological
properties reveals some qualitative and quantitative similarities.
1
Introduction
Bayesian inference and natural image learning are two important computations suggested in visual cortex.
On the one hand, Bayesian inference can capture interactions between bottom-up information from the sensory system and top-down information from the central control system, thereby explaining various complex
behaviors of the brain such as contextual and attentional effects (Lee and Mumford, 2003; Kersten and
Yuille, 2003; Doya et al., 2007; George and Hawkins, 2005; Rao, 2005; Chikkerur et al., 2010; Hosoya, 2010)
and multi-cue and multisensory integration (Rao et al., 2002; Deneve, 2004). On the other hand, natural
image learning gives rise to the receptive field properties of various visual areas including V1 (Olshausen and
Field, 1996; Olshausen and Field, 2003; Bell and Sejnowski, 1997; van Hateren and van der Schaaf, 1998;
Hoyer and Hyvärinen, 2000; Olshausen, 2002; Hurri and Hyvärinen, 2003; Spratling, 2011), V2 (Hoyer and
Hyvärinen, 2002b; Lee et al., 2008), MT (Cadieu and Olshausen, 2009), and MST (Park et al., 2000), from
the statistical properties hidden in sensory inputs. Our aim in this research is to study the mutual roles of
the two types of computation in a hierarchical vision system, in relation to the classical and non-classical
receptive field properties of early visual cortex.
We particularly focus on a multi-layer model using a multinomial Bayesian network. The multinomial
approach has been taken in several model studies of visual cortex (George and Hawkins, 2005; Rao, 2005;
Ichisugi, 2007; Röhrbein et al., 2007; Chikkerur et al., 2010; Hosoya, 2010). In these, multinomial variables
have typically played the “structural” role of bundling a number of discrete states representing individual
visual features so that the variable itself forms a discrete-finite feature space, much like hypercolumns gathering dozens of minicolumns as found in visual cortex of higher mammals (Mountcastle, 1997; Buxhoeveden
and Casanova, 2002; Tanaka, 2003). Multinomial variables in the present work are also endowed with two
“computational” roles.
1
The first is to enforce sparsity. We use a method of maximal-likelihood learning that alternately repeats
sampling-based Bayesian inference and weight updates. This is a simple extension of a classical method for
binary Bayesian networks (Neal, 1992). However, an efficient representation does not emerge in a binary
system unless sparsity is required. Our approach is to make use of the property that a multinomial variable
has many possible states and therefore the activation of each state is forced to be sparse. As a result,
our learning method applied to a three-layered multinomial Bayesian network successfully yielded oriented
Gabor filters similar to V1 simple cells in the second layer, and combinations of orientations like collinear or
parallel lines, contours, and angles similar to V2 in the third layer. Quantitative analyses of these receptive
field properties also revealed some similarities to the experimental data (DeValois et al., 1982; Parker and
Hawken, 1988; Ringach, 2002; Anzai et al., 2007).
The second role of multinomial variables is to reproduce divisive normalization phenomena (Heeger,
1992). That is, since a multinomial variable represents a probability distribution over a discrete-finite set of
features, the probability of each state being activated is automatically normalized. This effect is in parallel
with the property of actual neurons whose activations are adjusted as if they sum up to a constant. Indeed,
this observation has been exploited to explain attention-driven normalization in prior multinomial Bayesian
models (Rao, 2005; Chikkerur et al., 2010). The present work shows that cross-orientation suppression
(Bonds, 1989), which is another example of divisive normalization, can also be explained by the same
mechanism.
It is of great interest whether our network with the learned hierarchical representations can reproduce a
non-trivial physiological effect involving both feedforward and feedback processing. For this reason, we have
conducted a simulation of the filling-in phenomenon (Matsumoto and Komatsu, 2005), in which some cells
in V1 respond as if they recover a missing visual attribute in the blind spot from the surrounds. We explain
this phenomenon by Bayesian inference involving feedback processing, in which the high-level representation
learned in the top layer is used for estimating the orientation of the missing part of the stimulus to be similar
to the one appearing outside the receptive field.
Our model is most closely related to predictive coding (Rao and Ballard, 1999) in the sense that they
have also studied the interplay between Bayesian inference and natural image learning in relation to the
classical and non-classical receptive field properties of early visual cortex. However, there are several crucial
differences in the way information is encoded and processed; for example, they consider additive or subtractive
interaction between bottom-up and top-down signals, whereas we take multiplicative interaction; also, they
interpret neural activities as values of continuous states, whereas we regard them as probabilities of discrete
states. The relevance of the predictive coding model and other models will be thoroughly discussed in
Section 5.
The rest of this article is organized as follows. The next section introduces our computational model,
defines the structure, and explains the inference and learning methods. Then, after describing some detailed
settings (Section 3), we present the results of our simulations of learning and inference in comparison with
the experimental data (Section 4). Finally, we discuss related work (Section 5) and conclude the article with
some future possibilities (Section 6). The Appendix gives a formal derivation of our learning method.
2
2.1
Computational model
Network Structure
Let us consider a Bayesian network (directed acyclic graph) consisting of a set X of random variables (nodes).
We divide the set X of nodes into two disjoint sets, visible nodes V and hidden nodes H, where the visible
nodes have no children. We assume that all hidden nodes are multinomial variables ranging over discretefinite states, while all visible nodes are continuous variables ranging over real numbers. The latter is because
pixel values in gray-scale images are given to the visible nodes. An example of such network is given in
Figure 1(a). We write st(X), pa(X), and ch(X) for the set of states, the set of parent nodes, and the set of
children of a given node X.
As usual, we postulate that the joint distribution of all variables can be factorized to the product of local
2
conditional probabilities P(X|pa(X)) of each variable X given its parents:
Y
P(X) =
P(X|pa(X))
(1)
X∈X
We further assume that the conditional probability distribution P(X|pa(X)) is defined parametrically in
terms of a Softmax function for a hidden node and a Gaussian function for a visible node:
• For a hidden variable H, we assume a weight matrix {wh,u }h∈st(H),u∈st(U ) of real values for each
parent variable U , where the matrix elements are indexed by a child state h and a parent state u; see
Figure 1(b, up). Then, the following defines the probability of a hidden variable H to have state h
provided that its parent nodes pa(H) = {U1 , . . . , Up } have states u1 , . . . , up :
Pp
exp( j=1 wh,uj )
Pp
P(h|u1 , . . . , up ) = P
(2)
0
h0 ∈st(H) exp(
j=1 wh ,uj )
That is, when the states u1 , . . . , up of the parents are given, a state h of H is more likely to be chosen
when the summation of the weights wh,u1 , . . . , wh,up has a larger value. Intuitively, a “lower-level”
feature h is likely to be active when it is strongly supported by either of the “higher-level” features
u1 , . . . , up . The Softmax definition arises from forming a probability distribution out of the real-valued
summation of weights (exponentiation for making it positive, followed by normalization). We adopt
real-valued weights here for representing both excitatory and suppressive effects.
• For a visible variable V , we assume a weight vector {wuV }u∈st(U ) of real values for each parent variable
U , where the vector elements are indexed by a parent state u (but not a child state v, which is
continuous); see Figure 1(b, down). Then, the following defines the probability of a visible variable V
to have state v provided that its parent nodes pa(V ) = {U1 , . . . , Up } have states u1 , . . . , up :
!
Pp
(v − j=1 wuVj )2
1
exp −
(3)
P(v|u1 , . . . , up ) = √
2σ 2
2πσ 2
for a fixed hyperparameter σ. Intuitively, when the states u1 , . . . , up of the parents are given, a state
v of V is more likely to be chosen when the summation of the weights wuV1 , . . . , wuVp is closer to the
value v.
We assume that the prior P (X) of a root node (i.e., a node with no parent) is uniform. From now on, we
denote by w the vector of all the weights in the entire network, and since the distribution P(X) depends on
the weights, we will instead write P(X|w).
It is important to emphasize that it is a state rather than a variable that represents each individual
visual feature, unlike the more conventional approach using continuous or binary variables. Thus, in our
setting, a multinomial variable represents a “feature space” that combines a set of features that are related
to each other. For this reason, we sometimes call a state a unit in the sequel. We can find a parallel between
such nested feature representation and the hypercolumn and minicolumn structures known in visual cortex of
higher mammals; a minicolumn is a set of vertically organized neurons with similar functions (visual features)
and a hypercolumn is a cluster of dozens of minicolumns that have different but closely related functions
(Mountcastle, 1997; Buxhoeveden and Casanova, 2002; Tanaka, 2003). However, it should be stressed that
the model presented here is only functional and its neural implementability is an open question.
2.2
Bayesian inference
Having defined the Bayesian network in Section 2.1, we intend to perform hierarchical Bayesian inference to
obtain the posterior of some variables given an input. For example, such inference can yield the posterior
distribution P (H|v) of all hidden variables H given a concrete input v. If we are interested in a particular
3
(a)
(b)
u
U
wh,u
Softmax
H
u
U
Gaussian
h
wuV
V
Figure 1: (a) An example of multinomial Bayesian network. The visible nodes (bottom layer) are continuous
and defined by a Gaussian function. The hidden nodes (upper layers) are multinomial and defined by a
Softmax function. (b) A weight is assumed between each hidden state and each parent state (up) or between
each visible node and each parent state (down). Such weights are used for defining conditional probabilities.
state a of a particular hidden variable h, then the inference can also give the marginal posterior probability
P (h = a|v), which we relate with a neural firing rate. Note that this approach contrasts with prior work
in which the values of continuous variables, rather than their probabilities, are related with neural activities
(Rao and Ballard, 1999).
Of the various inference methods of Bayesian networks, we adopt Gibbs-sampling (Bishop, 2006). While
prior work has typically used belief propagation (George and Hawkins, 2005; Rao, 2005; Ichisugi, 2007;
Röhrbein et al., 2007; Chikkerur et al., 2010; Hosoya, 2010), we do not take this approach since belief
propagation can compute only marginal probabilities of a single variable (Pearl, 1997), while Gibbs-sampling
can compute joint probabilities of multiple variables, which are crucial in our case since the update rules of
our learning algorithm take into account the dependencies among different variables (Section 2.3).
To briefly introduce the method used here, recall first how Gibbs-sampling works in general. Let us fix
a configuration v̂ for the visibles V. At time t = 0, we start with some “initial” configuration ĥ(0) for
the hidden variables H. Then, at a subsequent time t ≥ 1, we take a “course” of sampling for all hidden
variables H1 , H2 , . . . , HN , one after another
ĥ1 (t)
ĥ2 (t)
..
.
from P(H1 | v̂, ĥ2 (t − 1), . . . , ĥN (t − 1), w),
from P(H2 | v̂, ĥ1 (t), ĥ3 (t − 1), . . . , ĥN (t − 1), w),
ĥN (t)
from
P(HN | v̂, ĥ1 (t), ĥ2 (t), . . . , ĥN −1 (t), w).
It is known that, by repeating this procedure, the sampling asymptotically approaches to the actual posterior
distribution P(H|v̂, w). Then, note that, in each step, we need to compute the distribution P(H | V, H \
H, w) for given values of V and H \ H. This can be obtained by using property (1) of Bayesian networks:


Y
1
P(X|pa(X), w)
(4)
P(H | V, H \ H, w) = P(H|pa(H), w)
A
X∈ch(H)
P
Q
where the normalizing constant A = H P(H|pa(H), w) X∈ch(H) P(X|pa(X), w) (where the set pa(X)
in each P(X|pa(X), w) may include the variable H). Note here that the computation in equation (4) is
local since it involves only the variables in H’s Markov blanket, that is, the children, the parents, and the
children’s parents except H itself (Bishop, 2006). However, we do not claim any biological plausibility here;
4
relevantly, Gibbs-sampling seems difficult to implement in cortex since this method requires samples to be
taken in a sequential manner from variables to variables.
2.3
Unsupervised learning
Assume that inputs given to the visible nodes V follow an external probability distribution P∗ (V). The
learning searches for a generative model reflecting the distribution of input data as precisely as possible.
Thus, the objective function is the expected log likelihood,
F(w) = E∗ [log P(V|w)],
(5)
where E∗ [·] denotes the expectation taken under the external distribution P∗ . The algorithm can be derived
as a stochastic gradient learning method on the objective function and formulated as an alternate iteration
of two steps.
Inference Take a random sample v̂ from the distribution P∗ (V); then, take a random sample ĥ from the
posterior distribution P(H|v̂, w) by using Gibbs-sampling as explained in Section 2.2. Note that a
complete configuration for all variables is thus given. We repeat this process several times.
Update For each pair of a hidden node H and a parent U ∈ pa(H), update each weight wh,u by
D
E
∆wh,u ∝ δu,û (δh,ĥ − P(h|pa(ĥ)) .
Also, for each pair of a visible node V and a parent U ∈ pa(V ), update each weight wuV by
+
*
X
1
V
V
δu,û (v̂ −
wuj ) .
∆wu ∝
σ2
(6)
(7)
uj ∈pa(v̂)
In both, δ·,· denotes Kronecker’s delta, h·i is the average over the samples taken in the inference step,
û and v̂ each denote the states of U and V in each sample, and pa(v̂) denotes the set of the states of
the parent nodes pa(V ) in each sample.
A formal derivation is given in the Appendix. Note that the update rules (6) and (7) involve information
only from proximal nodes. The above learning algorithm is a conservative extension of the algorithm for
binary networks presented in (Neal, 1992).
In our simulation presented later, we apply the commonly used technique that trains each layer one after
another rather than simultaneously training all layers. The intuitive reason it works is as follows. When a
layer undergoes training, no top-down influence is exerted on the nodes of this layer since all the parameters
for the upper layers are fixed to zero. Therefore, the nodes in the trained layer are statistically independent
of each other by property (1) of Bayesian networks. This fact, together with the sparsity enforced by
multinomial variables, leads to an efficient feature representation.
3
Simulation Setting
For simulation, we employed the three-layer Bayesian network illustrated in Figure 2 that mimics early visual
cortex.
Layer 0 The bottom layer, called layer 0, has 24 × 24 visible (continuous) nodes that are arranged in
rectangular form. Inputs of image patches will be given to this layer.
Layer 1 The middle layer, called layer 1, has 3 × 3 nodes that are also arranged rectangularly. Each node
has 100 discrete states. Each layer-1 node is connected to a set of 12 × 12 layer-0 nodes such that two
adjacent layer-1 nodes have an overlapping set of subordinate layer-0 nodes; side-by-side adjacent ones
have 12 × 6 or 6 × 12 overlapping nodes and diagonally adjacent ones have 6 × 6 overlapping nodes.
This layer will be compared to V1 simple cells.
5
Layer 2
Layer 1
Layer 0
24
12
24
12
Figure 2: The architecture of the network used in the simulation. Layer 0 has 24 × 24 continuous nodes.
Layer 1 has 3 × 3 nodes (multinomial variables, with each depicted as a square subregion) each with 100
states (units, with each depicted as a circle in the side box). Layer 2 has 4 nodes each with 100 states.
Layer-2 nodes are connected to all layer-1 nodes, while layer-1 nodes are connected to overlapping sets of
12 × 12 layer-0 nodes.
Layer 2 The top layer, called layer 2, has 4 nodes, each having 100 states. Each layer-2 node is connected
to all layer 1 nodes. This layer will be compared to V2 cells.
Note that, from these connectivities, layer-2 nodes can have receptive fields covering the whole visual field,
while layer-1 nodes can have smaller receptive fields partially overlapping with each other.
We used the dataset of gray-scale natural images provided by Olshausen as inputs1 . The images were
preprocessed with the whitening and low-pass filter prescribed by (Olshausen and Field, 2003). The resulting
images had pixel intensities ranging over real numbers, with zero indicating gray. In each learning cycle, we
extracted a 24 × 24 patch from a random position in an image, skipping those with variances of less than 0.1
to speed up learning. Then, we gave the pixel intensities in the image patch to the visible nodes. For each
image patch, we took 10 samples from the posterior in the inference step. We updated the weights after
every 10 cycles of this process.
As already mentioned, we employed layer-by-layer learning. That is, after initializing the network (setting
all weights to zero), we first let layer 1 learn alone (fixing the weights between layers 1 and 2) and, after it
settled, let layer 2 start to learn (fixing the weights between layers 0 and 1). The first phase spent about 3M
cycles with a learning rate of 0.02. The second phase spent about 5M cycles with a learning rate of 5.0 for
the first 3M cycles and 1.0 for the rest. We fixed 1/σ 2 = 4.5 throughout the simulation.
1 Available
through http://redwood.berkeley.edu/bruno/sparsenet/.
6
4
Simulation Results
This section presents the results of our simulations using the methods given in Sections 2 and 3. Sections 4.1
and 4.2 show the hierarchical feature representations learned in layers 1 and 2 with quantitative comparisons
with the experimental data; Section 4.3 gives the result of a simple procedure to confirm that the learned
representations were reasonable. Sections 4.4 and 4.5 present the simulations of cross-orientation suppression
and filling-in phenomena using Bayesian inference on the trained network.
4.1
Layer 1 representations
Figure 3(a) shows part of the basis images represented in layer 1. Each block of 100 consecutive image
patches (“basis images”) corresponds to the 100 units of a layer-1 multinomial variable. Each image patch
displays the weights {wuV }V ∈ch(U ) from a layer-1 unit u to all its subordinate layer-0 nodes V . Since each
layer-1 unit is connected to 12 × 12 layer-0 nodes, the basis image has 12 × 12 pixels. Only the basis images
for the units of three multinomial variables are shown (300 units out of a total of 900 units); the remainder
are quite similar. We can see that the basis images resemble Gabor filters and have much variety in position,
orientation, spatial frequency, phase, and bandwidth; this is qualitatively similar to typical receptive field
shapes of V1 simple cells (Hubel and Wiesel, 1968; Jones and Palmer, 1987).
In principle, the basis images should not directly be compared with the receptive fields since the former
are internal representations while the latter are response properties. For this reason, in other model studies,
the receptive fields have been estimated as the inverse of the weight matrix in independent component
analysis (ICA) (van Hateren and van der Schaaf, 1998) or by performing reverse correlation in sparse coding
(Olshausen, 2002); in these cases, the receptive fields looked slightly different from the basis images. In
our case, however, the receptive fields and the bases were almost identical. We used reverse correlation to
estimate the receptive field profile of each layer-1 unit. Specifically, we approximately computed for each
unit u,
Z
VRF = V P(u|V, w)Prand (V)dV,
(8)
where P(u|V, w) is the posterior probability of the unit u given the input image V in the generative model
(the trained network), and Prand (V) denotes the distribution of Gaussian random images (with variance
1). In short, VRF gives the average of random images that activate the unit u. We approximated this by
using samples of random images and unit activations, where the latter were obtained through the Gibbssampling-based inference explained in Section 2.2. Figure 3(b) shows the estimated receptive field profiles
(using 50,000 image samples with 20 courses of Gibbs-sampling for each image) for ten sample units, in
comparison to their basis images. For these units, the shapes of both profiles look very similar, except that
the receptive field is localized in accordance with the spatial arrangement of the subordinate visible nodes.
In general, for all layer-1 units with clear basis shapes, the basis image and the receptive field profiles were
essentially identical at the level of pixel intensity. More precisely, we first judged 72 units with the maximal
pixel value smaller than 0.1 as having unclear basis shapes. For the remaining 828 units, we normalized the
pixel values in the basis image by dividing them by the maximal value and did the same on the pixel values
in the receptive field profile (properly clipped to the size of the basis image). We plotted all the normalized
pixel values y in the receptive field profiles of all units against the corresponding normalized pixel values x
in the basis images. Since it well fitted the linear function y = 0.92x + 0.0015 with R2 = 0.92, we concluded
that our basis images were effectively identical to the receptive field profiles. Therefore we will make no
distinction between these two in the sequel.
To quantitatively compare the basis images with the physiological data, we fitted the bases with the
standard Gabor function
x02
y 02
G(x, y; x0 , y0 , A, σx , σy , θ, f, φ) = A exp − 2 − 2 cos (2πf x0 + φ)
2σx
2σy
where x0 = (x − x0 ) cos θ + (y − y0 ) sin θ and y 0 = −(x − x0 ) sin θ + (y − y0 ) cos θ; the parameters are center
position (x0 , y0 ), amplitude A, size (σx , σy ), orientation θ, spatial frequency f , and phase φ. We discarded
7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0
25
50
75
100
125
150
175
200
225
250
275
300
(a)
325
350
375
400
425
450
475
(b)
Figure 3: (a) Learned basis images represented in layer 1. Each 100 consecutive images corresponds to a
single layer-1 multinomial variable. Only the bases for three variables are shown. (b) The basis images (up)
and the receptive field profiles (down) of ten layer-1 units. The receptive fields are obtained through reverse
correlation.
8
118 units (out of 900) for poor fitting (R2 < 0.9), and further 121 units since the filters were not reasonably
within the image boundary (the center position exceeded the boundary with one pixel of the inner margin
or the size in each dimension was greater than half the image width). For the remaining 661 basis images,
we pooled the parameters obtained from the fitting.
Figures 4(a) and (b) show the histograms of spatial frequencies and lengths (2σy ) of the 661 units,
together with side-by-side replots of known physiological data, namely, the distributions of the “peak spatial
frequencies of foveal X cells” (DeValois et al., 1982, Fig. 6) and the “space constant in height dimension”
(Parker and Hawken, 1988, Fig. 4a). Since the frequency and the length data in the simulation were based on
pixels while those in the experiment were based on degrees, we first scaled the frequency data by coefficient
sfreq = 31.6 so as to minimize the Hellinger distance between the two histograms and then adjusted the length
data using scaling coefficient 60/sfreq = 1.9. For both quantities, the simulation data did not exactly match
the experimental data, although they did not deviate too much, either; note in particular the similarity in
the peak lengths arising without a direct fitting.
Next, following (Ringach, 2002), we compared the receptive field shapes of model units with V1 simple
cells in the space of two-dimensional sizes in units of the sinusoidal wavelength, defined as (nx , ny ) =
(σx f, σy f ). Figure 5 plots the simulation and the experimental data (the latter is provided in Ringach’s
website). The overall tendencies toward the upper left were consistent in both data, while the simulation
data tended to shift to the left particularly for units close to the origin. This means that the aspect ratios
of well-tuned units were high in both data, but those of broadly tuned units were overly high in the model.
We made further comparisons using the aspect ratios and the spatial frequency bandwidths in octave
p calculated from the values of (nx , ny ), namely, σy /σx and log2 ((f + ∆f )/(f − ∆f )), where ∆f =
(log 2)/2/(σx π), respectively. The last definition of bandwidth can be obtained analytically as the full
width at half maximum (FWHM) in the Fourier transform of a Gabor function; see (Daugman, 1985; Petkov
and Kruizinga, 1997). Figures 4(c) and (d) show the histograms of spatial frequency bandwidths and aspect
ratios, respectively, of the model units and V1 simple cells. Only
p 387 model units were used to analyze
the bandwidths since the rest had non-real values (with nx ≤ (log 2)/2/π). While the distributions of
bandwidths almost coincided, those of aspect ratios looked rather different, with a notable shift in peak values. Finally, Figure 4(e) plots the bandwidths against the frequencies of the 387 model units. The negative
correlation and the lower variability of bandwidths in high-frequency units were consistent with physiology
(DeValois et al., 1982, Fig. 7), even though the concrete values on both axes were different (probably because
the bandwidths were measured in different ways).
These analyses reveal both common and uncommon points between the model data and the related
experimental data. Inconsistency was mostly lesser variability in the model results; indeed, we noticed that
such quantitative tendencies were considerably sensitive to hyperparameters such as the number of units in
each variable and the number of variables in each layer. Note that, roughly, the inverse of the number of
units in a variable (0.01 in our network) corresponds to sparsity, and the total number of units in a layer
relative to the size of an image patch (900/242 ≈ 1.56) corresponds to the degree of overcompleteness in
existing models; the dependencies of quantitative tendencies on such hyperparameters are also common in
these models. (Relevantly, the distribution of spatial frequencies of independent component filters of natural
images in (van Hateren and van der Schaaf, 1998) has much less variety compared to our results, which
may also be because their model does not allow overcomplete representations.) We did not explore those
hyperparameters to find the best match since detailed reproduction of physiological properties was not the
main issue in this work.
4.2
Layer 2 representations
Since each layer-1 unit in the trained network represented a localized orientation, each layer-2 unit should
represent some combination of localized orientations. Figure 6 shows the basis representations of all 400
layer-2 units, where each image patch displays the weighted superposition of the basis images of the most
strongly connected layer-1 units (at most one unit par each layer-1 node with a weight larger than 2.0),
where the basis images are properly translated according to the spatial arrangement of the layer-1 units.
9
30
40
30
20
(b) Length (arcmin) [scale=1.9 h=0.409]
60
40
50
occurrence (%)
20
occurrence (%)
(a) Spatial frequency (cy/deg) [scale=31.3 h=0.248]
50
50
10
30
20
10
40
30
20
10
5
0
20
15
occurrence (%)
15
occurrence (%)
10
10
0
0
0.5 0.7 1 1.4 2 2.8 4 5.6 811.2 16>16
0 5 10 15 20 25 30 35 40 >40
0
0
0.5 0.7 1 1.4 2 2.8 4 5.6 811.2 16>16
0 5 10 15 20 25 30 35 40 >40
(c) Spatial frequency bandwidth (octave) [h=0.231]
(d) Apsect ratio [h=0.413]
35
25
(c) Spatial frequency bandwidth (octave) [h=0.231]
(d) Apsect ratio [h=0.413]
30
35
25
20
25
30
20
15
20
25
occurrence (%)
occurrence (%)
(b) Length (arcmin) [scale=1.9 h=0.409]
60
40
occurrence (%)
occurrence (%)
(a) Spatial frequency (cy/deg) [scale=31.3 h=0.248]
50
10
5
10
2
3
2
1
3
70
(e) Bandwidth vs spatial frequency
60
6
5
4
2
3
4
Orientation
50
60
40
50
30
20
3
10
2
3.5
4.2
5
1
Frequency (cy/deg)
4
1
70
occurrence (%)
4
0
Orientation
occurrence (%)
5
Bandwidth (octave)
Bandwidth (octave)
6
3
0
(e) Bandwidth vs spatial frequency
7
10
5
5
0
0 0.5 1 1.5 2 2.5 33.5 4 4.5 5 5.5 6
0
1
0
0 0.5 1 1.5 2 2.5 33.5 4 4.5 5 5.5 6
7
15
0
6
3
0
3.5
4.2
5
Frequency (cy/deg)
1
6
40
30
20
10
2
0
3
0
1
2
Figure 4: The distributions of (a) spatial frequencies, (b) lengths, (c) spatial frequency bandwidths, (d) aspect ratios, and (e) bandwidths relative to frequencies of layer-1 units. Data (a–d) are compared with
known experimental data (see text), with the degree of matching being indicated by the Hellinger distance
(“h”). The spatial frequencies and the lengths in simulation are scaled for adjustment as indicated. Black:
simulation, white: experiment.
10
3
(a) Simulation
1.5
(b) Experiment
1.5
ny
1
ny
1
0.5
0
0.5
0
0.5
1
0
1.5
nx
250
0
0.5
1
1.5
nx
Phase
Figure 5: The distributions of shapes of receptive fields of (a) model units and (b) V1 simple cells (Ringach,
2002) in space of two-dimensional sizes in units of sinusoidal wavelength.
200
To which visual
150 area is layer-2 the closest? At first, one might think that it corresponds to V1 complex
cells since many units combine similar orientations and thus look like collinear or parallel lines. However,
from tests using100
grating stimuli, we found that essentially no unit had phase invariance (data not shown),
which would be the most important property of actual complex cells. In fact, a closer look at Figure 6
reveals that some units combine orientations that are slightly or even largely different; thus, such units look
50
like contours or angles. Such properties are more prevalent in V2 (Hegdé and Essen, 2000; Ito and Komatsu,
2004; Anzai et al., 2007) and therefore comparison with this area would be more sensible.
To investigate0the
structure
of the 60
combined
0
20
40
80 orientations in layer-2, we conducted the following test,
adhering to the physiological study of V2 (Anzai et al., 2007). For each layer-2 unit, we recorded the
response to a localized sinusoidal grating (with an amplitude of 1.0) of size 12 × 12 at one of 4 × 4 positions
within the visual field. The
response
was measured as the probability
thetuned)
unit being activated in 200
Phase
(broadly tuned)
Phaseof
(well
120
120orientations equally divided from 180 degrees; the
samples. The orientation
of the grating was chosen from 12
phase and spatial frequency that evoked the maximal response were adopted. Figure 7 shows the receptive
field profiles of 100
six typical layer-2 units measured in this100
way, where each profile is shown as sixteen panels
corresponding to the sixteen positions used in the test in which a line segment indicates the response to
80
80
one of the 12 orientations (local orientation tuning curve). Just like the bases, these receptive fields tend to
combine similar orientations,
but contain different orientations
as well.
60
60
To analyze the receptive fields quantitatively, we first identified and pooled the optimal orientations at
all 16 positions for
the local orientation tuning curve well fitted a von
40 each unit. That is, at each position, if 40
Mises function
20
20
M (θ; A, m, θ0 ) = A exp [m(cos(2(θ
− θ0 )) − 1)]
(9)
with R2 > 0.5 and0 A > 0.5, then we pooled the peak orientation
θ0 ; otherwise, if the tuning curve well fitted
0
0 Mises
20 functions
40
60
80
0
20
40
60
80
a mixture of two von
M2 (θ; A1 , A2 , m1 , m2 , θ1 , θ2 ) = M (θ; A1 , m1 , θ1 ) + M (θ; A2 , m2 , θ2 )
(10)
with R2 > 0.5, then we pooled each peak orientation θi (i = 1, 2) if M2 (θi ) > 0.5; if the curve fitted neither
function, then we pooled no orientation at that position. The units with only zero or one orientation pooled
were discarded for further analysis; a total of 132 out of 400 units remained. Then, for each unit, we calculated
the maximal orientation difference, that is, the largest difference between each pair of pooled orientations.
11
Figure 8(a) shows the distribution of the maximal orientation differences for all layer-2 units, with the
replot of the analogous distribution reported on V2 cells (Anzai et al., 2007)2 . Both data are similar in the
sense that maximal orientation differences around 0 degree are the most frequent and then larger maximal
orientation differences are gradually less frequent, except for a weak peak at around 90 degrees. Further, to
examine the proportion of non-similar orientations in the combined orientations, we categorized the layer-2
units into those with maximal orientation differences of 30 degrees or less (uniform) and those that remained
(non-uniform); 75% of the units were categorized as uniform and 25% as non-uniform. Figure 8(b) shows
the distribution of the differences between all pairs of orientations pooled for all non-uniform units, with
analogous data from the same experimental study (Anzai et al., 2007). Again, both distributions have a
prominent peak at around 0 degree with a gradual decrease until a weaker peak at around 90 degrees. As a
check, we confirmed that the distributions obtained from simulation were significantly different from what
we would expect if each unit’s selectivity at each location was random; in such case, the maximal orientation
differences would give a peak only at 90 degrees and the all pairwise orientation differences would give a
uniform distribution (p < 10−6 for both; Kolmogorov-Smirnov test).
Why did such property of orientation combination emerge in our model? Since the features captured by
our learning algorithm are essentially the oriented edges co-occurring frequently, such property should be
due to the statistics of natural images. That is, although oriented edges are the elements of natural images
at the finest level, those edges are at the next level combined as a straight line, a smooth contour, or an
angle to constitute a part of object. In this respect, the lack of phase invariance is not surprising at all
since orientations with difference phases would not occur simultaneously in natural images. Interestingly,
our result, as well as the physiological result, is consistent with a more direct study on the statistics of cooccurring edges in natural images (Sigman et al., 2001), which also reports that iso-orientation co-occurrences
are most often and those with larger orientation differences are decreasingly less often, though it shows no
discernible peak at around 90 degrees. (The last point could be related to that our quantitative result was
somewhat sensitive to hyperparameters, in particular, the 90-degree peak disappeared when sparsity was set
to be larger.)
4.3
Image generation
In order to confirm that the learned hierarchical representations presented so far reasonably captured natural
image statistics, we generated images from the model indicating how it might have interpreted a given image.
More precisely, after giving a random natural image patch V to layer-0, we first took a sample H1 , H2 from
the posterior P(H1 , H2 |V), where H1 and H2 are the layer-1 and layer-2 variables, respectively. Then, we
obtained the average of 100 images generated from the conditional probability P(V|H1 ) and also the average
of 100 images generated from P(V|H2 ). Figure 9 shows the result, where each column displays an input
natural image patch (top), three averaged images each generated from a different sample H1 (middle), and
three each generated from a different sample H2 (bottom). As one can see, the generated images were quite
similar to the input image, which indicates that the model had learned reasonable representations of natural
images. However, the model tended not to precisely reproduce the input image, but highlighted conspicuous
features and ignored obscure ones, which is not surprising since the basis representations did not include
tiny-sized edge features (Figure 3(a)).
4.4
Cross-orientation suppression
Cross-orientation suppression (Bonds, 1989) is a well-known divisive normalization phenomenon in V1
(Heeger, 1992). According to this, a typical cell reduces the response to an optimal grating when a mask
grating with a non-optimal orientation is superimposed upon it.
This behavior can easily be explained in our model since the units are grouped together into multinomial
variables. That is, consider a multinomial variable including two units, s1 and s2 , that represent different
2 The original experimental data contained both positive and negative orientation differences. We disregarded the signs in
our analysis.
12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
Figure 6: The learned basis representations in layer-2. Each image displays the weighted superposition of
the basis images of strongly connected layer-1 units properly shifted according to the spatial arrangement of
the layer-1 units.
13
(a)
(b)
(c)
non−empty/empty N=400
Unit groups (uniform/non−uniform) N=132
25%
33%
(d)
(e)
(f)
Figure 7: The receptive field profiles of six exemplar layer-2 units in the trained network. Each panel shows
the local orientation tuning curve at one of 67%
16 positions within the visual field.
75%
(a) Maximum orientation differences [h=0.198]
(b) All orientation differences for non−uniforms [h=0.201]
50
60
40
occurrence (%)
occurrence (%)
50
40
30
20
20
10
10
0
30
0
15
30
45
degree
60
75
0
90
0
15
30
45
degree
60
75
90
Figure 8: The distributions of maximum orientation differences for all layer-2 units (left) and of all pairwise
orientation differences for all non-uniform units (right). Black: simulation, white: experiment.
14
Figure 9: The model’s “interpretation” of input images. Each column shows a random natural image patch
(top), three averaged images generated from layer-1 each using a different sample from the posterior (middle),
and three averaged images generated similarly from layer-2 (bottom).
features. Then, if a stimulus only with s1 ’s feature is presented, then only s1 can be activated. However, if a
stimulus with both features is presented, then both s1 and s2 can be activated, in which case the probability
of s1 being activated would be reduced.
We confirmed that such a phenomenon of cross-orientation suppression could be robustly reproduced in
our network. Figure 10(a) shows the responses of 10 exemplar layer-1 units to a single sinusoidal grating
(covering the entire visual field) with a varied orientation (excitatory orientation tuning curve) and the
responses to a plaid composed of an optimal grating and a mask grating with a varied orientation (mask
orientation tuning curve). The orientation of the single grating or of the mask grating of the plaid was
chosen from the 16 orientations equally divided from 180 degrees; the spatial frequency obtained from Gabor
fitting and an amplitude of 0.5 were used; the phase that evoked the maximal response was adopted. Here,
the response of a unit was defined as the probability of the unit being activated in 200 samples. In the
figure, the mask orientation tuning curve (solid line) of each unit has a peak at a similar orientation to the
excitatory orientation tuning curve (broken line); thus the response to an optimal grating was suppressed
by a non-optimal mask grating, though the extent of suppression varies from unit to unit. Observe the
qualitative similarity between these simulated data and the responses of an actual V1 simple cell replotted
in Figure 10(b).
In general, most of layer-1 units with a clear excitatory tuning curve also had a clear mask tuning curve
with a similar peak orientation. More precisely, defining clear tuning curves as those that well fit a von
Mises function (equation (9)) with R2 > 0.5 and A > 0.5, we found 533 units out of 900 having a clear
excitatory tuning curve, of which 490 (91.9%) also had a clear mask tuning curve whose peak orientation
was within 22.5 degrees from that of the excitatory tuning curve. The units with unclear excitatory tuning
curves mostly had either tiny, badly shaped, or low-frequency receptive fields.
15
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0
0 45 90 135
0
0 45 90 135
0
0 45 90 135
0
0 45 90 135
0
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0
0 45 90 135
0
0 45 90 135
0
0 45 90 135
0
0 45 90 135
(a)
0
20
0 45 90 135
10
0
0
45
90
135
0 45 90 135
(b)
Figure 10: The cross-orientation suppression response properties of ten exemplar layer-1 units (a) and a
V1 simple cell (b). The broken line shows the excitatory orientation tuning curve and the solid line shows
the mask orientation tuning curve. Note the similar peak orientations in both curves. The data for the
simple cell is a replot from the experimental study (Bonds, 1989). Abscissa: orientation in degree, ordinate:
response.
4.5
Filling-in
Filling-in is the phenomenon that one perceives within the blind spot visual attributes similar to the surrounds
even though it receives no retinal input (Ramachandran, 1992). Neural correlates have also been reported
in V1 (Komatsu and Kinoshita, 2000; Matsumoto and Komatsu, 2005). In this work, we take a particular
physiological result on filling-in using bar stimuli of varied lengths (Matsumoto and Komatsu, 2005) and
attempt to explain it by Bayesian inference.
The task is illustrated in Figure 11. Suppose that a unit selective to a horizontal orientation has a
receptive field that is partially overlapped with a blind spot. Then, we present a bar stimulus of varied
length over the receptive field while fixing the left end (a–d). When the bar is not long enough (c), one
would not perceive the part of the bar intersecting the blind spot (e). However, when the right end exceeds
the boundary of the blind spot (d), one would perceive the entire stimulus as a single bar (f). If the unit
correlates the perception, then the unit would not be activated in the former case, while it would in the latter.
Such neurons indeed exist in V1 (Matsumoto and Komatsu, 2005); an example is illustrated in Figure 13(b).
Note that cases (c) and (d) in Figure 11 are identical in terms of the retinal input to the receptive field
as shown in (e) and (f), since the right end of the bar appears outside the receptive field. Therefore the
information on the right end must be propagated to the unit through some indirect path. In our model,
we assume that such propagation is carried out by feedback from the higher layer. Importantly, layer-2 has
units representing combinations of similar orientations: this provides the ground that the orientation inside
the blind spot is likely to be similar to the one outside. The proposed mechanism is illustrated in Figure 12,
in which the activations of nodes in a simplified network are shown (A) before and (B) after the bar stimulus
exceeds the blind spot boundary. Pay special attention to the rightmost unit, representing a horizontal
orientation, of the layer-1 node (1b) indicated by the thick box. Before the bar stimulus exceeds the blind
spot, the target unit is only weakly activated since it receives almost only the weak bottom-up information
on the partially visible horizontal bar; it receives scarce top-down information since the lack of retinal input
on the right hand side activates no unit of the node (1c) and therefore the layer-2 nodes are not provided
with enough evidence of a horizontal orientation. However, once the right end of the bar becomes visible,
the target unit is strongly activated since the layer-2 nodes can now clearly detect the horizontal orientation
16
and therefore exert a strong top-down influence on the node (1b). In the biological visual system, feedback
connections from V2 are a possible mechanism causing filling-in, although horizontal connections give an
alternative (Komatsu and Kinoshita, 2000; Matsumoto and Komatsu, 2005; Satoh and Usui, 2008).
We have conducted a simulation of the above task using the network after learning all layers (where no
blind spot was assumed during learning). We targeted a unit of the layer-1 node whose receptive field had
a horizontal orientation in the central region (the right most unit shown in Figure 3(b)). We assumed a
blind spot of size 7 × 4 located at position (13, 8) (with origin (0, 0)) and simulated it by “unfixing” the
values of the layer-0 nodes within this region during inference; more precisely, we computed samples from
the posterior distribution P (H, VBS |V \ VBS ) instead of P (H|V), where VBS denotes the layer-0 nodes
within the blind spot. Figure 13(A) plots the average responses of the target unit to the bar stimuli (of
thickness 2 at y-position 9) of varied lengths (where the average was taken of 10 separate trials with the same
stimulus, each time measuring the probability of the unit’s activation in 2, 000 samples taken after 1, 000
unused samples). As shown in Figure 13(a, left), the response increased markedly after the bar stimulus
exceeded the blind spot (solid line), even though the retinal input in the receptive field was the same.
To confirm that the above was indeed a filling-in effect, we ran a few control simulations. First, when we
removed the blind spot (that is, fixing all the layer-0 nodes to some values during inference) as illustrated in
Figure 11(g–j), the average responses of the same unit recorded after the bar stimulus exceeded the receptive
field were not different from those recorded just before this excess (broken line in Figure 13(a, left)). This
means that the presence of the blind spot was essential. (Physiologically, the cases with and without a blind
spot correspond to monocular stimulation to the eye with a blind spot, called the “blind-spot eye condition,”
and binocular stimulation, called the “non-blind-spot eye condition,” respectively.) Furthermore, to check
that the effect was caused by feedback, we conducted the same task without top-down influence on layer-1.
This can be simulated by using the network before learning layer-2 but after learning layer-1, in which case
the layer-1 nodes would receive no top-down influence. As seen in Figure 13(a, right), the responses of the
same unit did not change after the bar stimulus exceeded the receptive field even when the blind spot was
present. Note the qualitative similarity between Figures 13(a, left) and (b).
5
5.1
Related work
Predictive coding models
A similar kind of interplay between hierarchical Bayesian inference and natural image learning has been
studied in the seminal work on the predictive coding model (Rao and Ballard, 1999). In this, although
the inference algorithm is deterministic, it can be reformalized as a Bayesian inference computing the most
probable explanation of a given input. Using their inference and learning algorithms, they have successfully
reproduced a variety of classical and nonclassical receptive field properties of early visual cortex including
end-stopping, pop-outs, and surround suppression.
However, one criticism on the original predictive coding model was the linear (additive or subtractive)
interaction between top-down and bottom-up signals, which was inconsistent with known multiplicative
effects of cortical feedback (McAdams and Maunsell, 1999; Treue, 2001). This issue was addressed by a more
recent proposal of nonlinear predictive coding model (Spratling, 2008; Spratling, 2010; Spratling, 2011),
which modified the update rules in the inference algorithm in a multiplicative fashion. Thus, the new model
successfully reproduced a wider range of nonclassical properties of V1, including attentional effects (Spratling,
2008) as well as cross-orientation suppression and surround modulation (Spratling, 2010). However, the
drawback was that it had no longer a clear Bayesian interpretation and had no objective function for inference
and learning; consequently, the convergence of computation was explained only informally (Spratling, 2011)
and the result of computation was rather difficult to interpret statistically.
In our model, the interaction between top-down and bottom-up signals is governed by Bayes’ theorem,
which has inherently multiplicative nature. Consequently, feedforward signals do not represent prediction
errors but “prediction matches,” which have the effect of sharpening bottom-up sensory signals that agree
with the top-down predictions (Murray et al., 2004). This fact was crucial in our simulation of filling-in
17
With a blind spot
(a)
(b)
RF
(c)
(d)
(e)
(f)
(i)
(j)
BS
Retinal input
Without a blind spot
(g)
(h)
Figure 11: Filling-in task. Bar stimuli of varied length are presented over the receptive field (RF) of a unit,
(a–d) and (g–j). Under conditions (a–d), the receptive field partially overlaps with a blind spot (BS). When
the right end of the bar stays within BS (c), the part covered by BS is perceived to be absent (e). When
the bar end exceeds the boundary of BS (d), the whole stimulus is perceived as a single bar even though the
retinal input to BS is absent (f). The conditions (g–j) are the cases without BS.
18
(A)
2a
1a
2b
1b
1c
RF
BS
(B)
2a
1a
2b
1b
1c
Figure 12: The proposed mechanism causing the filling-in phenomenon. The rightmost unit of node (1b)
is poorly activated when it only receives weak bottom-up support for a horizontal orientation (A), while
the unit is strongly activated when the context provides enough information on the horizontal orientation
through the top-down process.
19
With top−down
Without top−down
100
80
1
1
0.8
0.8
0.6
0.6
40
0.4
0.4
20
0.2
0.2
0
0
6
12
18
24
0
60
6
12
18
(a)
24
−20
0
2
4
6
8
10
12
14
16
(b)
Figure 13: (a) The responses of a layer-1 unit (the right most unit in Figure 3(b)) to bar stimuli of varied
lengths (corresponding to Figures 11(a–d)) when top-down signals are enabled (left) or disabled (right). A
blind spot was either present (solid line) or absent (broken line). The dotted box in each panel indicates
the length range corresponding to the receptive field of the recorded unit, while the filled box indicates the
length range corresponding to the blind spot. When top-down signals were enabled, the response increased
markedly after the bar exceeded the blind spot only. (b) The responses of a V1 neuron adapted from the
experimental study (Matsumoto and Komatsu, 2005). Abscissa: bar length (pixels), ordinate: response.
(Section 4.5) as well as in prior work reproducing attentional effects (Rao, 2005; Chikkerur et al., 2010).
However, one remaining issue is whether surround modulation, a prevalently observed phenomenon in V1,
can also be explained in our model (see Section 6 for further discussion).
It is worth noting that predictive coding interprets neural activities as the values of continuous states,
whereas our model regards them as the probabilities of discrete states. From current knowledge in physiology,
it is far from clear which interpretation is more plausible. However, the probabilistic interpretation has
drawn much attention these days from both experimental and theoretical sides in the name of “sampling
hypothesis” (Hoyer and Hyvärinen, 2002a; Fiser et al., 2010; Berkes et al., 2011; Buesing et al., 2011). In
particular, (Berkes et al., 2011) demonstrated that the spontaneous activities of multiple neurons (putatively
representing the prior distribution) became gradually closer to their average activities evoked by natural
images (the average posterior) as the animal grew up. This result appears to be quite compatible with our
learning scheme, which searches for an internal model as close as possible to the external visual world.
5.2
Restricted Boltzmann machine models
The sparse deep belief net is a stack of symmetric graphical models (i.e., restricted Boltzmann machines).
They have used a fast learning and inference algorithm based on contrastive divergence (Lee et al., 2008) to
reproduce a hierarchy of representations similar to V1 simple cells and V2; the latter was compared with a
set of experimental data on V2 (Ito and Komatsu, 2004) complementary to the one we used. However, their
model is not Bayesian in the sense that inference in hierarchy needs to be performed in a layer-by-layer manner
and, as a result, the computation on the entire network does not yield a simple probabilistic interpretation.
This theoretical drawback may be even more pronounced in a task involving complex inference. For example,
in our filling-in simulation, a part of the visible variables are unfixed to represent a blind spot and exactly
the same inference algorithm can be used to obtain a posterior probability. In their case, on the other hand,
some modification would be needed since simple layer-by-layer computation is not sufficient and, even in
that case, the result would not have a straightforward interpretation.
20
5.3
Multinomial network models
Quite a few studies have used multinomial Bayesian networks for modeling visual cortex computation.
Multinomial variables have typically served as a handy way of modeling hypercolumns composed of featurerepresenting minicolumns (George and Hawkins, 2005; Rao, 2005; Ichisugi, 2007; Röhrbein et al., 2007;
Chikkerur et al., 2010; Hosoya, 2010) or as a mechanism for explaining attention-driven divisive normalization
(Rao, 2005; Chikkerur et al., 2010). However, to the best of our knowledge, there have been no proposals of
methods of natural image learning that work for hierarchical multinomial Bayesian networks.
One novelty in our work is to exploit multinomial variables to implicitly enforce sparsity. Sparsity is
known to be an important constraint on unit activities to obtain efficient feature representations such as
visual receptive field properties from natural image patches (Olshausen and Field, 1996). Indeed, in a separate
simulation using binary variables (sparsity 0.5), we could not observe any meaningful features that arose.
However, our mechanism for sparsity may look rather unusual since previous models typically control sparsity
through a regularization term (Olshausen and Field, 1996; Lee et al., 2008), while ours hard-wires sparsity
by the number of states allowed for a variable. This could be a potentially severe restriction since a single
sample of activation allows exactly one unit to be active for each variable, although our simulation nonetheless
yielded various receptive field properties similar to actual visual neurons. An alternative approach would be
to employ a network with binary variables with explicit sparsity enforced on each variable. Although similar
results should in principle arise from such binary model, it is widely considered that this approach requires
a prohibitively large computation time (Hinton, 2007).
Lastly, it is worth pointing out that our model implicitly suggests that divisive normalization and sparsity,
which are apparently different phenomena, could be caused by the same neural mechanism. This might make
sense since the most plausible way of implementing either would be horizontal inhibition (Kisvárday et al.,
1997).
6
Conclusion
In this article, we have proposed a hierarchical probabilistic model of visual cortex and studied two kinds
of interplay between natural image learning and Bayesian inference. First, we have shown that a maximallikelihood learning using Gibbs-sampling-based Bayesian inference can yield a hierarchy of feature representations similar to receptive fields of early visual cortex. Second, we have demonstrated that Bayesian inference
performed on the network after learning can reproduce two kinds of context-sensitive response properties,
namely, cross-orientation suppression and filling-in. In particular, the latter exploits feedback processing
from the higher layer that has learned V2-like features representing combinations of similar orientations.
One important nonclassical property of V1 not addressed in the current work is surround suppression,
which has been explained by several other statistical models of natural image (Rao and Ballard, 1999;
Schwartz and Simoncelli, 2001; Spratling, 2010). We speculate that this property could in principle be
modeled by so-called “explain-away” effect, in which one explanation for an evidence can rule out another
potential explanation. In our case, an orientation appearing in the surround would explain away the same
orientation appearing in the center, thus causing suppression. Note that the explain-away effect occurs in
a Bayesian model as an interaction among multiple variables in the same level (with common subordinate
variables) and therefore could coexist with top-down effects like the one used in our filling-in simulation.
However, in order to enable such surround suppression effect in our multinomial framework, an extension
seems needed such that a visual feature to be explained away can be “turned off,” which we hope to
incorporate in a future version of our model.
Our eventual goal is to build up a model with a much deeper hierarchy. For this, crucially needed
are invariance properties, such as phase invariance in V1 complex cells and size invariance in higher areas.
Although our aim is not to explain full details of neural responses, the invariance properties seem to be
key to robust reproduction of major properties of higher areas. Since temporal structure is known to help
acquire invariance, we plan to incorporate trace learning into our learning algorithm (Földiak, 1991; Hurri
and Hyvärinen, 2003; Stringer et al., 2007; Cadieu and Olshausen, 2009).
21
Acknowledgments This work has been supported by the Presto program of Japan Science and Technology
Agency in the research area of “Decoding and Controlling Brain Information.” The author specially thanks
Florian Röhrbein, Taro Toyoizumi, and Shun-ichi Amari for valuable discussions.
22
References
Anzai, A., Peng, X., and Essen, D. C. V. (2007). Neurons in monkey visual area V2 encode combinations of
orientations. Nature Neuroscience, 10:1313–1321.
Bell, A. and Sejnowski, T. J. (1997). The ‘independent components’ of natural scenes are edge filters. Vision
Research, 37:3327–3338.
Berkes, P., Orbán, G., Lengyel, M., and Fiser, J. (2011). Spontaneous cortical activity reveals hallmarks of
an optimal internal model of the environment. Science, 331(6013):83–87.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Bonds, A. B. (1989). Role of inhibition in the specification of orientation selectivity of cells in the cat striate
cortex. Visual Neuroscience, 2:41–55.
Buesing, L., Bill, J., Nessler, B., and Maass, W. (2011). Neural dynamics as sampling: a model for stochastic
computation in recurrent networks of spiking neurons. PLoS Computational Biology, 7(11):e1002211.
Buxhoeveden, D. P. and Casanova, M. F. (2002). The minicolumn hypothesis in neuroscience. Brain,
125(5):935–951.
Cadieu, C. F. and Olshausen, B. A. (2009). Learning transformational invariants from natural movies. In
Advances in Neural Information Processing Systems, volume 21, pages 209–216.
Chikkerur, S., Serre, T., Tan, C., and Poggio, T. (2010). What and where: A Bayesian inference theory of
attention. Vision Research, 55(22):2233–2247.
Daugman, J. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized
by two-dimensional visual cortical filters. Optical Society of America, 7(2):1160–1169.
Deneve, S. (2004). Bayesian multisensory integration and cross-modal spatial links. Journal of PhysiologyParis, 98(1–3):249–258.
DeValois, R. L., Albrecht, D. G., and Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque
visual cortex. Vision Research, 22:545–559.
Doya, K., Ishii, S., Pouget, A., and Rao, R. P. N., editors (2007). Bayesian Brain: Probabilistic Approaches
to Neural Coding. MIT Press.
Fiser, J., Berkes, P., Orbán, G., and Lengyel, M. (2010). Statisticaly optimal perception and learning: from
behavior to neural representations. Trends in Cognitive Sciences, 14(3):119–130.
Földiak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3(2):194–200.
George, D. and Hawkins, J. (2005). A hierarchical Bayesian model of invariant pattern recognition in the
visual cortex. In Proceedings of International Joint Conference on Neural Networks, volume 3, pages
1812–1817.
Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9:181–197.
Hegdé, J. and Essen, D. C. V. (2000). Selectivity for complex shapes in primate visual area V2. Journal of
Neuroscience, 20(5):RC61.
Hinton, G. (2007). Learning multiple layers of representation. Trends in Cognitive Sciences, 11(10):428–434.
Hosoya, H. (2010). Bayesian interpretation of border-ownership signals in early visual cortex. In International
Conference on Neural Information Processing (Lecture Notes in Computer Science, Volume 6443/2010),
pages 1–8.
23
Hoyer, P. O. and Hyvärinen, A. (2000). Independent component analysis applied to feature extraction from
colour and stereo image. Network: Computation in Neural Systems, 11:191–210.
Hoyer, P. O. and Hyvärinen, A. (2002a). Interpreting neural response variability as Monte Carlo sampling
of the posterior. In Advances in Neural Information Processing Systems, volume 15, pages 277–284.
Hoyer, P. O. and Hyvärinen, A. (2002b). A multi-layer sparse coding network learns contour coding from
natural images. Vision Research, 42:1593–1605.
Hubel, D. H. and Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex.
Journal of Physiology, 195:215–243.
Hurri, J. and Hyvärinen, A. (2003). Simple-cell-like receptive fields maximize temporal coherence in natural
video. Neural Computation, 15(3):663–691.
Ichisugi, Y. (2007). A cerebral cortex model that self-organizes conditional probability tables and executes
belief propagation. International Joint Conference on Neural Networks (IJCNN 2007), pages 178–183.
Ito, M. and Komatsu, H. (2004). Representation of angles embedded within contour stimuli in area V2 of
macaque monkeys. The Journal of Neuroscience, 24(13):3313–3324.
Jones, J. P. and Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple
receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6):1233–1258.
Kersten, D. and Yuille, A. (2003). Bayesian models of object perception. Current Opinion in Neurobiology,
13:150–158.
Kisvárday, Z. F., Tóth, É., Rausch, M., and Eysel, U. T. (1997). Orientation-specific relationship between
populations of excitatory and inhibitory lateral connections in the visual cortex of the cat. Cerebral
Cortex, 7(7):605–618.
Komatsu, H. and Kinoshita, M. (2000). Neural responses in the retinotopic representation of the blind spot
in the macaque V1 to stimuli for perceptual filling-in. The Journal of Neuroscience, 20(24):9310–9319.
Lee, H., Chaitanya, E., and Ng, A. Y. (2008). Sparse deep belief net model for visual area V2. In Advances
in Neural Information Processing Systems, volume 20, pages 873–880.
Lee, T. S. and Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of Optical
Society of America A, 20(70):1434–1448.
Matsumoto, M. and Komatsu, H. (2005). Neural responses in the macaque V1 to bar stimuli with various
lengths presented on the blind spot. Journal of Neurophysiology, 93:2374–2387.
McAdams, C. J. and Maunsell, J. H. R. (1999). Effects of attention on orientation-tuning functions of single
neurons in macaque cortical area V4. The Journal of Neuroscience, 19(1):431–441.
Mountcastle, V. B. (1997). The columnar organization of the neocortex. Brain, 120:701–722.
Murray, S. O., Schrater, P., and Kersten, D. (2004). Perceptual grouping and the interactions between visual
cortical areas. Neural Networks, 17:695–705.
Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56(1):71–113.
Olshausen, B. A. (2002). Sparse codes and spikes. In Probabilistic Models of the Brain: Perception and
Neural Function, pages 257–272. MIT Press.
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning
sparse code for natural images. Nature, 381:607–609.
24
Olshausen, B. A. and Field, D. J. (2003). Sparse coding with an overcomplete basis set: A strategy employed
by V1? Vision Research, 37(23):3311–3325.
Park, K. Y., Jabri, M., Lee, S. Y., and Sejnowski, T. (2000). Independent components of optical flows have
MSTd-like receptive fields. In Proceedings of the international workshop on independent component
analysis and blind signal separation, pages 597–602.
Parker, A. J. and Hawken, M. J. (1988). Two-dimensional spatial structure of receptive fields in monkey
striate cortex. Journal of Optical Society America, A(5):598–605.
Pearl, J. (1997). Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Inference. Morgan
Kaufmann.
Petkov, N. and Kruizinga, P. (1997). Computational models of visual neurons specialised in the detection of
periodic and aperiodic oriented visual stimuli: bar and grating cells. Biological cybernetics, 76(2):83–96.
Ramachandran, V. S. (1992). Blind spots. Scientific American, 266:86–91.
Rao, R. P. (2005). Bayesian inference and attentional modulation in the visual cortex.
16(16):1843–1848.
Neuroreport,
Rao, R. P. N. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation
of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87.
Rao, R. P. N., Olshausen, B. A., and Lewicki, M. S., editors (2002). Probabilistic models of the brain:
Perception and neural function. MIT Press.
Ringach, D. L. (2002). Spatial structure and symmetry of simple-cell receptive fields in macaque primary
visual cortex. Journal of Neurophysiology, 88:455–463.
Röhrbein, F., Eggert, J., and Körner, E. (2007). A cortex-inspired neural-symbolic network for knowledge
representation. In Proceedings of the IJCAI Workshop on Neural-Symbolic Learning and Reasoning.
Satoh, S. and Usui, S. (2008). Computational theory and applications of a filling-in process at the blind
spot. Neural Networks, 21:1261–1271.
Schwartz, O. and Simoncelli, E. (2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825.
Sigman, M., Cecchi, G. A., Gilbert, C. D., and Magnasco, M. O. (2001). On a common circle: natural scenes
and Gestalt rules. Proceedings of the National Academy of Sciences of the United States of America,
98(4):1935–1940.
Spratling, M. W. (2008). Predictive coding as a model of biased competition in visual attention. Vision
Research, 48(12):1391–1408.
Spratling, M. W. (2010). Predictive coding as a model of response properties in cortical area V1. Journal
of Neuroscience, 30(9):3531–3543.
Spratling, M. W. (2011). Unsupervised learning of generative and discriminative weights encoding elementary
image components in a predictive coding model of cortical function. Neural Computation. in press.
Stringer, S. M., Rolls, E. T., and Tromans, J. (2007). Invariant object recognition with trace learning and
multiple stimuli present during training. Network: Computation in Neural Systems, 18:161–187.
Tanaka, K. (2003). Columns for complex visual object features in the inferotemporal cortex: clustering of
cells with similar but slightly different stimulus selectivities. Cerebral Cortex, 13:90–99.
25
Treue, S. (2001). Neural correlates of attention in primate visual cortex. Trends in Neurosciences, 24(5):295–
300.
van Hateren, J. H. and van der Schaaf, A. (1998). Independent component filters of natural images compared
with simple cells in primary visual cortex. Proceedings of the Royal Society B: Biological Sciences,
265(1394):359–366.
A
Derivation of the learning algorithm
This section proves that our learning algorithm can be derived as a stochastic gradient method on the
expected log likelihood. Below, we consider the derivation separately for two kinds of parameters to optimize,
namely, the weights wh,u used for hidden variables H (Softmax) and the weights wuV for visible variables V
(Gaussian).
Softmax case First, the partial derivative of the expected log likelihood Ev̂∗ [log P(v̂|w)] with respect to a
parameter wh,u can be transformed as follows.
1
∂P (v̂|w)
∂Ev̂∗ [log P(v̂|w)]
∗
= Ev̂
(11)
∂wh,u
P(v̂|w) ∂wh,u


X
1
∂P (v̂, ĥ|w) 
= Ev̂∗ 
(12)
P(v̂|w) ∂wh,u
ĥ


X P(ĥ|v̂, w) ∂P (v̂, ĥ|w)

(13)
= Ev̂∗ 
P(v̂, ĥ|w) ∂wh,u
ĥ
"
#
1
∂P
(v̂,
ĥ|w)
∗
= Ev̂,
(14)
ĥ
P(v̂, ĥ|w) ∂wh,u
1
∂P (x̂|w)
(15)
= Ex̂∗
P(x̂|w) ∂wh,u
In step (14), the expectation is taken under the distribution P∗ generalized
such that P∗ (v̂, ĥ|w) = P(ĥ|v̂, w)P∗ (v̂).
Q
Now, note that the joint distribution is factorized P(x̂|w) = x̂∈x̂ P(x̂|pa(x̂), w) (equation 1) of which
only the factor P(ĥ|pa(ĥ), w) uses the weight wh,u (equation 2). Therefore,
"
#
1
∂P(
ĥ|û,
t̂,
w)
(15) = E∗ĥ,û,t̂
,
(16)
∂wh,u
P(ĥ|û, t̂, w)
where T = pa(H) \ {U }. Further, from the Softmax model (equation 2), the following can be derived by a
little manipulation.

 P(h|u, t̂, w)(1 − P(h|u, t̂, w)) (û = u, ĥ = h)
∂P(ĥ|û, t̂, w)
=
(17)
−P(h|u, t̂, w)P(ĥ|u, t̂, w)
(û = u, ĥ 6= h)

∂wh,u
0
(û 6= u)
Taken together, these can be rewritten as
∂P(ĥ|û, t̂, w)
∂wh,u
=
P(ĥ|û, t̂, w)δu,û (δh,ĥ − P(h|û, t̂, w)).
(18)
Putting these back into formula (16), we obtain
(16)
=
h
i
E∗ĥ,û,t̂ δu,û (δh,ĥ − P(h|û, t̂, w)) .
26
(19)
By replacing the computation of the expectation by the average over samples taken by Gibbs-sampling, we
derive the weight update rule (6).
Gaussian case Exactly in the same way as the Softmax case, the partial derivative of the expected log
likelihood with respect to a parameter wuV is transformed as
1
∂Ev̂∗ [log P(v̂|w)]
∂P(v̂|û, t̂, w)
∗
= Ev̂,û,t̂
.
(20)
∂wuV
∂wuV
P(v̂|û, t̂, w)
From the Gaussian model (equation 3), we can derive the following.

X
 1 (v̂ − (
wt̂Vi + wûV ))P(v̂|û, t̂, w) (û = u)
∂P(v̂|û, t̂, w)
2
σ
=
i

∂wuV
0
(û 6= u)
(21)
Taken together, these can be written as
∂P(v̂|û, t̂, w)
∂wuV
=
X
1
δu,û (v̂ − (
wt̂Vi + wûV ))P (v̂|û, t̂, w).
2
σ
i
(22)
#
X
1
V
V
δu,û (v̂ − (
wt̂i + wû ))P (v̂|û, t̂, w) .
σ2
i
(23)
Putting these back into (20), we obtain
"
(20)
=
E∗v̂,û,t̂
The weight update rule (equation 7) is derived by replacing the expectation with the average over samples.
27