Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multinomial Bayesian learning for modeling classical and non-classical receptive field properties Haruo Hosoya Brain Science Institute, RIKEN PRESTO, Japan Science and Technology Agency Hirosawa 2-1, Wako-shi, Saitama 351-0198, Japan [email protected] Keywords: Bayesian inference, natural image learning, simple cells, V2, divisive normalization, filling-in Abstract We study the interplay between Bayesian inference and natural image learning in a hierarchical vision system, in relation to the response properties of early visual cortex. We particularly focus on a Bayesian network with multinomial variables that can represent discrete feature spaces similar to hypercolumns combining minicolumns, enforce sparsity of activation to learn efficient representations, and explain divisive normalization. We demonstrate that maximal-likelihood learning using samplingbased Bayesian inference gives rise to classical receptive field properties similar to V1 simple cells and V2 cells, while inference performed on the trained network yields non-classical context-dependent response properties such as cross-orientation suppression and filling-in. Comparison with known physiological properties reveals some qualitative and quantitative similarities. 1 Introduction Bayesian inference and natural image learning are two important computations suggested in visual cortex. On the one hand, Bayesian inference can capture interactions between bottom-up information from the sensory system and top-down information from the central control system, thereby explaining various complex behaviors of the brain such as contextual and attentional effects (Lee and Mumford, 2003; Kersten and Yuille, 2003; Doya et al., 2007; George and Hawkins, 2005; Rao, 2005; Chikkerur et al., 2010; Hosoya, 2010) and multi-cue and multisensory integration (Rao et al., 2002; Deneve, 2004). On the other hand, natural image learning gives rise to the receptive field properties of various visual areas including V1 (Olshausen and Field, 1996; Olshausen and Field, 2003; Bell and Sejnowski, 1997; van Hateren and van der Schaaf, 1998; Hoyer and Hyvärinen, 2000; Olshausen, 2002; Hurri and Hyvärinen, 2003; Spratling, 2011), V2 (Hoyer and Hyvärinen, 2002b; Lee et al., 2008), MT (Cadieu and Olshausen, 2009), and MST (Park et al., 2000), from the statistical properties hidden in sensory inputs. Our aim in this research is to study the mutual roles of the two types of computation in a hierarchical vision system, in relation to the classical and non-classical receptive field properties of early visual cortex. We particularly focus on a multi-layer model using a multinomial Bayesian network. The multinomial approach has been taken in several model studies of visual cortex (George and Hawkins, 2005; Rao, 2005; Ichisugi, 2007; Röhrbein et al., 2007; Chikkerur et al., 2010; Hosoya, 2010). In these, multinomial variables have typically played the “structural” role of bundling a number of discrete states representing individual visual features so that the variable itself forms a discrete-finite feature space, much like hypercolumns gathering dozens of minicolumns as found in visual cortex of higher mammals (Mountcastle, 1997; Buxhoeveden and Casanova, 2002; Tanaka, 2003). Multinomial variables in the present work are also endowed with two “computational” roles. 1 The first is to enforce sparsity. We use a method of maximal-likelihood learning that alternately repeats sampling-based Bayesian inference and weight updates. This is a simple extension of a classical method for binary Bayesian networks (Neal, 1992). However, an efficient representation does not emerge in a binary system unless sparsity is required. Our approach is to make use of the property that a multinomial variable has many possible states and therefore the activation of each state is forced to be sparse. As a result, our learning method applied to a three-layered multinomial Bayesian network successfully yielded oriented Gabor filters similar to V1 simple cells in the second layer, and combinations of orientations like collinear or parallel lines, contours, and angles similar to V2 in the third layer. Quantitative analyses of these receptive field properties also revealed some similarities to the experimental data (DeValois et al., 1982; Parker and Hawken, 1988; Ringach, 2002; Anzai et al., 2007). The second role of multinomial variables is to reproduce divisive normalization phenomena (Heeger, 1992). That is, since a multinomial variable represents a probability distribution over a discrete-finite set of features, the probability of each state being activated is automatically normalized. This effect is in parallel with the property of actual neurons whose activations are adjusted as if they sum up to a constant. Indeed, this observation has been exploited to explain attention-driven normalization in prior multinomial Bayesian models (Rao, 2005; Chikkerur et al., 2010). The present work shows that cross-orientation suppression (Bonds, 1989), which is another example of divisive normalization, can also be explained by the same mechanism. It is of great interest whether our network with the learned hierarchical representations can reproduce a non-trivial physiological effect involving both feedforward and feedback processing. For this reason, we have conducted a simulation of the filling-in phenomenon (Matsumoto and Komatsu, 2005), in which some cells in V1 respond as if they recover a missing visual attribute in the blind spot from the surrounds. We explain this phenomenon by Bayesian inference involving feedback processing, in which the high-level representation learned in the top layer is used for estimating the orientation of the missing part of the stimulus to be similar to the one appearing outside the receptive field. Our model is most closely related to predictive coding (Rao and Ballard, 1999) in the sense that they have also studied the interplay between Bayesian inference and natural image learning in relation to the classical and non-classical receptive field properties of early visual cortex. However, there are several crucial differences in the way information is encoded and processed; for example, they consider additive or subtractive interaction between bottom-up and top-down signals, whereas we take multiplicative interaction; also, they interpret neural activities as values of continuous states, whereas we regard them as probabilities of discrete states. The relevance of the predictive coding model and other models will be thoroughly discussed in Section 5. The rest of this article is organized as follows. The next section introduces our computational model, defines the structure, and explains the inference and learning methods. Then, after describing some detailed settings (Section 3), we present the results of our simulations of learning and inference in comparison with the experimental data (Section 4). Finally, we discuss related work (Section 5) and conclude the article with some future possibilities (Section 6). The Appendix gives a formal derivation of our learning method. 2 2.1 Computational model Network Structure Let us consider a Bayesian network (directed acyclic graph) consisting of a set X of random variables (nodes). We divide the set X of nodes into two disjoint sets, visible nodes V and hidden nodes H, where the visible nodes have no children. We assume that all hidden nodes are multinomial variables ranging over discretefinite states, while all visible nodes are continuous variables ranging over real numbers. The latter is because pixel values in gray-scale images are given to the visible nodes. An example of such network is given in Figure 1(a). We write st(X), pa(X), and ch(X) for the set of states, the set of parent nodes, and the set of children of a given node X. As usual, we postulate that the joint distribution of all variables can be factorized to the product of local 2 conditional probabilities P(X|pa(X)) of each variable X given its parents: Y P(X) = P(X|pa(X)) (1) X∈X We further assume that the conditional probability distribution P(X|pa(X)) is defined parametrically in terms of a Softmax function for a hidden node and a Gaussian function for a visible node: • For a hidden variable H, we assume a weight matrix {wh,u }h∈st(H),u∈st(U ) of real values for each parent variable U , where the matrix elements are indexed by a child state h and a parent state u; see Figure 1(b, up). Then, the following defines the probability of a hidden variable H to have state h provided that its parent nodes pa(H) = {U1 , . . . , Up } have states u1 , . . . , up : Pp exp( j=1 wh,uj ) Pp P(h|u1 , . . . , up ) = P (2) 0 h0 ∈st(H) exp( j=1 wh ,uj ) That is, when the states u1 , . . . , up of the parents are given, a state h of H is more likely to be chosen when the summation of the weights wh,u1 , . . . , wh,up has a larger value. Intuitively, a “lower-level” feature h is likely to be active when it is strongly supported by either of the “higher-level” features u1 , . . . , up . The Softmax definition arises from forming a probability distribution out of the real-valued summation of weights (exponentiation for making it positive, followed by normalization). We adopt real-valued weights here for representing both excitatory and suppressive effects. • For a visible variable V , we assume a weight vector {wuV }u∈st(U ) of real values for each parent variable U , where the vector elements are indexed by a parent state u (but not a child state v, which is continuous); see Figure 1(b, down). Then, the following defines the probability of a visible variable V to have state v provided that its parent nodes pa(V ) = {U1 , . . . , Up } have states u1 , . . . , up : ! Pp (v − j=1 wuVj )2 1 exp − (3) P(v|u1 , . . . , up ) = √ 2σ 2 2πσ 2 for a fixed hyperparameter σ. Intuitively, when the states u1 , . . . , up of the parents are given, a state v of V is more likely to be chosen when the summation of the weights wuV1 , . . . , wuVp is closer to the value v. We assume that the prior P (X) of a root node (i.e., a node with no parent) is uniform. From now on, we denote by w the vector of all the weights in the entire network, and since the distribution P(X) depends on the weights, we will instead write P(X|w). It is important to emphasize that it is a state rather than a variable that represents each individual visual feature, unlike the more conventional approach using continuous or binary variables. Thus, in our setting, a multinomial variable represents a “feature space” that combines a set of features that are related to each other. For this reason, we sometimes call a state a unit in the sequel. We can find a parallel between such nested feature representation and the hypercolumn and minicolumn structures known in visual cortex of higher mammals; a minicolumn is a set of vertically organized neurons with similar functions (visual features) and a hypercolumn is a cluster of dozens of minicolumns that have different but closely related functions (Mountcastle, 1997; Buxhoeveden and Casanova, 2002; Tanaka, 2003). However, it should be stressed that the model presented here is only functional and its neural implementability is an open question. 2.2 Bayesian inference Having defined the Bayesian network in Section 2.1, we intend to perform hierarchical Bayesian inference to obtain the posterior of some variables given an input. For example, such inference can yield the posterior distribution P (H|v) of all hidden variables H given a concrete input v. If we are interested in a particular 3 (a) (b) u U wh,u Softmax H u U Gaussian h wuV V Figure 1: (a) An example of multinomial Bayesian network. The visible nodes (bottom layer) are continuous and defined by a Gaussian function. The hidden nodes (upper layers) are multinomial and defined by a Softmax function. (b) A weight is assumed between each hidden state and each parent state (up) or between each visible node and each parent state (down). Such weights are used for defining conditional probabilities. state a of a particular hidden variable h, then the inference can also give the marginal posterior probability P (h = a|v), which we relate with a neural firing rate. Note that this approach contrasts with prior work in which the values of continuous variables, rather than their probabilities, are related with neural activities (Rao and Ballard, 1999). Of the various inference methods of Bayesian networks, we adopt Gibbs-sampling (Bishop, 2006). While prior work has typically used belief propagation (George and Hawkins, 2005; Rao, 2005; Ichisugi, 2007; Röhrbein et al., 2007; Chikkerur et al., 2010; Hosoya, 2010), we do not take this approach since belief propagation can compute only marginal probabilities of a single variable (Pearl, 1997), while Gibbs-sampling can compute joint probabilities of multiple variables, which are crucial in our case since the update rules of our learning algorithm take into account the dependencies among different variables (Section 2.3). To briefly introduce the method used here, recall first how Gibbs-sampling works in general. Let us fix a configuration v̂ for the visibles V. At time t = 0, we start with some “initial” configuration ĥ(0) for the hidden variables H. Then, at a subsequent time t ≥ 1, we take a “course” of sampling for all hidden variables H1 , H2 , . . . , HN , one after another ĥ1 (t) ĥ2 (t) .. . from P(H1 | v̂, ĥ2 (t − 1), . . . , ĥN (t − 1), w), from P(H2 | v̂, ĥ1 (t), ĥ3 (t − 1), . . . , ĥN (t − 1), w), ĥN (t) from P(HN | v̂, ĥ1 (t), ĥ2 (t), . . . , ĥN −1 (t), w). It is known that, by repeating this procedure, the sampling asymptotically approaches to the actual posterior distribution P(H|v̂, w). Then, note that, in each step, we need to compute the distribution P(H | V, H \ H, w) for given values of V and H \ H. This can be obtained by using property (1) of Bayesian networks: Y 1 P(X|pa(X), w) (4) P(H | V, H \ H, w) = P(H|pa(H), w) A X∈ch(H) P Q where the normalizing constant A = H P(H|pa(H), w) X∈ch(H) P(X|pa(X), w) (where the set pa(X) in each P(X|pa(X), w) may include the variable H). Note here that the computation in equation (4) is local since it involves only the variables in H’s Markov blanket, that is, the children, the parents, and the children’s parents except H itself (Bishop, 2006). However, we do not claim any biological plausibility here; 4 relevantly, Gibbs-sampling seems difficult to implement in cortex since this method requires samples to be taken in a sequential manner from variables to variables. 2.3 Unsupervised learning Assume that inputs given to the visible nodes V follow an external probability distribution P∗ (V). The learning searches for a generative model reflecting the distribution of input data as precisely as possible. Thus, the objective function is the expected log likelihood, F(w) = E∗ [log P(V|w)], (5) where E∗ [·] denotes the expectation taken under the external distribution P∗ . The algorithm can be derived as a stochastic gradient learning method on the objective function and formulated as an alternate iteration of two steps. Inference Take a random sample v̂ from the distribution P∗ (V); then, take a random sample ĥ from the posterior distribution P(H|v̂, w) by using Gibbs-sampling as explained in Section 2.2. Note that a complete configuration for all variables is thus given. We repeat this process several times. Update For each pair of a hidden node H and a parent U ∈ pa(H), update each weight wh,u by D E ∆wh,u ∝ δu,û (δh,ĥ − P(h|pa(ĥ)) . Also, for each pair of a visible node V and a parent U ∈ pa(V ), update each weight wuV by + * X 1 V V δu,û (v̂ − wuj ) . ∆wu ∝ σ2 (6) (7) uj ∈pa(v̂) In both, δ·,· denotes Kronecker’s delta, h·i is the average over the samples taken in the inference step, û and v̂ each denote the states of U and V in each sample, and pa(v̂) denotes the set of the states of the parent nodes pa(V ) in each sample. A formal derivation is given in the Appendix. Note that the update rules (6) and (7) involve information only from proximal nodes. The above learning algorithm is a conservative extension of the algorithm for binary networks presented in (Neal, 1992). In our simulation presented later, we apply the commonly used technique that trains each layer one after another rather than simultaneously training all layers. The intuitive reason it works is as follows. When a layer undergoes training, no top-down influence is exerted on the nodes of this layer since all the parameters for the upper layers are fixed to zero. Therefore, the nodes in the trained layer are statistically independent of each other by property (1) of Bayesian networks. This fact, together with the sparsity enforced by multinomial variables, leads to an efficient feature representation. 3 Simulation Setting For simulation, we employed the three-layer Bayesian network illustrated in Figure 2 that mimics early visual cortex. Layer 0 The bottom layer, called layer 0, has 24 × 24 visible (continuous) nodes that are arranged in rectangular form. Inputs of image patches will be given to this layer. Layer 1 The middle layer, called layer 1, has 3 × 3 nodes that are also arranged rectangularly. Each node has 100 discrete states. Each layer-1 node is connected to a set of 12 × 12 layer-0 nodes such that two adjacent layer-1 nodes have an overlapping set of subordinate layer-0 nodes; side-by-side adjacent ones have 12 × 6 or 6 × 12 overlapping nodes and diagonally adjacent ones have 6 × 6 overlapping nodes. This layer will be compared to V1 simple cells. 5 Layer 2 Layer 1 Layer 0 24 12 24 12 Figure 2: The architecture of the network used in the simulation. Layer 0 has 24 × 24 continuous nodes. Layer 1 has 3 × 3 nodes (multinomial variables, with each depicted as a square subregion) each with 100 states (units, with each depicted as a circle in the side box). Layer 2 has 4 nodes each with 100 states. Layer-2 nodes are connected to all layer-1 nodes, while layer-1 nodes are connected to overlapping sets of 12 × 12 layer-0 nodes. Layer 2 The top layer, called layer 2, has 4 nodes, each having 100 states. Each layer-2 node is connected to all layer 1 nodes. This layer will be compared to V2 cells. Note that, from these connectivities, layer-2 nodes can have receptive fields covering the whole visual field, while layer-1 nodes can have smaller receptive fields partially overlapping with each other. We used the dataset of gray-scale natural images provided by Olshausen as inputs1 . The images were preprocessed with the whitening and low-pass filter prescribed by (Olshausen and Field, 2003). The resulting images had pixel intensities ranging over real numbers, with zero indicating gray. In each learning cycle, we extracted a 24 × 24 patch from a random position in an image, skipping those with variances of less than 0.1 to speed up learning. Then, we gave the pixel intensities in the image patch to the visible nodes. For each image patch, we took 10 samples from the posterior in the inference step. We updated the weights after every 10 cycles of this process. As already mentioned, we employed layer-by-layer learning. That is, after initializing the network (setting all weights to zero), we first let layer 1 learn alone (fixing the weights between layers 1 and 2) and, after it settled, let layer 2 start to learn (fixing the weights between layers 0 and 1). The first phase spent about 3M cycles with a learning rate of 0.02. The second phase spent about 5M cycles with a learning rate of 5.0 for the first 3M cycles and 1.0 for the rest. We fixed 1/σ 2 = 4.5 throughout the simulation. 1 Available through http://redwood.berkeley.edu/bruno/sparsenet/. 6 4 Simulation Results This section presents the results of our simulations using the methods given in Sections 2 and 3. Sections 4.1 and 4.2 show the hierarchical feature representations learned in layers 1 and 2 with quantitative comparisons with the experimental data; Section 4.3 gives the result of a simple procedure to confirm that the learned representations were reasonable. Sections 4.4 and 4.5 present the simulations of cross-orientation suppression and filling-in phenomena using Bayesian inference on the trained network. 4.1 Layer 1 representations Figure 3(a) shows part of the basis images represented in layer 1. Each block of 100 consecutive image patches (“basis images”) corresponds to the 100 units of a layer-1 multinomial variable. Each image patch displays the weights {wuV }V ∈ch(U ) from a layer-1 unit u to all its subordinate layer-0 nodes V . Since each layer-1 unit is connected to 12 × 12 layer-0 nodes, the basis image has 12 × 12 pixels. Only the basis images for the units of three multinomial variables are shown (300 units out of a total of 900 units); the remainder are quite similar. We can see that the basis images resemble Gabor filters and have much variety in position, orientation, spatial frequency, phase, and bandwidth; this is qualitatively similar to typical receptive field shapes of V1 simple cells (Hubel and Wiesel, 1968; Jones and Palmer, 1987). In principle, the basis images should not directly be compared with the receptive fields since the former are internal representations while the latter are response properties. For this reason, in other model studies, the receptive fields have been estimated as the inverse of the weight matrix in independent component analysis (ICA) (van Hateren and van der Schaaf, 1998) or by performing reverse correlation in sparse coding (Olshausen, 2002); in these cases, the receptive fields looked slightly different from the basis images. In our case, however, the receptive fields and the bases were almost identical. We used reverse correlation to estimate the receptive field profile of each layer-1 unit. Specifically, we approximately computed for each unit u, Z VRF = V P(u|V, w)Prand (V)dV, (8) where P(u|V, w) is the posterior probability of the unit u given the input image V in the generative model (the trained network), and Prand (V) denotes the distribution of Gaussian random images (with variance 1). In short, VRF gives the average of random images that activate the unit u. We approximated this by using samples of random images and unit activations, where the latter were obtained through the Gibbssampling-based inference explained in Section 2.2. Figure 3(b) shows the estimated receptive field profiles (using 50,000 image samples with 20 courses of Gibbs-sampling for each image) for ten sample units, in comparison to their basis images. For these units, the shapes of both profiles look very similar, except that the receptive field is localized in accordance with the spatial arrangement of the subordinate visible nodes. In general, for all layer-1 units with clear basis shapes, the basis image and the receptive field profiles were essentially identical at the level of pixel intensity. More precisely, we first judged 72 units with the maximal pixel value smaller than 0.1 as having unclear basis shapes. For the remaining 828 units, we normalized the pixel values in the basis image by dividing them by the maximal value and did the same on the pixel values in the receptive field profile (properly clipped to the size of the basis image). We plotted all the normalized pixel values y in the receptive field profiles of all units against the corresponding normalized pixel values x in the basis images. Since it well fitted the linear function y = 0.92x + 0.0015 with R2 = 0.92, we concluded that our basis images were effectively identical to the receptive field profiles. Therefore we will make no distinction between these two in the sequel. To quantitatively compare the basis images with the physiological data, we fitted the bases with the standard Gabor function x02 y 02 G(x, y; x0 , y0 , A, σx , σy , θ, f, φ) = A exp − 2 − 2 cos (2πf x0 + φ) 2σx 2σy where x0 = (x − x0 ) cos θ + (y − y0 ) sin θ and y 0 = −(x − x0 ) sin θ + (y − y0 ) cos θ; the parameters are center position (x0 , y0 ), amplitude A, size (σx , σy ), orientation θ, spatial frequency f , and phase φ. We discarded 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0 25 50 75 100 125 150 175 200 225 250 275 300 (a) 325 350 375 400 425 450 475 (b) Figure 3: (a) Learned basis images represented in layer 1. Each 100 consecutive images corresponds to a single layer-1 multinomial variable. Only the bases for three variables are shown. (b) The basis images (up) and the receptive field profiles (down) of ten layer-1 units. The receptive fields are obtained through reverse correlation. 8 118 units (out of 900) for poor fitting (R2 < 0.9), and further 121 units since the filters were not reasonably within the image boundary (the center position exceeded the boundary with one pixel of the inner margin or the size in each dimension was greater than half the image width). For the remaining 661 basis images, we pooled the parameters obtained from the fitting. Figures 4(a) and (b) show the histograms of spatial frequencies and lengths (2σy ) of the 661 units, together with side-by-side replots of known physiological data, namely, the distributions of the “peak spatial frequencies of foveal X cells” (DeValois et al., 1982, Fig. 6) and the “space constant in height dimension” (Parker and Hawken, 1988, Fig. 4a). Since the frequency and the length data in the simulation were based on pixels while those in the experiment were based on degrees, we first scaled the frequency data by coefficient sfreq = 31.6 so as to minimize the Hellinger distance between the two histograms and then adjusted the length data using scaling coefficient 60/sfreq = 1.9. For both quantities, the simulation data did not exactly match the experimental data, although they did not deviate too much, either; note in particular the similarity in the peak lengths arising without a direct fitting. Next, following (Ringach, 2002), we compared the receptive field shapes of model units with V1 simple cells in the space of two-dimensional sizes in units of the sinusoidal wavelength, defined as (nx , ny ) = (σx f, σy f ). Figure 5 plots the simulation and the experimental data (the latter is provided in Ringach’s website). The overall tendencies toward the upper left were consistent in both data, while the simulation data tended to shift to the left particularly for units close to the origin. This means that the aspect ratios of well-tuned units were high in both data, but those of broadly tuned units were overly high in the model. We made further comparisons using the aspect ratios and the spatial frequency bandwidths in octave p calculated from the values of (nx , ny ), namely, σy /σx and log2 ((f + ∆f )/(f − ∆f )), where ∆f = (log 2)/2/(σx π), respectively. The last definition of bandwidth can be obtained analytically as the full width at half maximum (FWHM) in the Fourier transform of a Gabor function; see (Daugman, 1985; Petkov and Kruizinga, 1997). Figures 4(c) and (d) show the histograms of spatial frequency bandwidths and aspect ratios, respectively, of the model units and V1 simple cells. Only p 387 model units were used to analyze the bandwidths since the rest had non-real values (with nx ≤ (log 2)/2/π). While the distributions of bandwidths almost coincided, those of aspect ratios looked rather different, with a notable shift in peak values. Finally, Figure 4(e) plots the bandwidths against the frequencies of the 387 model units. The negative correlation and the lower variability of bandwidths in high-frequency units were consistent with physiology (DeValois et al., 1982, Fig. 7), even though the concrete values on both axes were different (probably because the bandwidths were measured in different ways). These analyses reveal both common and uncommon points between the model data and the related experimental data. Inconsistency was mostly lesser variability in the model results; indeed, we noticed that such quantitative tendencies were considerably sensitive to hyperparameters such as the number of units in each variable and the number of variables in each layer. Note that, roughly, the inverse of the number of units in a variable (0.01 in our network) corresponds to sparsity, and the total number of units in a layer relative to the size of an image patch (900/242 ≈ 1.56) corresponds to the degree of overcompleteness in existing models; the dependencies of quantitative tendencies on such hyperparameters are also common in these models. (Relevantly, the distribution of spatial frequencies of independent component filters of natural images in (van Hateren and van der Schaaf, 1998) has much less variety compared to our results, which may also be because their model does not allow overcomplete representations.) We did not explore those hyperparameters to find the best match since detailed reproduction of physiological properties was not the main issue in this work. 4.2 Layer 2 representations Since each layer-1 unit in the trained network represented a localized orientation, each layer-2 unit should represent some combination of localized orientations. Figure 6 shows the basis representations of all 400 layer-2 units, where each image patch displays the weighted superposition of the basis images of the most strongly connected layer-1 units (at most one unit par each layer-1 node with a weight larger than 2.0), where the basis images are properly translated according to the spatial arrangement of the layer-1 units. 9 30 40 30 20 (b) Length (arcmin) [scale=1.9 h=0.409] 60 40 50 occurrence (%) 20 occurrence (%) (a) Spatial frequency (cy/deg) [scale=31.3 h=0.248] 50 50 10 30 20 10 40 30 20 10 5 0 20 15 occurrence (%) 15 occurrence (%) 10 10 0 0 0.5 0.7 1 1.4 2 2.8 4 5.6 811.2 16>16 0 5 10 15 20 25 30 35 40 >40 0 0 0.5 0.7 1 1.4 2 2.8 4 5.6 811.2 16>16 0 5 10 15 20 25 30 35 40 >40 (c) Spatial frequency bandwidth (octave) [h=0.231] (d) Apsect ratio [h=0.413] 35 25 (c) Spatial frequency bandwidth (octave) [h=0.231] (d) Apsect ratio [h=0.413] 30 35 25 20 25 30 20 15 20 25 occurrence (%) occurrence (%) (b) Length (arcmin) [scale=1.9 h=0.409] 60 40 occurrence (%) occurrence (%) (a) Spatial frequency (cy/deg) [scale=31.3 h=0.248] 50 10 5 10 2 3 2 1 3 70 (e) Bandwidth vs spatial frequency 60 6 5 4 2 3 4 Orientation 50 60 40 50 30 20 3 10 2 3.5 4.2 5 1 Frequency (cy/deg) 4 1 70 occurrence (%) 4 0 Orientation occurrence (%) 5 Bandwidth (octave) Bandwidth (octave) 6 3 0 (e) Bandwidth vs spatial frequency 7 10 5 5 0 0 0.5 1 1.5 2 2.5 33.5 4 4.5 5 5.5 6 0 1 0 0 0.5 1 1.5 2 2.5 33.5 4 4.5 5 5.5 6 7 15 0 6 3 0 3.5 4.2 5 Frequency (cy/deg) 1 6 40 30 20 10 2 0 3 0 1 2 Figure 4: The distributions of (a) spatial frequencies, (b) lengths, (c) spatial frequency bandwidths, (d) aspect ratios, and (e) bandwidths relative to frequencies of layer-1 units. Data (a–d) are compared with known experimental data (see text), with the degree of matching being indicated by the Hellinger distance (“h”). The spatial frequencies and the lengths in simulation are scaled for adjustment as indicated. Black: simulation, white: experiment. 10 3 (a) Simulation 1.5 (b) Experiment 1.5 ny 1 ny 1 0.5 0 0.5 0 0.5 1 0 1.5 nx 250 0 0.5 1 1.5 nx Phase Figure 5: The distributions of shapes of receptive fields of (a) model units and (b) V1 simple cells (Ringach, 2002) in space of two-dimensional sizes in units of sinusoidal wavelength. 200 To which visual 150 area is layer-2 the closest? At first, one might think that it corresponds to V1 complex cells since many units combine similar orientations and thus look like collinear or parallel lines. However, from tests using100 grating stimuli, we found that essentially no unit had phase invariance (data not shown), which would be the most important property of actual complex cells. In fact, a closer look at Figure 6 reveals that some units combine orientations that are slightly or even largely different; thus, such units look 50 like contours or angles. Such properties are more prevalent in V2 (Hegdé and Essen, 2000; Ito and Komatsu, 2004; Anzai et al., 2007) and therefore comparison with this area would be more sensible. To investigate0the structure of the 60 combined 0 20 40 80 orientations in layer-2, we conducted the following test, adhering to the physiological study of V2 (Anzai et al., 2007). For each layer-2 unit, we recorded the response to a localized sinusoidal grating (with an amplitude of 1.0) of size 12 × 12 at one of 4 × 4 positions within the visual field. The response was measured as the probability thetuned) unit being activated in 200 Phase (broadly tuned) Phaseof (well 120 120orientations equally divided from 180 degrees; the samples. The orientation of the grating was chosen from 12 phase and spatial frequency that evoked the maximal response were adopted. Figure 7 shows the receptive field profiles of 100 six typical layer-2 units measured in this100 way, where each profile is shown as sixteen panels corresponding to the sixteen positions used in the test in which a line segment indicates the response to 80 80 one of the 12 orientations (local orientation tuning curve). Just like the bases, these receptive fields tend to combine similar orientations, but contain different orientations as well. 60 60 To analyze the receptive fields quantitatively, we first identified and pooled the optimal orientations at all 16 positions for the local orientation tuning curve well fitted a von 40 each unit. That is, at each position, if 40 Mises function 20 20 M (θ; A, m, θ0 ) = A exp [m(cos(2(θ − θ0 )) − 1)] (9) with R2 > 0.5 and0 A > 0.5, then we pooled the peak orientation θ0 ; otherwise, if the tuning curve well fitted 0 0 Mises 20 functions 40 60 80 0 20 40 60 80 a mixture of two von M2 (θ; A1 , A2 , m1 , m2 , θ1 , θ2 ) = M (θ; A1 , m1 , θ1 ) + M (θ; A2 , m2 , θ2 ) (10) with R2 > 0.5, then we pooled each peak orientation θi (i = 1, 2) if M2 (θi ) > 0.5; if the curve fitted neither function, then we pooled no orientation at that position. The units with only zero or one orientation pooled were discarded for further analysis; a total of 132 out of 400 units remained. Then, for each unit, we calculated the maximal orientation difference, that is, the largest difference between each pair of pooled orientations. 11 Figure 8(a) shows the distribution of the maximal orientation differences for all layer-2 units, with the replot of the analogous distribution reported on V2 cells (Anzai et al., 2007)2 . Both data are similar in the sense that maximal orientation differences around 0 degree are the most frequent and then larger maximal orientation differences are gradually less frequent, except for a weak peak at around 90 degrees. Further, to examine the proportion of non-similar orientations in the combined orientations, we categorized the layer-2 units into those with maximal orientation differences of 30 degrees or less (uniform) and those that remained (non-uniform); 75% of the units were categorized as uniform and 25% as non-uniform. Figure 8(b) shows the distribution of the differences between all pairs of orientations pooled for all non-uniform units, with analogous data from the same experimental study (Anzai et al., 2007). Again, both distributions have a prominent peak at around 0 degree with a gradual decrease until a weaker peak at around 90 degrees. As a check, we confirmed that the distributions obtained from simulation were significantly different from what we would expect if each unit’s selectivity at each location was random; in such case, the maximal orientation differences would give a peak only at 90 degrees and the all pairwise orientation differences would give a uniform distribution (p < 10−6 for both; Kolmogorov-Smirnov test). Why did such property of orientation combination emerge in our model? Since the features captured by our learning algorithm are essentially the oriented edges co-occurring frequently, such property should be due to the statistics of natural images. That is, although oriented edges are the elements of natural images at the finest level, those edges are at the next level combined as a straight line, a smooth contour, or an angle to constitute a part of object. In this respect, the lack of phase invariance is not surprising at all since orientations with difference phases would not occur simultaneously in natural images. Interestingly, our result, as well as the physiological result, is consistent with a more direct study on the statistics of cooccurring edges in natural images (Sigman et al., 2001), which also reports that iso-orientation co-occurrences are most often and those with larger orientation differences are decreasingly less often, though it shows no discernible peak at around 90 degrees. (The last point could be related to that our quantitative result was somewhat sensitive to hyperparameters, in particular, the 90-degree peak disappeared when sparsity was set to be larger.) 4.3 Image generation In order to confirm that the learned hierarchical representations presented so far reasonably captured natural image statistics, we generated images from the model indicating how it might have interpreted a given image. More precisely, after giving a random natural image patch V to layer-0, we first took a sample H1 , H2 from the posterior P(H1 , H2 |V), where H1 and H2 are the layer-1 and layer-2 variables, respectively. Then, we obtained the average of 100 images generated from the conditional probability P(V|H1 ) and also the average of 100 images generated from P(V|H2 ). Figure 9 shows the result, where each column displays an input natural image patch (top), three averaged images each generated from a different sample H1 (middle), and three each generated from a different sample H2 (bottom). As one can see, the generated images were quite similar to the input image, which indicates that the model had learned reasonable representations of natural images. However, the model tended not to precisely reproduce the input image, but highlighted conspicuous features and ignored obscure ones, which is not surprising since the basis representations did not include tiny-sized edge features (Figure 3(a)). 4.4 Cross-orientation suppression Cross-orientation suppression (Bonds, 1989) is a well-known divisive normalization phenomenon in V1 (Heeger, 1992). According to this, a typical cell reduces the response to an optimal grating when a mask grating with a non-optimal orientation is superimposed upon it. This behavior can easily be explained in our model since the units are grouped together into multinomial variables. That is, consider a multinomial variable including two units, s1 and s2 , that represent different 2 The original experimental data contained both positive and negative orientation differences. We disregarded the signs in our analysis. 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 Figure 6: The learned basis representations in layer-2. Each image displays the weighted superposition of the basis images of strongly connected layer-1 units properly shifted according to the spatial arrangement of the layer-1 units. 13 (a) (b) (c) non−empty/empty N=400 Unit groups (uniform/non−uniform) N=132 25% 33% (d) (e) (f) Figure 7: The receptive field profiles of six exemplar layer-2 units in the trained network. Each panel shows the local orientation tuning curve at one of 67% 16 positions within the visual field. 75% (a) Maximum orientation differences [h=0.198] (b) All orientation differences for non−uniforms [h=0.201] 50 60 40 occurrence (%) occurrence (%) 50 40 30 20 20 10 10 0 30 0 15 30 45 degree 60 75 0 90 0 15 30 45 degree 60 75 90 Figure 8: The distributions of maximum orientation differences for all layer-2 units (left) and of all pairwise orientation differences for all non-uniform units (right). Black: simulation, white: experiment. 14 Figure 9: The model’s “interpretation” of input images. Each column shows a random natural image patch (top), three averaged images generated from layer-1 each using a different sample from the posterior (middle), and three averaged images generated similarly from layer-2 (bottom). features. Then, if a stimulus only with s1 ’s feature is presented, then only s1 can be activated. However, if a stimulus with both features is presented, then both s1 and s2 can be activated, in which case the probability of s1 being activated would be reduced. We confirmed that such a phenomenon of cross-orientation suppression could be robustly reproduced in our network. Figure 10(a) shows the responses of 10 exemplar layer-1 units to a single sinusoidal grating (covering the entire visual field) with a varied orientation (excitatory orientation tuning curve) and the responses to a plaid composed of an optimal grating and a mask grating with a varied orientation (mask orientation tuning curve). The orientation of the single grating or of the mask grating of the plaid was chosen from the 16 orientations equally divided from 180 degrees; the spatial frequency obtained from Gabor fitting and an amplitude of 0.5 were used; the phase that evoked the maximal response was adopted. Here, the response of a unit was defined as the probability of the unit being activated in 200 samples. In the figure, the mask orientation tuning curve (solid line) of each unit has a peak at a similar orientation to the excitatory orientation tuning curve (broken line); thus the response to an optimal grating was suppressed by a non-optimal mask grating, though the extent of suppression varies from unit to unit. Observe the qualitative similarity between these simulated data and the responses of an actual V1 simple cell replotted in Figure 10(b). In general, most of layer-1 units with a clear excitatory tuning curve also had a clear mask tuning curve with a similar peak orientation. More precisely, defining clear tuning curves as those that well fit a von Mises function (equation (9)) with R2 > 0.5 and A > 0.5, we found 533 units out of 900 having a clear excitatory tuning curve, of which 490 (91.9%) also had a clear mask tuning curve whose peak orientation was within 22.5 degrees from that of the excitatory tuning curve. The units with unclear excitatory tuning curves mostly had either tiny, badly shaped, or low-frequency receptive fields. 15 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0 0 45 90 135 0 0 45 90 135 0 0 45 90 135 0 0 45 90 135 0 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0 0 45 90 135 0 0 45 90 135 0 0 45 90 135 0 0 45 90 135 (a) 0 20 0 45 90 135 10 0 0 45 90 135 0 45 90 135 (b) Figure 10: The cross-orientation suppression response properties of ten exemplar layer-1 units (a) and a V1 simple cell (b). The broken line shows the excitatory orientation tuning curve and the solid line shows the mask orientation tuning curve. Note the similar peak orientations in both curves. The data for the simple cell is a replot from the experimental study (Bonds, 1989). Abscissa: orientation in degree, ordinate: response. 4.5 Filling-in Filling-in is the phenomenon that one perceives within the blind spot visual attributes similar to the surrounds even though it receives no retinal input (Ramachandran, 1992). Neural correlates have also been reported in V1 (Komatsu and Kinoshita, 2000; Matsumoto and Komatsu, 2005). In this work, we take a particular physiological result on filling-in using bar stimuli of varied lengths (Matsumoto and Komatsu, 2005) and attempt to explain it by Bayesian inference. The task is illustrated in Figure 11. Suppose that a unit selective to a horizontal orientation has a receptive field that is partially overlapped with a blind spot. Then, we present a bar stimulus of varied length over the receptive field while fixing the left end (a–d). When the bar is not long enough (c), one would not perceive the part of the bar intersecting the blind spot (e). However, when the right end exceeds the boundary of the blind spot (d), one would perceive the entire stimulus as a single bar (f). If the unit correlates the perception, then the unit would not be activated in the former case, while it would in the latter. Such neurons indeed exist in V1 (Matsumoto and Komatsu, 2005); an example is illustrated in Figure 13(b). Note that cases (c) and (d) in Figure 11 are identical in terms of the retinal input to the receptive field as shown in (e) and (f), since the right end of the bar appears outside the receptive field. Therefore the information on the right end must be propagated to the unit through some indirect path. In our model, we assume that such propagation is carried out by feedback from the higher layer. Importantly, layer-2 has units representing combinations of similar orientations: this provides the ground that the orientation inside the blind spot is likely to be similar to the one outside. The proposed mechanism is illustrated in Figure 12, in which the activations of nodes in a simplified network are shown (A) before and (B) after the bar stimulus exceeds the blind spot boundary. Pay special attention to the rightmost unit, representing a horizontal orientation, of the layer-1 node (1b) indicated by the thick box. Before the bar stimulus exceeds the blind spot, the target unit is only weakly activated since it receives almost only the weak bottom-up information on the partially visible horizontal bar; it receives scarce top-down information since the lack of retinal input on the right hand side activates no unit of the node (1c) and therefore the layer-2 nodes are not provided with enough evidence of a horizontal orientation. However, once the right end of the bar becomes visible, the target unit is strongly activated since the layer-2 nodes can now clearly detect the horizontal orientation 16 and therefore exert a strong top-down influence on the node (1b). In the biological visual system, feedback connections from V2 are a possible mechanism causing filling-in, although horizontal connections give an alternative (Komatsu and Kinoshita, 2000; Matsumoto and Komatsu, 2005; Satoh and Usui, 2008). We have conducted a simulation of the above task using the network after learning all layers (where no blind spot was assumed during learning). We targeted a unit of the layer-1 node whose receptive field had a horizontal orientation in the central region (the right most unit shown in Figure 3(b)). We assumed a blind spot of size 7 × 4 located at position (13, 8) (with origin (0, 0)) and simulated it by “unfixing” the values of the layer-0 nodes within this region during inference; more precisely, we computed samples from the posterior distribution P (H, VBS |V \ VBS ) instead of P (H|V), where VBS denotes the layer-0 nodes within the blind spot. Figure 13(A) plots the average responses of the target unit to the bar stimuli (of thickness 2 at y-position 9) of varied lengths (where the average was taken of 10 separate trials with the same stimulus, each time measuring the probability of the unit’s activation in 2, 000 samples taken after 1, 000 unused samples). As shown in Figure 13(a, left), the response increased markedly after the bar stimulus exceeded the blind spot (solid line), even though the retinal input in the receptive field was the same. To confirm that the above was indeed a filling-in effect, we ran a few control simulations. First, when we removed the blind spot (that is, fixing all the layer-0 nodes to some values during inference) as illustrated in Figure 11(g–j), the average responses of the same unit recorded after the bar stimulus exceeded the receptive field were not different from those recorded just before this excess (broken line in Figure 13(a, left)). This means that the presence of the blind spot was essential. (Physiologically, the cases with and without a blind spot correspond to monocular stimulation to the eye with a blind spot, called the “blind-spot eye condition,” and binocular stimulation, called the “non-blind-spot eye condition,” respectively.) Furthermore, to check that the effect was caused by feedback, we conducted the same task without top-down influence on layer-1. This can be simulated by using the network before learning layer-2 but after learning layer-1, in which case the layer-1 nodes would receive no top-down influence. As seen in Figure 13(a, right), the responses of the same unit did not change after the bar stimulus exceeded the receptive field even when the blind spot was present. Note the qualitative similarity between Figures 13(a, left) and (b). 5 5.1 Related work Predictive coding models A similar kind of interplay between hierarchical Bayesian inference and natural image learning has been studied in the seminal work on the predictive coding model (Rao and Ballard, 1999). In this, although the inference algorithm is deterministic, it can be reformalized as a Bayesian inference computing the most probable explanation of a given input. Using their inference and learning algorithms, they have successfully reproduced a variety of classical and nonclassical receptive field properties of early visual cortex including end-stopping, pop-outs, and surround suppression. However, one criticism on the original predictive coding model was the linear (additive or subtractive) interaction between top-down and bottom-up signals, which was inconsistent with known multiplicative effects of cortical feedback (McAdams and Maunsell, 1999; Treue, 2001). This issue was addressed by a more recent proposal of nonlinear predictive coding model (Spratling, 2008; Spratling, 2010; Spratling, 2011), which modified the update rules in the inference algorithm in a multiplicative fashion. Thus, the new model successfully reproduced a wider range of nonclassical properties of V1, including attentional effects (Spratling, 2008) as well as cross-orientation suppression and surround modulation (Spratling, 2010). However, the drawback was that it had no longer a clear Bayesian interpretation and had no objective function for inference and learning; consequently, the convergence of computation was explained only informally (Spratling, 2011) and the result of computation was rather difficult to interpret statistically. In our model, the interaction between top-down and bottom-up signals is governed by Bayes’ theorem, which has inherently multiplicative nature. Consequently, feedforward signals do not represent prediction errors but “prediction matches,” which have the effect of sharpening bottom-up sensory signals that agree with the top-down predictions (Murray et al., 2004). This fact was crucial in our simulation of filling-in 17 With a blind spot (a) (b) RF (c) (d) (e) (f) (i) (j) BS Retinal input Without a blind spot (g) (h) Figure 11: Filling-in task. Bar stimuli of varied length are presented over the receptive field (RF) of a unit, (a–d) and (g–j). Under conditions (a–d), the receptive field partially overlaps with a blind spot (BS). When the right end of the bar stays within BS (c), the part covered by BS is perceived to be absent (e). When the bar end exceeds the boundary of BS (d), the whole stimulus is perceived as a single bar even though the retinal input to BS is absent (f). The conditions (g–j) are the cases without BS. 18 (A) 2a 1a 2b 1b 1c RF BS (B) 2a 1a 2b 1b 1c Figure 12: The proposed mechanism causing the filling-in phenomenon. The rightmost unit of node (1b) is poorly activated when it only receives weak bottom-up support for a horizontal orientation (A), while the unit is strongly activated when the context provides enough information on the horizontal orientation through the top-down process. 19 With top−down Without top−down 100 80 1 1 0.8 0.8 0.6 0.6 40 0.4 0.4 20 0.2 0.2 0 0 6 12 18 24 0 60 6 12 18 (a) 24 −20 0 2 4 6 8 10 12 14 16 (b) Figure 13: (a) The responses of a layer-1 unit (the right most unit in Figure 3(b)) to bar stimuli of varied lengths (corresponding to Figures 11(a–d)) when top-down signals are enabled (left) or disabled (right). A blind spot was either present (solid line) or absent (broken line). The dotted box in each panel indicates the length range corresponding to the receptive field of the recorded unit, while the filled box indicates the length range corresponding to the blind spot. When top-down signals were enabled, the response increased markedly after the bar exceeded the blind spot only. (b) The responses of a V1 neuron adapted from the experimental study (Matsumoto and Komatsu, 2005). Abscissa: bar length (pixels), ordinate: response. (Section 4.5) as well as in prior work reproducing attentional effects (Rao, 2005; Chikkerur et al., 2010). However, one remaining issue is whether surround modulation, a prevalently observed phenomenon in V1, can also be explained in our model (see Section 6 for further discussion). It is worth noting that predictive coding interprets neural activities as the values of continuous states, whereas our model regards them as the probabilities of discrete states. From current knowledge in physiology, it is far from clear which interpretation is more plausible. However, the probabilistic interpretation has drawn much attention these days from both experimental and theoretical sides in the name of “sampling hypothesis” (Hoyer and Hyvärinen, 2002a; Fiser et al., 2010; Berkes et al., 2011; Buesing et al., 2011). In particular, (Berkes et al., 2011) demonstrated that the spontaneous activities of multiple neurons (putatively representing the prior distribution) became gradually closer to their average activities evoked by natural images (the average posterior) as the animal grew up. This result appears to be quite compatible with our learning scheme, which searches for an internal model as close as possible to the external visual world. 5.2 Restricted Boltzmann machine models The sparse deep belief net is a stack of symmetric graphical models (i.e., restricted Boltzmann machines). They have used a fast learning and inference algorithm based on contrastive divergence (Lee et al., 2008) to reproduce a hierarchy of representations similar to V1 simple cells and V2; the latter was compared with a set of experimental data on V2 (Ito and Komatsu, 2004) complementary to the one we used. However, their model is not Bayesian in the sense that inference in hierarchy needs to be performed in a layer-by-layer manner and, as a result, the computation on the entire network does not yield a simple probabilistic interpretation. This theoretical drawback may be even more pronounced in a task involving complex inference. For example, in our filling-in simulation, a part of the visible variables are unfixed to represent a blind spot and exactly the same inference algorithm can be used to obtain a posterior probability. In their case, on the other hand, some modification would be needed since simple layer-by-layer computation is not sufficient and, even in that case, the result would not have a straightforward interpretation. 20 5.3 Multinomial network models Quite a few studies have used multinomial Bayesian networks for modeling visual cortex computation. Multinomial variables have typically served as a handy way of modeling hypercolumns composed of featurerepresenting minicolumns (George and Hawkins, 2005; Rao, 2005; Ichisugi, 2007; Röhrbein et al., 2007; Chikkerur et al., 2010; Hosoya, 2010) or as a mechanism for explaining attention-driven divisive normalization (Rao, 2005; Chikkerur et al., 2010). However, to the best of our knowledge, there have been no proposals of methods of natural image learning that work for hierarchical multinomial Bayesian networks. One novelty in our work is to exploit multinomial variables to implicitly enforce sparsity. Sparsity is known to be an important constraint on unit activities to obtain efficient feature representations such as visual receptive field properties from natural image patches (Olshausen and Field, 1996). Indeed, in a separate simulation using binary variables (sparsity 0.5), we could not observe any meaningful features that arose. However, our mechanism for sparsity may look rather unusual since previous models typically control sparsity through a regularization term (Olshausen and Field, 1996; Lee et al., 2008), while ours hard-wires sparsity by the number of states allowed for a variable. This could be a potentially severe restriction since a single sample of activation allows exactly one unit to be active for each variable, although our simulation nonetheless yielded various receptive field properties similar to actual visual neurons. An alternative approach would be to employ a network with binary variables with explicit sparsity enforced on each variable. Although similar results should in principle arise from such binary model, it is widely considered that this approach requires a prohibitively large computation time (Hinton, 2007). Lastly, it is worth pointing out that our model implicitly suggests that divisive normalization and sparsity, which are apparently different phenomena, could be caused by the same neural mechanism. This might make sense since the most plausible way of implementing either would be horizontal inhibition (Kisvárday et al., 1997). 6 Conclusion In this article, we have proposed a hierarchical probabilistic model of visual cortex and studied two kinds of interplay between natural image learning and Bayesian inference. First, we have shown that a maximallikelihood learning using Gibbs-sampling-based Bayesian inference can yield a hierarchy of feature representations similar to receptive fields of early visual cortex. Second, we have demonstrated that Bayesian inference performed on the network after learning can reproduce two kinds of context-sensitive response properties, namely, cross-orientation suppression and filling-in. In particular, the latter exploits feedback processing from the higher layer that has learned V2-like features representing combinations of similar orientations. One important nonclassical property of V1 not addressed in the current work is surround suppression, which has been explained by several other statistical models of natural image (Rao and Ballard, 1999; Schwartz and Simoncelli, 2001; Spratling, 2010). We speculate that this property could in principle be modeled by so-called “explain-away” effect, in which one explanation for an evidence can rule out another potential explanation. In our case, an orientation appearing in the surround would explain away the same orientation appearing in the center, thus causing suppression. Note that the explain-away effect occurs in a Bayesian model as an interaction among multiple variables in the same level (with common subordinate variables) and therefore could coexist with top-down effects like the one used in our filling-in simulation. However, in order to enable such surround suppression effect in our multinomial framework, an extension seems needed such that a visual feature to be explained away can be “turned off,” which we hope to incorporate in a future version of our model. Our eventual goal is to build up a model with a much deeper hierarchy. For this, crucially needed are invariance properties, such as phase invariance in V1 complex cells and size invariance in higher areas. Although our aim is not to explain full details of neural responses, the invariance properties seem to be key to robust reproduction of major properties of higher areas. Since temporal structure is known to help acquire invariance, we plan to incorporate trace learning into our learning algorithm (Földiak, 1991; Hurri and Hyvärinen, 2003; Stringer et al., 2007; Cadieu and Olshausen, 2009). 21 Acknowledgments This work has been supported by the Presto program of Japan Science and Technology Agency in the research area of “Decoding and Controlling Brain Information.” The author specially thanks Florian Röhrbein, Taro Toyoizumi, and Shun-ichi Amari for valuable discussions. 22 References Anzai, A., Peng, X., and Essen, D. C. V. (2007). Neurons in monkey visual area V2 encode combinations of orientations. Nature Neuroscience, 10:1313–1321. Bell, A. and Sejnowski, T. J. (1997). The ‘independent components’ of natural scenes are edge filters. Vision Research, 37:3327–3338. Berkes, P., Orbán, G., Lengyel, M., and Fiser, J. (2011). Spontaneous cortical activity reveals hallmarks of an optimal internal model of the environment. Science, 331(6013):83–87. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Bonds, A. B. (1989). Role of inhibition in the specification of orientation selectivity of cells in the cat striate cortex. Visual Neuroscience, 2:41–55. Buesing, L., Bill, J., Nessler, B., and Maass, W. (2011). Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS Computational Biology, 7(11):e1002211. Buxhoeveden, D. P. and Casanova, M. F. (2002). The minicolumn hypothesis in neuroscience. Brain, 125(5):935–951. Cadieu, C. F. and Olshausen, B. A. (2009). Learning transformational invariants from natural movies. In Advances in Neural Information Processing Systems, volume 21, pages 209–216. Chikkerur, S., Serre, T., Tan, C., and Poggio, T. (2010). What and where: A Bayesian inference theory of attention. Vision Research, 55(22):2233–2247. Daugman, J. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Optical Society of America, 7(2):1160–1169. Deneve, S. (2004). Bayesian multisensory integration and cross-modal spatial links. Journal of PhysiologyParis, 98(1–3):249–258. DeValois, R. L., Albrecht, D. G., and Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Research, 22:545–559. Doya, K., Ishii, S., Pouget, A., and Rao, R. P. N., editors (2007). Bayesian Brain: Probabilistic Approaches to Neural Coding. MIT Press. Fiser, J., Berkes, P., Orbán, G., and Lengyel, M. (2010). Statisticaly optimal perception and learning: from behavior to neural representations. Trends in Cognitive Sciences, 14(3):119–130. Földiak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3(2):194–200. George, D. and Hawkins, J. (2005). A hierarchical Bayesian model of invariant pattern recognition in the visual cortex. In Proceedings of International Joint Conference on Neural Networks, volume 3, pages 1812–1817. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9:181–197. Hegdé, J. and Essen, D. C. V. (2000). Selectivity for complex shapes in primate visual area V2. Journal of Neuroscience, 20(5):RC61. Hinton, G. (2007). Learning multiple layers of representation. Trends in Cognitive Sciences, 11(10):428–434. Hosoya, H. (2010). Bayesian interpretation of border-ownership signals in early visual cortex. In International Conference on Neural Information Processing (Lecture Notes in Computer Science, Volume 6443/2010), pages 1–8. 23 Hoyer, P. O. and Hyvärinen, A. (2000). Independent component analysis applied to feature extraction from colour and stereo image. Network: Computation in Neural Systems, 11:191–210. Hoyer, P. O. and Hyvärinen, A. (2002a). Interpreting neural response variability as Monte Carlo sampling of the posterior. In Advances in Neural Information Processing Systems, volume 15, pages 277–284. Hoyer, P. O. and Hyvärinen, A. (2002b). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42:1593–1605. Hubel, D. H. and Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 195:215–243. Hurri, J. and Hyvärinen, A. (2003). Simple-cell-like receptive fields maximize temporal coherence in natural video. Neural Computation, 15(3):663–691. Ichisugi, Y. (2007). A cerebral cortex model that self-organizes conditional probability tables and executes belief propagation. International Joint Conference on Neural Networks (IJCNN 2007), pages 178–183. Ito, M. and Komatsu, H. (2004). Representation of angles embedded within contour stimuli in area V2 of macaque monkeys. The Journal of Neuroscience, 24(13):3313–3324. Jones, J. P. and Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6):1233–1258. Kersten, D. and Yuille, A. (2003). Bayesian models of object perception. Current Opinion in Neurobiology, 13:150–158. Kisvárday, Z. F., Tóth, É., Rausch, M., and Eysel, U. T. (1997). Orientation-specific relationship between populations of excitatory and inhibitory lateral connections in the visual cortex of the cat. Cerebral Cortex, 7(7):605–618. Komatsu, H. and Kinoshita, M. (2000). Neural responses in the retinotopic representation of the blind spot in the macaque V1 to stimuli for perceptual filling-in. The Journal of Neuroscience, 20(24):9310–9319. Lee, H., Chaitanya, E., and Ng, A. Y. (2008). Sparse deep belief net model for visual area V2. In Advances in Neural Information Processing Systems, volume 20, pages 873–880. Lee, T. S. and Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of Optical Society of America A, 20(70):1434–1448. Matsumoto, M. and Komatsu, H. (2005). Neural responses in the macaque V1 to bar stimuli with various lengths presented on the blind spot. Journal of Neurophysiology, 93:2374–2387. McAdams, C. J. and Maunsell, J. H. R. (1999). Effects of attention on orientation-tuning functions of single neurons in macaque cortical area V4. The Journal of Neuroscience, 19(1):431–441. Mountcastle, V. B. (1997). The columnar organization of the neocortex. Brain, 120:701–722. Murray, S. O., Schrater, P., and Kersten, D. (2004). Perceptual grouping and the interactions between visual cortical areas. Neural Networks, 17:695–705. Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56(1):71–113. Olshausen, B. A. (2002). Sparse codes and spikes. In Probabilistic Models of the Brain: Perception and Neural Function, pages 257–272. MIT Press. Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning sparse code for natural images. Nature, 381:607–609. 24 Olshausen, B. A. and Field, D. J. (2003). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325. Park, K. Y., Jabri, M., Lee, S. Y., and Sejnowski, T. (2000). Independent components of optical flows have MSTd-like receptive fields. In Proceedings of the international workshop on independent component analysis and blind signal separation, pages 597–602. Parker, A. J. and Hawken, M. J. (1988). Two-dimensional spatial structure of receptive fields in monkey striate cortex. Journal of Optical Society America, A(5):598–605. Pearl, J. (1997). Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Petkov, N. and Kruizinga, P. (1997). Computational models of visual neurons specialised in the detection of periodic and aperiodic oriented visual stimuli: bar and grating cells. Biological cybernetics, 76(2):83–96. Ramachandran, V. S. (1992). Blind spots. Scientific American, 266:86–91. Rao, R. P. (2005). Bayesian inference and attentional modulation in the visual cortex. 16(16):1843–1848. Neuroreport, Rao, R. P. N. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87. Rao, R. P. N., Olshausen, B. A., and Lewicki, M. S., editors (2002). Probabilistic models of the brain: Perception and neural function. MIT Press. Ringach, D. L. (2002). Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex. Journal of Neurophysiology, 88:455–463. Röhrbein, F., Eggert, J., and Körner, E. (2007). A cortex-inspired neural-symbolic network for knowledge representation. In Proceedings of the IJCAI Workshop on Neural-Symbolic Learning and Reasoning. Satoh, S. and Usui, S. (2008). Computational theory and applications of a filling-in process at the blind spot. Neural Networks, 21:1261–1271. Schwartz, O. and Simoncelli, E. (2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825. Sigman, M., Cecchi, G. A., Gilbert, C. D., and Magnasco, M. O. (2001). On a common circle: natural scenes and Gestalt rules. Proceedings of the National Academy of Sciences of the United States of America, 98(4):1935–1940. Spratling, M. W. (2008). Predictive coding as a model of biased competition in visual attention. Vision Research, 48(12):1391–1408. Spratling, M. W. (2010). Predictive coding as a model of response properties in cortical area V1. Journal of Neuroscience, 30(9):3531–3543. Spratling, M. W. (2011). Unsupervised learning of generative and discriminative weights encoding elementary image components in a predictive coding model of cortical function. Neural Computation. in press. Stringer, S. M., Rolls, E. T., and Tromans, J. (2007). Invariant object recognition with trace learning and multiple stimuli present during training. Network: Computation in Neural Systems, 18:161–187. Tanaka, K. (2003). Columns for complex visual object features in the inferotemporal cortex: clustering of cells with similar but slightly different stimulus selectivities. Cerebral Cortex, 13:90–99. 25 Treue, S. (2001). Neural correlates of attention in primate visual cortex. Trends in Neurosciences, 24(5):295– 300. van Hateren, J. H. and van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society B: Biological Sciences, 265(1394):359–366. A Derivation of the learning algorithm This section proves that our learning algorithm can be derived as a stochastic gradient method on the expected log likelihood. Below, we consider the derivation separately for two kinds of parameters to optimize, namely, the weights wh,u used for hidden variables H (Softmax) and the weights wuV for visible variables V (Gaussian). Softmax case First, the partial derivative of the expected log likelihood Ev̂∗ [log P(v̂|w)] with respect to a parameter wh,u can be transformed as follows. 1 ∂P (v̂|w) ∂Ev̂∗ [log P(v̂|w)] ∗ = Ev̂ (11) ∂wh,u P(v̂|w) ∂wh,u X 1 ∂P (v̂, ĥ|w) = Ev̂∗ (12) P(v̂|w) ∂wh,u ĥ X P(ĥ|v̂, w) ∂P (v̂, ĥ|w) (13) = Ev̂∗ P(v̂, ĥ|w) ∂wh,u ĥ " # 1 ∂P (v̂, ĥ|w) ∗ = Ev̂, (14) ĥ P(v̂, ĥ|w) ∂wh,u 1 ∂P (x̂|w) (15) = Ex̂∗ P(x̂|w) ∂wh,u In step (14), the expectation is taken under the distribution P∗ generalized such that P∗ (v̂, ĥ|w) = P(ĥ|v̂, w)P∗ (v̂). Q Now, note that the joint distribution is factorized P(x̂|w) = x̂∈x̂ P(x̂|pa(x̂), w) (equation 1) of which only the factor P(ĥ|pa(ĥ), w) uses the weight wh,u (equation 2). Therefore, " # 1 ∂P( ĥ|û, t̂, w) (15) = E∗ĥ,û,t̂ , (16) ∂wh,u P(ĥ|û, t̂, w) where T = pa(H) \ {U }. Further, from the Softmax model (equation 2), the following can be derived by a little manipulation. P(h|u, t̂, w)(1 − P(h|u, t̂, w)) (û = u, ĥ = h) ∂P(ĥ|û, t̂, w) = (17) −P(h|u, t̂, w)P(ĥ|u, t̂, w) (û = u, ĥ 6= h) ∂wh,u 0 (û 6= u) Taken together, these can be rewritten as ∂P(ĥ|û, t̂, w) ∂wh,u = P(ĥ|û, t̂, w)δu,û (δh,ĥ − P(h|û, t̂, w)). (18) Putting these back into formula (16), we obtain (16) = h i E∗ĥ,û,t̂ δu,û (δh,ĥ − P(h|û, t̂, w)) . 26 (19) By replacing the computation of the expectation by the average over samples taken by Gibbs-sampling, we derive the weight update rule (6). Gaussian case Exactly in the same way as the Softmax case, the partial derivative of the expected log likelihood with respect to a parameter wuV is transformed as 1 ∂Ev̂∗ [log P(v̂|w)] ∂P(v̂|û, t̂, w) ∗ = Ev̂,û,t̂ . (20) ∂wuV ∂wuV P(v̂|û, t̂, w) From the Gaussian model (equation 3), we can derive the following. X 1 (v̂ − ( wt̂Vi + wûV ))P(v̂|û, t̂, w) (û = u) ∂P(v̂|û, t̂, w) 2 σ = i ∂wuV 0 (û 6= u) (21) Taken together, these can be written as ∂P(v̂|û, t̂, w) ∂wuV = X 1 δu,û (v̂ − ( wt̂Vi + wûV ))P (v̂|û, t̂, w). 2 σ i (22) # X 1 V V δu,û (v̂ − ( wt̂i + wû ))P (v̂|û, t̂, w) . σ2 i (23) Putting these back into (20), we obtain " (20) = E∗v̂,û,t̂ The weight update rule (equation 7) is derived by replacing the expectation with the average over samples. 27