Download Extended PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neurocomputational speech processing wikipedia , lookup

Convolutional neural network wikipedia , lookup

Perception wikipedia , lookup

Neural modeling fields wikipedia , lookup

Neuroscience in space wikipedia , lookup

Metastability in the brain wikipedia , lookup

Allochiria wikipedia , lookup

Functional magnetic resonance imaging wikipedia , lookup

Cortical cooling wikipedia , lookup

Binding problem wikipedia , lookup

Visual search wikipedia , lookup

Visual selective attention in dementia wikipedia , lookup

Sensory substitution wikipedia , lookup

Transsaccadic memory wikipedia , lookup

Visual extinction wikipedia , lookup

Embodied cognitive science wikipedia , lookup

Evoked potential wikipedia , lookup

Visual memory wikipedia , lookup

Visual servoing wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Spatial memory wikipedia , lookup

Time perception wikipedia , lookup

Neuroesthetics wikipedia , lookup

C1 and P1 (neuroscience) wikipedia , lookup

Visual N1 wikipedia , lookup

Cognitive neuroscience of music wikipedia , lookup

Sensory cue wikipedia , lookup

Visual spatial attention wikipedia , lookup

P200 wikipedia , lookup

Transcript
Report
Distinct Computational Principles Govern
Multisensory Integration in Primary Sensory and
Association Cortices
Highlights
d
Multisensory interactions differ computationally across the
cortical hierarchy
d
In visual cortex, spatial disparity controlled the influence of
auditory signals
d
In parietal cortex, spatial disparity determined the effect of
task-irrelevant signals
d
Only parietal cortices weighted signals by their reliability and
task relevance
Rohe & Noppeney, 2016, Current Biology 26, 509–514
February 22, 2016 ª2016 Elsevier Ltd All rights reserved
http://dx.doi.org/10.1016/j.cub.2015.12.056
Authors
Tim Rohe, Uta Noppeney
Correspondence
[email protected] (T.R.),
[email protected] (U.N.)
In Brief
Rohe and Noppeney show that
multisensory interactions are pervasive
but governed by distinct computational
principles across the cortical hierarchy.
Critically, only parietal cortices integrated
signals weighted by their bottom-up
sensory reliabilities and top-down task
relevance into multisensory spatial
priority maps.
Current Biology
Report
Distinct Computational Principles
Govern Multisensory Integration
in Primary Sensory and Association Cortices
Tim Rohe1,2,* and Uta Noppeney1,3,*
1Max
Planck Institute for Biological Cybernetics, Spemannstrasse 38, 72076 Tübingen, Germany
of Psychiatry and Psychotherapy, University of Tübingen, Calwerstrasse 14, 72076 Tübingen, Germany
3Centre for Computational Neuroscience and Cognitive Robotics, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK
*Correspondence: [email protected] (T.R.), [email protected] (U.N.)
http://dx.doi.org/10.1016/j.cub.2015.12.056
2Department
SUMMARY
Human observers typically integrate sensory signals
in a statistically optimal fashion into a coherent
percept by weighting them in proportion to their reliabilities [1–4]. An emerging debate in neuroscience is
to which extent multisensory integration emerges
already in primary sensory areas or is deferred to
higher-order association areas [5–9]. This fMRI study
used multivariate pattern decoding to characterize
the computational principles that define how auditory and visual signals are integrated into spatial
representations across the cortical hierarchy. Our results reveal small multisensory influences that were
limited to a spatial window of integration in primary
sensory areas. By contrast, parietal cortices integrated signals weighted by their sensory reliabilities
and task relevance in line with behavioral performance and principles of statistical optimality. Intriguingly, audiovisual integration in parietal cortices
was attenuated for large spatial disparities when signals were unlikely to originate from a common
source. Our results demonstrate that multisensory
interactions in primary and association cortices are
governed by distinct computational principles. In primary visual cortices, spatial disparity controlled the
influence of non-visual signals on the formation of
spatial representations, whereas in parietal cortices,
it determined the influence of task-irrelevant signals.
Critically, only parietal cortices integrated signals
weighted by their bottom-up reliabilities and topdown task relevance into multisensory spatial priority maps to guide spatial orienting.
RESULTS
Our senses are exposed to a constant influx of signals. To make
sense of this cacophony, the brain needs to solve two computational challenges: first, it needs to determine which signals
emanate from a common source based on them co-occurring
in time (e.g., temporal synchrony) and space (e.g., spatial
disparity) [4, 10, 11]. Second, it needs to integrate signals from
a common source into a statistically optimal percept by weighting them in proportion to their reliabilities [1–3]. To determine the
functional relevance and computational principles that govern
multisensory interactions across the cortical hierarchy, we presented five participants with synchronous audiovisual spatial
signals that varied in their spatial disparity and visual reliability
(Figures 1A and 1B). On each trial, participants reported their
perceived location of the auditory or visual signal. The study
was approved by the human research review committee of the
University of Tübingen.
Combining psychophysics and multivariate fMRI pattern decoding, we characterized how human observers integrate auditory and visual signals into spatial representations in terms of the
audiovisual weight index wAV that quantifies the influence of the
true auditory and visual locations on (1) the perceived/reported
auditory and visual spatial estimates (i.e., participants’ behavioral localization responses; Figure 1C) and (2) the spatial estimates decoded from regions of interest along the auditory [12]
and visual [13] dorsal processing hierarchy (Figure 2). This audiovisual weight index ranges from pure visual (90 ) to pure auditory
(0 ) influence. We performed the statistics on the behavioral and
neural audiovisual weight indices using a two (auditory versus
visual report) 3 two (high versus low visual reliability) 3 two (large
versus small spatial disparity) factorial design based on circular
statistics [14].
Behavioral Results
Our results demonstrate that participants integrated auditory
and visual signals weighted by their reliabilities and task relevance (see Table 1; see Figure S1 for histograms of reported
signal locations across all conditions). The relative influence of
the visual signal on participants’ perceived location was greater
for high relative to low visual reliability (main effect of visual
reliability: p < 0.001; permutation testing of a likelihood ratio
test statistic). Moreover, it was greater when the location of
the visual signal needed to be reported than when the location
of the auditory signal needed to be reported (main effect of
task relevance: p < 0.001). Thus, participants flexibly adjusted
the weights according to the task-relevant sensory modality.
As a consequence, they reported different auditory and
visual locations for identical audiovisual signals. Critically, this
difference significantly increased for large (>6.6 ) relative to
small (%6.6 ) spatial disparities. In other words, audiovisual
Current Biology 26, 509–514, February 22, 2016 ª2016 Elsevier Ltd All rights reserved 509
Figure 1. Example Trial, Experimental Design, and Behavioral Data
(A) In a ventriloquist paradigm, participants were presented with synchronous audiovisual signals originating from four possible locations along the azimuth. The
visual signal was a cloud of white dots. The auditory signal was a brief burst of white noise. Participants localized either the auditory or the visual signal (n.b. for
illustrational purposes the visual angles of the cloud have been scaled in a non-uniform fashion in this scheme).
(B) The four-factorial experimental design manipulated (1) the location of the visual (V) signal ( 10 , 3.3 , 3.3 , and 10 ), (2) the location of the auditory (A) signal
( 10 , 3.3 , 3.3 , and 10 ), (3) the reliability of the visual signal (high [VR+] versus low [VR ] reliability as defined by the spread of the visual cloud), and (4) task
relevance (auditory versus visual report). Using fMRI, we measured activation patterns to audiovisual signals of all experimental conditions from voxels of regions
along the auditory and visual spatial-processing hierarchies.
(C) Behavioral results: audiovisual weight index wAV (across-participants circular mean and double-bootstrapped 68% confidence interval; n = 5) was computed
as the angle between the auditory and visual regression coefficients. Audiovisual weight index wAV as a function of audiovisual spatial disparity (small [%6.6 ]
versus large [>6.6 ]), task relevance (auditory versus visual report), and visual reliability (high [VR+] versus low [VR ]) are shown. For a purely visual influence, wAV
is 90 . For a purely auditory influence, it is 0 .
See also Figure S1.
integration broke down when auditory and visual signals were
far apart and more likely to be caused by independent sources
(i.e., a significant interaction between task relevance and spatial
disparity; p = 0.015).
fMRI Decoding across the Cortical Hierarchy
To characterize how auditory and visual signals were integrated
into spatial representations at the neural level, we combined
fMRI with multivariate pattern decoding. Based on a support-vector regression model trained on audiovisual spatially congruent
trials, we decoded a brain area’s spatial estimate of spatially
disparate audiovisual signals. First, we ensured that we could
decode the spatial estimate for congruent trials significantly
better than chance in all eight regions of interest (Table S1). Using
the same analysis approach as for behavioral localization
responses, we then investigated how the neural audiovisual
weight wAV index was affected by visual reliability, task-relevant
sensory modality, and spatial disparity (Figures 2A–2D). As the
two (auditory versus visual report) 3 two (high versus low visual
reliability) 3 two (large versus small spatial disparity) repeatedmeasures analysis did not reveal a significant three-way interaction (Table 1), Figure 2 presents the neural audiovisual weights
separately as a function of visual reliability, task relevance, and
spatial disparity (Figures 2A–2C) and of both task relevance and
spatial disparity (Figure 2D).
Effect of Sensory Reliability on Audiovisual Integration
First, we asked which regions integrate auditory and visual signals weighted by their reliability as expected from principles of
statistical optimality [1–3] and participants’ behavioral localiza-
tion responses (Figure 2A). Surprisingly, visual reliability did not
significantly influence audiovisual weighting in lower visual or
auditory areas. Only higher parietal cortices (IPS0–IPS4) were
governed by the classical reliability-driven reweighting with
more weight being given to the auditory signal when the visual
signal was unreliable. Whereas IPS0–4 mainly represented the
location of the visual signal for high visual reliability (i.e., the audiovisual weight index was approximately 90 ), its spatial estimate shifted toward the location of the concurrent auditory
signal for low visual reliability.
Effect of Task Relevance on Audiovisual Integration and
Its Interaction with Spatial Disparity
Next, we asked where auditory and visual signals were integrated into spatial representations weighted by their task relevance (Figure 2B). Whereas we found a marginally significant
main effect of task relevance (i.e., visual versus auditory report)
on the audiovisual weight index already in higher-order auditory
areas (hA) encompassing the belt and the planum temporale,
the effect emerged predominantly in higher-order association
areas such as IPS0–4 (cf. Table 1). In these areas, the visual
signal exerted a stronger influence on the decoded location during visual than auditory report. Thus, both planum temporale and
IPS0–4 formed different spatial estimates for identical audiovisual stimuli depending on which sensory modality was attended
and reported.
Importantly, the difference between spatial estimates for auditory and visual report was further increased in IPS3–4, when the
spatial disparity between auditory and visual signals was large
(i.e., significant interaction between task relevance and spatial
510 Current Biology 26, 509–514, February 22, 2016 ª2016 Elsevier Ltd All rights reserved
Effect of Spatial Disparity on Audiovisual Integration
In contrast to the interaction between task relevance and spatial
disparity that was found in parietal areas, we observed a main effect of spatial disparity in low-level visual areas V1 and, marginally
significant, in V2 (cf. solid lines are below dotted lines in V1 and
V2; Figure 2C; Table 1). Only for small spatial disparities auditory
signals exerted an ‘‘attractive’’ influence on the spatial representations decoded from low-level visual areas (cf. solid lines below
90 in V1; p = 0.094 in a one-sample permutation test in Figure 2C). Likewise, we observed a limited but significant attractive
influence of visual signals on spatial representations decoded
from auditory areas for small spatial disparities (solid lines above
0 in A1 in Figure 2C; p = 0.032 for unidirectional hypothesis in a
one-sample permutation test). These results suggest that integration in low-level sensory areas depends on auditory and visual
signals co-occurring within a spatial window [15]. In short, spatial
disparity controls the influence of the non-preferred sensory signals on the spatial estimates in low-level sensory areas.
Note, however, that spatial disparity was inherently correlated
with the eccentricity of the audiovisual signals by virtue of the
factorial and spatially balanced nature of our design. Whereas
signals were presented para-foveally or peripherally for smalldisparity trials, they were presented in the periphery for largedisparity trials.
Interaction between Spatial Disparity and Visual
Reliability
For completeness, we also observed an interaction between reliability and spatial disparity in V3AB. This interaction results from
a larger spatial window of integration for less-reliable sensory
signals (see [16]). Basically, it is easier to determine that two signals come from different sources when the visual input is reliable.
Figure 2. fMRI Results
Audiovisual weight index as a function of visual reliability, task relevance, and
disparity and its correlation with the corresponding behavioral weight index in
the regions of interest. Audiovisual weight index wAV (across-participants
circular mean and double-bootstrapped 68% confidence interval; n = 5) was
computed as the angle between the auditory and visual regression coefficients. For a purely visual region, wAV is 90 . For a purely auditory region, it
is 0 . Asterisks indicate the statistical significance of effects on wAV derived
from a circular log-likelihood ratio statistic.
(A) Audiovisual weight index wAV as a function of visual reliability (high [VR+]
versus low [VR ]).
(B) Audiovisual weight index wAV as a function of task relevance (auditory [A]
versus visual [V] report).
(C) Audiovisual weight index wAV as a function of audiovisual spatial disparity
(small [%6.6 ; D ] versus large [>6.6 ; D+]).
(D) Audiovisual weight index wAV in IPS3–4 as a function of task relevance and
disparity.
(E) Circular-circular correlation (across-participants mean after Fisher
z-transformation ± SEM; n = 5) between the neural weight index wAV and the
equivalent behavioral weight index in the regions of interest.
See also Figure S2 and Tables S1 and S2.
disparity; cf. Table 1). In other words, when audiovisual spatial
disparity was large and signals were unlikely to emanate from
a common event, audiovisual interactions broke down and
IPS3–4 predominantly represented the location of the task-relevant signal. Thus, spatial disparity controlled the influence of the
task-irrelevant signal on the spatial estimate in IPS3–4.
Relation of Neural and Behavioral Weight Indices of
Audiovisual Spatial Integration
Finally, we asked how and where along the cortical hierarchies
the neural audiovisual weights were related with the behavioral
audiovisual weights. Hence, we computed the correlation between the neural and behavioral weight indices for each of the regions of interest. The correlation coefficient increased along the
visual processing hierarchy culminating in IPS3–4 (Figure 2E).
Likewise, in the auditory system, the correlation between neural
and behavioral weights was enhanced in higher-order auditory
areas relative to primary auditory cortex.
To further investigate which region predominantly drove participants’ perceptual localization responses, we decoded the reported signal location from fMRI activation patterns while accounting for the physical signal location and visual reliability. IPS3–4
showed the greatest and selectively significant correlation coefficient between true and decoded reported locations (Table S1).
Collectively, these results suggest that audiovisual integration
processes in higher-order visual (in particular IPS3–4) and auditory areas are closely related to participants’ trial-by-trial
perceived stimulus location.
Controlling for Eye Movements and Hemifield of Signals
as Potential Confounds
To address potential concerns that our decoding results may
be confounded by eye movements, we performed a series of
Current Biology 26, 509–514, February 22, 2016 ª2016 Elsevier Ltd All rights reserved 511
Table 1. Statistical Significance of Main and Interaction Effects of the Factors Visual Reliability, Task Relevance, and Spatial Disparity
for the Behavioral and Neural Audiovisual Weight Index wAV
VR p
TR p
Sp
VR 3 TR p
VR 3 S p
TR 3 S p
Behavior
<0.001*
<0.001*
0.607
0.002*
0.186
0.015*
VR 3 TR 3 S p
0.752
V1
1
0.942
0.005*
0.961
0.227
1
0.999
V2
0.227
1
0.059
0.92
0.824
0.424
0.847
V3
1
0.665
1
0.811
0.904
1
1
V3AB
0.974
0.745
0.589
0.992
0.040*
1
0.997
IPS0–2
0.022*
0.047*
1
0.103
0.430
1
1
IPS3–4
0.028*
<0.001*
0.999
0.949
0.994
0.021*
0.999
hA
0.997
0.066
1
0.984
1
0.979
1
A1
0.433
0.468
0.678
0.979
1
1
0.910
p values are based on permutation tests using a circular log-likelihood ratio statistic. For the neural weight index wAV, they are corrected for multiple
comparisons across the eight regions of interest. n = 5. Asterisks indicate significant p values. VR, visual reliability; TR, task relevance; S, spatial
disparity. See also Tables S2 and S3.
control analyses. First, we evaluated participants’ eye movements based on eye-tracking data recorded concurrently during
fMRI acquisition. Fixation was well maintained throughout
the experiment with post-stimulus saccades detected in only
2.293% ± 1.043% (mean ± SEM) of the trials. Moreover, four (visual location) 3 four (auditory location) 3 two (visual reliability) 3
two (task relevance) repeated measures ANOVAs performed
separately for (1) % saccades or (2) % eye blinks revealed no significant main effects or interactions (Table S2). The repeated
measures ANOVA on post-stimulus mean horizontal eye position
(0–875 ms post-stimulus onset) revealed no significant effects
either. Small trends were observed for the main effect of task
relevance and visual local positions.
As a further control analysis, we therefore re-performed the
linear regression analyses to compute the neural weight index
wAV (with fMRI-decoded spatial location as dependent variable;
see the Supplemental Experimental Procedures) and included
post-stimulus mean horizontal eye position as a nuisance covariate in addition to the true auditory and visual locations to predict
the fMRI-decoded locations. This analysis basically replicated
our initial results (Figure S2; Table S3).
Finally, we investigated the effect of within/across hemifield of
presentation on our results (i.e., wAV) by including a nuisance variable that coded whether the auditory and visual signals were
presented in the same or different hemifield in the linear regression analysis (with fMRI-decoded spatial location as dependent
variable). This analysis again basically replicated our initial results (Figure S2; Table S3).
DISCUSSION
This study combined psychophysics and multivariate fMRI
pattern decoding to characterize how the brain integrates audiovisual signals into spatial representations along the auditory [12]
and visual [13] processing hierarchies. Our results demonstrate
that distinct computational principles govern audiovisual interactions in primary sensory and higher-order association areas.
Accumulating evidence has demonstrated that multisensory
integration is not deferred to association cortices [17–22]
but starts already at the primary, putatively unisensory level
[23–28] via thalamo-cortical mechanisms [27], direct connectiv-
ity between sensory areas [29], or top-down influences from
higher-order association cortices [30]. Our data also reveal bidirectional audiovisual influences at the primary cortical level. In
particular, the auditory location influenced the spatial estimate
encoded in primary visual cortex. In line with the spatial principle
of multisensory integration [15], a concurrent auditory (resp. visual) signal attracted the spatial estimate in V1 (resp. A1) only
when the two signals co-occurred close in space. In other words,
spatial disparity controlled the influence of auditory signals on
spatial estimates in primary visual areas. Yet, even for low spatial
disparity, audiovisual influences in primary sensory areas were
relatively small when compared to parietal cortices. These findings dovetail nicely with previous neurophysiological studies
showing about 15% ‘‘multisensory’’ neurons in primary sensory
areas [31] but more than 50% in classical association areas such
as intraparietal or superior temporal sulci [32].
Critically, our study did not only show that multisensory interactions increased progressively along the cortical hierarchy but
that they changed their computational operations from primary
sensory to higher-order parietal areas. Only higher-order parietal
areas (IPS0–4) integrated auditory and visual signals weighted
by their reliability in line with principles of statistical optimality
[1–3]. Yet, despite profound audiovisual interactions in IPS0–4,
sensory signals were not fused into one unified amodal spatial
representation [1, 3]. Instead, the sensory weights in IPS0–4
depended on the sensory modality that needed to be reported
(cf. Figure 2B). This context-dependent weighting of the auditory
and visual signals led to different spatial estimates for identical
audiovisual stimuli under visual and auditory report.
Critically, spatial disparity increased this difference between
the spatial estimates. When auditory and visual signals were
far apart and hence likely to come from independent events,
audiovisual integration was attenuated and IPS3–4 encoded
predominantly the location of the task-relevant signal. Thus,
IPS3–4 gracefully transitions between information integration
and segregation depending on the probability of the two signals
being generated by a common cause [4, 10].
A recent model-based fMRI study showed that IPS3–4 is more
likely to encode spatial estimates formed by Bayesian causal
inference [33] than by traditional forced fusion [1, 3] or full segregation models. Yet, the principles that drove this result remained
512 Current Biology 26, 509–514, February 22, 2016 ª2016 Elsevier Ltd All rights reserved
unclear (for further discussion, see the Supplemental Experimental Procedures). Using ‘‘model-free’’ multivariate decoding,
the current study reveals three fundamental computational principles that determine audiovisual integration in IPS3–4: IPS3–4
integrates sensory signals weighted by (1) their bottom-up sensory reliabilities and (2) their top-down task relevance into multisensory spatial priority maps. (3) Critically, these spatial priority
maps take into account the causal structure of the environment
(i.e., interaction of task relevance with spatial disparity). The influence of task relevance and its interaction with spatial disparity
implicate different spatial estimates for auditory and visual report
in particular when the two signals are far apart, as expected under Bayesian causal inference. Yet, as we manipulated task relevance in long blocks, attention may potentially have increased
the reliability of the auditory (resp. visual) signal and thereby influenced its weight in the final IPS3–4 spatial estimate for visual
(resp. auditory) report [34]. Future studies manipulating task relevance in a cuing paradigm may help to further disentangle these
different explanatory mechanisms.
Collectively, these computational operations enable IPS3–4 to
form multisensory spatial priority maps that go functionally
beyond traditional unisensory spatial priority maps [35, 36].
Multisensory spatial priority maps define attentional priority in
space jointly based on the bottom-up salience and reliability of
signals from multiple sensory modalities [22], their current task
relevance, and their causal structure.
Indeed, the behavioral relevance of the IPS3–4 spatial estimates was further indicated by the correlation between the neural and behavioral weights, which progressively increased along
the auditory and visual processing hierarchies to culminate in
IPS3–4 [22, 37]. Likewise, IPS3–4 was the only region where
perceptual report was decoded significantly better than chance.
Previous studies focusing selectively on the auditory ventriloquist illusion have highlighted the importance of the planum temporale in audiovisual integration at the perceptual level [21, 38].
Indeed, our study also revealed a high correlation between neural and behavioral weights indices in higher auditory areas. Yet,
unlike IPS3–4, higher auditory areas integrated sensory signals
weighted by their task relevance but were not significantly
affected by visual reliability. As null results need to be interpreted
with caution, future studies are needed to further define the
computational similarities and differences of the planum temporale and IPS3–4 in audiovisual integration (for further discussion,
see the Supplemental Experimental Procedures).
In conclusion, our results reveal distinct computational principles governing multisensory interactions in primary and association cortices. In primary visual cortices, spatial disparity
controlled the influence of non-visual signals on the formation
of spatial estimates. In parietal cortices, it determined the influence of task-irrelevant signals. Critically, parietal cortices integrated signals weighted by their bottom-up sensory reliabilities
and top-down task relevance into multisensory spatial priority
maps guiding spatial orienting and effective interactions in a
multisensory world. Moving beyond identifying multisensory influences toward characterizing their computational principles reveals a hierarchical organization of multisensory perception in
human neocortex. Our results demonstrate that multisensory
interactions are pervasive in human neocortex but subserve
distinct computational tasks [5, 39].
SUPPLEMENTAL INFORMATION
Supplemental Information includes Supplemental Experimental Procedures,
two figures, and three tables and can be found with this article online at
http://dx.doi.org/10.1016/j.cub.2015.12.056.
AUTHOR CONTRIBUTIONS
T.R. and U.N. conceived and designed the experiments. T.R. performed the
experiments. T.R. and U.N. analyzed the data. T.R. and U.N. wrote the paper.
ACKNOWLEDGMENTS
This study was funded by the European Research Council (ERC-2012StG_20111109) and the Max Planck Society. We thank C. Frith and U. Beierholm for helpful comments on a previous draft of this manuscript, P. Dayan for
helpful discussions, and P. Ehses for advice on the fMRI sequence.
Received: July 31, 2015
Revised: November 6, 2015
Accepted: December 22, 2015
Published: February 4, 2016
REFERENCES
1. Ernst, M.O., and Banks, M.S. (2002). Humans integrate visual and haptic
information in a statistically optimal fashion. Nature 415, 429–433.
2. Fetsch, C.R., Pouget, A., DeAngelis, G.C., and Angelaki, D.E. (2012).
Neural correlates of reliability-based cue weighting during multisensory
integration. Nat. Neurosci. 15, 146–154.
3. Alais, D., and Burr, D. (2004). The ventriloquist effect results from nearoptimal bimodal integration. Curr. Biol. 14, 257–262.
4. Körding, K.P., Beierholm, U., Ma, W.J., Quartz, S., Tenenbaum, J.B., and
Shams, L. (2007). Causal inference in multisensory perception. PLoS ONE
2, e943.
5. Ghazanfar, A.A., and Schroeder, C.E. (2006). Is neocortex essentially
multisensory? Trends Cogn. Sci. 10, 278–285.
6. Kayser, C., Logothetis, N.K., and Panzeri, S. (2010). Visual enhancement
of the information representation in auditory cortex. Curr. Biol. 20, 19–24.
7. Liang, M., Mouraux, A., Hu, L., and Iannetti, G.D. (2013). Primary sensory
cortices contain distinguishable spatial patterns of activity for each sense.
Nat. Commun. 4, 1979.
8. Meyer, K., Kaplan, J.T., Essex, R., Webber, C., Damasio, H., and Damasio,
A. (2010). Predicting visual stimuli on the basis of activity in auditory
cortices. Nat. Neurosci. 13, 667–668.
9. Vetter, P., Smith, F.W., and Muckli, L. (2014). Decoding sound and imagery
content in early visual cortex. Curr. Biol. 24, 1256–1262.
10. Shams, L., and Beierholm, U.R. (2010). Causal inference in perception.
Trends Cogn. Sci. 14, 425–432.
11. Wallace, M.T., Roberson, G.E., Hairston, W.D., Stein, B.E., Vaughan, J.W.,
and Schirillo, J.A. (2004). Unifying multisensory signals across time and
space. Exp. Brain Res. 158, 252–258.
12. Tian, B., Reser, D., Durham, A., Kustov, A., and Rauschecker, J.P. (2001).
Functional specialization in rhesus monkey auditory cortex. Science 292,
290–293.
13. Mishkin, M., Ungerleider, L.G., and Macko, K.A. (1983). Object vision and
spatial vision: two cortical pathways. Trends Neurosci. 6, 414–417.
14. Anderson, C.M., and Wu, C.F.J. (1995). Measuring location effects from
factorial experiments with a directional response. Int. Stat. Rev. 63,
345–363.
15. Stein, B.E., and Meredith, M.A. (1993). The Merging of the Senses (The MIT
Press).
16. Rohe, T., and Noppeney, U. (2015). Sensory reliability shapes perceptual
inference via two mechanisms. J. Vis. 15, 22.
Current Biology 26, 509–514, February 22, 2016 ª2016 Elsevier Ltd All rights reserved 513
17. Sadaghiani, S., Maier, J.X., and Noppeney, U. (2009). Natural, metaphoric,
and linguistic auditory direction signals have distinct influences on visual
motion processing. J. Neurosci. 29, 6490–6499.
18. Beauchamp, M.S., Argall, B.D., Bodurka, J., Duyn, J.H., and Martin, A.
(2004). Unraveling multisensory integration: patchy organization within human STS multisensory cortex. Nat. Neurosci. 7, 1190–1192.
19. Hein, G., Doehrmann, O., Müller, N.G., Kaiser, J., Muckli, L., and Naumer,
M.J. (2007). Object familiarity and semantic congruency modulate responses in cortical audiovisual integration areas. J. Neurosci. 27, 7881–
7887.
20. Werner, S., and Noppeney, U. (2010). Distinct functional contributions of
primary sensory and association areas to audiovisual integration in object
categorization. J. Neurosci. 30, 2662–2675.
21. Bonath, B., Noesselt, T., Martinez, A., Mishra, J., Schwiecker, K., Heinze,
H.J., and Hillyard, S.A. (2007). Neural basis of the ventriloquist illusion.
Curr. Biol. 17, 1697–1703.
22. Macaluso, E., Driver, J., and Frith, C.D. (2003). Multimodal spatial representations engaged in human parietal cortex during both saccadic and
manual spatial orienting. Curr. Biol. 13, 990–999.
23. Kayser, C., Petkov, C.I., Augath, M., and Logothetis, N.K. (2007).
Functional imaging reveals visual modulation of specific fields in auditory
cortex. J. Neurosci. 27, 1824–1835.
24. Noesselt, T., Rieger, J.W., Schoenfeld, M.A., Kanowski, M., Hinrichs, H.,
Heinze, H.J., and Driver, J. (2007). Audiovisual temporal correspondence
modulates human multisensory superior temporal sulcus plus primary
sensory cortices. J. Neurosci. 27, 11431–11441.
28. Lee, H., and Noppeney, U. (2014). Temporal prediction errors in visual and
auditory cortices. Curr. Biol. 24, R309–R310.
29. Falchier, A., Clavagnier, S., Barone, P., and Kennedy, H. (2002).
Anatomical evidence of multimodal integration in primate striate cortex.
J. Neurosci. 22, 5749–5759.
30. Macaluso, E., and Driver, J. (2005). Multisensory spatial interactions: a
window onto functional integration in the human brain. Trends Neurosci.
28, 264–271.
31. Bizley, J.K., Nodal, F.R., Bajo, V.M., Nelken, I., and King, A.J. (2007).
Physiological and anatomical evidence for multisensory interactions in
auditory cortex. Cereb. Cortex 17, 2172–2189.
32. Dahl, C.D., Logothetis, N.K., and Kayser, C. (2009). Spatial organization of
multisensory responses in temporal association cortex. J. Neurosci. 29,
11924–11932.
33. Rohe, T., and Noppeney, U. (2015). Cortical hierarchies perform
Bayesian causal inference in multisensory perception. PLoS Biol. 13,
e1002073.
34. Martinez-Trujillo, J.C., and Treue, S. (2004). Feature-based attention increases the selectivity of population responses in primate visual cortex.
Curr. Biol. 14, 744–751.
35. Bisley, J.W., and Goldberg, M.E. (2010). Attention, intention, and priority in
the parietal lobe. Annu. Rev. Neurosci. 33, 1–21.
36. Gottlieb, J.P., Kusunoki, M., and Goldberg, M.E. (1998). The representation of visual salience in monkey parietal cortex. Nature 391,
481–484.
25. Lewis, R., and Noppeney, U. (2010). Audiovisual synchrony improves motion discrimination via enhanced connectivity between early visual and
auditory areas. J. Neurosci. 30, 12329–12339.
37. Snyder, L.H., Batista, A.P., and Andersen, R.A. (1997). Coding of intention
in the posterior parietal cortex. Nature 386, 167–170.
26. Foxe, J.J., Morocz, I.A., Murray, M.M., Higgins, B.A., Javitt, D.C., and
Schroeder, C.E. (2000). Multisensory auditory-somatosensory interactions in early cortical processing revealed by high-density electrical mapping. Brain Res. Cogn. Brain Res. 10, 77–83.
38. Bonath, B., Noesselt, T., Krauel, K., Tyll, S., Tempelmann, C., and Hillyard,
S.A. (2014). Audio-visual synchrony modulates the ventriloquist illusion
and its neural/spatial representation in the auditory cortex. Neuroimage
98, 425–434.
27. Lakatos, P., Chen, C.M., O’Connell, M.N., Mills, A., and Schroeder, C.E.
(2007). Neuronal oscillations and multisensory interaction in primary auditory cortex. Neuron 53, 279–292.
39. Driver, J., and Noesselt, T. (2008). Multisensory interplay reveals crossmodal influences on ‘sensory-specific’ brain regions, neural responses, and
judgments. Neuron 57, 11–23.
514 Current Biology 26, 509–514, February 22, 2016 ª2016 Elsevier Ltd All rights reserved
Current Biology, Volume 26
Supplemental Information
Distinct Computational Principles
Govern Multisensory Integration
in Primary Sensory and Association Cortices
Tim Rohe and Uta Noppeney
Supplemental Figures
Figure S1 (Related to Figure 1). Histograms of the reported signal locations (across-participants mean ± SEM, N = 5) for
the 64 experimental conditions. The four auditory locations are displayed along the columns and the four visual locations along
the rows. Blue indicates auditory report; red indicates visual report. (A) High visual reliability. (B) Low visual reliability.
Figure S2 (Related to Figure 2). Audiovisual weight index (after controlling for eye movements, left panel, or for whether
auditory and visual signals were presented in common or different hemifields, right panel) in the regions of interest.
Audiovisual weight index wAV (across-participants circular mean and double-bootstrapped 68% confidence interval, N = 5) was
computed as the angle between the auditory and visual regression coefficients (atan(ßV/ßA)). In order to control for horizontal eye
movements, we included the post-stimulus mean horizontal eye position as a nuisance covariate in addition to the true auditory and
visual locations to predict the fMRI decoded locations (left panel, (A)-(D)). In an additional analysis, we included a nuisance
variable that coded whether the auditory and visual signals were presented in common or different hemifields in addition to the true
auditory and visual locations in the linear regression analysis to predict the fMRI decoded locations (right panel, (E)-(H)). For a
purely visual region, wAV is 90°. For a purely auditory region, it is 0°. Asterisks indicate the statistical significance of effects on
wAV derived from a circular log-likelihood ratio statistic. (A, E) Audiovisual weight index wAV as a function of visual reliability
(high (VR+) vs. small (VR-)). (B, F) Audiovisual weight index wAV as a function of task-relevance (auditory (A) vs. visual (V)
report). (C, G) Audiovisual weight index wAV as a function of audiovisual spatial disparity (small (≤ 6.6°; D-) vs. large (> 6.6°;
D+)). (D, H) Audiovisual weight index wAV in IPS3-4 as a function of task-relevance and disparity.
Supplemental Tables
Table S1 (Related to Figure 2). Decoding accuracies (across-participants mean Fisher z-transformed
correlation coefficients) for signal location in audiovisual congruent conditions (top) and for perceptual
localization reports (bottom) in the regions of interest.
Region
V1
V2
V3
V3AB
IPS0-2
IPS3-4
hA
A1
Signal location
z
0.700*
0.660*
0.720*
0.620*
0.400*
0.340*
0.410*
0.140*
Perceptual localization
z
0.141
0.255
0.208
0.036
0.338
0.355*
0.208
0.036
Correlation coefficients were Fisher z-transformed before averaging. * = p < 0.05 in one-sample bootstrap test
after multiple comparison correction across the eight regions of interest. N =5.
Table S2 (Related to Figure 2 and Table 1). Main and interaction effects of the factors visual signal location (LV),
auditory signal location (LA), visual reliability (VR), and task- relevance (TR) on post-stimulus eye-movement indices in
repeated measures ANOVAs.
Percent saccades
F
df1
df2
Horizontal eye position
p
TR
3.82
1
4
0.12
VR
0.10
1
4
0.77
LV
1.86
1.1
4.4 0.24
LA
1.06
1.2
4.6 0.37
TR X VR
0.02
1
4
0.89
TR X LV
1.70
1.0
4.2 0.26
TR X LA
1.12
1.7
6.6 0.37
VR X LV
1.02
2.0
7.8 0.40
VR X LA
1.56
1.3
5.3 0.28
LV X LA
0.79
2.3
9.1 0.50
TR X VR X LV
0.37
1.6
6.3 0.66
TR X VR X LA
0.13
1.7
6.6 0.84
TR X LV X LA
0.97
2.3
9.3 0.43
VR X LV x LA
1.08
2.2
8.6 0.39
TR X VR X LV X LA
1.63
2.4
9.7 0.25
Note: p values are Greenhouse-Geisser corrected. N = 5.
Percent blinks
F
df1
df2
p
F
df1
df2
p
7.73
1.82
5.38
3.25
1.38
5.84
1.69
3.08
1.10
1.92
0.27
3.06
0.62
1.28
0.82
1
1
1.0
1.1
1
1.0
1.0
1.6
1.1
2.0
1.4
1.4
2.4
2.2
2.1
4
4
4.2
4.3
4
4.2
4.2
6.3
4.3
8.2
5.5
5.7
9.8
8.6
8.2
0.05
0.25
0.08
0.14
0.31
0.07
0.26
0.12
0.36
0.21
0.69
0.13
0.59
0.33
0.48
0.04
0.46
1.56
0.88
3.29
0.79
0.57
3.26
0.75
1.79
0.11
1.01
1.63
1.17
1.49
1
1
1.5
1.5
1
1.9
1.4
2.0
1.7
2.5
2.3
1.5
2.6
2.2
2.2
4
4
6.1
6.1
4
7.4
5.7
7.9
6.6
10.0
9.4
5.9
10.3
8.7
8.6
0.86
0.53
0.28
0.43
0.14
0.48
0.54
0.09
0.49
0.22
0.92
0.39
0.24
0.36
0.28
Table S3 (Related to Table 1). Statistical significance of main and interaction effects of the factors
visual reliability (VR), task- relevance (TR) and spatial disparity (S) for the audiovisual weight index
wAV when effects of the post-stimulus mean horizontal position of the eyes were controlled (upper
panel) or when effects of the hemifield of the audiovisual signals were controlled (lower panel).
VR
TR
S
p
p
p
VR X
TR
p
VR
XS
p
TR X S
p
VR X TR
XS
p
Control of eye movements
V1
1
0.938
0.975
0.207
1
0.999
0.004
V2
0.253
1
0.065
0.953
0.811
0.496
0.861
V3
0.999
0.704
1
0.873
0.906
1
1
V3AB
0.981
0.769
0.678
0.994
1
0.999
0.041
IPS0-2
0.080
1
0.146
0.406
1
1
0.035
IPS3-4
0.998
0.986
0.965
0.993
0.016
0.001
0.020
hA
1
0.301
1
0.959
1
0.953
1
A1
0.430
0.478
0.697
0.977
1
1
0.893
Control of hemifield
V1
1
0.942
0.961
0.227
1
0.999
0.005
V2
0.227
1
0.064
0.920
0.824
0.424
0.847
V3
1
0.665
1
0.811
0.904
1
1
V3AB
0.974
0.745
0.589
0.992
1
0.997
0.040
IPS0-2
1
0.103
0.430
1
1
0.022
0.049
IPS3-4
0.999
0.949
0.994
0.999
0.028
<0.001
0.021
hA
0.997
0.066
1
0.984
1
0.979
1
A1
0.433
0.468
0.678
0.979
1
1
0.91
Note: p values (corrected for multiple comparisons across the eight regions of interest) were derived from
maximum permutation tests using a circular log-likelihood ratio statistic. N = 5. P values in bold indicate
significant values.
Supplemental Experimental Procedures
Participants
After giving written informed consent, six healthy volunteers without a history of neurological or psychiatric
disorders (all university students or graduates; 2 female; mean age 28.8 years, range 22-36 years) participated in
the fMRI study. All participants had normal or corrected-to normal vision and reported normal hearing. One
participant was excluded because of excessive head motion (4.21 / 3.52 STD above the mean of the translational
/ rotational volume-wise head motion based on the included 5 participants). The data have previously been
reported [S1]. The study was approved by the human research review committee of the University of Tübingen.
Stimuli
The visual stimulus was a cloud of 20 white dots (diameter: 0.43° visual angle) sampled from a bivariate
Gaussian with a vertical standard deviation of 2.5° and a horizontal standard deviation of 2° or 14° presented on
a black background (i.e., 100% contrast). The auditory stimulus was a burst of white noise with a 5 ms on/off
ramp. To create a virtual auditory spatial signal, the noise was convolved with spatially specific head-related
transfer functions (HRTFs) thereby providing binaural (interaural time and amplitude differences) and
monoaural spatial filtering signals. The HRTFs were pseudo-individualized by matching participants’ head
width, heights, depth and circumference to the anthropometry of participants in the CIPIC database [S2]. HRTFs
from the available locations in the database were interpolated to the desired location of the auditory signal.
Experimental design
In a spatial ventriloquist paradigm, participants were presented with synchronous, yet spatially congruent or
discrepant visual and auditory signals (Fig. 1A). On each trial, visual and auditory locations were independently
sampled from four possible locations along the azimuth (i.e., -10°, -3.3°, 3.3° or 10°) leading to four levels of
spatial disparity (i.e., 0°, 6.6°, 13.3° or 20°). In addition, we manipulated the reliability of the visual signal by
setting the horizontal standard deviation of the Gaussian cloud to 2° (high reliability) or 14° (low reliability)
visual angle. In an inter-sensory selective-attention paradigm, participants either reported their auditory or visual
perceived stimulus location. Hence, the 4 x 4 x 2 x 2 factorial design manipulated (1) the location of the visual
stimulus ({-10°, -3.3°, 3.3°, 10°}, i.e., the mean of the Gaussian) (2) the location of the auditory stimulus ({-10°,
-3.3°, 3.3°, 10°}) (3) the reliability of the visual signal ({2°,14°}, standard deviation of the Gaussian) and (4)
task-relevance (auditory- / visual-selective report) (Fig. 1B). To characterize the computational principles of
information integration, we reorganized these conditions into a two (visual reliability: high vs. low) x two (taskrelevance: auditory vs. visual report) x two (spatial disparity: ≤ 6.6° vs. > 6.6°) factorial design for the statistical
analysis of the behavioural and fMRI data.
On each trial, synchronous audiovisual spatial signals were presented for 50 ms followed by a variable
inter-stimulus fixation interval from 1.75-2.75 s. Participants localized the signal in the task-relevant sensory
modality as accurately as possible by pushing one of four spatially corresponding buttons. Throughout the
experiment, they fixated a central cross (1.6° diameter).
To maximize design efficiency, stimuli and conditions were presented in a pseudo-randomized fashion.
Only the factor task-relevance was held constant within a session and counterbalanced across sessions. In each
session, each of the 32 audiovisual spatial stimuli was presented exactly 11 times. 5.9% null-events were
interspersed in the sequence of 352 stimuli per session. Each participant completed 20 sessions (10 auditory and
10 visual localization tasks, i.e. 110 trials for each of the 64 conditions were included in total; apart from one
participant that performed 9 auditory and 11 visual localization sessions). Before the fMRI study, the
participants completed one practice session outside the scanner.
The number of sessions of the main experiment (i.e., 10 sessions x 2 task-contexts = 20 sessions) was
determined based on a prior independent pilot study with one single subject that participated in 33 sessions in
total including 17 scanning sessions for the auditory localization task. Computing the decoding performance for
an increasing number of sessions demonstrated that reliable decoding performance was obtained approximately
for  10 sessions.
Experimental setup
Audiovisual stimuli were presented using Psychtoolbox 3.09 (www.psychtoolbox.org) [S3] running under
MATLAB R2010a (MathWorks). Auditory stimuli were presented at ~75 dB SPL using MR-compatible
headphones (MR Confon). Visual stimuli were back-projected onto a Plexiglas screen using an LCoS projector
(JVC DLA-SX21). Participants viewed the screen through an extra-wide mirror mounted on the MR head coil
resulting in a horizontal visual field of approx. 76° at a viewing distance of 26 cm. Participants performed the
localization task using an MR-compatible custom-built button device. Participants’ eye movements and fixation
were monitored by recording the participants’ pupil location using an MR-compatible custom-build infrared
camera (sampling rate 50 Hz) mounted in front of the participants’ right eye and iView software 2.2.4
(SensoMotoric Instruments).
General analysis strategy for behavioural and neural data
To characterize how human observers integrate auditory and visual signals into spatial representations at the
behavioural and neural level, we developed a common analysis strategy for both (i) reported auditory and visual
spatial estimates (i.e., participants’ behavioural localization responses) and (ii) the spatial estimates decoded
from regions of interest along the auditory and visual dorsal processing hierarchy.
We quantified the influence of the auditory and visual signals on the reported (behavioural) or decoded
(neural) auditory or visual spatial estimates using a linear regression model. In this regression model, the
perceived/decoded spatial locations were predicted by the true spatial locations separately for auditory and
visual signals for each of the eight conditions in the two (visual reliability: high vs. low) x two (task-relevance:
auditory vs. visual report) x two (spatial disparity: ≤ 6.6° vs. > 6.6°) factorial design. Hence, the regression
model included 16 regressors in total, i.e., 8 (conditions) x 2 (true auditory or visual spatial locations). The
auditory (ßA) and visual (ßV) parameter estimates quantified the influence of auditory and visual signals on the
perceived/decoded signal location for a particular condition.
To obtain a summary index for the relative audiovisual weights, we computed wAV as the four-quadrant
inverse tangens of the visual (ßV) and auditory (ßA) parameter estimates from the regression model for each of
the eight conditions based on participants’ behavioural responses (Fig. 1C) and their decoded spatial estimates
across the cortical hierarchy (Fig. 2A-D). For a purely visual region, wAV is 90°. For a purely auditory region, it
is 0°. Confidence intervals of the mean wAV were computed using a double bootstrap [S4] (to account for the
small number of participants) for circular measures.
Permutation tests of circular indices for behavioural and neural data
We performed the statistics on the behavioural and neural audiovisual weight indices using a two (auditory vs.
visual report) x two (high vs. low visual reliability) x two (large vs. small spatial disparity) factorial design
based on the likelihood ratio statistics for circular measures (LRTS) [S5]. Similar to an analysis of variance for
linear data, LRTS computes the difference in log-likelihood functions for the full model that allows differences
in the mean locations of circular measures between conditions and the reduced null model that does not model
any mean differences between conditions. LRTS were computed separately for the main effects and interactions.
To refrain from making any parametric assumptions, we evaluated the main effects of visual reliability,
task-relevance, spatial disparity and their interactions in the factorial design using permutation testing of the
LRTS. To account for the within-subject repeated measures design at the 2nd random effects level, permutations
were constrained to occur within each participant. For the main effects of visual reliability, task-relevance and
spatial disparity, wAV values were permuted within the levels of the non-tested factors (5000 random
permutations). For tests of the two-way interactions, we permuted within the levels of the third factor the simple
main effects of the two factors of interest which are exchangeable under the null-hypothesis of no interaction
[S6] (1024 permutations possible). For tests of the three-way interaction, values were freely permuted across all
conditions [S7] (5000 random permutations). Unless otherwise stated, results are reported at p < 0.05. Note that
we ensured the validity of these permutation approaches in simulations showing that simulated p values
converged to a nominal alpha-level of 0.05.
Behavioural data
To quantify the influence of the auditory and visual signals on the perceived signal location, we computed the
audiovisual weight index wAV based on participants’ localization responses. We then evaluated the effect of
visual reliability, task-relevance and spatial disparity on the behavioural audiovisual weight index using
permutation testing of a likelihood ratio test statistic (LRTS) (Tab. 1, for details see above).
MRI data acquisition
A 3T Siemens Magnetom Trio MR scanner was used to acquire both T1-weighted anatomical images and T2*weighted axial echoplanar images (EPI) with BOLD contrast (gradient echo, parallel imaging using GRAPPA
with an acceleration factor of 2, TR = 2480 ms, TE = 40 ms, flip angle = 90°, FOV = 192×192 mm2, image
matrix 78×78, 42 transversal slices acquired interleaved in ascending direction, voxel size = 2.5×2.5×2.5 mm3 +
0.25 mm interslice gap).
In total, 353 volumes times 20 sessions were acquired for the ventriloquist paradigm, 161 volumes
times 2-4 sessions for the auditory localizer and 159 volumes times 10-16 sessions for the visual retinotopic
localizer resulting in approximately 18 hours of scanning in total per participant assigned over 7-11 days. The
first three volumes of each session were discarded to allow for T1 equilibration effects.
fMRI data analysis
Ventriloquist paradigm
The fMRI data were analysed with SPM8 (www.fil.ion.ucl.ac.uk/spm) [S8]. Scans from each participant were
corrected for slice timing, were realigned and unwarped to correct for head motion and spatially smoothed with
a Gaussian kernel of 3 mm FWHM. The time series in each voxel was high-pass filtered to 1/128 Hz. All data
were analysed in native participant space. The fMRI experiment was modelled in an event-related fashion with
regressors being entered into the design matrix after convolving each event-related unit impulse with a canonical
hemodynamic response function and its first temporal derivative. In addition to modelling the 32 conditions in
our 4 (auditory locations) x 4 (visual locations) x 2 (visual reliability) factorial design, the general linear model
included the realignment parameters as nuisance covariates to account for residual motion artefacts. The factor
task-relevance (visual vs. auditory report) was modelled across sessions. The parameter estimates pertaining to
the canonical hemodynamic response function defined the magnitude of the BOLD response to the audiovisual
stimuli in each voxel. For the multivariate decoding analysis, we extracted the parameter estimates of the
canonical hemodynamic response function for each condition and session from voxels of the regions of interest
(= fMRI activation vectors) defined in separate auditory and retinotopic localizer experiments (see below). Each
spatial fMRI activation vector for the 64 conditions in our 4x4x2x2 factorial design was based on 11 trials
within a particular session. To avoid the effects of image-wide activity changes, each spatial fMRI activation
vector was normalized to have mean zero and standard deviation one.
For the multivariate decoding analysis, we used a linear support-vector regression model (SVR, as
implemented in LIBSVM 3.14 [S9]). For each of the eight regions of interest in the visual [S10] and auditory
[S11] processing hierarchies we trained the SVR model to learn the mapping from the fMRI activation vectors
to the external spatial locations based on the audiovisually congruent conditions (including conditions of
auditory and visual report) from all but one session. This learnt mapping from activation pattern to external
spatial location was then used to decode the spatial location from the fMRI activation vectors of the spatially
congruent and incongruent audiovisual conditions of the remaining session. In a leave-one-session-out crossvalidation scheme, the training-test procedure was repeated for all sessions. The SVRs’ parameters (C and ν)
were optimized using a grid search within each cross-validation fold (i.e., nested cross-validation).
We then asked two questions: First, we investigated whether the true signal location of audiovisual
congruent signals can be decoded successfully from the fMRI activation patterns in our regions of interest. We
evaluated decoding accuracy in terms of the Pearson correlation between the true and the decoded spatial
location in audiovisual congruent conditions alone (n.b. we can compute audiovisual accuracy in a meaningful
way only for the congruent audiovisual trials). The Fisher z-transformed correlation coefficients for each subject
were entered into a one-sample t-test and tested against zero by bootstrapping (n = 1000 bootstraps) the onesample t statistic (Tab. S1) [S12]. To correct for multiple comparisons across the eight regions of interest, we
compared the critical test statistic of each region against the bootstrap distribution of the maximum test statistic
across the eight regions of interest [S13, S14]. Please note that the correlation coefficients, the main effects and
interactions each tested for effects of interest and addressed a priori hypotheses. Therefore, we did not correct
the p values for the number of tested effects in our behavioural and fMRI analysis, but only corrected our fMRI
analysis for the number of regions investigated using the bootstrapping procedure described above.
Upon reviewers’ suggestion, we also investigated audiovisual spatial representations in prefrontal
regions, in particular in the frontal eye fields, in addition to our eight regions of interest along the auditory and
visual processing hierarchies. However, the decoding accuracies for audiovisual congruent conditions were not
significantly better than chance in any of the prefrontal regions. Therefore, we did not characterize the spatial
representations in prefrontal areas further.
Second, we investigated how the eight regions of interest in the visual [S10] and auditory [S11]
processing hierarchies integrate auditory and visual signals into spatial representations. For this, we focused on
the decoded spatial locations of the spatially congruent and incongruent conditions that provide information
about how a brain region combines visual and auditory spatial signals into spatial representations as a result of
our SVR decoding procedure. Using the same regression analysis approach as for behavioural localization
responses, we quantified the influence of the true auditory and visual signal location on the decoded spatial
estimates for each region (see section General analysis strategy for behavioural and neural data). To increase
the efficiency of the parameter estimates in higher-order association areas, we pooled the decoded spatial
locations in IPS0-2 and IPS3-4 in the regression model (see [S15] for a similar approach).
We computed the ‘neural’ relative audiovisual weight wAV as the angle between the auditory and visual
parameter estimates of the linear regression and computed confidence intervals of mean wAV using a double
bootstrap [S4] (Fig. 2). Next, we evaluated the main effects of visual reliability, task-relevance, spatial disparity
and their interactions using the same permutation testing of the LRTS that was used for the behavioural data
[S5] (Tab. 1). To correct for multiple comparisons across the eight regions of interest, we compared the critical
test statistic of each region against the permutation distribution of the maximum test statistic across the eight
regions of interest [S13, S14].
Further, we identified multisensory influences in primary sensory areas (i.e., auditory influence on V1
and visual influence on A1) for small spatial discrepancies. More specifically, we tested whether wAV was
smaller than 90° (in V1) or larger than 0° (in A1) for small spatial discrepancies (≤ 6.6°) [S16] using a one-sided
one-sample permutation test, where we permuted the sign of the circular distance of wAV (averaged across taskrelevance and spatial disparity) from the critical value in each subject (i.e., 32 permutations).
To correlate the neural and behavioural audiovisual weight indices, we computed the circular-circular
correlation coefficient [S17] between neural and behavioural weight indices over the conditions in the two
(visual reliability: high vs. low) x two (task-relevance: auditory vs. visual report) x two (spatial disparity: ≤ 6.6°
vs. > 6.6°) factorial design for each participant (Fig. 2E). The correlation coefficients were averaged across
participants after Fisher’s z-transformation.
To further test which region drives participants’ perceptual localization reports, we decoded the
reported signal location from fMRI data. Therefore, we supplemented our first-level general linear models (see
above) with four regressors each coding for one of the four possible reported signal locations (e.g., regressor 1
was a dummy variable coding the onset of each trial where participants reported location 1). Thus, each
regressor coded one response location unconfounded by the true physical location of the auditory and visual
signal components. As before, we used a linear support-vector regression model with leave-one-session-out
cross-validation to decode the reported signals locations from fMRI activation patterns of the parameter
estimates pertaining to the four response regressors’ in the eight regions of interest. We tested whether the
decoding accuracy as indexed by the correlation coefficient between the true spatial reports and the decoded
spatial reports were better than chance using a bootstrap approach with multiple comparison correction as
described for decoding accuracies above (Tab. S1).
Auditory and visual retinotopic localizer
Auditory and visual retinotopic localizers were used to define regions of interest along the auditory and visual
processing hierarchies in a participant-specific fashion. In the auditory localizer, participants were presented
with brief bursts of white noise at -10° or 10° visual angle (duration 500 ms, stimulus onset asynchrony 2 s). In
a one-back task, participants indicated via a key press when the spatial location of the current trial was different
from the previous trial. 20 s blocks of auditory conditions (i.e., 20 trials) alternated with 13 s fixation periods.
The auditory locations were presented in a pseudorandomized fashion to optimize design efficiency. Similar to
the main experiment, the auditory localizer sessions were modelled in an event-related fashion with the onset
vectors of left and right auditory stimuli being entered into the design matrix after convolution with the
hemodynamic response function and its first temporal derivative. Auditory responsive regions were defined as
voxels in superior temporal and Heschl’s gyrus showing significant activations for auditory stimulation relative
to fixation (p < 0.05, family-wise error corrected). Within these regions, we defined primary auditory cortex
(A1) based on cytoarchitectonic probability maps [S18] and referred to the remainder (i.e., planum temporale
and posterior superior temporal gyrus) as higher-order auditory cortex (hA).
Standard phase-encoded retinotopic mapping [S19] was used to define visual regions of interest.
Participants viewed a checkerboard background flickering at 7.5 Hz through a rotating wedge aperture of 70°
width (polar angle mapping) or an expanding/contracting ring (eccentricity mapping). The periodicity of the
apertures was 42 s. Visual responses were modelled by entering a sine and cosine convolved with the
hemodynamic response function as regressors in a general linear model. The preferred polar angle was
determined as the phase lag for each voxel which is the angle between the parameter estimates for the sine and
the cosine. The preferred phase lags for each voxel were projected on the reconstructed, inflated cortical surface
using Freesurfer 5.1.0 [S20]. Visual regions V1-V3, V3AB and IPS0-IPS4 were defined as phase reversal in
angular retinotopic maps. IPS0-4 were defined as contiguous, approximately rectangular regions based on phase
reversals along the anatomical IPS [S21]. For the decoding analyses, the auditory and visual regions were
combined from the left and right hemispheres.
Control analyses to account for eye movements and hemifield of signals as potential confounds
Eye recordings were calibrated with standard eccentricities between ±3° and ±10° to determine the deviation
from the fixation cross. Fixation position was post-hoc offset corrected. Eye position data were automatically
corrected for blinks and converted to radial velocity. For each trial, the post-stimulus mean horizontal eye
position and the number of saccades (defined by a radial eye-velocity threshold of 15° s-1 for a minimum of 60
ms duration and radial amplitude larger than 1°) were quantified (0-875 ms after stimulus onset). We analyzed
the eye movement indices including percent saccades, percent eye blinks and the post-stimulus mean horizontal
eye position in three separate four (visual location) x four (auditory location) x two (visual reliability) x two
(task-relevance) repeated measures ANOVAs. As the analysis of mean horizontal eye position revealed a nonsignificant trend of task relevance (Tab. S2), we performed an additional control analysis where we included the
post-stimulus mean horizontal eye position as a nuisance covariate into the regression model in addition to the
true auditory and visual locations to predict the fMRI decoded locations (Fig. S2 and tab. S3).
Finally, we investigated the effect of within/across hemifield of signal presentation on our results by
including a nuisance covariate that coded whether the auditory and visual signals were presented in the same or
different hemifield in the linear regression analysis (with fMRI decoded spatial location as dependent variable)
(Fig. S2 and tab. S3).
Bayesian Causal Inference: Qualitative predictions for effects of visual reliability, task-relevance and
interaction between task-relevance and spatial disparity
To form a reliable representation of the environment, the brain is challenged to integrate signals originating
from a common source and segregate those from different sources [S22, S23]. Reliability-weighted integration
is statistically optimal for signals coming from a common source. Yet, it fails to capture more natural situations
where the brain is confronted with many signals that may or may not arise from a common source [S23]. Indeed,
it is well-established that information integration breaks down for conflicting sensory signals [S16, S24] that are
unlikely to emanate from a common source. Likewise, our study demonstrates that the perceived sound location
is strongly influenced by the location of the visual signal, when spatial disparity is small. Yet, for large spatial
disparities, this audiovisual bias is greatly attenuated [S25].
Bayesian Causal Inference provides a rational strategy of how the brain should arbitrate between
information integration and segregation in the face of uncertainty about the causal structure of the world. Under
the assumption of a common source auditory and visual spatial estimates should be combined weighted
according to their reliabilities; under the hypothesis of different sources, they should be treated independently.
Critically, on a particular instance the brain does not know the underlying causal structures. It needs to infer
their probabilities from the sensory inputs based on audiovisual correspondences such as spatial discrepancy.
The brain is then thought to compute a final estimate of the auditory (or visual) signal location by combining the
spatial estimates under the two causal assumptions (i.e., common vs. independent sources) weighted by their
probabilities [S22, S26].
Using model-based decoding in a spatial localization task, a recent study [S1] suggested that Bayesian
Causal Inference can better account for participants’ behavioural localization responses and spatial estimates
decoded from IPS3-4 than traditional forced fusion models [S27, S28]. More specifically, this previous study
demonstrated that spatial estimates obtained from Bayesian Causal Inference can be more accurately decoded
from IPS3-4 than spatial estimates from forced fusion models. Yet, there are several factors that could have led
to higher decoding accuracies for Bayesian Causal Inference relative to forced fusion estimates. Therefore, the
current study focuses on the fundamental computational principles that drive audiovisual integration in IPS3-4.
Bayesian Causal Inference makes three qualitative key predictions. First, sensory signals should be
integrated depending on their reliability in particular when spatial disparity is small (i.e., effect of visual
reliability and potentially its interaction with disparity). Second, auditory and visual signals should only be
partially integrated, because formally the Bayesian Causal Inference estimate combines the forced fusion and
the full segregation estimate of the task-relevant signal. Hence, we would expect different spatial estimates for
visual and auditory report (i.e., effect of task-relevance). Third, the difference between auditory and visual
estimates should increase for large spatial disparities (i.e., interaction between task-relevance and spatial
disparity). For large spatial disparities it is unlikely that the two signals come from a common source, so that the
influence of the full segregation estimate on the final estimate is increased.
Importantly, our results demonstrate that IPS3-4 is the only area that shows a qualitative profile of wAV
weight index that is consistent with all of these three key predictions. These results strongly support the
hypothesis that IPS3-4 integrates auditory and visual spatial signals in line with Bayesian Causal Inference.
Moreover, they demonstrate how Bayesian Causal Inference enables the formation of multisensory spatial
priority maps where spatial priority is jointly defined by the relative reliabilities and task-relevance of signals
from multiple sensory modalities.
Role of the planum temporale in audiovisual integration of spatial signals
Previous studies [S29, S30] have implicated the planum temporale in the ventriloquist illusion by demonstrating
that the univariate BOLD-response in the planum temporale correlates with participants’ perceived sound
location. Likewise, our results suggest that the activation pattern in higher auditory areas including the planum
temporale correlates with participants’ percept. Further, the neural and behavioural audiovisual weight indices
correlate in higher auditory areas. We also observe a substantial effect of task-relevance in higher order auditory
areas. Collectively, these results suggest that the planum temporale plays a critical role in integrating
audiovisual signals into behaviourally relevant spatial estimates.
However, in contrast to IPS3-4, higher auditory areas (e.g., planum temporale) do not appear to
integrate audiovisual signals weighted by their reliability, which would be inconsistent with Bayesian Causal
Inference. As null-results need to be interpreted with caution (in particular given non-significant effects of
reliability in primary auditory areas), future studies will need to further investigate the computational similarities
and differences between planum temporale and IPS3-4 in audiovisual spatial integration. We may also need to
manipulate auditory reliability instead of visual reliability for direct comparison of reliability reweighting across
auditory and visual areas. Potentially, planum temporale and IPS3-4, both higher-order sensory processing
areas, play similar roles in integrating auditory and visual signals into spatial representations.
Supplemental References
S1.
S2.
S3.
S4.
S5.
S6.
S7.
S8.
S9.
S10.
S11.
S12.
S13.
S14.
S15.
S16.
S17.
S18.
S19.
S20.
S21.
S22.
S23.
S24.
S25.
S26.
S27.
S28.
S29.
Rohe, T., and Noppeney, U. (2015). Cortical Hierarchies Perform Bayesian Causal Inference in
Multisensory Perception. PLoS Biol 13, e1002073.
Algazi, V.R., Duda, R.O., Thompson, D.M., and Avendano, C. (2001). The cipic hrtf database. In
Applications of Signal Processing to Audio and Acoustics, 2001 IEEE Workshop on the. (IEEE), pp.
99-102.
Brainard, D.H. (1997). The psychophysics toolbox. Spatial vision 10, 433-436.
Martin, M.A. (1990). On the double bootstrap. In Computing Science and Statistics: Interface'90.
Proceedings of the 22nd Symposium on the Interface. pp. 73-78.
Anderson, C.M., and Wu, C.J. (1995). Measuring location effects from factorial experiments with a
directional response. International Statistical Review 63, 345-363.
Edgington, E., and Onghena, P. (2007). Randomization tests, (Boca Raton, FL: CRC Press).
Gonzalez, L., and Manly, B. (1998). Analysis of variance by randomization with small data sets.
Environmetrics 9, 53-65.
Friston, K.J., Holmes, A.P., Worsley, K.J., Poline, J.P., Frith, C.D., and Frackowiak, R.S.J. (1994).
Statistical parametric maps in functional imaging: a general linear approach. Human brain mapping 2,
189-210.
Chang, C.C., and Lin, C.J. (2011). LIBSVM: a library for support vector machines. ACM Transactions
on Intelligent Systems and Technology (TIST) 2, 27.
Mishkin, M., Ungerleider, L.G., and Macko, K.A. (1983). Object vision and spatial vision: two cortical
pathways. Trends in neurosciences 6, 414-417.
Tian, B., Reser, D., Durham, A., Kustov, A., and Rauschecker, J.P. (2001). Functional specialization in
rhesus monkey auditory cortex. Science 292, 290-293.
Efron, B., and Tibshirani, R.J. (1994). An introduction to the bootstrap, (London: Chapmann and Hall).
Nichols, T.E., and Holmes, A.P. (2002). Nonparametric permutation tests for functional neuroimaging:
a primer with examples. Human brain mapping 15, 1-25.
Nichols, T., and Hayasaka, S. (2003). Controlling the familywise error rate in functional neuroimaging:
a comparative review. Statistical methods in medical research 12, 419-446.
Liu, T., Hospadaruk, L., Zhu, D.C., and Gardner, J.L. (2011). Feature-specific attentional priority
signals in human cortex. J Neurosci 31, 4484-4495.
Stein, B.E., and Meredith, M.A. (1993). The merging of the senses, (Cambridge, MA: The MIT Press).
Berens, P. (2009). CircStat: A MATLAB toolbox for circular statistics. Journal of Statistical Software
31, 1-21.
Eickhoff, S.B., Stephan, K.E., Mohlberg, H., Grefkes, C., Fink, G.R., Amunts, K., and Zilles, K.
(2005). A new SPM toolbox for combining probabilistic cytoarchitectonic maps and functional
imaging data. NeuroImage 25, 1325-1335.
Sereno, M.I., Dale, A.M., Reppas, J.B., Kwong, K.K., Belliveau, J.W., Brady, T.J., Rosen, B.R., and
Tootell, R.B. (1995). Borders of multiple visual areas in humans revealed by functional magnetic
resonance imaging. Science 268, 889-893.
Dale, A.M., Fischl, B., and Sereno, M.I. (1999). Cortical surface-based analysis. I. Segmentation and
surface reconstruction. NeuroImage 9, 179-194.
Swisher, J.D., Halko, M.A., Merabet, L.B., McMains, S.A., and Somers, D.C. (2007). Visual
topography of human intraparietal sulcus. J Neurosci 27, 5326-5337.
Kording, K.P., Beierholm, U., Ma, W.J., Quartz, S., Tenenbaum, J.B., and Shams, L. (2007). Causal
inference in multisensory perception. PloS one 2, e943.
Shams, L., and Beierholm, U.R. (2010). Causal inference in perception. Trends Cogn Sci 14, 425-432.
Welch, R.B., and Warren, D.H. (1980). Immediate perceptual response to intersensory discrepancy.
Psychol Bull 88, 638-667.
Bertelson, P., and Radeau, M. (1981). Cross-modal bias and perceptual fusion with auditory-visual
spatial discordance. Attention, Perception, & Psychophysics 29, 578-584.
Wozny, D.R., Beierholm, U.R., and Shams, L. (2010). Probability matching as a computational
strategy used in perception. PLoS Comput Biol 6.
Alais, D., and Burr, D. (2004). The ventriloquist effect results from near-optimal bimodal integration.
Curr Biol 14, 257-262.
Ernst, M.O., and Banks, M.S. (2002). Humans integrate visual and haptic information in a statistically
optimal fashion. Nature 415, 429-433.
Bonath, B., Noesselt, T., Krauel, K., Tyll, S., Tempelmann, C., and Hillyard, S.A. (2014). Audio-visual
synchrony modulates the ventriloquist illusion and its neural/spatial representation in the auditory
cortex. NeuroImage 98, 425-434.
S30.
Bonath, B., Noesselt, T., Martinez, A., Mishra, J., Schwiecker, K., Heinze, H.J., and Hillyard, S.A.
(2007). Neural basis of the ventriloquist illusion. Curr Biol 17, 1697-1703.