Download Neural Coding: Higher Order Temporal Patterns in the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Types of artificial neural networks wikipedia , lookup

Mirror neuron wikipedia , lookup

Development of the nervous system wikipedia , lookup

Neural oscillation wikipedia , lookup

Central pattern generator wikipedia , lookup

Neuroanatomy wikipedia , lookup

Metastability in the brain wikipedia , lookup

Neuropsychopharmacology wikipedia , lookup

Premovement neuronal activity wikipedia , lookup

Neural coding wikipedia , lookup

Circumventricular organs wikipedia , lookup

Inductive probability wikipedia , lookup

Neural modeling fields wikipedia , lookup

Time series wikipedia , lookup

Biological neuron model wikipedia , lookup

Pre-Bötzinger complex wikipedia , lookup

Optogenetics wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Synaptic gating wikipedia , lookup

Nervous system network models wikipedia , lookup

Efficient coding hypothesis wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Transcript
Neural Coding:
Higher Order Temporal Patterns
in the Neurostatistics of Cell Assemblies
Laura Martignon
Max Planck Institute
for Human Development
Lentzeallee 94, 14195 Berlin
[email protected]
Gustavo Deco
Siemens AG
Corporate Technology ZT
Otto Hahn Ring 6, 81739 Munich
[email protected]
Kathryn Laskey
Dept. of Systems Engineering
George Mason University
Fairfax, VA
[email protected]
Mathew Diamond
Cognitive Neuroscience Sector International
School of Advanced Studies
Via Beirut, 9 34014 Trieste
[email protected]
Winrich Freiwald
Dept. of Neurophysiology
Max-Planck-Institute for Brain Research
D-60528Frankfurt/M.
[email protected]
Eilon Vaadia
Dept. of Physiology
The Hebrew University of Jerusalem
Jerusalem 91010
[email protected]
Abstract
Recent advances in the technology of multi-unit recordings make it possible to test Hebb’s
hypothesis that neurons do not function in isolation but are organized in assemblies. This has
created the need for statistical approaches to detecting the presence of spatiotemporal patterns of
more than two neurons in neuron spike train data. We examine three measures for the presence of
higher order patterns of neural activation -- coefficients of log-linear models, connected cumulants,
and redundancies -- and present arguments in favor of the coefficients of log-linear models. We
present test statistics for detecting the presence of higher order interactions in spike train data. We
also present a Bayesian approach for inferring the existence or absence of interactions and
estimating their strength. The two methods are shown to be consistent in the sense that highly
significant correlations are also highly probable. A heuristic for the analysis of temporal patterns is
also proposed. The methods are applied to experimental data, to synthetic data drawn from our
statistical models, and to synthetic data produced from biologically plausible simulations of neural
activity. Our experimental data are drawn from multi-unit recordings in the somato-sensory cortex
of anesthetized rats obtained by Diamond, from multi-unit recordings in the frontal cortex of
behaving monkeys obtained by Vaadia and from multi-unit recordings in the visual cortex of
behaving monkeys obtained by Freiwald.
1 Introduction
Hebb conjectured that information processing in the brain is achieved through the collective action
of groups of neurons, which he called cell assemblies (Hebb, 1949). One of the assumptions on
which he based his argument was his other famous hypothesis that excitatory synapses are
strengthened when the involved neurons are frequently active in synchrony. Evidence in support of
this, so called ”Hebbian learning” hypothesis has been provided by a variety of experimental
findings over the last three decades (see for instance, Rauschecker, 1991). Evidence for collective
phenomena confirming the ”cell assembly” hypothesis has only recently begun to emerge as a
result of the progress achieved by multi-unit recording technology. Hebb’s followers had been left
with a twofold challenge:
•
to provide an unambiguous definition of cell assemblies.
•
to conceive and carry out the experiments that demonstrate their existence.
Cell assemblies have been defined in various ways, both in terms of anatomy and of shared
function. One persistent approach characterizes the cell assembly by near-simultaneity or some
other specific timing relation in the firing of the involved neurons. If two neurons converge on a
third one, their synaptic influence is much larger for near-coincident firing, due to the spatiotemporal summation in the dendrite (Abeles, 1991; Abeles, Bergman, Margalit and Vaadia, 1993).
Thus syn-firing and other specific temporal relationships between active neurons have been posited
as mechanisms by which the brain codes information (Singer, 1994; Abeles and Gerstein, 1998;
Abeles, Prut, Bergman and Vaadia, 1994; Prut, Vaadia, Bergman, Haalman and Slovin, 1993).
In pursuit of experimental evidence for cell assembly activity in the brain, physiologists thus seek
to analyze the activation of many separate neurons simultaneously, preferably in awake, behaving
animals. These ”multi-neuron activities” are then inspected for possible signs of interactions
between neurons. Results of such analyses may be used to draw inferences regarding the
processes taking place within and between hypothetical cell assemblies. The conventional approach
is based on the use of cross correlation techniques, usually applied to the activity of pairs
(sometimes triplets) of neurons recorded under appropriate stimulus conditions. The result is a
time-averaged measure of the temporal correlation among the spiking events of the observed
neurons under those conditions. Application of these measures has revealed interesting instances of
time- and context-dependent synchronization dynamics in different cortical areas. For interactions
among more than two neurons, Gerstein, Perkel and Dayhoff (1985) devised the so called gravity
method. Their method views neurons as particles that attract each other and measures
simultaneously the attractions between all possible pairs of neurons in the set. The method,
although it allows one to look at several neurons globally, does not account for nonlinear synapses
nor for synchronization of more than two neurons. Recent investigations have focused on the
detection of individual instances of synchronized activity called ”unitary events” (Grün, 1996;
Grün, Aertsen, Vaadia and Riehle, 1995) between groups of two and more neurons. Of special
interest are patterns involving three or more neurons, which cannot be described in terms of paircorrelations. Such patterns are genuine higher order phenomena. The method developed by Prut,
Vaadia, Bergman, Haalman, Slovin and Abeles (1998) for detecting precise firing sequences of up
to three units adopts this view, subtracting from three-way correlations all possible pair
correlations. The models reported in this paper were developed for the purpose of describing and
detecting higher order correlations of any order in a unified way.
In the data we analyze, the spiking events (in the 1 m/sec range) are encoded as sequences of 0's
and 1's, and the activity of the whole group is described as a sequence of binary configurations.
This paper presents a family of statistical models for analyzing such data. In our models, the
parameters represent spatiotemporal firing patterns. We generalize the spatial correlation models
developed by Martignon et al., (Martignon, v. Hasseln, Grün, Aertsen and Palm, 1995), (Laskey
& Martignon, 1996) to include a temporal dimension. We develop statistical tests for detecting the
presence of a genuine order-n correlation and distinguishing it from an artifact that can be explained
by lower order interactions. The tests compare observed firing frequencies on the involved neurons
with frequencies predicted by a distribution that maximizes entropy among all distributions
consistent with observed information on synchrony of lower order. We also discuss a Bayesian
approach in which hypotheses about interactions on a large number of subsets can be compared
simultaneously.
We compare our log-linear models with two other candidate parameters of higher order
correlations. One of these candidates, drawn from statistical physics, is the connected cumulant.
The other, drawn from information theory, is mutual information or redundancy. We argue that the
coefficients of log-linear models, or effects, provide a more natural measure of higher order
phenomena than either of these alternate approaches.
We present the results of analyzing synthetic data generated from our models. These synthetic data
provide a test of the performance of our statistical methods on data of known distribution. We also
present results for synthetic data provided by the experimenter. These synthetic data permit us to
evaluate the ability of our models to detect known structure and avoid detecting spurious structure
in data generated by biologically plausible models. Finally, results are presented from applying the
models to multi-unit recordings in the frontal cortex of monkeys obtained by Vaadia and multi-unit
recordings in the somato-sensory cortex of rats obtained by Diamond.
2 Measures for Higher Order Synchronization
The term spatial correlation has been used to denote synchronous firing of a group of neurons,
while the term temporal correlation has been used to indicate chains of firing events at specific
temporal intervals. Terms like "couple" or "triplet" have been used to denote spatiotemporal
patterns of two or three neurons (Abeles et al., 1993; Grün, 1996) firing simultaneously or in
sequence. Establishing the presence of such patterns is not straightforward. For example, three
neurons may fire together more often than expected by chance1 without exhibiting an authentic
third order interaction. For example, if a neuron participates in two couples, such that each pair
fires together more often than by chance, then the three involved neurons will fire together more
often than the independence hypothesis would predict. This is not, however, a genuine third-order
phenomenon. Authentic triplets, and, in general, authentic n-th order correlations, must therefore
be distinguished from correlations that can be explained in terms of lower order correlations. We
introduce log-linear models for representing firing frequencies on a set of neurons and show that
nonzero coefficients or effects of these log-linear models are a natural measure for synchronous
firing. We argue that effects of log-linear models are superior to other candidate approaches drawn
from statistical physics and information theory. In this section we consider only models for
synchronous firing. Generalization to temporal and spatiotemporal effects is treated in Section 3.
2.1 Effects or Coefficients of Log-Linear Models
Consider a set of n neurons. Each neuron is modeled as a binary unit that can take on one of two
states: 1 (”firing”) or 0 (”silent”). The state of the n units is represented by the vector x =(x 1 ,…,x n ),
where each x i can take on the value zero or one. There are 2n possible states for the n neurons. If
all neurons fire independently of each other, the probability of configuration x is given by
p(x 1,K, x n ) = p(x1 )L p(x n )
(1)
Methods for detecting correlations look for departures from this model of independence. Neurons
are said to be correlated if they do not fire independently. A correlation between two neurons,
labeled 1 and 2, is expressed mathematically as
p(x 1, x 2 ) ≠ p(x1 ) p(x 2 )
1
That is to say, more often than predicted by the hypothesis of independence.
(2)
Extending this idea to larger sets of neurons introduces complications. It is not sufficient simply to
compare the joint probability p(x 1 ,x 2 ,x 3 ) with the product p(x 1 )p(x 2 )p(x 3 ) and declare the existence
of a triplet when the two are not equal. This would confuse authentic triplets with overlapping
doublets or combinations of doublets and singlets. Thus, neurons 1, 2, and 3 may fire together
more often than the independence model (1) would predict because neurons 1 and 2, and neurons 2
and 3, are each involved in a binary interaction (Figure 1).
•
1
•
•
1
3
•2
a. Overlapping Pairwise Correlations
•3
•2
b. Authentic Third-Order Correlation
Figure 1: Overlapping Doublets and Authentic Triplet
Equation (2) expresses in mathematical form the idea that two neurons are correlated if the joint
probability distribution for their states cannot be determined from the two individual probability
distributions p(x 1 ) and p(x 2 ). Now consider a set of three neurons. The probability distributions
for the three pair configurations are p(x 1 ,x 2 ), p(x 2 ,x 3 ), and p(x 1 ,x 3 ). 2 A genuine third-order
interaction among the three neurons would occur if it were impossible to determine the joint
distribution for the three-neuron configuration using only information on these pair distributions. A
canonical way to construct a joint distribution on three neurons from the pair distributions is to
maximize entropy subject to the constraint that the two-neuron marginal distributions are given by
p(x 1 ,x 2 ), p(x 2 ,x 3 ), and p(x 1 ,x 3 ). This is the distribution that adds the least information beyond the
2
Note that these pairwise distributions also determine the single-neuron firing frequencies p(x 1), p(x 2), and p(x 3),
which can be expressed as their linear combinations.
information contained in the pair distributions. It is well known3 that the joint distribution thus
obtained can be written as
p(x 1, x 2 , x 3 ) = exp{θ0 +θ 1 x1 + θ 2 x 2 + θ 3 x 3 + θ 12 x1 x2 +θ 13 x1 x 3 + θ 23 x2 x3 } ,
(3)
where the θ’s are real-valued parameters, and where θ 0 is determined from the other θ’s and the
constraint that the probabilities of all configurations sum to 1. In this model, there is a parameter
for each individual neuron and for each pair of neurons. Each of these parameters, which in the
statistics literature are called effects, can be thought of as measuring the tendency to be active of the
neuron(s) with which it is associated. Increasing the value of θ i without changing any other θ’s
increases the probability that neuron i is in its ”on” state. Increasing the value of θ ij without
changing any other θ’s increases the probability of simultaneous firing of neurons i and j. It is
instructive to note that there is a second-order correlation between neurons i and j in the sense of
(2) precisely when θ ij≠0.
Not all joint probability distributions on three neurons can be written in the form of (3). A general
joint distribution for three neurons can be expressed by including one additional parameter:4
p(x 1, x 2 , x 3 ) = exp{θ0 +θ 1 x1 + θ 2 x 2 + θ 3 x 3 + θ 12 x1 x2 +θ 13 x1 x 3 + θ 23 x2 x3 + θ 123 x1 x 2 x 3 }
(4)
Holding the other parameters fixed and increasing the value of θ 123 increases the probability that all
three neurons fire simultaneously. Equation (3) corresponds to the special case of (4) in which θ 123
3
4
See, for example, Good (1963) and also Bishop, Fienberg and Holland (1975).
This can be seen by noting that log p(x 1,x 2,x 3) can be expressed as a set of linear equations relating the
configuration probabilities to the θ’s. These equations give a unique set of θ’s for any given probability
distribution.
is equal to zero, just as independence corresponds to the special case of (3) in which all secondorder terms θ ij are equal to zero. It seems natural, then, to define a genuine order-3 correlation as a
joint probability distribution that cannot be expressed in the form of (3) because θ 123 ≠0. As will be
shown below in Section 3, this idea can be extended naturally to larger sets of neurons and to
temporal patterns.
Parameterizations of the general form (3) and (4) are called log-linear models because the logarithm
of the probabilities can be expressed as a linear sum of functions of the configuration values. They
obviously correspond to the coefficients of the energy expansions of generalized Boltzmann
machines in statistical mechanics (Amari, 1992). The coefficients of the log-linear model are called
effects. A joint probability distribution on n neurons is represented as a log-linear model in which
parameters are associated with subsets of neurons. A genuine n-th order correlation on a set A of n
neurons is defined in terms of a log-linear model in which the parameter θ A associated with the set
A is not equal to zero.
The information-theoretic justification for our proposed definition of higher order interaction is
theoretically satisfying. But if the idea is to have practical utility, we require a method for
measuring the presence of genuine correlations and distinguishing them from noise due to finite
sample size. Fortunately, there is a large statistical literature on estimation and hypothesis testing
using log-linear models. In the following sections, we discuss ways to detect and quantify the
presence of higher order correlations. We also apply the methods to real and synthetic data.
2.2 Other Measures for Higher Order Correlation
We mention two further approaches to measuring the existence of higher order correlations in a set
of binary neurons. Connected cumulants, a measure of higher order correlation used in statistical
mechanics, have been proposed by Aertsen (Aertsen, 1992) as a candidate parameter for
spatiotemporal patterns of neural activation. Mutual information has been proposed by Grün
(1996).
For two neurons with configurations denoted by (x 1 ,x 2 ) the connected cumulant is given by
C123 = x1x 2 − x1 x 2
(5)
while for three neurons the corresponding formula is
C123 = x1x 2x 3 − x1 x 2x 3 − x 2 x1x3 − x3 x1x 2 + 2 x1 x2 x3
(6)
In these expressions, the symbol <⋅ > stands for expectation. For the single i-th neuron indexed by
<x i > is the expected firing frequency. For a set of k neurons, < x1 ,x2 ,...x k > is the expected
frequency with which the corresponding k neurons are simultaneously active.
The general formula of cumulants can be obtained by induction by means of the following
construction rule:
 products
C1,2 ,...,N = x1 ...x N − ∑ 
Ci1

of
...iM
connected
cumulants
.
where
M <N

In the case of two neurons, the absence of correlation defined by means of cumulants is equivalent
to the absence of correlation established by the effects of log-linear models. That is, the connected
cumulant is equal to zero if and only if the neurons exhibit no interaction as defined by (2). For
three or more neurons things change. From (6) it can be shown that the connected cumulant may
be nonzero when there exist overlapping couples as shown in Figure 1. Such overlapping couples,
we argue, constitute a second-order phenomenon and not a true third-order interaction. The thirdorder term of a log-linear model, in contrast, will be equal to zero for Figure 1a and nonzero for
Figure 1b. Thus, we argue that the effects of log-linear models are a better measure for authentic
higher order interactions.
Another approach to measuring higher-order correlations is drawn from information theory. Grün
(Grün, 1996) proposed the use of mutual information to characterize higher order phenomena. The
mutual information measure (Cover & Thomas, 1991) for two neurons is given by
I(x1; x2 ) = ∫ P(x1,x 2 ) log(
p(x1, x 2 )
)dx dx
p(x1 )p(x2 ) 1 2
(7)
= H(x1 ) + H(x2 ) − H(x1, x 2 )
where H(x1 ,xn ) denotes the joint Shannon entropy (Cover and Thomas, 1991). The mutual
information of two neurons is zero if and only if the neurons are stochastically independent.
Therefore, departures of (7) from zero measure correlation between pairs of neurons. The measure
(7) can be extended to three neurons as
p(x ,x ,x )
1 2 3 )
I(x1 ;x2 ;x 3 ) = ∫ dx dx dx P(x ,x ,x )log(
1 2 3 1 2 3
p(x ) p(x )p(x )
1
2
3
= H(x ) + H(x ) + H(x ) − H(x ,x ,x )
1
2
3
1 2 3
(8)
Mutual information as defined by (8) measures departures from independence and not true thirdorder correlation. Therefore, it too is unable to distinguish true higher order correlation from
correlation due to lower order phenomena.
3 A Mathematical Model for Spatiotemporal Firing Patterns
The previous section described several approaches to generalizing the concept of pairwise
correlations to larger numbers of neurons. Coefficients of log-linear models were proposed as a
measure of higher order correlations and justified by their ability to distinguish true higher order
correlations from apparent correlations that can be explained in terms of lower order correlations.
For clarity of exposition, the discussion was limited to synchronous firing on groups of two and
three neurons. In this section, we generalize the approach to arbitrary numbers of neurons and to
temporal correlations.
Consider a set Λ of n binary neurons and denote by p the probability distribution on sequences
of binary configurations of Λ . Assume that the sequence of configurations x t = (x1t ,L, xnt ) of n
zneurons forms a Markov chain of order r. Let δ be the time step, and denote the conditional
distribution for x t given previous configurations by p(x t | x (t−δ ) , x (t −2 δ) ,..., x (t − rδ) ) . We assume that
all transition probabilities are strictly positive and expand the logarithm of the conditional
distribution as
p( x | x
t
( t −δ )
,x
( t−2δ )
,...,x
( t −r δ )
) = exp{θ + ∑ θ X }
0
A ∈Ξ
A
A
(9)
In this expression, each A is a subset of pairs of subscripts of the form (i,t − sδ) for r≥s≥0 that
includes at least one pair of the form (i,t ). The variable x A = ∏ x(i j ,t −m j δ) is equal to 1 in the
1≤ j ≤k
event that all neurons in A are active and zero otherwise. The set Ξ⊆2Λ for which θ A ≠0 is called
the interaction structure for the distribution p. The parameter θ A is called the interaction strength for
the interaction on subset A. Clearly, θ Α =0 means A∉Ξ and is taken to indicate absence of an
order-|A| interaction among neurons in A. We denote the structure-specific vector of nonzero
nteraction strengths by θ Ξ.
Definition 1: We say that neurons {i1 ,i2 ,..., ik } exhibit a spatiotemporal pattern if there is a set
of time intervals m1 δ,m2 δ,..., mk δ with r≥mi≥0 and at least one mi = 0 , such that θ A ≠ 0 in (1),
where A = {(i1 ,t − m1 δ),...,(ik ,t − mk δ)}.
Definition 2: A subset {i1 ,i2 ,..., ik } of neurons exhibits a synchronization or spatial correlation
if θ A ≠ 0 for A={(i1 , 0),...,(ik ,0) }.
In the case of absence of any temporal dependencies the configurations at different times are
independent and we drop the time index:
p(x) = exp{θ 0 + ∑θ A x A }
(10)
where each A is a nonempty subset of Λ and x A = ∏ x i .
i∈A
Of course, we expect temporal correlation of some kind to be present, one such example being the
refractory period after firing. Nevertheless, (10) may be adequate in cases of weak temporal
correlation and generalizes naturally to the situation of temporal correlation. Increasing the bin
width can result in temporal correlations of short time intervals manifesting as synchronization.
It is clear that (3) and (4) are special cases of (9) for the case of three neurons and no temporal
dependence. The interaction structure Ξ for (4) consists of all non-empty subsets of neurons. For
(3) the interaction structure consists of all single-neuron and two-neuron subsets.
Although the models (9) and (10) are statistical, not physiological, one would naturally expect
synaptic connection between two neurons to manifest as nonzero θ A for the set A composed of
these neurons with the appropriate delay. One example leading to nonzero θ A in (10) would be
simultaneous activation of the neurons in A due to common input (Figure 2a). Another example is
simultaneous activation of overlapping subsets covering A, each by a different unobserved input
(Figure 2b). An attractive feature of our models is that the two cases of Figure 2 can be
distinguished when the neurons in A exhibit no true lower order interactions. Suppose a model of
the form (2) is constructed for all neurons, both observed and unobserved. The model of Figure 2a
corresponds to a log-linear expansion with only individual neuron coefficients and the fourth-order
effect θ H123 , where H is a hidden unit, as in Figure 2. The model of Figure 2b contains individual
neuron effects and the triplet effects θ H12 and θ K23 , where K is another hidden unit different from
H. When the probabilities are summed over the unobserved neurons to obtain the distribution over
the three observed neurons, both models will contain third-order coefficients θ 123 . However, the
model of Figure 2a will contain no nonzero second-order coefficients, while the model of Figure
2b will contain nonzero coefficients of all orders up to and including order 3. Thus log-linear
models have the desirable feature that the existence of a nonzero θ A coupled with θ B =0 for B ⊂ A
indicates a correlation among the neurons in A possibly involving unobserved neurons. On the
other hand, a nonzero θ A together with θ B ≠0 for B ⊂ A may be explained away by interactions
involving subsets of A and unobserved neurons.
The probability distribution p* that maximizes entropy among distributions with the same
marginals as the distribution p on proper subsets of A has a log-linear expansion in which only
θ B , B ⊂ A, B ≠ A are nonzero.
1
•
H
•
1
•
•
2
5
•
•
•
4
H
5
2
•
K
6
•
•
•4
3
A
•
Λ
a. θ 123 ≠0, θ 12 =θ 23 =0
•
•
6
3
•
A
Λ
b. θ 123 ≠0, θ 12 ≠0, θ 23 ≠0
Figure 2: Nonzero θ 123 Due to Common Hidden Inputs H and/or K
4 Estimation and Testing Based on Maximum Likelihood Theory
We consider two general approaches to estimating the correlation structure in a set of neurons. The
first approach is based on maximum likelihood theory. Interaction strength parameters are
estimated by the method of maximum likelihood and structures are compared using likelihood ratio
tests. This approach is grounded in the frequentist school of statistical inference. The second
approach, described in the following section, is based on the Bayesian school of statistics. Before
presenting the maximum likelihood methods, we take a brief digression to present the frequentis t
approach to statistical inference and distinguish it from the Bayesian approach.
The frequentist school of statistics grew up in the early part of this century, and has its roots in
positivism and the accompanying philosophical stance that science should strive for strict
objectivity. Frequentist statistics has its theoretical foundation in the work of von Mises (1939) and
the pioneering work of Neyman and Pearson (1928). In frequentist statistics, one views the
parameters of statistical models as objective unknowns that are not directly observable, about
which one can draw inferences on the basis of observable random variables whose distributions
depend on the parameters. Thus, one cannot directly observe the probability that a coin will turn up
heads or that a neuron will fire, but one can observe a large sample of coin tosses or a large
number of time bins of neural recording, from which one can estimate the probability in question.
Frequentists take the philosophical position that only objectively random phenomena can be
assigned probabilities. Thus, it is incorrect to assign a probability to an individual toss of a coin, an
individual sample on a neuron, or any other one-time event. Probabilities are assigned to
hypothetically infinite collectives of repeatable events, such as a series of coin tosses or a series of
multi-unit recordings made under identical conditions.
The probability distribution of (9) (or in the case of absence of temporal correlation, (10)) can be
estimated by the empirical frequency distribution obtained via a sample taken from multi-unit
recordings. From this empirical distribution, estimates of the θ’s can be obtained by solving a
system of linear equations relating the logarithms of the frequencies to the θ’s. Although
straightforward, this naive approach leads to difficulties. These difficulties arise because of
instability in the parameter estimates due to statistical variation in the empirical frequency
distribution. In other words, if the same recording procedure were applied multiple times under
identical conditions to the identical animal, the results of the recordings would differ from each
other and from the assumed underlying ”true” probability distribution, simply due to chance
fluctuations. Therefore, even if θ A=0, we would expect an estimate derived from empirical
frequency data to differ from zero. It is important to be able to tell true correlations from estimates
θ A that are different from zero by chance.
If the observed data arise from either (9) or (10), the Law of Large Numbers implies that as the
sample size grows sufficiently large, the observed frequencies of coincident firing of neurons in
the set A eventually converge to the true probability of coincident firing as given by the model. The
theory of maximum likelihood implies that the maximum likelihood estimates of the θ’s also
converge to their true values. Statistical hypothesis tests can be constructed to evaluate whether a
particular nonzero estimate θˆ A can be explained by a chance fluctuation in a model with true θ A
equal to zero.
Estimation and hypothesis testing are complicated by instability due to the presence of a large
number of parameters to be estimated and/or hypotheses to be tested simultaneously. The
dimension of (9) is exponential in the number of neurons times the order of the Markov chain, and
(10) is exponential in the number of neurons. The estimates of θ A and θ B are correlated for
overlapping subsets A and B. This means that the number of observations required for stable
estimates for all θ’s grows rapidly with the number of neurons and the order of the Markov chain.
4.1 Statistical Tests Based on Maximum Likelihood Theory
We are interested in the problem of detecting nonzero θ A for subsets A of neurons in a set of n
neurons whose firing frequencies are governed by model (9) or (10). That is, for a given set A, we
are interested in testing which of the two hypotheses
H0 :
θ A=0
H1 :
θ A≠0
is the case. The hypothesis H0 is called the null hypothesis and H1 is called the alternative
hypothesis. Generally, the hypothesis selected as the null hypothesis is regarded as having the
benefit of the doubt, and the alternative hypothesis must bear the burden of proof. That is, we wish
to reject the null hypothesis in favor of the alternative hypothesis only when the experimental
results are unlikely to have been observed if the null hypothesis is true. In our situation, it seems
reasonable to take the hypothesis θ A=0 as the null hypothesis against the alternative θ A≠0. That is,
we wish to declare the existence of a higher order interaction only when the observed correlations
are unlikely to be a chance result of a model in which θ A=0.
In the theory of statistical hypothesis testing (Neyman & Pearson, 1928), one constructs a test
statistic whose distribution is known under the null hypothesis. Then one chooses a critical range
of values of the test statistic that are highly unlikely under the null hypothesis but are likely to be
observed under plausible deviations from the null hypothesis. If the observed value of the test
statistic falls into the critical range, one rejects the null hypothesis in favor of the alternative
hypothesis.
Consider the problem of testing θ A≠0 against θ A=0 in the log-linear model (10) for synchronous
firing in observations with no temporal correlation. A reasonable test statistic is the likelihood ratio
statistic, denoted by G2 . To construct the G2 test statistic, we first obtain the maximum likelihood
estimate of the parameters θ under the null and alternative hypotheses. We construct our test
statistic by conditioning on the silence of neurons outside A. The maximum likelihood estimate of
θ under the null hypothesis is then obtained by setting the frequency distribution p1 (x ) over
configurations of neurons in the set A to the sample frequencies and solving for θ . The maximum
likelihood estimate for θ under the null hypothesis may be obtained by a standard likelihood
maximization procedure such as iterative proportional fitting (IPF), or by a simple one-dimensional
optimization procedure described below, which we call the Constraint Perturbation Procedure
(CPP). From the estimate of θ , one can solve for the estimated configuration probabilities p0 (x ) for
the null hypothesis. Using these probability estimates, we compute the likelihood ratio test statistic
G 2 = 2N ∑ p1 (x)(ln p1(x) − ln p0 (x))
(11)
x
where N is the sample size for the test (the number of configurations in which neurons outside of
A are silent). As the sample size for the test tends to infinity, the distribution for this test statistic
under p0 tends to a chi-squared distribution, where the number of degrees of freedom is the
difference in the number of parameters in the two models being compared. In this case the larger
model has a single extra parameter, so the reference distribution has one degree of freedom. The
interpretation of test results is as follows. If the value of the test statistic is highly unusual for a chisquared distribution with one degree of freedom, this casts doubt on the null model p0 and is
therefore regarded as evidence that the additional parameter θ A is nonzero.
Discarding observations for which neurons outside A are active simplifies the construction of a test
statistic for θ A=0. It is easy to see that the distribution of the test statistic does not depend on
whether there are nonzero order-|A| or higher coefficients other than θ A . The power of a test is
reduced over a test based on all the observations.5 In our data this is not a major difficulty because
overall firing frequencies are low. An alternative approach would be to construct a test using all the
data, in which the null hypothesis is that all coefficients of order |A| or larger are zero and the
alternate hypothesis is that all order-|A| or higher coefficients except θ A are zero.
Both of these tests suffer from a common difficulty that arises in high-dimensional statistical
problems. We usually are not interested in inference about one specific coefficient θ A. Rather, we
are interested in the question of which coefficients are nonzero. In other words, the hypothesis
testing procedure is applied repeatedly for a large number of sets A. Although the significance level
(probability of incorrectly rejecting a true null hypothesis) may be low for each test individually, if
enough tests are performed one expects some spurious results. Thus, it is important not to take
significance levels at face value when performing multiple repeated hypothesis tests. 6
Nevertheless, we regard statistical significance of a coefficient as evidence for existence of
correlation of the associated neurons.
4.2 The Constrained Perturbation Procedure
In this section we construct the null hypotheses for the tests proposed above. Assume again that p
is a probability distribution defined on the configurations of a set A of vertices. Consider the
following problem: Find a distribution p* such that
1. the marginals of order less than A of p* coincide with those of p.
5
A statistical hypothesis test has a high power if the probability of correctly rejecting a false null hypothesis is
high, and high significance if the probability of incorrectly rejecting a true null hypothesis is low.
6
There exist adjustments to significance tests to account for multiple comparisons, which we do not develop
formally in this paper.
2. p* maximizes entropy among all distributions whose marginals of order less than A coincide
with those of p.
This problem has a unique solution. In fact, this solution coincides with the unique distribution
whose marginals of order less than A coincide with those of p such that θ *A = 0, where θ*A is
the coefficient corresponding to A in the log-expansion of p. This was proven by Good (1963) in
the early sixties (for a proof of the uniqueness of the solution see Whittaker 1989).
In what follows we explicitly construct a solution. If B is a nonempty subset of A , denote by χ B
the configuration that has a component 1 for every index in B and 0 elsewhere.
Define p* ( χ B ) = p( χ B ) + ( −1)| B| ∆ , where ∆ is to be determined by solving for θ*A ≡ 0 among
probability distributions. As we have observed, θ*A can be written as
θ* = ∑ ( −1)
| A− B|
A
B⊂ A
lnp* ( χ ) .
B
(12)
Furthermore, define p* ( 0) = 1 − ∑ p* ( χ B ) . This distribution p* has the required properties.
φ≠ B ⊂A
It maximizes entropy among those with the same marginals of p on proper subsets of A .
Example 2. Suppose that A = V={1,2,3} and that p is a strictly positive distribution on A . We
want to construct the distribution p* that maximizes entropy among all those consistent with the
marginals of order 2, denoted by Marg.({{1,2},{1,3},{1,3}}).7 Define
p*(1,1,1) = p(1,1,1) - ∆
p*(1,1,0) = p(1,1,0) + ∆
p*(1,0,1) = p(1,0,1) + ∆
p*(1,0,0) = p(1,0,0) - ∆
7
Observe that by fixing the second-order marginals, we are also fixing first-order marginals, which are their linear
combinations.
p*(0,1,1) = p(0,1,1) + ∆
p*(0,1,0) = p(0,1,0) - ∆
p*(0,0,1) = p(0,0,1) - ∆
p*(0,0,0) = 1 -
∑ p( χ B )
B⊂{1,2 ,3 }
It is possible to see at a glance that all second-order marginals of p* coincide with those of p. The
only additional step we need to undertake is to solve for ∆ in the equations θ*A = 0. Observe that
our construction contains only one step where a numerical approximation becomes necessary,
namely solving for θ*A =0 in terms of ∆ . This can be done by using any one-dimensional
optimization procedure such as Newton’s approximation method8 . Newton's approximation
method has to be used in combination with the conditions that the perturbation ∆ is such that all
candidate solutions are probability distributions and thus their values are between 0 and 1.
5 Statistical Tests Based on Bayesian Model Comparison
An alternative approach to the Neyman & Pearson hypothesis testing described in 4.1 is Bayesian
estimation. The Bayesian approach is gaining in favor as computational methods for obtaining
Bayesian estimates grow in maturity. Before moving in Section 5.1 to a detailed treatment of the
Bayesian approach to our problem, a digression is necessary to discuss the Bayesian attitude in
statistics, how it differs from the frequentist attitude, and why it is especially appropriate for
problems like that treated in this paper.
The Bayesian school of statistics is named after its originator, the Reverend Thomas Bayes (16961761). In the Bayesian view, probability is a theory of belief dynamics. A probability is a number
Any one-dimensional optimization procedure can be used to solve the problem: Find ∆ such that θ*A = 0, or
equivalently, such that the resulting probability distribution maximizes entropy in the manifold of probability
distributions whose marginals of order less than A coincide with those of p
8
that can be attached to any uncertain proposition and measures the degree to which that proposition
is believed to be true. When additional evidence is obtained about the proposition, beliefs are
updated according to the probability calculus, so that a prior probability distribution is transformed
into a posterior probability distribution that reflects how beliefs have been revised in the light of the
new evidence. Most modern Bayesians are subjectivists. That is, they view probabilities as
subjective and adopt the stance that rational individuals may disagree about the probability to be
attached to a proposition.9 For collectives of events that a frequentist would model as random and
be willing to assign a probability distribution, de Finetti’s Theorem (de Finetti, 1937) shows that
with enough evidence any nondogmatic subjectivist10 will converge to nearly the same probability
estimate as a frequentist’s.
Standard frequentist methods may run into difficulties on high-dimensional problems such as the
one treated in this paper. In contrast, high-dimensional problems pose no essential difficulty in the
Bayesian approach, as long as prior distributions are chosen appropriately. In the Bayesian
framework, the prior distribution can be thought of as a bias that keeps the model from jumping to
inappropriately strong conclusions. In our problem, then, we assign prior distributions that bias
the model toward the hypothesis that ”effects, that is, coefficients of the log-linear model, are
zero”. Inferences by the model that “a parameter is nonzero” indicate reasonably strong
observational evidence against the hypothesis that “the parameter is zero”. Another advantage of
the Bayesian approach we take in this paper is that nonnested hypotheses pose no special
difficulty.
In the Bayesian approach, we begin by assigning a prior distribution to the vector θ of parameters
of the log-linear model. In our prior distribution, we assume that all θ A are independent of each
9
It is likely that early probabilists such as Pascal, Bernoulli, and Laplace did not distinguish between the
frequentist, subjectivist, and logical views of probability but assumed that all rational individuals with the same
information would agree on the ”correct” probability to be attached to a proposition, and that with enough evidence
would come to agree on an estimate near the true probability.
10
That is, one who does not place probability zero on a region including the correct parameter value.
other. We also assume that each θ A is probably equal to zero but has a small probability of being
nonzero. We then assume a continuous probability density for θ A conditional on its being nonzero, in which values very different from zero are less likely than values near zero. This prior
distribution assigns a probability to each interaction structure Ξ, in which the interaction structures
containing fewer subsets are most likely.
A sample of observations from multi-unit recordings is used to update the prior distribution and
obtain a posterior distribution. The posterior distribution also assigns a probability to each
interaction structure and a continuous probability structure to the nonzero θ A given the interaction
structure. The posterior probability that θ A≠0 is the sum of the probability of all interaction
structures in which θ A≠0. Thus, there is no need to choose a specific null hypothesis against which
to test -- the Bayesian approach automatically tests all hypotheses in which θ A≠0 against all
hypotheses in which θ A=0.
The major difficulty with the Bayesian approach is its computational complexity. With n neurons
and considering only synchronicity with no temporal correlation, there are 2 2
n
interaction
structures to consider. Clearly, it is infeasible to enumerate all of these. In Section 5.1 we discuss
Markov Chain Monte Carlo sampling approaches designed for this type of high-dimensional
problems. These approaches still have a high computational cost, and we are working on heuristics
to increase the efficiency of sampling.
5.1 Detecting Synchronizations in the Bayesian Approach
We begin with the problem of determining synchronization structure in the absence of temporal
correlation. Thus we consider the simplified log-linear model of (10). We are concerned with two
problems: identifying the correlation structure Ξ and estimating the interaction strength vector θ Ξ.
Information about p(x ) prior to observing any data is represented by the prior distribution over Ξ
and the θ’s. Observations are used to update this probability distribution to obtain a posterior
distribution over structures and parameters. The posterior probability that neurons in the set A
exhibit an order-|A| interaction is the sum of the posterior probabilities of interaction structures Ξ
for which A∈Ξ.
In what follows, we confine our discussion to the case of spatial correlations. A heuristic treatment
of temporal correlations is discussed in Section 6.1.1, while full treatment of spatiotemporal
interactions will appear in a future paper. We estimate both structure and parameters in a unified
Bayesian framework. Results include a probability, given the observations, of an order |A|
interaction among neurons in A, together with a posterior distribution for the strength of the
interaction, conditional on its occurrence.
The prior distribution we use is a mixed discrete/continuous distribution. Each structure Ξ is
assigned a prior probability. Conditional on structure Ξ, each nonzero θ A is assigned a continuous
distribution. The posterior distribution is also a mixed discrete continuous distribution. The
posterior distribution for θ A represents information about the magnitude of the interaction. The
mean or mode of the posterior distribution can be used as a point estimate of the interaction
strength; the standard deviation of the posterior distribution reflects remaining uncertainty about the
interaction strength.
Computing the posterior probability of a structure Ξ requires integrating θ Ξ out of the joint massdensity function for (Ξ, θ ΞX). There are no closed-form solutions for the required integrals. We
use Laplace's method (Tierney & Kadane, 1986; Kass & Raftery, 1995) to estimate the posterior
probability of structures. The posterior distribution of θ A given frequency data also cannot be
obtained in closed form. We use the mode of the posterior distribution as a point estimate of θ A .
The standard deviation of θ A , which indicates how precisely θ A can be estimated from the given
data, is estimated using a normal approximation to the posterior distribution. The covariance matrix
of the θ A is estimated as the inverse Fisher information matrix evaluated at the mode of the
posterior distribution. The posterior probability of an interaction A is the sum over of the posterior
probabilities of all structures containing A. The total number of structures is much too large to
enumerate explicitly. We used a Markov Chain Monte Carlo Model Composition algorithm (MC3)
to search over structures. This is a stochastic algorithm that converges to a stationary distribution in
which a structure Ξ is visited with probability equal to its posterior probability.
The prior distribution we used has two components. The first component assigns a prior
probability to each structure. In our model, interactions are independent of each other and each
interaction has a probability of 0.1. This reflects the prior expectation that not many interactions are
expected to be present. The second component of the prior distribution is the conditional
distribution of interaction strengths given the structure. If an interaction A is absent, the
corresponding strength parameter θ A is taken to be identically zero given structure Ξ. All
interactions A∈Ξ are taken to have independent and normally distributed interaction strengths with
mean zero and standard deviation 2. This reflects the prior expectation that interaction strength
magnitudes are rarely larger than 4 in absolute value.11 Using data to infer the distribution involves
two subproblems: Given a fixed interaction structure Ξ, infer the parameters θ A for A∈Ξ; and
determine the posterior probability of the interaction structure Ξ. These problems are considered
separately in the following subsections. These subsections may be omitted by the reader not
interested in the technical details.
11
The choice of distribution was informed by experience analyzing other data sets and by discussions with
neuroscientists. The symmetry of the normal distribution is unrealistic in that most interaction parameters are
expected to be positive. However, the convenience of the normal distribution outweighed this disadvantage.
5.2 Estimating Parameters for an Interaction Structure
In this section, we consider the problem of using observations to estimate interaction strengths
{θ A}A∈Ξ conditional on a fixed interaction structure Ξ. Initial information about θ A is expressed as
a prior probability distribution g(θ|Ξ). Let x 1 ,…,x N be a sample of N independent observations
from the distribution p(x|θ,Ξ). The joint probability of the observed set of configurations,
p(x1 |θ, Ξ)Lp( xN θ,Ξ) ,
viewed as a function of θ, is called the likelihood function for θ. From (2), it
follows that the likelihood function can be written




L(θ) = p(x1 | θ,Ξ)Lp(xN | θ,Ξ) = exp ∑ h(xi | θ) = exp  N ∑θ A t A  ,
 i

 A∈Ξ

(13)
where t ζ = N1 ∑ t ξi is the frequency of simultaneous activation of the nodes in the set A. The
expected value of T ζ is the probability that TA=1 and is simply the marginal function for the set A.
The observed value t ζ is referred to as the marginal frequency for A.
The posterior distribution for θ given the sample x 1 , …, x N is given by
g(θ | Ξ, x1 ,…,xN ) = Kp(x 1 | θ,Ξ)Lp(xN | θ, Ξ)g(θ | Ξ)
(14)
where the constant of proportionality K is chosen so that (4) integrates to 1:


K =  ∫ p(x1 | θ,Ξ)Lp(x N |θ,Ξ)g(θ | Ξ)dθ' 
 θ'

−1
(15)
In this expression, θ’ is the vector of θ A for nonempty A∈Ξ. Integration is performed only over
parameters that may vary independently. Because the probabilities are constrained to sum to 1, θ ∅
is a function of the other θ A:




θ ∅ = − log ∑ exp  ∑ θ At A (x )
x
A∈Ξ
 A≠∅


(16)
The random vector T of marginal frequencies is called a sufficient statistic for θ because the
posterior distribution of θ depends on the data only through T . 12 Another way of expressing the
fact that T is a sufficient statistic for θ is that the max-ent model consistent with all marginals is the
same distribution that has θ as its effects (Cover & Thomas, 1991). Thus, for the purpose of
computing the posterior distribution for a fixed structure, no information is lost if the data are
summarized by the vector of observed cluster activation frequencies for those clusters belonging to
Ξ.
A structure Ξ is specified by enumerating those effects θ A that are nonzero. The prior distribution
g(θ|Ξ) expresses prior information about the nonzero θ A. We assume that the nonzero θ A are
normally distributed with zero mean and standard deviation σ=2. That is, we are assuming that the
θ A are symmetrically distributed about zero (i.e., excitatory and inhibitory interactions are equally
_
_
The sufficient statistic T contains TA only for those clusters in Ξ. For notational simplicity the dependence of T
on the structure Ξ is suppressed.
12
likely) and that most θ A lie between -4 and 4. The standard deviation σ=2 is based on previous
experience applying this class of models (Martignon et al., 1995) to other data.
The posterior distribution g(θ|Ξ, x1 ,…, x N ) cannot be obtained in closed form. Thus approximation
is necessary. Options for approximating the posterior distribution include analytical or samplingbased methods. Because we must estimate posterior distributions for a large number of structures,
Monte Carlo methods are infeasibly slow. We therefore use an analytical approximation method.
We approximate g(θ|Ξ, x1 ,…, x N ) as the standard large-sample normal approximation. The
approximate posterior mean is given by the mode θ˜ of the posterior distribution. The posterior
mode θ˜ can be obtained by using Newton’s method to maximize the logarithm of the joint massdensity function:
θ˜ =
argmax θ {log ( p(x 1 | θ,Ξ)Lp(xN | θ,Ξ)g(θ | Ξ))}
(17)
The approximate posterior covariance is equal to the inverse of the Hessian matrix of second
derivatives (which can be obtained in closed form):
−1
Σ˜ = [ −D 2θ log( p(x1 | θ, Ξ)Lp(x N |θ,Ξ)g(θ | Ξ))]
(18)
The normal approximation is accurate for large sample sizes (De Groot, 1970). The posterior
density function is always unimodal, and we have observed that the conditional density for θ A
given the other θ’s is not too asymmetric even when the corresponding marginal TA is small.
Nevertheless, the quality of the Laplace approximation is a question worthy of further
investigation.
5.3 Estimating the Posterior Probability of Correlations
To perform inference with multiple structures, it is natural to assign a prior distribution for θ that is
a mixture of distributions of the form (7). That is, a structure probability π Ξ and a prior distribution
g(θ|Ξ) is specified for each of the interaction structures under consideration. The structure-specific
parameter distribution g(θ|Ξ) was described above in Section 3. To obtain the prior distribution for
structures, we used discussions with neuroscientists and experience fitting similar models to other
data sets. We expected all θ Α for single element A to be nonzero. The prior distribution we used
assumed that the θ A were zero or nonzero independently, and each θ A with |A|>1 had prior
probability of .1 of being nonzero. The results of the analysis were insensitive to the (nonzero)
prior probability of including singleton A.
The posterior distribution of θ given a sample of observations is also a mixture distribution:
g(θ|x 1 ,…,x N) =
∑ π * g(θ | Ξ, x ,…, x
Ξ
Ξ
1
N
)
(19)
where π *Ξ is the posterior probability of structure Ξ and g(θ |Ξ, x1 ,…, xN ) is the posterior
distribution of θ given structure Ξ as defined in (4) and (5).
The ratio of posterior probabilities for two structures Ξ1 and Ξ2 is given by
p(Ξ1 | x1 ,K ,xN )
p(x1 ,K, x N | Ξ 1 ) π Ξ 1
π*Ξ 1
=
=
p(Ξ 2 | x1 ,K ,x N )
p(x 1,K, x N |Ξ 2 ) π Ξ2
π*
(20)
Ξ2
Thus, the posterior odds ratio is obtained by multiplying the prior odds ratio by the ratio of
marginal probabilities of the observations under each of the two structures. This ratio of marginal
probabilities is called the Bayes factor (Kass & Raftery, 1995). The marginal probability of the
observations under structure Ξ is obtained by integrating θ’ out of the joint conditional density:
p(x 1,K.x N |Ξ) =
∫ p(x
1
| θ', Ξ)Lp(xN | θ', Ξ)g(θ'| Ξ)dθ'
(21)
We assumed in Section 3 that the joint mass-density function (and hence the posterior density
function for θ) was highly peaked about its maximum and approximately normally distributed. The
marginal probabilit
y
p(x1 ,K. x N |Ξ)
is approximated by integrating this normal density:
p˜ (x 1,K.x N |Ξ) ≈ (2π)d /2 Σ
˜ 1 / 2 p(x | θ˜ ,Ξ)Lp(x | θ˜,Ξ)g(θ˜ | Ξ)
1
N
(22)
This approximation (12) is known as Laplace’s approximation (Kass & Raftery, 1995; Tierney &
Kadane, 1986).
The posterior expected value of θ is a weighted average of the conditional posterior expected
values. As noted above, these expected values are not available in closed form. Using the
conditional MAP (=maximum a posteriori) estimate θ˜ Ξ to approximate the conditional posterior
expected value, we obtain the point estimate
θ˜ =
∑ π *θ˜
Ξ
Ξ
Ξ
(23)
In (23), the value θ˜ A =0 is used for structures not containing A. Also of interest is an estimate of
the value of θ A given that θ A≠0, which is obtained by dividing θ˜ A by the sum of the π Ξ for which
A∈Ξ. Equation (18) can be used to approximate the posterior covariance matrix Σ˜ Ξ for θ
conditional on structure Ξ. Again, the unconditional covariance can be approximated by a weighted
average:
Σ˜ =
∑ π * Σ˜
Ξ
Ξ
(24)
Ξ
Computing the posterior distribution requires computing, for each possible structure Ξ, the
posterior probability of Ξ and the conditional distribution of θ given Ξ. This is clearly infeasible:
k
For k nodes there are 2k activation clusters, and therefore 2 2 possible structures. It is therefore
necessary to approximate (19) by sampling a subset of structures. We used a MC3 algorithm
(Madigan & York, 1993) to sample structures. A Markov chain on structures was constructed such
that it converges to an equilibrium distribution in which the probability of being at structure Ξ is
equal to the Laplace approximation to the structure posterior probability π *Ξ . We used a MetropolisHastings sampling scheme, defined as follows: When the process is at structure Ξ, a new structure
Ξ’ is sampled by either adding or deleting a single cluster. The cluster A’ to add or delete is chosen
by a probability distribution ρ(A’|Ξ). 13 The selected cluster A’ is then either accepted or rejected,
with the probability of acceptance given by
 p(x 1,K.x N | Ξ' )π Ξ' ρ(A'| Ξ) 
min 1,

 p(x1 ,K.x N | Ξ)π Ξ ρ(A' |Ξ') 
(25)
We ran the MC3 algorithm for 15,000 runs. We estimated the posterior probability of a structure as
its frequency of occurrence over the 15,000 runs. We estimated interaction strength parameters and
To improve acceptance probabilities, we modified the distribution ρ(A’|Ξ) as sampling progressed. The sampling
probabilities were bounded away from zero to ensure convergence.
13
standard deviations using only the 100 highest probability structures. Although the number of
possible structures is astronomical, the vast majority of structures have extremely small
probabilities. The practice of using the best 100 models to estimate interaction strengths is justified
first because estimates based on the best 100 models of the interaction probabilities were very
similar to those based on sampling frequencies, and second because the between-structure variance
components for the interaction strengths were generally very small (the only exceptions being lowprobability interactions that were highly correlated with other interactions).
5.4 A heuristic for detecting temporal patterns
The correct way for modelling temporal structure is the Markov chain defined in (9). But the
amount of data required to estimate this Markov Chain for reasonable time delays (say, 50 to 100
ms) is enormous and so is the time required for any statistical approach to this estimation task. We
propose a heuristic than can be useful as an approximation. We see a temporal pattern as a “shifted
synchrony”. Suppose we observe neuron 1 fires exactly two bins after neuron 2 quite often. When
do we declare that that this phenomenon is significant? In our simplified approach we shift the data
of neuron 2 two bins ahead of neuron 1 and, neglecting the two first bins of neuron 1 and the last
two of neuron 2, we have an artificially created parallel process for these two neurons such that the
temporal pattern has now become a synchrony. We will declare that a temporal pattern is
significant if the synchrony obtained by performing the corresponding shifts is significant. We
introduce some notation: Given four neurons 1,2,3,4, we use the notation 12[1]3[2]4[-2] to
indicate that we shift neuron 2 one bin after neuron 1, neuron 3 two bins after neuron 1, and
neuron 4 two neurons before neuron 1. The first number from left to right indicates the neuron
with respect to which all shifts have been performed.
Interaction Strenght
-65
-36
-29
-25
-21
-18
-15
-12
-9
-6
-3
0
3
6
9
12
15
Figure 3: Histogram of Estimated Interactions of Two Neurons (1,2 of Table 5)
6 Applying the Model to Simulated Data
We applied our models to synthetically generated data. We generated data randomly from a model
of the form (2), where we specified Ξ and θ A. Table 1 shows the nonzero θ A in the model we
used.
18
Cluster A
θA
4,6
0.45
3,4,6
0.74
2,3,4,5
1.79
2,3,4,5,6
0.61
Table 1: Nonzero Parameter Values for Synthetic Data
We used the θ A from Table 1 to compute p(x) from (2) and generated random observations from
p(x). Results from the analysis of a sample of 10,000 randomly generated data appear in Table 2.
Cluster
A
G2
Posterior Probability
of A
(Frequency)
Posterior Probability
of A
(Best 100 models)
MAP
estimate θ˜ ξ
Standard
Deviation of
θA
4,6
0.0001
0.99
0.99
0.46
0.10
3,4,6
<0.01
0.93
0.96
1.13
0.26
Table 2: Results for a randomly generated sample of 10.000 data from
the distribution specified in Table 1
We ran the MC3 search algorithm described in Section 4 for 15,000 iterations. Posterior
probabilities, estimated θ A, and estimated σA are shown in Tables 1 and 2 for all interactions A
with estimated posterior probability of at least .1 (i.e., interactions for which the posterior
probability is at least as large as the prior probability). Posterior probabilities for interactions were
estimated in two different ways: by frequencies collected over the 15,000 iterations of the MC3
algorithm and by normalizing probabilities for the 100 highest probability structures enumerated
during the model search. Also shown in Table 2 are estimates for the interaction strength
parameters θ A conditional on θ A≠0. The estimate θ˜ A is a weighted average of the MAP estimates
(7). Structures included in the weighted average were those from the 100 most probable models
that contain θ A. Structure weights are proportional to the structure posterior probability. The
estimate σ˜ A is obtained from the sum of ”within” and ”between” variance components:
(
)
2
2
σ˜ 2A = ∑ p˜Ξ σ˜ AΞ
+ θ˜ AΞ − θ˜ A 
A∈Ξ
(26)
The sum is over the 100 most probable enumerated structures that include A; the weight p˜ Ξ is
proportional to the posterior probability of Ξ and normalized to sum to 1; the within-structure
variance component is a weighted average of the appropriate diagonal elements of (18); and the
between-structure variance component is the sum of squared deviations of the structure-specific
parameter estimates about their mean. The results shown in Table 1 and Table 2 indicate that these
two methods yield similar estimates.
In Table 2 the (4,6) and (3,4,6) interactions exhibit a very high posterior probability. The former
was estimated quite precisely, the latter to within two posterior standard deviations. Neither the
(2,3,4,5) nor the (2,3,4,5,6) interaction was detected by the frequentist/Bayesian analysis of the
synthetic data (their posterior probabilities were .03 and .04, respectively). This suggests that the
magnitude of these interactions was not sufficient to be detected in a sample of this size. No
interactions that were not present in the synthetic data model were detected by the data analysis.
6.1 Neuronal Interactions in Experimental Data
We begin by presenting data obtained by Vaadia: In the experiment spiking events among groups
of neurons were analyzed through simultaneous recordings of 6 neurons in the cortex of Rhesus
monkeys. The monkeys were trained to localize a source of light and, after a delay, to touch the
target from which the light was presented. At the beginning of each trial the monkeys touched a
"ready-key", then a central red light was turned on (ready signal). Three to six seconds later, a
visual cue was given in the form of a 200-ms light blink coming from either the left or the right.
After a delay of 1 to 32 seconds, the color of the ready signal changed from red to yellow and the
monkeys had to release the ready key and touch the target from which the cue was given.
The spikes of each neuron were encoded as discrete events by a sequence of zeros and ones with
time resolution of 1 millisecond. The activity of the whole group was described as a sequence of
configurations or vectors of these binary states. Since the method presented in this paper for
detecting synchronizations does not take into account nonstationarities, the data were selected by
cutting segments of time during which the cells did not show significant modulations of their
fiiring rate. The segments were taken from 1000 ms before the cue onset till 1000 ms thereafter.
The duration of each segment was 2 seconds. We used 94 such segments (total time - 188
seconds) for the analysis presented below. The data were then binned in time windows of 40
milliseconds. The choice of the bin length was determined in discussion with the experimenter.
The frequencies of configurations of zeros and ones in these windows are the data used for
analysis in this paper. We analyzed recordings prior to the ready signal separately from data
recorded after the ready-signal. Each of these data sets is assumed to consist of independent trials
from a model of the form (2). Tables 3 and 4 present the same type of results as those of Table 2 in
the previous section.
Gt
Cluster A
Posterior Probability
Posterior
of A
Probability of A
(Frequencies)
(Best 100 models)
MAP
estimate θ˜ A
Standard Deviation
of θ A
4,6
.001
0.98
1.00
0.49
0.11
2,3,4,5
>.09
0.33
0.36
2.34
0.67
3,4,6
>.05
0.22
0.15
0.80
0.27
2,3,4,5,6
>.05
0.16
0.13
2.53
0.88
1,4,5,6
>.05
0.16
0.10
6.67
1.19
3,4
>.05
0.12
0.08
0.63
0.23
Table 3: Results for Pre-Ready Signal Data, Coefficients
with Posterior Probability > 0.1
There was a high probability second-order interaction in each data set: (4,6) in the pre-ready data
and (3,4) in the post-ready data. In the pre-ready data, a fourth-order interaction (2,3,4,5) had
posterior probability about 1/3 (this represents approximately three times the prior probability of
.1). The tables show a few other interactions with posterior probability larger than their prior
probability.
Posterior
Posterior
MAP estimate θ˜ ξ
Probability of A Probability of A
(Frequency)
(Best 100 models)
Cluster A
G2
3,4
0.03
0.86
0.94
1.00
0.27
2,5
>0.05
0.25
0.18
0.98
0.34
1,4,5,6
>0.05
0.18
0.13
1.06
0.36
1,4,6
>0.05
0.15
0.08
0.38
0.13
Standard
Deviation of θ A
Table 4: Results for Post-Ready Signal Data:
Effects with Posterior Probability > 0.1
Several data sets from this type of experiment were analyzed under the guidance of the sixth
author, who conducted the experiments. With only one exception interactions of order 2 or more
were all characterized by positive effects. In data sets of 6-8 neurons from different segments on
the experimental data described above the presence of triplets was rare.
Comparison of Tables 3 and 4 reveals some similarities and some important differences. The preready data shows strong evidence for the (4,6) interaction; there is no evidence for this interaction
in the post-ready data. The post-ready data shows strong evidence for the (3,4) interaction; there is
no evidence for this interaction in the pre-ready data. (Although the (3,4) interaction term appears
in Table 3, its posterior probability is barely greater than its prior probability, indicating that the
data show evidence neither for its presence nor for its absence.) Both these second-order
interactions are positive. Single-element effect values are negative because the overall level of
activation in these recordings was low. Results of this type, that is, of varying second-order
structure with varying attentional state, have been obtained previously by several groups of
researchers (Grün, 1995; Riehle et al., 1998) who have used the method introduced by Palm et al.
(1989) for detecting surprising correlations. This method is essentially consistent with the method
presented here for the second-order case (a difference of the two approaches is the test used for
determining significance: We use G2 while Palm et al. (1988) use a Fisher significance test
comparing the data with the binomial distribution determined by the success probabilities fixed by
the first-order model.)
To double-check of our methods the fifth author also provided data from biologically plausible
simulations of neural activity. Only second-order interactions of a different type were simulated,
both synchronous and with small time delays. Both the frequentist and the Bayesian methods
detected the simulated interactions and, in the case of overlapping couples, did not misdetect
nonexisting triplets.
6.2 Neuronal Interactions in Anesthetized Rat Somatosensory Cortex
We also applied the models to data collected from the vibrissal region of the rat cortex. The
vibrissae are arranged in five rows on the snout (named A, B, C, D, and E) with four to seven
vibrissae in each row (Figure 4A). Neurons in the vibrissal area of the somatosensory cortex are
organized into columns (Figure 4B); layer IV of the column is composed of a tightly packed cluster
of neurons called a ”barrel” (Woolsey and Van der Loos, 1970). Each barrel-column is
topologically related to a single mystacial vibrissa on the contralateral snout; thus ”vibrissa D2”
(second whisker in row D) is linked to ”column D2” (Welker, 1971), our wish was to find out
whether interacting neurons are distributed across cortical columns or restricted to single columns.
furthermore we examined whether the spatial distribution of interacting neurons varies as a
function of the stimulus site.
The data analyzed here were recorded from the cortex of a rat lightly anesthetized with urethane
(1.5 g/kg) and held in a stereotaxic apparatus. Cortical neuronal activity was recorded through an
array of six tungsten microelectrodes inserted in cortical columns D1, D2, D3, C3, as well as in the
”septum” located between columns D1 and D2 (Figure 4). Using spike sorting methods, events
from a total of 15 separate neurons were discriminated. Locations of the neurons are shown in
Figure 4. During the recording session, individual whiskers were deflected across 50-55 trials by a
computer-controlled piezoelectric wafer that delivered an up-down movement lasting 100 ms; the
stimulus was repeated once per second (Diamond et al, 1993). We have restricted the analysis to
data collected during stimulation of vibrissae D1, D2, and D3 and only during the 200 ms
peristimulus period (100 ms after the up movement and 100 ms after the down movement). For
each stimulus site, then, there were 10,000-11,000 ms of data. As in the previous section, the
spiking events (in the 1 ms range) of each neuron were encoded as a sequence of zeros and ones,
and the activity of the whole group was described as a sequence of configurations or vectors of
these binary states.
The interactions illustrated in Figure 4 are those for which the likelihood test statistic is significant
at the 0.0001 level. Figure 4A shows the interactions occurring during stimulation of whisker D1.
Two neurons within column D1 showed a significant negative correlation. Interactions during
stimulation of whisker D2 are given in Figure 4B. One neuron in column D3 and one in column C3
showed a pairwise positive interaction (firing together less frequently than expected by chance). In
addition, a positive higher order interaction emerged, consisting of three neurons distributed across
the D1-D2 septum and columns D2 and D3. During stimulation of whisker D3 a pairwise and a
higher order interaction were detected, both of them negative (Figure 4C).
positive correlation
negative correlation
A
D1
1
2
septum
3
4
D2
5
6
D3
7 9 12
8 10 13
11
C3
D1 stimulation
14
15
B
D1
1
2
septum
3
4
D2
5
6
D3
7 9 12
8 10 13
11
C3
D2 stimulation
14
15
C
D1
1
2
septum
3
4
D2
5
6
D3
7 9 12
8 10 13
11
C3
D3 stimulation
14
15
Figure 4: Selected Interactions (G2<0.0001) from Results of the Analysis of Data on
Spike Trains from the Somatosensory Cortex of Rats
From analysis of this limited data set, several interesting observations can be made: (1) Both
positive and negative interactions were detected; (2) interactions occurred both within a cortical
columnar unit and across columns; (3) interactions were stimulus dependent (neurons were not
continuously correlated with one another but became grouped together as a function of the stimulus
parameter); (4) during different stimulus conditions a single neuron may take part in different
groups (neuron 10 during stimulation of whisker D2 and D3); (5) during one stimulus condition,
the same neuron can participate in two separate interactions (neuron 10 during stimulation of
whisker D3).
6.3 Higher order synchronization in the visual cortex of the behaving macaque
Precise patterns of neural activity of groups of neurons in the visual cortex of behaving macaques
was detected by means of the analysis of data on multi-unit recordings obtained by Freiwald. The
experiment consisted of a simple fixation task: the monkey was supposed to learn to detect a
change in the intensity of a fixated light point on a screen and react with a motoric answer. The
fixation task paradigm is the following: A fixation point appears on the screen and stays on the
screen for 8 to 9 seconds. The monkey fixates this point and then presses a bar. After the bar is
pressed an interval of approximately five to six seconds begins, during which the monkey fixates
the point and maintains the bar pressed. Afterwards, the point darkens and disappears from the
screen, and, as a consequence, the monkey releases the bar. During the five to six seconds a visual
stimulus is presented on the screen, which is not relevant for behavior. The data presented here
were from four neurons recorded in the inferotemporal cortex (IT) on repeated trials of 5.975 ms
each, including 1000 ms before the stimulus onset. We decomposed each trial into a first chunk of
1.000 ms, considered stationary (pre-stimulus), a non-stationary interval of 800 ms (phasic
response), and then four segments of 1.000 ms each (tonic response), followed by a short last
segment of 175 ms. We chose a bin width of 5 ms. We chose to here present a comparison
between the activity during the first 1000 ms before the stimulus onset and the 1000 ms right after
the non-stationary interval corresponding
containing the phasic response. We performed a
synchronization analysis both with the frequentist and the Bayesian method and obtained
compatible results, in the sense that all synchronizations detected by the Bayesian method were
declared significant at the .001 level.
Cluster A
Posterior
Probability of A
(Frequency)
Posterior Probability MAP estimate θ˜
ξ
of A
(Best 100 models)
Standard
Deviation of θ A
Fixation (1-1000 ms)
3,4
0.99
1.00
0.69
0.08
2,4
1.00
1.00
0.56
0.04
2,3
1.00
1.00
1.54
0.05
2,3,4
0.22
0.22
-0.66
0.02
1,4
0.99
1.00
0.70
0.01
1,3
0.99
1.00
1.72
0.01
1,2
0.99
1.00
1.45
0.00
1,2,3
0.99
1.00
-1.26
0.03
Table 5: Synchronies with probability larger than 0.1 in the first 1000 ms in the
fixation task
Cluster A
Posterior
Probability of A
(Frequency)
Posterior Probability MAP estimate θ˜
ξ
of A
(Best 100 models)
Standard
Deviation of θ A
Phase 1 (1800 - 2800 ms)
3,4
0.26
0.26
0.29
0.30
2,4
1.00
1.00
0.53
0.02
2,3
0.99
1.00
1.06
0.05
2,3,4
0.17
0.16
-0.07
0.07
1,4
0.71
0.70
0.32
0.01
1,3
1.00
1.00
1.14
0.04
1,2
1.00
1.00
0.88
0.01
1,2,3
0.78
0.78
0.13
0.02
Table 6: Synchronies with probability larger than 0.1 in the first 1000 ms after
the visual stimulus.
7 Discussion
This paper proposes a parameterization of synchronization and, in general, of temporal correlation
of neural activation of more than two neurons. This parameterization has a sound informationtheoretical justification. If the problem is to analyze a group of neurons and establish whether they
fire in synchrony, the statistical problems of estimating the significance of the corresponding effect
being nonzero has a straightforward, reliable treatment. If the problem is to detect the whole
interaction structure in a group of neurons (i.e., which subsets have interactions and how strong)
the Bayesian approach has advantages over the frequentist approach because it does not have to
face the problem of multiple null hypotheses and the multiple significance levels associated with
them. Yet the Bayesian approach is of astronomical complexity when the temporal interaction
structure of, say, 8 neurons is to be assessed. Even the heuristic method briefly discussed in
section 6.1.1. becomes basically infeasible for the Bayesian approach we exposed here, at least in
its present form. New techniques for speeding up Bayesian search across structures will be
presented in a forthcoming paper.
An important underlying assumption in our investigations has been that the data are basically
stationary. This is seldom the case. As we know from the important work by Abeles, Bergman,
Gat, Meilijson, Seidenmann, Tishby, and Vaadia (1995), cortical activity flips from one quasistationary state to the next. These authors developed a hidden Markov model technique to
determine these flips. A combination of their methods with the ones presented here would allow
the analysis of structure changes through quasi-stationary segments.
Another advancement in global structure analysis of spike activity has been obtained by de Sa’, de
Charms, and Merzenich (1997) by means of a Helmholtz machine that determines highly probable
patterns of activation based on spike train data. Here again a combination of methods would allow
analysis of the correlation structure in probable patterns.
While the first techniques for analysis of patterns in neural activation of recorded cells were simple
and transparent, the new techniques for determining correlation structure of more than two neurons
are complex and often even infeasible. The reason for our presentation of both the frequentist and
the Bayesian approach to detecting correlation structure is our wish to reinforce the use of the
frequentist method, in spite of all its shortcomings, because it is quick and can be used for large
numbers of neurons and long time delays. Our experience with the systematic comparison of the
two methods has enhanced the confidence that a careful frequentist strategy is fairly reliable.
8 Acknowledgments
The authors thank Günther Palm for helpful discussions. An anonymous reviewer of an earlier
version of this paper is gratefully acknowledged for extremely useful comments that have
influenced the preparation of this paper.
Dr. Laskey’s work was supported by a Career Development Fellowship with the Krasnow
Institute for the study of cognitive science at George Mason University.
References
Abeles, M. (1991). Corticonics: Neural Circuits of the Cerebral Cortex. Cambridge: Cambridge
University Press.
Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidenmann, E., Tishby, N. and Vaadia, E.
(1995). Cortical Activity Flips among quasi-Stationary States. Proc. Natl. Acad. Sci. 92:
8616-8620.
Abeles, M., Bergman, H., Margalit, E. and Vaadia, E. (1993). ”Spatiotemporal Firing Patterns in
the Frontal Cortex of Behaving Monkeys.” Journal of Neurophysiology 70, 4: 16291638.
Abeles, M. and Gerstein, G. (1998). Detecting Spatiotemporal Firing Patterns among
Simultaneously Recorded Single Neurons. J. Neurophysiol. 60: 909-924.
Abeles, M., Prut, Y., Bergman, H. and Vaadia, E. (1994). Synchronization in Neural
Transmission and its Importance for Information Processing. Prog. Brain. res. 102:
395-404.
Aertsen, A., Gerstein, G., Habib, M. and Palm, G.(1989). Dynamics of Neural Firing Correlation:
Modulation of ”Effective Connectivity.”. J. Neurophysiol.. 61: 900-917.
Aertsen, A. (1992) Detecting Ttriple Activity. Preprint.
Bishop, Y., Fienberg, S. and Holland, P. (1975) Discrete Multivariate Analysis. Cambridge, MA:
MIT Press.
Cover, T. and Thomas, J. (1991). Elements of Information Theory. New York: Wiley
Series in Communications.
de Finetti, B. (1937). La prevision; ses lois logiques, ses sources subjectives. Annales
de l’ Institut Henri Poincare. 7: 1-68. (Reprinted in 1964 in English
translation as ”Foresight: its Logical Laws, its Subjective Sources”, in
Studies in Subjective Probability, edited by Kyburg, Jr., H. E., and Smokler,
H. E.).
de Sa’, R., de Charms, R. and Merzenich, M. (1997). Using Helmholtz Machines to
Analyze Multi-Channel Neuronal Recordings. Neural Information
Processing System. 10: 131-136.
De Groot, M. (1970). Optimal Statistical Decisions. New York: McGraw-Hill.
Diamond, M., Armstrong-James, M. and Ebner, F. (1993). Experience-Dependent Plasticity in the
Barrel Cortex of Adult Rats, Proc. Natl. Acad. Sci. 90: 2602-2606.
Gerstein, G., Perkel, D. and Dayhoff, J. (1985). Cooperative Firing Activity in Simultaneously
Recorded Populations of Neurons: Detection and Measurement. Journal of
Neuroscience, 5: 881-889.
Good, I. (1963). On the Population Frequencies of Species and the Estimation of
Population Parameters. Biometrika. 40: 237-264,
Grün, S., Aertsen, A., Vaadia, E. and Riehle, A. (1995). Behavior-Related Unitary
Events in Cortical Activity. Learning and Memory: 23rd Göttingen
Neurobiology Conference (poster presentation).
Grün, S. (1996). Unitary Joint-Events in Multiple-Neuron Spiking Activity-Detection,
Significance and Interpretation. Frankfurt : Verlag Harry Deutsch.
Hebb, D. (1949). The Organization of Behavior. New York: Wiley.
Kass, R. and Raftery, A. (1995). Bayes Factors. Journal of the American Statistical
Association. 90, no. 430: 773-795.
Laskey, K. and Martignon, L. (1996). Bayesian Learning of Log-linear Models for
Neural Connectivity. Proceedings of the XII Conference on Uncertainty in
Artificial Intelligence. San Mateo: Horwitz, E. ed., Morgan-Kaufmann.
Madigan, D. and York, J. (1993). Bayesian Graphical Models for Discrete Data.
Technical Report 239, Department of Statistics, University of Washington.
Martignon, L., v. Hasseln, H., Grün, S., Aertsen, A. and Palm, G. (1995). Detecting Higher
Order Interactions among the Spiking Events of a Group of Neurons” Biol.Cyb. 73, 6981.
von Mises, R. (1939). Probability, Statistics and Truth. London: George Allen &
Unwin Ltd.
Neyman, J. and Pearson, E. (1928). On the Use and the Interpretation of Certain Test Criteria for
Purposes of Statistical Inference, BiometriKa, 20: 175-240, part I: 263-94 part II.
Palm, G., Aertsen, A. and Gerstein, G. (1988). On the Significance of Correlations
among Neuronal Spike Trains. Biol. Cybern. 59: 1-11.
Prut, Y., Vaadia, E., Abeles, M., Haalman, I. and Slovin, H. (1993). Dynamics of Neuronal
Interaction in the Frontal Cortex of Behaving Monkeys. Concepts in Neuroscience , 4,
no 2: 131-158.
Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H. and Abeles, M. (1998).
Spatiotemporal Structure of Cortical Activity: Properties and Behavioral
Relevance. Journal of Neurophysiology. 79: 2857-2874.
Rauschecker, J. (1991) Mechanisms of Visual Plasticity: Hebb Synapses, NMDA Reception and
beyond. Phys. Rev.. 71: 587-615.
Riehle, A., Grün, S., Diesmann, M. and Aertsen, A. (1998). Science. 281: 34-42.
Singer, W. (1994). Time as Coding Space in Neocortical Processing: A Hypothesis. in:
Temporal Coding in the Brain. Berlin Heidelberg: Buzsaki, G. et al. (Eds).
Springer Verlag.
Strangman, G. (1997) Detecting Synchronous Cell Assemblies with Limited Data and Overlapping
Assemblies . Neural Computation. 9, 1: 51-76
Tierney, L. and Kadane, J. (1986). Accurate Approximations of Posterior Moments
and Marginal Densities. Journal of the American Statistical Association. 81:
82-86.
Welker, C. (1971). Microelectrode Delineation of Fine Grain Somatotopic Organization of SmI
Cerebral Neocortex in Albino Rat. Brain Res. 26: 259-275
Whittaker, J. (1989). Graphical Models in Applied Multivariate Statistics. New York: Wiley
Woolsey, T.A. and Van der Loos, H. (1970). The Structural Organization of Layer IV
in the Somatosensory region, SI, of Mouse Cerebral Cortex. Brain Res.. 17:
205-242.