Download An evaluation of alternative methods for testing hypotheses, from the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Exploratory factor analysis wikipedia , lookup

Mixture model wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

Transcript
Journal of Mathematical Psychology 72 (2016) 43–55
Contents lists available at ScienceDirect
Journal of Mathematical Psychology
journal homepage: www.elsevier.com/locate/jmp
An evaluation of alternative methods for testing hypotheses, from the
perspective of Harold Jeffreys
Alexander Ly ∗ , Josine Verhagen, Eric-Jan Wagenmakers
University of Amsterdam, Department of Psychology, Weesperplein 4, 1018 XA Amsterdam, The Netherlands
highlights
• Reply to Robert (2016) The expected demise of the Bayes factor.
• Reply to Chandramouli and Shiffrin (2016) Extending Bayesian induction.
• Further elaboration on Jeffreys’s Bayes factors.
article
info
Article history:
Available online 15 February 2016
Keywords:
Bayes factors
Induction
Model selection
Replication
Statistical evidence
abstract
Our original article provided a relatively detailed summary of Harold Jeffreys’s philosophy on statistical
hypothesis testing. In response, Robert (2016) maintains that Bayes factors have a number of serious
shortcomings. These shortcomings, Robert argues, may be addressed by an alternative approach that
conceptualizes model selection as parameter estimation in a mixture model. In a second comment,
Chandramouli and Shiffrin (2016) seek to extend Jeffreys’s framework by also taking into consideration
data distributions that do not originate from either of the models under test. In this rejoinder we argue
that Robert’s (2016) alternative view on testing has more in common with Jeffreys’s Bayes factor than
he suggests, as they share the same ‘‘shortcomings’’. On the other hand, we show that the proposition
of Chandramouli and Shiffrin (2016) to extend the Bayes factor is in fact further removed from Jeffreys’s
view on testing than the authors suggest. By elaborating on these points, we hope to clarify our case for
Jeffreys’s Bayes factors.
© 2016 Elsevier Inc. All rights reserved.
In our original article (Ly, Verhagen, & Wagenmakers, 2016)
we outlined how Harold Jeffreys constructed his hypothesis tests.
Jeffreys’s tests contrast a precise, point-null hypothesis M0 versus
a more general alternative hypothesis M1 . Here the point-null
hypothesis represents a general law, an invariance, or a categorical
causal claim (e.g., ‘‘apple trees always bear apples’’; ‘‘people cannot
look into the future’’; ‘‘Alzheimer’s disease is caused by a fungal
infection of the central nervous system’’), whereas the alternative
hypothesis relaxes that law. Jeffreys’s tests require a thoughtful
specification of the prior distribution for the parameter of interest,
and much of Jeffreys’s work was concerned with providing good
default specifications—‘‘good’’ in the sense that they adhere to
general common-sense desiderata (e.g., Bayarri, Berger, Forte, &
García-Donato, 2012). We are pleased that our summary attracted
two comments by renowned researchers; below we respond to
∗
Corresponding author.
E-mail address: [email protected] (A. Ly).
http://dx.doi.org/10.1016/j.jmp.2016.01.003
0022-2496/© 2016 Elsevier Inc. All rights reserved.
their ideas in a way that we hope is consistent with the overall
philosophy of Harold Jeffreys himself.
1. Rejoinder to Robert
In general, Robert’s (2016) comments highlight the inevitable
subtleties in constructing a Bayes factor. His alternative mixture
model procedure is practical and may be immensely valuable for
specific situations (i.e., hierarchical models) that are common in
psychological research. Nevertheless, we believe Robert’s suggestion about the demise of the Bayes factor to be an overstatement.
1.1. Robert’s critique on the Bayes factor
Our understanding of Jeffreys’s method is partly based on the
work by Robert and colleagues (2009), and it should, therefore,
not come as a surprise that Robert’s view and ours overlap to a
considerable degree. Robert’s arguments for dismissing the Bayes
factor can be grouped in terms of (1) its usage in making decisions
and (2) the care that needs to be taken in choosing the priors.
44
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
1.1.1. First critique: the distinction between inference and decision
making
We share Robert’s discontent with the statistical practice that
emphasizes all-or-none decisions at some arbitrary threshold, and
we agree that scientific learning should instead be guided by a
continuous measure of evidence. In the process of eviscerating pvalue null hypothesis tests, Rozeboom (1960, pp. 422–423) already
expressed a similar sentiment:
‘‘The null-hypothesis significance test treats ‘acceptance’ or
‘rejection’ of a hypothesis as though these were decisions
one makes. But a hypothesis is not something, like a piece
of pie offered for dessert, which can be accepted or rejected
by a voluntary physical action. Acceptance or rejection of a
hypothesis is a cognitive process, a degree of believing or
disbelieving which, if rational, is not a matter of choice but
determined solely by how likely it is, given the evidence, that
the hypothesis is true’’.
Our favorite continuous measure of evidence is of course a Bayes
factor constructed from a pair of priors selected according to
Jeffreys’s desiderata, or a Jeffreys’s Bayes factor in short. It is
important to note that this measure provides only the first of three
Bayesian ingredients needed for decision making. The other two
ingredients are the prior model probabilities (which, combined
with the Bayes factor, yield posterior model probabilities) and
the specification of a loss function (or equivalently, a utility
function; Berger, 1985, Lindley, 1977, and Robert, 2007).
For instance, consider a Bayes factor of BF10 (d) = 4.6 for the
observed data d. This Bayes factor can be converted to a posterior
model probability of P (M0 | d) = 0.17 when we set P (M0 ) =
P (M1 ) = 1/2 (Ly et al., 2016). One possible subsequent decision rule is then to accept P (M1 | d) because it has the highest
posterior model probability. We did not intend to suggest such a
procedure, as the decision is clearly sensitive to the prior model
probabilities. Furthermore, we do not recommend uniform prior
model probabilities regardless of scientific context. In fact, when
decision making is desired, the assignment of prior model probabilities is left to the substantive researcher. Such flexibility in assignment introduces subjectivity, and this may be seen either as a
disadvantage or as an advantage. At any rate, prior model probabilities can be used to formalize the adage that ‘‘extraordinary
claims require extraordinary evidence’’ (e.g., Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). Moreover, the prior model
probabilities can be used to address the problem of multiplicity (e.g., Jeffreys, 1961; Scott & Berger, 2010; Stephens & Balding,
2009). A similar argument applies to utility functions: these may
be subjective and hard to elicit, but such difficulties do not sanction the practice of ignoring utility functions altogether, at least
not when the purpose is to make decisions.
Thus, Robert worries that computation of Bayes factors may
tempt users to make all-or-none decisions while disregarding
prior model probabilities or loss functions. We agree with Robert
that there is a considerable difference between inference and
decision making, and that scientific learning should be guided by
a continuous measure of evidence that incorporates what we have
learned from the observed data. The Bayes factor is such a measure.
1.1.2. Second critique: the Jeffreys–Lindley–Bartlett paradox
We suspect that the Jeffreys–Lindley–Bartlett (henceforth JLB)
paradox is central to Robert’s (1993; 2014) dismissal of the Bayes
factor and it is the main motivation for the development of the
mixture model alternative. We take a closer look at the JLB paradox
and discuss two consequences foreseen by Jeffreys, who was
keenly aware of the ‘‘paradox’’ from the very beginning (Etz &
Wagenmakers, 2015).
First, the JLB paradox implies that we cannot use improper
priors to construct a Bayes factor. For instance, to estimate µ within
the normal model M1 : X ∼ N (µ, 1), we typically employ
Jeffreys’s (1946) prior µ ∝ 1. The reason to do so stems from
the fact that Jeffreys’s prior is translation-invariant, leading to a
posterior that is independent on how researchers parameterize
the problem (Ly, Marsman, Verhagen, Grasman, & Wagenmakers,
2015). The JLB paradox implies that we cannot use this same
(estimation) prior on the test-relevant parameter for a Bayesian
test. More specifically, when we pit the aforementioned model
M1 against the null model M0 : X ∼ N (0, 1) the improper
prior π1 (µ) ∝ 1 then becomes useless. To see this we consider
the Jeffreys’s prior as the limit of proper priors µ ∼ N (0, τ 2 )
with τ tending to infinity. The Bayes factor for the observed data
d = (n, x̄) is then given by

˜ 10 ; τ (d) = lim
lim BF
τ →∞
τ →∞
exp − 2n (x̄ − µ)2 exp − 2τ12 µ2 dµ
= 0,
√
2π τ exp[− 2n x̄2 ]
(1)





(τ nx̄)2
= lim √
= 0,
exp
τ →∞
2(1 + nτ 2 )
1 + nτ 2
1

(2)
regardless of the fixed sample size n and the observed sample
mean x̄. As such, the Bayes factor constructed from the improper
Jeffreys’s prior will always favor the null model and this also holds
for other improper priors. Moreover, Eq. (2) shows that for fixed
data d = (n, x̄) and a Bayes factor constructed from a normal prior
with hyperparameter τ we can obtain a Bayes factor in favor of
˜ 10 ; τ (d) < 1) simply by
the null hypothesis of arbitrary size (i.e., BF
taking τ large enough.
Hence, the JLB paradox effectively implies that a testing
problem should be treated differently from one that is concerned
with estimation. As such, when π1 is interpreted as prior belief
about the parameters θ1 , in the example above θ1 = µ, one’s
belief about the parameter then changes depending on whether
one is concerned with testing or estimating. More generally,
this difference is due to the fact that estimation is typically a
within-model affair. Recall that a model Mi specifies a relationship
fi (d | θi ) that defines which parameters θi are relevant in the data
generating process of the data d. Hence, the function fi gives the
(only) context in which the parameters θi can be perceived.
In essence, the fi justifies that it is meaningful to calculate a
posterior distribution for the parameter. To underline this point
we add subscripts to the parameters indicating model membership
in the next example, by taking θ0 = σ0 and θ1 = (µ1 , σ1 )
for f0 and f1 both normals. For example, when we assume that
M0 : X ∼ N (0, σ02 ) only a posterior for the standard deviation
σ0 is worthwhile to be pursued, as the posterior for the population
mean remains zero, regardless of the data. Within M0 , the Jeffreys’s
prior for σ0 is given by π0 (σ0 ) ∝ σ0−1 , which can be updated
to a posterior π0 (σ0 | d). On the other hand, under M1 : X ∼
N (µ1 , σ12 ) we are dealing with two parameters of interest. Within
M1 , the Jeffreys’s prior for µ1 is π (µ1 ) ∝ 1, for σ1 is π1 (σ1 ) ∝ 1/σ1
and we take π1 (µ1 , σ1 ) = π1 (µ1 )π1 (σ1 ). These priors can be
updated to posteriors π1 (µ1 | d) and π1 (σ1 | d). Even though the
two priors π0 (σ0 ) and π1 (σ1 ) have the same form, they do not
lead to the same posterior. In fact, due to the presence of µ1 as
a parameter, the posterior mean of π1 (σ1 | d) within M1 will be
smaller or equal to the posterior mean of π0 (σ0 | d) within M0 .
Thus, when we are interested in the standard error σi , it matters
whether we believe that M0 holds true or whether the population
mean µ1 plays a role in the data generating process as specified
by f1 . The Bayes factor helps us distinguish which of the two
models is better suited to the data and which posterior for σi we
should report. Hence, testing is a between-model matter. Jeffreys
himself was very clear about the distinction between estimation
and testing:
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
‘‘We are now concerned with the more difficult question: in
what circumstances do observations support a change of the
form of the law itself? This question is really logically prior to
the estimation of the parameters, since the estimation problem
presupposes that the parameters are relevant’’. (Jeffreys,
1961, p. 245).
Hence, testing implies that we are uncertain about which of the
two functional relationships defined by the models M0 and M1 is
adequate for the data under study. This uncertainty is expressed
through the prior statement P (M0 ), P (M1 ) > 0 and when M0
and M1 are the only models under consideration we require that
P (M0 ) + P (M1 ) = 1. The priors π1 , π0 in a Bayes factor are,
thus, chosen to guide scientific learning, that is, how one transitions from prior model odds to posterior model odds and are not
designed to yield posteriors that are good for estimation. To simplify notation, we drop the subscripts indicating model membership when the context is clear.
Second, the separation of estimation and testing and the resulting separation of models led us to instantiate the hypotheses Hi
with their respective models Mi as discussed in Ly et al. (2016). In
effect, we have different contexts in which the respective parameters exist and, therefore, a philosophical conundrum in what is
meant by common parameters. The difference between the posteriors π0 (σ | d) and π1 (σ | d) discussed above showed that one
should not be fooled by the fact that the Greek letters are identical.
We therefore agree with Robert’s warning concerning the treatment of common parameters.
For the t-test the commonality between the two σ s within M0
and M1 is given by their meaning as a scaling parameter within
either model. Furthermore, the nesting of π0 (σ ) as π1 (µ, σ ) =
π1 (δ)π0 (σ ) can be considered a practical choice. In effect, the
Bayes factor BF10 (d) is then given by the ratio of the following two
marginal likelihoods
n
p(d | M1 ) = (2π )− 2
∞

σ −n
 2 
s
σ
n
p(d | M0 ) = (2π )− 2
0

exp − 2n

x̄
σ
−δ
2
π1 (δ) dδ π0 (σ )dσ ,
∞

∞
−∞
0
+

(3)



σ −n exp − 2σn 2 x̄2 + s2 π0 (σ )dσ ,
1.2. Jeffreys’s common-sense desiderata
‘‘Rejection of a null hypothesis is best when it is interocular’’. Edwards, Lindman, and Savage (1963, p. 240).
In conclusion, the JLB paradox prohibits the usage of improper
priors for testing, separates the estimation practice from a testing
concern, and challenges the idea of common parameters. As noted
above, we first require a justification before we can use the
same prior on the nuisance parameters. After doing so, we then
create an exception on the ban of improper priors allowing us
to assign improper priors to the nuisance parameters, say, θ0 =
σ . Furthermore, let δ denote the test-relevant parameter with,
say, θ1 = (θ0 , δ). Hence, after specifying Jeffreys’s translationinvariant priors on the nuisance parameters θ0 , which we would
use for estimation within each model, we only require to set the
prior π1 (δ) in order to define the Bayes factor BF10 (d). We suspect
that Jeffreys’s underlying reasons for the choice of π1 (δ) was to
have a test that passes ‘‘the interocular traumatic test; you know
what the data mean when the conclusion hits you between the
eyes’’. Edwards et al. (1963, p. 217).
We believe that the information consistency criterion makes
explicit which data hit us right between the eyes. This criterion
leads to a Bayes factor that is consistent for a finite sample, a
requirement that is much harder to be fulfilled than the asymptotic
consistency criterion, at least for parametric models (e.g., Bickel
& Kleijn, 2012, Yang & Le Cam, 2000). We agree with Robert
that information consistency is in some cases an approximate
statement. In particular, when the data are either distributed
according to M0 : X ∼ N (0, σ 2 ) or M1 : X ∼ N (µ, σ 2 ) then
the interocular data set with n > 2, x̄ ̸= 0 and, in particular,
s2 = 0 occurs with zero probability under both models, due to the
assumption σ > 0. However, when M0 and M1 are the only two
models under consideration, the observation x̄ ̸= 0 with n > 2, in
addition to s2 = 0, then should lead to the logical exclusion of M0 ,
thus, BF01 (d) = 0.
To appreciate the information consistency criterion, we revisit
˜ 10 ; τ (d) that lacks this
the Bayesian t-test with Bayes factors BF
property by constructing it from π0 (σ ) ∝ σ −1 and π1 (δ, σ ) =
π1 (δ)π0 (σ ) where π1 (δ) is normal around zero with a standard
deviation τ , i.e.,
˜ 10 ; τ (d) =(1 + nτ )
BF
2
(4)
where d = (n, x̄, s2 ). We would like to thank Robert for pointing
out our notational inaccuracy, as Eq. (9) in Ly et al. (2016) should
actually be Eq. (3), that is, the marginal likelihood of the alternative
model, thus, the numerator of the Bayes factor BF10 (d), after π1 (δ)
and π0 (σ ) are specified. In the original text we already filled in
π0 (σ ) ∝ σ −1 , a choice which we elaborated on in Section 3.2.2
of Ly et al. (2016).
With the nesting of π0 within π1 we made the following recommendation explicit: ‘‘It is to be understood that in pairs of
equations of this type [such as Eqs. (3) and (4)] the sign of proportionality indicates the same constant factor, which can be adjusted to make the total probability 1’’. (Jeffreys, 1961, p. 247)
More precisely, an improper prior π0 (σ ) ∝ σ −1 has a suppressed
normalization constant π0 (σ ) = c0 σ −1 and we not only take
π1 (σ ) = c1 σ −1 of the same form, but also choose to set c1 = c0 ,
which allows us to use improper priors on the nuisance parameters
(see Berger, Pericchi, & Varshavsky, 1998 for a theoretical justification). More examples of this type of nesting can be found in Dawid
and Lauritzen (2001), Consonni and Veronese (2008), and references therein.
45
n −1
2

1+
 2n
nx̄2
ns2
( 1 + nτ 2 ) +
nx̄2
ns2
.
(5)
As before, letting τ tend to infinity, while keeping n, x̄ and s2 fixed,
˜ 10 ; τ (d) = 0.
yields the JLB paradox, i.e., limτ →∞ BF
To simplify the discussion we suppose that τ is set to one.
˜ 10 ; τ =1 (d) is then asymptotically
The resulting Bayes factor BF
consistent. This means that if we repeatedly sample from the null
model, we let n tend to infinity and simultaneously let nx̄2 /(ns2 ) =
t 2 /(n − 1) tend to zero yielding
a Bayes factor of zero, where t
√
is the usual t-statistic t =
nx̄/sn−1 . Similarly, if we repeatedly
sample from the alternative model, we let n tend to infinity and
simultaneously let t 2 /(n − 1) tend to infinity yielding a Bayes factor
˜ 10 ; τ =1 (d) is able to detect the
of infinity. Thus, this Bayes factor BF
correct model when the number of data points tends to infinity.
˜ 10 ; τ =1 (d), however, is not information
The Bayes factor BF
consistent. For the t-test information consistency is concerned
with having a fixed number of data points n > 2, an observed
sample mean, say, x̄ ̸= 0 and s2 tending to zero. With τ , n and
˜ 10 ; τ =1 (d) is an decreasing function of
x̄ fixed, this Bayes factor BF
2
s that attains its maximum when s2 = 0. For instance, when
˜ 10 ; τ =1 (d) =
n = 4, x̄ = 7 the maximum is then given by lims2 →0 BF
11.18. Note that the data set with n = 4, x̄ = 7 and s2 → 0
46
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
is interocular as it leads to an observed sample effect size, an
realization of the t-statistic, that tends to infinity, which should
therefore lead to infinite support for the alternative compared to
the null model. The fact that the information inconsistent Bayes
˜ 10 ; τ =1 (d) is bounded makes it hard to be interpret. For
factor BF
instance, the observations n = 4, x̄ = 7 and s2 = 1 yields a Bayes
˜ 10 ; τ =1 (d) = 9.6, which does not seem a lot of evidence
factor of BF
against the null, but with respect to its maximum 11.81 might be
considered substantial.
On the other hand, a Jeffreys’s Bayes factor is by construction
information consistent and has a supremum (i.e., maximum) at
infinity, which makes it easier to be interpret. Jeffreys referred to
this and other desiderata as common-sense as they came natural
to him (Etz & Wagenmakers, 2015), but it took a long time before
his intuition was formalized by Berger and Pericchi (2001) and
extended by Bayarri et al. (2012).
Recall that information consistency in a t-test requires us to
construct a Bayes factor from a heavy-tailed prior on δ and we
agree with Robert that the Cauchy prior with scale γ = 1 is only
one of many possible choices. This is why we included a robustness
analysis in our open-source software package JASP (https://jaspstats.org/). However, we believe that the merit of a Jeffreys’s Bayes
factor (with γ fixed) is due to the fact that it kickstarts scientific
learning.
‘‘In any of these cases it would be perfectly possible to give a
form of [π1 (δ)] that would express the previous information
satisfactorily, and consideration of the general argument of
[Chapter] 5.0 will show that it would lead to common-sense
results, but they would differ in scale. As we are aiming chiefly
at a theory that can be used in the early stages of a subject,
we shall not at present consider the last type of case’’ (Jeffreys,
1961, p. 252).
Thus, Jeffreys was not opposed to incorporating previously
acquired data in a Bayesian hypothesis test, but to do so he first
designed a starting Bayes factor, for a first data set, say, dorig . After
observing dorig , we can then straightforwardly update a Jeffreys’s
Bayes factor for a future, not yet observed, data set, say, drep .
This informed Bayes factor BF10 (drep | dorig ) is then constructed
from the priors π1 (θ1 | dorig ) and π0 (θ0 | dorig ). This idea forms the
basis of the replication Bayes factors introduced in Verhagen and
Wagenmakers (2014) and is further exploited in Ly et al. (2015).
Hence, the man who discovered the origin of the earth, thus, also
provided us with the starting point for scientific learning.
1.3. Robert’s alternative approach
‘‘Prior distributions must always be chosen with the utmost
care when dealing with mixtures and their bearings on
the resulting inference assessed by a sensitivity study. The
fact that some noninformative priors are associated with
undefined posteriors, no matter what the sample size, is a clear
indicator of the complex nature of Bayesian inference for those
models’’ (Marin & Robert, 2014, p. 199).
As an alternative to Bayes factors, Robert (2016) suggests to use a
mixture model approach elaborated upon in Kamary, Mengersen,
Robert, and Rousseau (2014). The data generating process of a
mixture model can be envisioned as a stepwise procedure. First,
a membership variable zj is realized; in a two-component mixture,
zj assumes either the value zero or one. Next, given the outcome
zj = 0 (or zj = 1) a data point xj is generated according to
M0 : Xj ∼ f0 (xj | θ0 ) (or M1 : Xj ∼ f1 (xj | θ1 )). This means that
the complete data should consist of n-pairs (z1 , x1 ), . . . , (zn , xn ),
but in reality we only have the observations d = x1 , . . . , xn .
As a result of not observing the membership variables zj , the
observations are perceived as if each of the data points were
generated from the (arithmetic) mixture model Ma : Xj ∼
(1 − α)f0 (xj | θ0 ) + α f1 (xj | θ1 ), where α is the mixture proportion.
The artificial encompassing model Ma therefore contains the two
competing models, M0 and M1 , as special cases; when α = 0 and
α = 1 respectively. Hence, to uncover whether the observations
are more consistent with M0 or M1 , Kamary et al. (2014) suggest
to focus on estimating α within the encompassing model Ma .
Inferring α amounts to a missing data problem which is
in principle computationally intensive as there are 2n different
combinations for the membership variables zj s. Luckily, one can
resort to a completion method pioneered by Diebolt and Robert
(1994). When this stochastic exploration method yields n0 and n1
numbers of observations allocated to M0 and M1 ,respectively, the
posterior for α is then given by B (a + n0 , a + n1 ), when we use
a beta prior on the mixture proportion, α ∼ B (a, a). When n0 is
large and n1 small or zero, the posterior for α then concentrates
most of its mass near zero indicating more evidence for M0 as one
would expect.
Kamary et al. (2014) note that the data generative view
of the mixture model is theoretically justified, but that the
resulting natural Gibbs sampler has convergence problems when
the hyperprior a is smaller than one. To circumvent this problem,
Kamary et al. (2014) propose to use a Metropolis–Hastings
algorithm instead and illustrate its use by examples followed by
a proof that shows that the method is asymptotically consistent.
Thus, the work by Kamary et al. (2014) impressively introduces
an alternative view on testing, an algorithmic resolution, and a
theoretical justification.
1.3.1. Testing versus estimation
We believe that the Kamary et al. (2014) mixture approach will
be especially useful in psychological research. In particular, consider a hierarchical model where each participant’s performance xj
on a psychological task is captured by a particular model or strategy represented by fi . The posterior for α then gives an indication of
the prevalence of the model or strategy. When the posterior for α
is near zero or near one, this suggests that one model or strategy is
dominant; when the posterior for α is near 1/2, this suggests that
some participants are better described by one strategy, and some
are better described by another (for similar approaches see Friston
& Penny, 2011; Lee, Lodewyckx, & Wagenmakers, 2015).
The advantage of the mixture model approach is particularly
acute when it is reasonable to assume that not all participants will
follow one or the other strategy. In this special issue for Journal
of Mathematical Psychology alone, the articles by Kary, Taylor,
and Donkin (2016) and Turner, Sederberg, and McClelland (2016)
demonstrate considerable heterogeneity among participants: the
behavior of some participants is predicted much better by one
model, the behavior of other participants is predicted much better
by the competing model, and the behavior of a third set of
participants is predicted by the models about equally well (see
also Steingroever, Wetzels, & Wagenmakers, in press).
The standard Bayes factor tests determines whether all
participants are better predicted by model M0 or whether all
participants are better predicted by model M1 . Therefore, one can
construct situations in which the data support model M0 for 99
out of 100 participants, and nevertheless the Bayes factor strongly
prefers model M1 . We believe that in these hierarchical scenarios,
the mixture model approach is a valuable technique that can offers
additional insight.
The above considerations suggests that the mixture approach
relaxes Jeffreys’s conceptualization of a hypothesis test. More
precisely, Jeffreys viewed the null hypothesis as a general law,
which by definition implies that the membership variables zj are
either all zeros or all ones. Note that by embedding the models into
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
an artificial encompassing model, Kamary et al. (2014) transformed
the testing problem into one of estimation. Jeffreys, however, did
not feel that estimation is appropriate when the test of a general
law is at hand:
‘‘Broad used Laplace’s theory of sampling, which supposes that
if we have a population of n members, r of which may have a
property ϕ , and we do not know r, the prior probability of any
particular value of r(0 to n) is 1/(n + 1). Broad showed that on
this assessment, if we take a sample of number m and find all of
them with ϕ , the posterior probability that all n are ϕ ’s is (m +
1)/(n+1). A general rule would never acquire a high probability
until nearly the whole of the class had been sampled. We could
never be reasonably sure that apple trees would always bear
apples (if anything). The result is preposterous, and started the
work of Wrinch and myself in 1919–1923’’. (Jeffreys, 1980, p.
452).
Wrinch and Jeffreys (1919, 1921, 1923) argued that within an
estimation framework, a general law such as H0 : ‘‘All swans are
white’’ cannot gain much evidence until almost all swans have
been inspected.1 Moreover, common sense prescribes that the
plausibility of a general law increases with every observation in
accordance with the law, that is, s = n number of successes within
n trials. Jeffreys (1961, p. 256) operationalized the general law as a
binomial model M0 with θ0 fixed and its negation as the binomial
model M1 with a θ free to vary. With a uniform prior on θ this
(n+1)!
then leads to a Bayes factor of BF01 (d) = s!f ! θ0s (1 − θ0 )f , where n
denotes the total number of trials, s the number of successes, and
f the numbers of failures.
When only successes are observed (i.e., observations consistent
with the general law H0 : θ0 = 1), the Bayes factor simplifies
to n + 1; a single failure, on the other hand, indicates infinite
evidence against the general law: the observation of a single black
swan is interocular, as it conclusively falsifies the general law
‘‘all swans are white’’. Hence, Jeffreys’s Bayes factor formalizes
inductive reasoning and the logic of proof by contradiction.
The discussion above indicates that the mixture model
approach does not formalize inductive reasoning and the logic of
proof by contradiction: after having observed 10,000 white swans,
the observation of a single black swan will not greatly affect the
mixture proportion—the mixture proportion still reflects the fact
that there is a great preponderance of white swans. However, in
Jeffreys conceptualization, the single exception utterly destroys
the general law.
Another concern with the mixture model approach is that it
is relatively insensitive to the shape of the prior distributions.
Of course, this is also its strength, as this is needed to avoid
the JLB paradox. However, models that make correct predictions
should receive more reward when these predictions are risky,
and the degree of risk is partly encoded in the shape of the
prior distributions. For instance, suppose we model a binomial
parameter θ and assume that M1 : θ ∼ U [1/2, 1] and M2 : θ ∼
U [0, 1]; further, suppose the observed data are highly consistent
with the simpler model M1 . Because the predictions from M1 are
twice as risky as those from M2 we would want to prefer M1
over M2 , and in fact, the Bayes factor in favor of M1 against M2
is asymptotically equal to 2 (e.g., Heck, Wagenmakers, & Morey,
2015; Shiffrin, Chandramouli, & Grünwald, 2016).
47
1.4. Conclusion
Scientific learning involves more than just testing general laws
and invariances. Estimation and exploration are important and the
mixture approach has a lot to offer in this respect, particularly in
hierarchical settings where the general law is unlikely to hold for
all participants simultaneously. Other advantages of the mixture
approach are apparent as well. For instance, Example 3.1 in Kamary
et al. (2014) compares a Poisson distribution with parameter λ to
a geometric distribution with parameter p (see Robert, 2015 for
R code). The comparison begins by relating the parameterizations
to each other by setting p = (1 + λ)−1 , which allows the
use of the improper Jeffreys’s prior (with respect to the Poisson
distribution) π (λ) ∝ λ−1 over the two models. Note how
this procedure resembles Jeffreys’s recommendation for common
parameters even though the arguments differ. Moreover, the
resulting posterior π (λ | d) is then calculated from the mixture
of the likelihoods of both models. The simulations show that the
mixture approach performs well. We do not know how a Jeffreys’s
Bayes factor can be constructed to deal with a test between two
models of different relational forms as Jeffreys was only concerned
with nested model comparisons (e.g., Robert, 2016).
The mixture approach is not fully automatic, however, and requires some thoughts on how the priors should be chosen. In
particular, one cannot naively use improper priors on the testrelevant parameters, as this may yield posteriors that are also improper (Grazian & Robert, 2015). This was acknowledged by Robert
(2016) who used an (arbitrary) standard normal prior on µ in a ttest. Our implementation of this recommendation leads to a posterior median ranging from 0.3 to 0.9, for the interocular data with
n = 4, x̄ = 7 and s2 = 0, while α should be 1.0 if it were
information consistent. More recently, Kamary, Eun, and Robert
(2016) proposed a noninformative reparametrization for locationscale mixtures to resolve the aforementioned arbitrariness. Hence,
as with a Jeffrey’s Bayes factor, one should choose the priors carefully when one conceptualizes model selection as parameter estimation in a mixture model.
Lastly, Robert notes that the mixture approach is superior to
the Bayes factor as it leads to a faster accumulation
of α to the
√
null. The parametric convergence rate of n follows immediate
from casting the testing problem as one of estimation. Similarly, it
should be noted that Johnson and Rossell (2010) also use the rate
of convergence as a motivation for their Bayes factor approach. We
are unsure whether this rate is relevant as we do not consider a
testing problem as one of estimation. In the end the Bayes factor
and the mixture approach of Kamary et al. (2014) simply answer
different questions. The choice which method to use should not be
based on the rate of convergence, but on the research question the
user seeks to address.2
2. Rejoinder to Chandramouli and Shiffrin
Chandramouli and Shiffrin (2016) put forward a thoughtprovoking proposal which aims to explain and extend Bayesian
induction using simple matrix algebra. We have given this novel
idea considerable thought and outline some of our reservations
below.3
We believe that Chandramouli and Shiffrin (henceforth C&S)
put forward a belief propagation procedure that allows us to verify
2 We thank Joris Mulder for attending us to this.
3 The second and third authors are in a state of perpetual confusion regarding the
1 We now know that this particular statement does not hold true, since Australia
is home to many black swans. The statement itself however cannot be discarded
until the first exception is actually observed.
details of the Chandramouli and Shiffrin proposal. All credit concerning this section
goes to the first author, who, as such, takes full responsibility for any errors here.
For a thorough understanding of our reply, we recommend to have the comment
of Chandramouli and Shiffrin (2016) on hand.
48
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
whether two given models, say, M1 and M2 align with a scientist’s
prior belief about the true data generating process p∗ (X ). Instead of
setting the priors onto the two given models M1 and M2 directly,
C&S recommend to first elicit a scientist’s prior belief about the
true data generating p∗ (X ) in the most general setting. This prior
belief is then subsequently translated into priors on the models.
Hence, the resulting prior model probabilities P (M1 ) and P (M2 )
are derived.
In contrast, a Jeffreys’s Bayes factor follows from a top-tobottom procedure, where the top level is concerned with the
comparison between two models (i.e., model classes) for which one
has to (subjectively) choose prior model probabilities P (Mi ). Based
on top level desiderata, i.e., a coherent comparison between the
two models, we then derive the pair of priors π1 and π2 on the
lower level that are concerned with the parameters (i.e., model
instances) within the models M1 and M2 respectively. In effect,
the sole purpose of the pair π1 , π2 is to mediate scientific learning
through the Bayes factor, that is, to update the prior model odds to
posterior model odds.
On the other hand, the C&S induction scheme is a bottom-up
approach based on the philosophy that the whole is the sum of its
parts. At the lowest level, one has to subjectively elicit the scientist’s
prior belief about the true data generating process. The procedure
then elaborates on how this lowest level belief can be used to derive
the model instance priors π1 and π2 at the intermediate level. By
aggregating the model instance priors of π1 and π2 we then get
the prior model probabilities P (M1 ) and P (M2 ) at the top level. As
such, this method is not free from subjective input on the lowest
level.
Our major concern with the C&S method is the lack of invariance, which stems from their recommendation to operationalize their procedure with a seemingly innocent looking finitedimensional matrix with M rows and W number of columns.4 By
using a finite-dimensional matrix, C&S basically made a choice in
how they tackle the statistical modeling problem. The resulting
model priors P (Mi ) are sensitive to this choice. More specifically,
by initializing their procedure with a finite-dimensional matrix,
they use discretized approximations of quantities that are essentially continuous. The approximation error due to discretization is
non-negligible, as it permeates through all subsequent steps due
to the bottom-up nature leading to an ill-defined Bayes factor.
In brief, we believe that the C&S approach has to overcome some
challenges before their procedure can be perceived as an extension
of a traditional Bayes factors, let alone Jeffreys’s Bayes factors. We
have three remarks: (1) The C&S procedure is not invariant to how
one discretizes the statistical modeling problem; (2) the subjective
assessment of the priors on the lowest level and the resulting
prior model probabilities P (Mi ) on the top level are, therefore, illdefined, and (3) model selection based on posterior predictive pstatistics does not lead to a proper measure of evidence.
This paper continues as follows: We first apply the C&S
induction scheme to a concrete example. Then we show that we
get different results when we choose a different finite-dimensional
matrix to operationalize the C&S induction scheme. Lastly, we
argue that the implicit discretization necessary for the finitedimensional matrix is the main culprit of the resulting lack of
invariance.
2.1. Running example
4 We divert from the C&S notation, where the matrix is M × N dimensional, as
the number of columns does not correspond with the number of samples in a data
set. Instead, the number of columns refers to the number of possible outcomes a
random variable can take on, we use w and W instead.
5 C&S call the function p(X | 0.9, M ) a data distribution predicted by the model
1
instance ϑ = 0.9. When we use a capital X we mean the three probabilities
simultaneously. On the other hand, a small letter x refers to the probability with
which it is generated, for instance, p(xw | 0.9, M1 ) = 0.18 when w = 2.
To illustrate why we believe that the C&S method is essentially
a belief propagation procedure, we consider a random variable
X with a finite number of outcomes W . This W is denoted as n
in Chandramouli and Shiffrin (2016) and defines the number of
columns in their matrix representations (i.e., their Figures 1 and
2). To simplify matters, we use an example (taken from Ly et al.,
2015) where X has W = 3 number of outcomes.
Example 1 (A Psychological Task with Three Outcomes). In the
training phase of a source-memory task, the participant is
presented with two lists of words on a computer screen. List L
is projected on the left-hand side and list R is projected on the
right-hand side. In the test phase, the participant is then presented
with two words, side by side, that can stem from either list, thus,
ll, lr , rl, rr. At each trial, the participant is asked to categorize these
pairs as either:
• x1 meaning both words come from the left list, i.e., ‘‘ll’’,
• x2 meaning the words are mixed, i.e., ‘‘lr’’ or ‘‘rl’’,
• x3 meaning both words come from the right list, i.e., ‘‘rr’’.
Thus, the random variable X has W = 3 outcomes. To ease the
discussion, we assume that the words presented to the participant
are ‘‘rr’’.
As model M1 we take the so-called individual-word strategy. A
participant guided by this strategy will consider each word individually and compare it with list R only. Within this model M1 ,
the parameter is given by θ1 = ϑ , which we interpret as the
participant’s ‘‘right-list recognition ability’’. Hence, when the participant is presented with the pair ‘‘rr’’ she will respond x1 with
probability (1 − ϑ)2 , thus, two failed recollections; x2 with probability 2ϑ(1 − ϑ), thus, one failed and one successful recollection;
x3 with probability ϑ 2 , thus, two successful recollections.
More compactly, a participant guided by this strategy generates the outcomes [x1 , x2 , x3 ] with the following three probabilities p(X | ϑ, M1 ) = [(1 − ϑ)2 , 2ϑ(1 − ϑ), ϑ 2 ], respectively. Note
the data generative formulation. For instance, when the participant’s true ability is ϑ ∗ = 0.9, the three outcomes [x1 , x2 , x3 ]
are then generated with the three probabilities p(X | 0.9, M1 ) =
[0.01, 0.18, 0.81] respectively. We call the function p(X | θi , Mi )
with θi fixed a probability mass function (pmf) or model instance
of Mi .5 Hence, every ϑ in (0, 1) yields a pmf that defines W number
of probabilities. In effect, the model M1 consists of a collection of
pmfs, which C&S refer to as a model class.
As a competing model M2 , we take the so-called only-mixed
strategy. Within this model M2 , the parameter is given by θ2 = a,
which we interpret as the participant’s ‘‘mixed-list differentiability
ability’’. With probability a the participant first checks whether the
presented pair of words is mixed. If she perceives it as mixed, she
then produces the outcome x2 with probability a. If she does not
perceive the pair of words as mixed, the participant then randomly
chooses x1 or x3 each with probability (1 − a)/2.
More compactly, a participant guided by this strategy generates the outcomes [x1 , x2 , x3 ] with the following three probabilities p(X | a, M2 ) = [(1 − a)/2, a, (1 − a)/2], respectively. Again
we formulated the model as a data generative process. For instance, when the participant’s true ability is a∗ = 1/3, the three
outcomes [x1 , x2 , x3 ] are then generated with the same probability, i.e., p(X | 1/3, M2 ) = [1/3, 1/3, 1/3]. Note that this last pmf
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
p(X | 1/3, M2 ) is not in the collection of pmfs defined by M1 . Similarly, the pmf p(X | 0.9, M1 ) is not a member of the collection of
pmfs defined by M2 .
The two models M1 and M2 share only one pmf (model
instance), that is, the pmf indexed by ϑ = 0.5 within M1 and,
coincidentally, when a = 0.5 within M2 . We use these two models
M1 and M2 to explain the C&S belief propagation procedure.
49
Table 1
The matrix is a simplified version of the matrix found in Figure 1 of C&S with M = 66
and W = 3. The quantities under the columns with G(ψm , M1 ) and G(ψm , M2 )
at the top refer to the KL-divergences, see the main text. The parameter under θi
refers to the model instance that the pmf p(X | ψm ) is allocated to within the model
under Mi . For example, the candidate true pmf p(X | ψ18 ) is allocated to the model
instance p(X | ϑ = 0.60, M1 ) of model class M1 .
2.2. Chandramouli and Shiffrin’s procedure for induction
For a Bayesian analysis we need priors on the model instances,
which we denote by πi (θi ) as we have done before,6 and the priors
on the models P (Mi ). Instead of doing so directly, C&S recommend
to first (Step 1) elicit the scientist’s prior belief about the true data
generating process p∗ (X ) in the most general setting. Next (Step 2)
this subjectively chosen prior belief about p∗ (X ) is used to derive
the model instance priors πi (θi ) and, subsequently, the model class
priors P (Mi ). Lastly, (Step 3) C&S recommend to use posterior
p-statistics for inference.
2.2.1. Step 1: Eliciting the prior on candidate true data generating
Pmfs
In our example, the true data generating pmf p∗ (X ) defines
three probabilities p∗ (X ) = [p∗ (x1 ), p∗ (x2 ), p∗ (x3 )] with which
it generates the three outcomes [x1 , x2 , x3 ]. For instance, a
first candidate true data generating pmf could be p(X | ψ1 ) =
[0.0, 0.0, 1.0], where ψ1 is an indicator for later reference. A
second candidate true data generating pmf could be p(X | ψ2 ) =
[0.0, 0.1, 0.9] and so forth and so on. This method yields a
candidate set of true pmfs that we depicted in Table 1. The
‘‘matrix’’ depicted in Table 1 is a simplification of the table in
Figure 1 in Chandramouli and Shiffrin (2016) with M = 66 rows
and W = 3 columns. Please ignore the quantities to the right of
the double vertical line for the moment. Note that the number of
rows M = 66 is a result of our arbitrary choice of using a step size of
0.1 on the probabilities. Furthermore,recall that the pmfs p(X | ψm )
are candidates for the true data generating pmf p∗ (X ) and may not
have any connection with the models M1 and M2 specified above.
Of particular interest is the pmf p(X | ψ62 ) = [0.8, 0.1, 0.1], which
is neither a member of M1 nor of M2 ,7 but because it defines a valid
pmf it is, nonetheless, a candidate true data generating pmf.
Given this finite-dimensional matrix of Table 1, C&S then
recommend to elicit a scientist’s prior belief by setting prior beliefs
λ(ψm ) for m = 1, . . . , M, thus, on each candidate true data
generating pmf p(X | ψm ).8 For example, λ(ψ62 ) = 0.7 means that
the scientist bestows a large portion of belief to the pmf indexed by
ψ62 as being the true generating pmf p∗ (X ). Furthermore, λ(ψ61 )+
λ(ψ62 ) + λ(ψ63 ) = 0.90 means that the scientist is quite sure that
the participant will generate the response x1 with 80% chance. As
λ
the scientist’s prior belief, we necessarily require that
represents
M
m=1 λ(ψm ) = 1.
6 C&S denote this by p (θ ). Instead, we use the Greek letter π to distinguish this
0 i
i
model instance prior from the prior model probability P (Mi ) on the top level. The
subscript i refers to the model membership.
7 A pmf of M with p(x | ϑ, M ) = 0.8 requires ϑ ≈ 0.11. However, this
1
1
1
automatically yields p(x2 | 0.11, M1 ) = 0.19. Hence, there is no ϑ in M1 that leads
to the pmf indexed by ψ62 . Similarly, a pmf of M2 necessarily has p(x1 | a, M2 ) =
p(x3 | a, M2 ), which is clearly not the case for the pmf indexed by ψ62 .
8 C&S denote this prior pmf probability as p (ψ ). Instead, we use the Greek
0
m
letter λ to distinguish this prior pmf probability on the lowest level from the model
instance prior πi (θi ) on the intermediate level and the prior model probabilities
P (Mi ) on the top level.
2.2.2. Step 2: Propagating the prior belief to yield the prior model
probabilities
Once the prior beliefs λ(ψm ) about the true data generating
p∗ (X ) are chosen, C&S commence their belief propagation procedure by redistributing λ(ψm ) over the two models. Recall that a
model (class) Mi defines a collection of pmfs (model instances) denoted as p(X | θi , Mi ). The allocation of the prior pmf belief of the
first candidate true pmf in Table 1, that is, λ(ψ1 ), is easy, because
the associated pmf p(X | ψ1 ) = [0.0, 0.0, 1.0] does not belong to
M2 , but it is a member of M1 ; the pmf indexed by ψ1 is a model
instance of M1 when θ1 = ϑ = 1. C&S therefore allocate the prior
pmf probability λ(ψ1 ) to the model instance π1 (ϑ = 1) of M1 . On
the other hand, the pmf p(X | ψ62 ) = [0.8, 0.1, 0.1] is neither a
member of M2 nor does it belong to M1 . To nonetheless allocate
this prior pmf belief λ(ψ62 ) to a model instance of either M1 or
M2 , C&S use a divergence measure denoted by G. For simplicity we
take as G the Kullback–Leibler (KL) divergence, which is a measure
of dissimilarity. The KL-divergence from a candidate true pmf indexed by ψm to a model instance of Mi is defined as
G(ψm , θi | Mi ) =
W

w=1
p(xw | ψm ) log
p(xw | ψm )
p(xw | θi , Mi )
,
(6)
and the larger this divergence, the more dissimilar the model
instance p(X | θi , Mi ) is from the candidate true data generating
pmf p(X | ψm ). For example, a direct calculation shows that the KLdivergence between the candidate true p(X | ψ1 ) in Table 1 to the
model instance of M1 with θ1 = ϑ = 1.0 is given by G(ψ1 , θ1 =
1.0 | Mi ) = 0. The KL-divergence is zero if and only if the pmfs
indexed by ψm and the model instance indexed by θi are exactly
the same, hence, their dissimilarity is zero.
50
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
The KL-divergence between the candidate true p(X | ψm ) and
a collection of pmfs defined by the model Mi is given by
G(ψm , Mi ) = minθi G(ψm , θi | Mi ). That is, the dissimilarity between the candidate true data generating pmf ψm and the model
Mi is the smallest dissimilarity between p(X | ψm ) and the model
instances p(X | θi , Mi ) of model Mi . For example, a direct calculation shows that the KL-divergence from the candidate true
data generating pmf p(X | ψ62 ) to M1 is given by G(ψ62 , M1 ) =
G(ψ62 , θ1 = 0.15 | M1 ) = 0.137. Similarly, the KL-divergence between the same candidate true data generating pmf to M2 is given
by G(ψ62 , M2 ) = G(ψ62 , θ2 = 0.1 | M2 ) = 0.310. Because the divergence from the candidate true pmf p(X | ψ62 ) to M1 is smaller
than the divergence to M2 , the C&S procedure implies that we
should allocate the prior pmf probability λ(ψ62 ) to the prior model
instance probability π1 (ϑ = 0.15) belonging to M1 .
We suspect that the underlying idea of this belief allocation
procedure is based on the idea of chaining. Thus, if λ(ψ62 ) =
0.70, the scientist has much fate in p(X | ψ62 ) being the true data
generating pmf. However, as p(X | ψ62 ) is not in the model M1 nor
in M2 , the C&S procedure then recommends to go for the next best
thing; assigning the pmf prior λ(ψ62 ) to the model instance that is
most similar to p(X | ψ62 ), in this case, π1 (ϑ) with ϑ = 0.15.
This redistribution of the pmf prior λ(ψm ) to model instance
priors can be read from their table in Figure 1 in Chandramouli and
Shiffrin (2016) from left to right.9
In our Table 1 the numbers under G(ψm , M1 ) and G(ψm , M2 )
represents the KL-divergence from the candidate true pmf indexed
by ψm to the models M1 and M2 respectively. The parameter value
under θi indicates which parameter value ϑ within M1 or a within
M2 corresponds to the model instance that is closest to the pmf
of ψm . The last column indicates whether the ψm is eventually
allocated to M1 or M2 .
As in their table in Figure 1 of Chandramouli and Shiffrin (2016),
note that there are multiple candidates ψm s allocated to certain
parameter values in our Table 1. For example, the candidate pmfs
indexed by ψ3 and ψ12 are both allocated to the same model
instance indexed by ϑ = 0.90 within M1
. As such, C&S derive the
prior on the model instances as π1 (ϑ) =
λ(ψm ), where the sum
is over the candidates ψm which have the same ϑ in the column
under θi . For example, π1 (ϑ = 0.90) = λ(ψ3 ) + λ(ψ12 ).
After allocating all the M number of prior pmf probability λ(ψm )
to the model instances of either model classes, we have π1 (ϑk ) and
π2 (ak̃ ) for k = 1, . . . , K and k̃ = 1, . . . , K̃ . The K indicates the
number of unique values of ϑ s in the column under θi . As there
are multiple candidates allocated to certain parameter values we
typically have K + K̃ < M. With the model instance priors at
hand, the C&S scheme tells us to aggregate them to yield prior
K
model probabilities, i.e., P (M1 ) =
k=1 π1 (ϑk ) and P (M2 ) =
K̃
P(
k̃=1
M2
π2 (ak̃ ). As a result of
) = 1.
M
m =1
λ(ψm ) = 1 we have P (M1 ) +
2.2.3. Step 3: Posterior predictive p-statistics
So far, we only discussed the C&S belief propagation procedure
as a method to translate a scientist’s prior belief λ(ψ) about the
true data generating p∗ (X ) to prior beliefs on the model instances
πi (θi ), which can then be used to define prior beliefs on the models
P (Mi ). These priors can be used for inference after we observe data
d. As in C&S, we simplify the discussion by supposing that the data
consist of one observation where the participant responded with
x1 .
To invert the data generative view of pmfs, we fix the data part
of each pmf at the observation p(X | ψm ) = p(d | ψm ) and consider
9 We are unsure what ϕ in their table indicates.
the pmfs as a function of ψm , i.e., as a likelihood function. Bayes’
rule then allows us to update the subjectively chosen pmf prior to
a pmf posterior using all specified candidate likelihood functions
indexed by the ψm s, that is, λ(ψm | d) = p(d | ψm )λ(ψm )/C , for
m = 1, . . . , M, where the normalization constant C is given by
M
C =
m=1 p(d | ψm )λ(ψm ). Recall that the rows p(X | ψm ), thus,
the likelihood functions, themselves do not need to belong to the
models M1 and M2 . In fact, most of them do not, as most of the
entries under G(ψm , M1 ) and G(ψm , M2 ) are non-zero.
For inference concerning replication studies, C&S recommend
using posterior predictive p-statistics. For example, the observations dorig of the original experiment might suggest that a participant’s ‘‘right-list recognition ability’’ ϑ is a half. To test whether
this postulate ϑ = 0.5 can be reproduced, C&S recommend to
update the subjectively chosen pmf prior about the true p∗ (X ) to
a posterior yielding λ(ψm | dorig ). Recall that this posterior is also
based on likelihood functions p(d | ψm ) that do not belong to M1
as discussed above. For example, if λ(ψ62 ) > 0 then p(X | ψ62 ) =
[0.8, 0.1, 0.1] in Table 1 is used as a likelihood to relate the observations dorig to ψ62 . Because there is no ϑ that leads to p(X | ψ62 ),
see the footnote at the end of Section 2.2.1, the likelihood function
at ψ62 does not and cannot extract information about ϑ from dorig .
Nonetheless, C&S use the posterior λ(ψm | dorig ) to weight all the
candidate true pmfs in Table 1 resulting in a posterior predictive
M
p(xw | dorig ) =
m=1 p(xw | ψm )λ(ψm | dorig ) for w = 1, . . . , W .
This posterior predictive is used as a sampling distribution, i.e., it
defines the probabilities with which new data are generated.
If the actually observation drep is very improbable under this
predictive, then the C&S procedure prescribes this as a failure of
reproducibility. The problem with this prediction is that it is also
calculated from the predictions of p(X | ψ62 ), even though this pmf
ψ62 has no connection to ϑ whatsoever.
In sum, it seems that the C&S recommendation for replication
boils down to comparing the observed data drep in a replication
attempt using the posterior predictive as a sampling distribution,
which is based on irrelevant likelihood functions and subjective
belief λ(ψm ). Moreover, by using the posterior predictive as a
sampling distribution to assess replication, this method shares
many pitfalls with common p-value tests and therefore does not
quantify evidence (e.g., Bayarri & Berger, 2000; Wagenmakers,
2007).
2.2.4. C&S Bayes factors
Although C&S do not recommend to use Bayes factors for
inference, they note that Bayes factors can be constructed from
their belief propagation procedure. The main idea is to reuse the
belief propagation procedure, but this time to redistribute the
posterior beliefs λ(ψm | d) about the true data generating p∗ (X )
to posterior beliefs for the ‘‘model instances’’ πi (θ̂i | d), which can
then be used to define posterior beliefs on the ‘‘models’’ P (M̂i | d).
We are reluctant to call P (M̂i | d) the posterior model probabilities,
because they are calculated using likelihood functions that do not
belong to Mi (hence, the hats in our notation). There are now two
ways to derive a Bayes factor based on the quantities resulting from
the C&S belief propagation procedure.
The first method involves the ratio of the posterior and prior
model odds, that is,
ˆ 12 (d) =
BF
P (M̂1 | d)/P (M̂2 | d)
P (M1 )/P (M2 )
.
(7)
This Bayes factor depends on the subjectively chosen prior beliefs
λ(ψm ) about p∗ (X ), the chosen divergence measure G, and – most
troublesome – on the collection of candidate likelihood functions
p(d | ψm ) rather than on the likelihood that belong to the respective
models.
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
The second method involves the ratio of marginal likelihoods,
that is,
K

˜ 12 (d) =
BF
p(d | ϑk , M1 )π1 (ϑk )
k =1
K̃

.
(8)
p(d | ak̃ , M2 )π2 (ak̃ )
k̃=1
ˆ 12 (d), this Bayes factor is calculated from the
In contrast to BF
likelihoods p(d | θi , Mi ) that actually do belong to the respective
ˆ 12 (d) and BF
˜ 12 (d) will differ from each other.
models. Hence, BF
We have some reservations about the Bayes factor as defined
in Eq. (7) or Eq. (8) as a generalization of traditional Bayes factors.
First, a traditional Bayes factor leads to the same quantity whether
it is computed as the ratio of the posterior and prior model
odds or as the ratio of marginal likelihoods. Second, a traditional
Bayes factor would involve continuous integrals, whenever the
parameters ϑ and a are free to vary in continuous intervals. The
replacement of the integrals by finite sums is an artifact of only
considering a finite number M of candidate true pmfs p(X | ψm ).
2.3. Lack of invariance
Our major concern with Bayes factors calculated from the
C&S approach, however, is rooted in its operationalization using
a finite-dimensional matrix (e.g., Table 1), as it causes a lack
of invariance affecting every step of their belief propagation
procedure. As such, two scientist with the same subjective belief
λ(ψ) about the true p∗ (X ) using the same divergence measure
G, but with a different finite-dimensional matrix will calculate
different Bayes factors.
We appreciate the attempt by C&S to assess how well models
represent the true data generating process. Their procedure
considers all possible data generating pmfs and as such can
account for model misspecification. Although attractive, such an
unrestrictive view leads to complications when one is concerned
with testing models for which one has to set priors. The C&S
recommendation is to do so subjectively, which we consider nigh
impossible. More specifically, the collection of all possible data
generative pmfs P is typically hard to describe and without a
proper description even harder to subjective assign prior beliefs
to. Our paper continuous as follows: (1) We first characterize
P and simplify it with a parameterization; (2) a different
parameterization of P is then given leading to a different finitedimensional matrix. (3) In effect, this leads to different prior beliefs
and (4) different allocations, thus, different Bayes factors. (5) Lastly,
we remark how this problem is related to the invariance problem
already solved by Jeffreys (1946) and what his solution implies for
the C&S procedure.
2.3.1. Characterizing the collection of all possible Pmfs
When X has W = 3 number of outcomes, its true distribution p∗ (X ) can then be characterized by W − 1 = 2 parameters. Recall that a pmf for X then defines the three chances
p(X ) = [p(x1 ), p(x2 ), p(x3 )] with which it generates the outcomes
[x1 , x2 , x3 ]. The pmf must therefore satisfy two conditions: (i) it has
to be non-negative and bounded by one, i.e., 0 ≤ p(xw ) ≤ 1 for
each outcome xw of X with w = 1, . . . , W , and (ii) the probabilW
ities must sum to one, i.e.,
w=1 p(xw ) = 1. Note that this holds
true for any candidate true pmf p(X | ψm ) in Table 1. We call the
collection of functions for which the conditions (i) and (ii) hold the
collection of all possible pmfs or the full model and denote it by P .
The collection P has an uncountably infinite number of members,
each capable of being the true data generating process p∗ (X ). By
using a finite-dimensional matrix such as the one in Table 1, C&S,
51
thus, restrict their prior belief elicitation to only M = 66 candidate
true pmfs.
To show that even for W = 3 the full model P is uncountable,
we first parameterize P , that is, we identify each possible true pmf
of P with a two dimensional parameter ψ = (b, c ). Given any pmf
p(x) = [p(x1 ), p(x2 ), p(x3 )], we define b = p(x1 ), c = p(x2 ) and set
ψ = (b, c ). This construction is essentially a function ξ that maps a
member of the full model P into a parameter space Ψ of dimension
W − 1 = 2. Using the inverse parameterization ξ −1 we can identify
every parameter ψ = (b, c ), where (i’) 0 ≤ b, c ≤ 1 and (ii’)
b + c ≤ 1, with a pmf such that the three outcomes [x1 , x2 , x3 ]
are generated with the probabilities p(X | ψ) = [b, c , 1 − b − c ].
As there are an uncountable number of ψ = (b, c )s for which the
conditions (i’) and (ii’) holds, we conclude that there are also an
uncountable number of pmfs p(X | ψ) in the full model P for which
(i) and (ii) holds.
2.3.2. Different parameterizations, different representation of P : A
different set of candidate true pmfs
The aforementioned parameterization ξ : P → Ψ relates
to the candidate true pmfs of Table 1 as we have actually chosen
ψ1 = (0.0, 0.0), ψ2 = (0.0, 0.1), . . . , ψ62 = (0.8, 0.1), ψ63 =
(0.8, 0.2), ψ64 = (0.9, 0.0), ψ65 = (0.9, 0.1), ψ66 = (1.0, 0.0).
The resulting M = 66 number of columns is due to the dependence
between b and c.
A different parameterization ξ̃ from the full model P to a
parameter space Ψ̃ is based on a ‘‘stick-breaking’’ approach. Given
a p(X ) we then choose b̃ = p(x1 ), c̃ = p(x2 )/[1 − p(x1 )] and define
ψ̃ = (b̃, c̃ ).10 Using the inverse parameterization ξ̃ −1 we can also
′
′
identify every parameter ψ̃ = (b̃, c̃ ), where (i + ii ) 0 ≤ b̃, c̃ ≤ 1,
with a pmf such that the three outcomes [x1 , x2 , x3 ] are generated
with the probabilities p(X | ψ) = [b̃, (1 − b̃)c̃ , (1 − b̃)(1 − c̃ )].
Note that every parameter ψ̃ lies within the unit square Ψ̃ =
[0, 1] × [0, 1], and that b̃ and c̃ can be chosen independently from
each other. Again, as there are an uncountable number of elements
in the unit square, we have an uncountable collection of candidate
true pmfs P . With this stick-breaking representation of P and a
step size of 0.1 we get the matrix depicted in Table 2.
This new matrix differs substantially from the previous one.
First, it has more rows, thus, a larger number of candidate true
pmfs; M = 111 compared to M = 66 in Table 1. Second, there
are more candidate pmfs that imply that the first response x1 is
generated with 80% chance; eleven in Table 2 compared to three
in Table 1.
2.3.3. Different representation, different prior beliefs
Expanding on these observations, we suspect that a scientist
would subjectively set different prior beliefs depending on
whether she is confronted with the matrix of Table 1 or with
the matrix of Table 2. In particular, when confronted with the
matrix of Table 1 the scientist might subjectively set λ(ψ61 ) =
λ(ψ63 ) = 0.1 and λ(ψ62 ) = 0.7 meaning that she is quite sure,
that the participant will generate the response x1 with 80% chance,
i.e., P (p∗ (x1 ) = 0.80) = 0.9. To cohere to this belief the scientist
would simply set λ(ψ̃89 ) = λ(ψ̃99 ) = 0.1 and λ(ψ̃94 ) = 0.7 and,
subsequently, set the prior belief of all the ‘‘in-between’’ pmfs that
generate x1 with 80% to zero in Table 2. We highly doubt that any
scientist would be so specific in formulating her prior beliefs and,
thus, doubt that a subjective assessment of the prior beliefs will
work here.
10 This only works if p(x ) ̸= 1. When p(x ) = 1, we simply set c̃ = 0 and define
1
1
ψ̃ = (1, 0).
52
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
Table 2
The matrix is a simplified version of the matrix found in Figure 1 of C&S based on the different parameterization ξ̃ defined in text. Note how the pmf p(X | ψ̃19 ) is allocated
to M2 .
As an alternative, we might think that we are noninformative if
we give each candidate true pmf the same prior probability. This
means that we then give each candidate true pmf of Table 1 a
prior probability of λ(ψm ) = 1/66 ≈ 0.0152. The pmfs that the
participant will generate the response x1 with 80% chance then get
a total prior probability of 3/66 ≈ 0.0455. On the other hand, in
Table 2 a uniform prior on λ(ψ̃m ) = 1/111 ≈ 0.009 and the pmfs
that the participant will generate the response x1 with 80% chance
then gets prior probability of 11/111 ≈ 0.099. Hence, a different
set of candidate true pmfs will lead to a different assessment of
prior beliefs. This lack of invariance depends on how many and
which true candidate pmfs are chosen from P in constructing the
finite-dimensional matrices of Tables 1 and 2.
2.3.4. Different representation, different prior model probabilities,
thus, different Bayes factors
Applying the C&S belief propagation procedure to the matrix
of Table 1 yields different allocations, thus, different Bayes factors
then when we use the matrix of Table 2. For example, a scientist
might believe that the true data generating pmf is close to
p(X | ψ18 ) = [0.1, 0.6, 0.3] of Table 1, thus, chooses λ(ψ18 ) =
0.50. This prior belief then gets allocated to the model instance
p(X | ϑ = 0.6, M1 ) of M1 . Similarly, we would expect that the
scientist would also set λ(ψ̃19 ) ≈ 0.50 when confronted with
Table 2, because the candidate pmf p(X | ψ̃19 ) = [0.10, 0.63, 0.27]
in the second matrix does not differ that much from the pmf
p(X | ψ18 ) of the first matrix. However, according to the second
matrix the prior pmf probability λ(ψ̃19 ) is then allocated to
the model instance p(X | a = 0.63, M2 ) of M2 . In effect, a
different representation leads to a different belief allocation, thus,
different priors πi (θi ), P (Mi ) and different posteriors P (M̂i | d) and,
consequently, different Bayes factors. As such, our understanding
of the C&S belief propagation procedure leads to an inadequate
definition of Bayes factors, which depends on how we choose to
represent P .
2.3.5. Jeffreys’s prior and the C&S procedure
The reason for this lack of invariance is due to an error incurred
from (1) the parameterization ξ itself, and (2) the discretization
of the parameter space. For example, the matrices depicted in
Tables 1 and 2 were derived from the parameterizations ξ and ξ̃ ,
respectively, followed by a discretization of the parameter space
with a step size of 0.1 in each coordinate. The first point can be
repaired, as Jeffreys (1946) showed that the Fisher information can
be used to neutralize the parameterization error. This solution is
more commonly known as the Jeffreys’s prior. In Ly et al. (2015) we
showed that the Jeffreys’s prior on the parameters, say, ψ = (b, c )
in Ψ leads to a uniform prior on pmfs in P . The second point
however cannot be fixed.
To elaborate on this latter point, recall that the collection of all
data generating pmfs P is uncountably large, which means that
the scientist’s actual prior belief λ(ψ) is a continuous quantity. By
using a finite number M of candidate true data generating pmfs,
the target continuous random variable λ(ψ) is then approximated
by a discretized version λ(ψm ). The corresponding discretization
errors are comparable to the errors incurred when histograms are
used to approximate a smooth density function. Moreover, because
the actual belief about ψ is continuous, we have zero probability
of having the true data generating process p∗ (X ) being exactly
equal to one of the finite number of candidate pmfs p(X | ψm ).
As such, we cannot construct the actual belief λ(ψ) from point
masses. Note that this continuity issue was already alluded to in
Section 2.3.3 as one would expect that if the pmfs indexed by
ψ̃89 , ψ̃94 , ψ̃99 in Table 2 are assigned some prior mass, the pmfs
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
in between would also receive some prior mass. The implication is
that the C&S procedure might only work if we use a ‘‘matrix’’ with
an uncountable number of rows.
Furthermore, the discretization leads to another type of
approximation error that we refer to as geometric approximation
error due to the chosen divergence measure G. This error
was alluded to in Section 2.3.4, where a small change in the
candidate true data generating pmf p(X | ψ18 ) = [0.1, 0.6, 0.3]
to p(X | ψ̃19 ) = [0.10, 0.63, 0.27] leads to a completely different
allocation of the prior belief; from a model instance of M1 to
M2 . The geometrical interpretation stems from the fact that
KL-divergence can be thought of as a generalization of the
Fisher information metric.11 Moreover, it follows directly from
the geometric interpretation that the C&S belief propagation
procedure favors the more complex model, as it will attract a larger
number of candidate data generating pmfs indexed by ψm , see Ly
et al. (2015). This a priori boosting of the more complex model is at
odds with the simplicity postulate that seems to be central in the
foundations of the C&S procedure, see Shiffrin et al. (2016) in this
special issue.
The fact that we cannot construct the actual belief λ(ψ) from
point masses is at odds with the C&S idea that P (Mi ) is the sum of
its parts. This bottom-up view is what caused Shiffrin et al. (2016)
to avoid overlapping models; when M1 and M2 share a pmf and the
shared instance receives some prior mass, this prior mass will be
accounted for twice. As a result, the prior model probabilities will
then exceed one, i.e., P (M1 )+P (M2 ) > 1. To deal with overlapping
models Shiffrin et al. (2016) suggested to remove the common
pmfs from the larger model. This idea is elaborated on with a
toy example where M3 is a binomial model with the chance of
success θ fixed at θ = 0.5 and where M4 represents the binomial
model in which θ is free to vary between zero and one. They then
reformulate M4 as the binomial model M̃4 in which θ is free to
vary between (0, 0.49) and (0.51, 1). This replacement of M4 by
M̃4 leads to another approximation error. One solution would be
to allow M̃4 to converge to M4 by allowing θ to be in (0, 0.5 − ϵ) ∪
(0.5 + ϵ, 1). This construction however depends on how ϵ goes
to zero and induces the Borel–Kolmogorov paradox (e.g., Lindley,
1997; Wetzels, Grasman, & Wagenmakers, 2010). This paradox
is another indication of how the C&S belief propagation scheme
depends on how we as scientists represent the problem in terms
of the chosen parameterization and, subsequently, discretize the
parameter space.
In other words, we believe that the lack of invariance is
inescapable when the C&S approach is operationalized with a
finite-dimensional matrix leading to an oversimplication of the
problem resulting in a representation that is not on par with the
sophisticated ideas behind the C&S approach.
2.4. Conclusion
Based on the different strategies used to set priors πi (θi ) within
the models Mi , we conclude that the C&S belief propagation
procedure answers a different question than a traditional Bayes
factor. We believe that C&S are mostly concerned with how
a scientist’s subjective knowledge of the true data generating
p∗ (X ) is permeated in the models M1 and M2 . Hence, C&S
focus on checking whether the models M1 and M2 give a good
representation of expert knowledge.
As such, we think that the C&S approach can be valuable at the
preliminary stage of model building. In particular, by considering
11 The KL-divergence is not a metric in the formal sense, only its infinitesimal
version can be related to the Fisher information as a metric, i.e., Jeffreys’s prior.
53
all possible data generating pmfs for the random variable X ,
the C&S procedure forces the statistician to focus on building a
model that is relevant for the problem at hand, rather than being
restricted by the standard models. We would like to emphasize
that our remarks are not aimed at the aspiration of C&S to construct
good models that mimic nature well.
Our major concern deals with the finite-dimensional representation that C&S use to operationalize their procedure and the recommendations to set λ(ψ) subjectively. The idea to consider the
full model P is to account for misspecification; as a result, however, the subjective assessment of prior beliefs is nigh impossible.
Note that the subjective belief λ(ψ) is necessary a continuous random variable, because the full model P contains an uncountable
number of candidate true pmfs p(X | ψ). To make their procedure
viable C&S oversimplify the problem with a finite-dimensional matrix yielding approximation errors that cannot be ignored.
The problem worsens when X is also continuous. In that case,
the full model should then be represented by a ‘‘matrix’’ with
an uncountable number of rows and columns. Moreover, this full
model is far too complex, as it does not even allow for consistent
inference (Dvoretzky, Kiefer, & J, 1956). This is why regularization
methods were invented and alternative models were proposed
that grow with the number of samples (e.g., Bickel, 2006). The goal
set by C&S to compare models in a totally unrestrictive setting
is ambitious and an active area of research that is progressing
slowly, see Borgwardt and Ghahramani (2009), Ghosal, Lember,
and Van Der Vaart (2008), Holmes, Caron, Griffin, and Stephens
(2015), Labadi, Masuadi, and Zarepour (2014), Salomond (2013,
2014) for some recent results.
For estimation problems, one solution would be to forgo the
finite matrix representation and consider the prior on P as a
continuous random variable instead. As a replacement of the
subjective assessment, we then recommend Jeffreys’s prior as it
is uniform on P when X has a finite number of outcomes W . A
Jeffreys prior for the full model P is viable when W < ∞, as
the distribution of X is then at most a multinomial distribution
with W categories. When X is continuous the Jeffreys prior can
then be extended by a method described in Ghosal, Ghosh, and
Ramamoorthi (1997), which has been used successfully to justify
Bayesian nonparametric estimation methods, see also Ghosal,
Ghosh, and Van Der Vaart (2000) and Kleijn (2013). However, this
replacement of the discretizated λ(ψm ) by a continuous version
λ(ψ) is at odds with the philosophy that the prior on the whole,
P (Mi ), is a sum of its parts πi (θi ) as the individual model instances
then necessarily receive zero mass. Furthermore, we do not know
how to translate a continuous λ(ψ) on all pmfs P to the model
instances πi of Mi without an explicitly defined relationship
between the true data generating p∗ (X ) and the model instances of
Mi . In effect, we doubt that the C&S procedure extends traditional
Bayes factors and that it is capable of yielding a Jeffreys’s Bayes
factors that formalizes inductive reasoning and the logic of proof by
contradiction. The reason for this doubt is due to the fact that C&S
do not focus on the two models under test, instead, they embed
these two models within a larger encompassing model as Robert
did, see Section 1.
In conclusion, we believe that a Jeffreys’s Bayes factor remains
the preferred method of inference, because a Jeffreys’s Bayes factor
does not depend on how the full model P is represented and
discretized. Thus, it does not suffer from the lack of invariance
as discussed above. Furthermore, a Jeffreys’s Bayes factor does
not require a subjectively elicitation of prior beliefs. Note that
the Bayes factor focuses on comparing the models M1 and M2 ,
no reference is made to any true data generating process p∗ (X ).
Jeffreys was mostly concerned with quantifying the (relative)
evidence provided by the observations for either model. The Bayes
factor is not concerned with the true data generating process p∗ (X )
54
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
and it does not aspire to do so. Both M1 and M2 could be poor
descriptions of the true data generating pmf p∗ (X ), but fortunately
it has been shown that the model selected with a Bayes factor is the
model closest to the true p∗ (X ) in terms of KL-divergence (e.g., Dass
& Lee, 2004). Hence, the model that is preferred by the Bayes factor
will be able to generalize better to yet unseen data—a guarantee
that aligns with the spirit of the C&S approach.
3. Conclusion
We would like to thank the authors of both comments for
their stimulating remarks and for their creative alternatives and
extensions to Jeffreys’s Bayes factors. We hope this discussion
has resulted in a renewed appreciation for Harold Jeffreys’s
foundational contributions to model selection and hypothesis
testing, and we look forward to future developments in this
exciting and important area of research.
Acknowledgments
This work was supported by the starting grant ‘‘Bayes or
Bust’’ awarded by the European Research Council (grant number:
283876). Correspondence concerning this article may be addressed
to Alexander Ly. We thank Maarten Marsman and Joris Mulder for
their comments on an earlier draft of the manuscript.
References
Bayarri, M., & Berger, J. O. (2000). P values for composite null models. Journal of the
American Statistical Association, 95(452), 1127–1142.
Bayarri, M., Berger, J., Forte, A., & García-Donato, G. (2012). Criteria for Bayesian
model choice with application to variable selection. The Annals of statistics,
40(3), 1550–1577.
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. Springer Verlag.
Berger, J. O., & Pericchi, L. R. (2001). Objective Bayesian methods for model
selection: Introduction and comparison. In Lecture notes-monograph series
(pp. 135–207).
Berger, J. O., Pericchi, L. R., & Varshavsky, J. A. (1998). Bayes factors and marginal
distributions in invariant situations. Sankhyā: The Indian Journal of Statistics,
Series A, 307–321.
Bickel, P. (2006). Regularization in statistics. Test, 15(2), 271–344. With discussion
by Li, Tsybakov, van de Geer, Yu, Valdés, Rivero, Fan, van der Vaart, and a
rejoinder by the author.
Bickel, P. J., & Kleijn, B. J. K. (2012). The semiparametric Bernstein–von Mises
theorem. The Annals of Statistics, 40(1), 206–237.
Borgwardt, K.M., & Ghahramani, Z. (2009). Bayesian two-sample tests. arXiv
preprint arXiv:0906.4032.
Chandramouli, S., & Shiffrin, R. (2016). Extending Bayesian induction. Journal of
Mathematical Psychology, 72, 38–42.
Consonni, G., & Veronese, P. (2008). Compatibility of prior specifications across
linear models. Statistical Science, 23(3), 332–353.
Dass, S. C., & Lee, J. (2004). A note on the consistency of Bayes factors for testing
point null versus non-parametric alternatives. Journal of Statistical Planning and
Inference, 119(1), 143–152.
Dawid, A. P., & Lauritzen, S. L. (2001). Compatible prior distributions. In Bayesian
methods with applications to science, policy and official statistics (pp. 109–118).
Diebolt, J., & Robert, C. P. (1994). Estimation of finite mixture distributions through
Bayesian sampling. Journal of the Royal Statistical Society. Series B. Statistical
Methodology, 363–375.
Dvoretzky, A., Kiefer, J., & J, Wolfowitz (1956). Asymptotic minimax character of
the sample distribution function and of the classical multinomial estimator. The
Annals of Mathematical Statistics, 642–669.
Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for
psychological research. Psychological Review, 70, 193–242.
Etz, A., & Wagenmakers, E.-J. (2015). Origin of the Bayes factor. arXiv preprint
arXiv:1511.08180.
Friston, K. J., & Penny, W. (2011). Post hoc Bayesian model selection. NeuroImage,
56, 2089–2099.
Ghosal, S., Ghosh, J., & Ramamoorthi, R. (1997). Non-informative priors via sieves
and packing numbers. In Advances in statistical decision theory and applications
(pp. 119–132). Springer.
Ghosal, S., Ghosh, J. K., & Van Der Vaart, A. W. (2000). Convergence rates of posterior
distributions. The Annals of Statistics, 28(2), 500–531.
Ghosal, S., Lember, J., Van Der Vaart, A., et al. (2008). Nonparametric Bayesian model
selection and averaging. Electronic Journal of Statistics, 2, 63–89.
Grazian, C., & Robert, C. P. (2015). Jeffreys’ priors for mixture estimation. In Bayesian
statistics from methods to models and applications (pp. 37–48). Springer.
Heck, D., Wagenmakers, E.-J., & Morey, R. D. (2015). Testing order constraints:
Qualitative differences between Bayes factors and normalized maximum
likelihood. Statistics & Probability Letters, 105, 157–162.
Holmes, C. C., Caron, F., Griffin, J. E., Stephens, D. A., et al. (2015). Two-sample
Bayesian nonparametric hypothesis testing. Bayesian Analysis, 10(2), 297–320.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation
problems. Proceedings of the Royal Society of London. Series A. Mathematical and
Physical Sciences, 186(1007), 453–461.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, UK: Oxford University
Press.
Jeffreys, H. (1980). Some general points in probability theory. In A. Zellner, Kadane,
& B. Joseph (Eds.), Bayesian analysis in econometrics and statistics: Essays in honor
of Harold Jeffreys (pp. 451–453). Amsterdam: North-Holland.
Johnson, V. E., & Rossell, D. (2010). On the use of non-local prior densities
in Bayesian hypothesis tests. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 72(2), 143–170.
Kamary, K., Eun, J., & Robert, C.P. (2016). Non-informative reparameterisations for
location-scale mixtures. arXiv preprint arXiv:1601.01178.
Kamary, K., Mengersen, K., Robert, C.P., & Rousseau, J. (2014). Testing hypotheses
via a mixture estimation model. arXiv preprint arXiv:1412.2044.
Kary, A., Taylor, R., & Donkin, C. (2016). Using Bayes factors to test the predictions
of models: A case study in visual working memory. Journal of Mathematical
Psychology, 72, 210–219.
Kleijn, B. (2013). Criteria for Bayesian consistency. arXiv preprint arXiv:1308.1263.
Labadi, L.A., Masuadi, E., & Zarepour, M. (2014). Two-sample Bayesian nonparametric goodness-of-fit test. arXiv preprint arXiv:1411.3427.
Lee, M. D., Lodewyckx, T., & Wagenmakers, E.-J. (2015). Three Bayesian analyses
of memory deficits in patients with dissociative identity disorder. In J.
R. Raaijmakers, A. Criss, R. Goldstone, R. Nosofsky, & M. Steyvers (Eds.),
Cognitive modeling in perception and memory: A festschrift for Richard M. Shiffrin
(pp. 189–200). Psychology Press.
Lindley, D. V. (1977). The distinction between inference and decision. Synthese, 36,
51–58.
Lindley, D. V. (1997). Some comments on Bayes factors. Journal of Statistical Planning
and Inference, 61(1), 181–189.
Ly, A., Etz, A., Marsman, M., Epskamp, S., Gronau, Q., & Matzke, D., et al., (2015).
Replication Bayes factors. (in preparation).
Ly, A., Marsman, M., Verhagen, A., Grasman, R., & Wagenmakers, E.-J. (2015). A
tutorial on Fisher information. Journal of Mathematical Psychology (submitted
for publication).
Ly, A., Verhagen, A., & Wagenmakers, E.-J. (2016). Harold Jeffreys’s default Bayes
factor hypothesis tests: Explanation, extension, and application in psychology.
Journal of Mathematical Psychology, 72, 19–32.
Marin, J.-M., & Robert, C. P. (2014). Bayesian essentials with R. Springer.
Robert, C. (2007). The Bayesian choice: from decision-theoretic foundations to
computational implementation. Springer Science & Business Media.
Robert, C. (2016). The expected demise of the Bayes factor. Journal of Mathematical
Psychology, 72, 33–37.
Robert, C. P. (1993). A note on Jeffreys-Lindley paradox. Statistica Sinica, 3(2),
601–608.
Robert, C. P. (2014). On the Jeffreys-Lindley paradox. Philosophy of Science, 81(2),
216–232.
Robert, C.P. (2015). The Metropolis–Hastings algorithm. arXiv preprint
arXiv:1504.01896.
Robert, C. P., Chopin, N., & Rousseau, J. (2009). Harold Jeffreys’s theory of probability
revisited. Statistical Science, 141–172.
Rozeboom, W. W. (1960). The fallacy of the null–hypothesis significance test.
Psychological Bulletin, 57, 416–428.
Salomond, J.-B. (2013). Bayesian testing for embedded hypotheses with application
to shape constrains. arXiv preprint arXiv:1303.6466.
Salomond, J.-B. (2014). Adaptive Bayes test for monotonicity. In The contribution of
young researchers to Bayesian statistics (pp. 29–33). Springer.
Scott, J. G., & Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment
in the variable-selection problem. The Annals of Statistics, 38(5), 2587–2619.
Shiffrin, R., Chandramouli, S., & Grünwald, P. (2016). Bayes factors, relations
to minimum description length, and overlapping model classes. Journal of
Mathematical Psychology, 72, 56–77.
Steingroever, H., Wetzels, R., & Wagenmakers, E.-J. (in press). Bayes factors for
reinforcement–learning models of the Iowa gambling task, Decision.
Stephens, M., & Balding, D. J. (2009). Bayesian statistical methods for genetic
association studies. Nature Reviews Genetics, 10, 681–690.
Turner, B. M., Sederberg, P. B., & McClelland, J. L. (2016). Bayesian analysis of
simulation-based models. Journal of Mathematical Psychology, 72, 191–199.
Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a
replication attempt. Journal of Experimental Psychology: General, 143(4), 1457.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p
values. Psychonomic Bulletin & Review, 14, 779–804.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why
psychologists must change the way they analyze their data: The case of psi.
Journal of Personality and Social Psychology, 100, 426–432.
Wetzels, R., Grasman, R. P. P. P., & Wagenmakers, E.-J. (2010). An encompassing
prior generalization of the Savage–Dickey density ratio test. Computational
Statistics & Data Analysis, 54, 2094–2102.
A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55
Wrinch, D., & Jeffreys, H. (1919). On some aspects of the theory of probability.
Philosophical Magazine, 38, 715–731.
Wrinch, D., & Jeffreys, H. (1921). On certain fundamental principles of scientific
inquiry. Philosophical Magazine, 42, 369–390.
55
Wrinch, D., & Jeffreys, H. (1923). On certain fundamental principles of scientific
inquiry. Philosophical Magazine, 45, 368–375.
Yang, G. L., & Le Cam, L. (2000). Asymptotics in statistics: Some basic concepts. Berlin,
Germany: Springer.