Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Journal of Mathematical Psychology 72 (2016) 43–55 Contents lists available at ScienceDirect Journal of Mathematical Psychology journal homepage: www.elsevier.com/locate/jmp An evaluation of alternative methods for testing hypotheses, from the perspective of Harold Jeffreys Alexander Ly ∗ , Josine Verhagen, Eric-Jan Wagenmakers University of Amsterdam, Department of Psychology, Weesperplein 4, 1018 XA Amsterdam, The Netherlands highlights • Reply to Robert (2016) The expected demise of the Bayes factor. • Reply to Chandramouli and Shiffrin (2016) Extending Bayesian induction. • Further elaboration on Jeffreys’s Bayes factors. article info Article history: Available online 15 February 2016 Keywords: Bayes factors Induction Model selection Replication Statistical evidence abstract Our original article provided a relatively detailed summary of Harold Jeffreys’s philosophy on statistical hypothesis testing. In response, Robert (2016) maintains that Bayes factors have a number of serious shortcomings. These shortcomings, Robert argues, may be addressed by an alternative approach that conceptualizes model selection as parameter estimation in a mixture model. In a second comment, Chandramouli and Shiffrin (2016) seek to extend Jeffreys’s framework by also taking into consideration data distributions that do not originate from either of the models under test. In this rejoinder we argue that Robert’s (2016) alternative view on testing has more in common with Jeffreys’s Bayes factor than he suggests, as they share the same ‘‘shortcomings’’. On the other hand, we show that the proposition of Chandramouli and Shiffrin (2016) to extend the Bayes factor is in fact further removed from Jeffreys’s view on testing than the authors suggest. By elaborating on these points, we hope to clarify our case for Jeffreys’s Bayes factors. © 2016 Elsevier Inc. All rights reserved. In our original article (Ly, Verhagen, & Wagenmakers, 2016) we outlined how Harold Jeffreys constructed his hypothesis tests. Jeffreys’s tests contrast a precise, point-null hypothesis M0 versus a more general alternative hypothesis M1 . Here the point-null hypothesis represents a general law, an invariance, or a categorical causal claim (e.g., ‘‘apple trees always bear apples’’; ‘‘people cannot look into the future’’; ‘‘Alzheimer’s disease is caused by a fungal infection of the central nervous system’’), whereas the alternative hypothesis relaxes that law. Jeffreys’s tests require a thoughtful specification of the prior distribution for the parameter of interest, and much of Jeffreys’s work was concerned with providing good default specifications—‘‘good’’ in the sense that they adhere to general common-sense desiderata (e.g., Bayarri, Berger, Forte, & García-Donato, 2012). We are pleased that our summary attracted two comments by renowned researchers; below we respond to ∗ Corresponding author. E-mail address: [email protected] (A. Ly). http://dx.doi.org/10.1016/j.jmp.2016.01.003 0022-2496/© 2016 Elsevier Inc. All rights reserved. their ideas in a way that we hope is consistent with the overall philosophy of Harold Jeffreys himself. 1. Rejoinder to Robert In general, Robert’s (2016) comments highlight the inevitable subtleties in constructing a Bayes factor. His alternative mixture model procedure is practical and may be immensely valuable for specific situations (i.e., hierarchical models) that are common in psychological research. Nevertheless, we believe Robert’s suggestion about the demise of the Bayes factor to be an overstatement. 1.1. Robert’s critique on the Bayes factor Our understanding of Jeffreys’s method is partly based on the work by Robert and colleagues (2009), and it should, therefore, not come as a surprise that Robert’s view and ours overlap to a considerable degree. Robert’s arguments for dismissing the Bayes factor can be grouped in terms of (1) its usage in making decisions and (2) the care that needs to be taken in choosing the priors. 44 A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 1.1.1. First critique: the distinction between inference and decision making We share Robert’s discontent with the statistical practice that emphasizes all-or-none decisions at some arbitrary threshold, and we agree that scientific learning should instead be guided by a continuous measure of evidence. In the process of eviscerating pvalue null hypothesis tests, Rozeboom (1960, pp. 422–423) already expressed a similar sentiment: ‘‘The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true’’. Our favorite continuous measure of evidence is of course a Bayes factor constructed from a pair of priors selected according to Jeffreys’s desiderata, or a Jeffreys’s Bayes factor in short. It is important to note that this measure provides only the first of three Bayesian ingredients needed for decision making. The other two ingredients are the prior model probabilities (which, combined with the Bayes factor, yield posterior model probabilities) and the specification of a loss function (or equivalently, a utility function; Berger, 1985, Lindley, 1977, and Robert, 2007). For instance, consider a Bayes factor of BF10 (d) = 4.6 for the observed data d. This Bayes factor can be converted to a posterior model probability of P (M0 | d) = 0.17 when we set P (M0 ) = P (M1 ) = 1/2 (Ly et al., 2016). One possible subsequent decision rule is then to accept P (M1 | d) because it has the highest posterior model probability. We did not intend to suggest such a procedure, as the decision is clearly sensitive to the prior model probabilities. Furthermore, we do not recommend uniform prior model probabilities regardless of scientific context. In fact, when decision making is desired, the assignment of prior model probabilities is left to the substantive researcher. Such flexibility in assignment introduces subjectivity, and this may be seen either as a disadvantage or as an advantage. At any rate, prior model probabilities can be used to formalize the adage that ‘‘extraordinary claims require extraordinary evidence’’ (e.g., Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). Moreover, the prior model probabilities can be used to address the problem of multiplicity (e.g., Jeffreys, 1961; Scott & Berger, 2010; Stephens & Balding, 2009). A similar argument applies to utility functions: these may be subjective and hard to elicit, but such difficulties do not sanction the practice of ignoring utility functions altogether, at least not when the purpose is to make decisions. Thus, Robert worries that computation of Bayes factors may tempt users to make all-or-none decisions while disregarding prior model probabilities or loss functions. We agree with Robert that there is a considerable difference between inference and decision making, and that scientific learning should be guided by a continuous measure of evidence that incorporates what we have learned from the observed data. The Bayes factor is such a measure. 1.1.2. Second critique: the Jeffreys–Lindley–Bartlett paradox We suspect that the Jeffreys–Lindley–Bartlett (henceforth JLB) paradox is central to Robert’s (1993; 2014) dismissal of the Bayes factor and it is the main motivation for the development of the mixture model alternative. We take a closer look at the JLB paradox and discuss two consequences foreseen by Jeffreys, who was keenly aware of the ‘‘paradox’’ from the very beginning (Etz & Wagenmakers, 2015). First, the JLB paradox implies that we cannot use improper priors to construct a Bayes factor. For instance, to estimate µ within the normal model M1 : X ∼ N (µ, 1), we typically employ Jeffreys’s (1946) prior µ ∝ 1. The reason to do so stems from the fact that Jeffreys’s prior is translation-invariant, leading to a posterior that is independent on how researchers parameterize the problem (Ly, Marsman, Verhagen, Grasman, & Wagenmakers, 2015). The JLB paradox implies that we cannot use this same (estimation) prior on the test-relevant parameter for a Bayesian test. More specifically, when we pit the aforementioned model M1 against the null model M0 : X ∼ N (0, 1) the improper prior π1 (µ) ∝ 1 then becomes useless. To see this we consider the Jeffreys’s prior as the limit of proper priors µ ∼ N (0, τ 2 ) with τ tending to infinity. The Bayes factor for the observed data d = (n, x̄) is then given by ˜ 10 ; τ (d) = lim lim BF τ →∞ τ →∞ exp − 2n (x̄ − µ)2 exp − 2τ12 µ2 dµ = 0, √ 2π τ exp[− 2n x̄2 ] (1) (τ nx̄)2 = lim √ = 0, exp τ →∞ 2(1 + nτ 2 ) 1 + nτ 2 1 (2) regardless of the fixed sample size n and the observed sample mean x̄. As such, the Bayes factor constructed from the improper Jeffreys’s prior will always favor the null model and this also holds for other improper priors. Moreover, Eq. (2) shows that for fixed data d = (n, x̄) and a Bayes factor constructed from a normal prior with hyperparameter τ we can obtain a Bayes factor in favor of ˜ 10 ; τ (d) < 1) simply by the null hypothesis of arbitrary size (i.e., BF taking τ large enough. Hence, the JLB paradox effectively implies that a testing problem should be treated differently from one that is concerned with estimation. As such, when π1 is interpreted as prior belief about the parameters θ1 , in the example above θ1 = µ, one’s belief about the parameter then changes depending on whether one is concerned with testing or estimating. More generally, this difference is due to the fact that estimation is typically a within-model affair. Recall that a model Mi specifies a relationship fi (d | θi ) that defines which parameters θi are relevant in the data generating process of the data d. Hence, the function fi gives the (only) context in which the parameters θi can be perceived. In essence, the fi justifies that it is meaningful to calculate a posterior distribution for the parameter. To underline this point we add subscripts to the parameters indicating model membership in the next example, by taking θ0 = σ0 and θ1 = (µ1 , σ1 ) for f0 and f1 both normals. For example, when we assume that M0 : X ∼ N (0, σ02 ) only a posterior for the standard deviation σ0 is worthwhile to be pursued, as the posterior for the population mean remains zero, regardless of the data. Within M0 , the Jeffreys’s prior for σ0 is given by π0 (σ0 ) ∝ σ0−1 , which can be updated to a posterior π0 (σ0 | d). On the other hand, under M1 : X ∼ N (µ1 , σ12 ) we are dealing with two parameters of interest. Within M1 , the Jeffreys’s prior for µ1 is π (µ1 ) ∝ 1, for σ1 is π1 (σ1 ) ∝ 1/σ1 and we take π1 (µ1 , σ1 ) = π1 (µ1 )π1 (σ1 ). These priors can be updated to posteriors π1 (µ1 | d) and π1 (σ1 | d). Even though the two priors π0 (σ0 ) and π1 (σ1 ) have the same form, they do not lead to the same posterior. In fact, due to the presence of µ1 as a parameter, the posterior mean of π1 (σ1 | d) within M1 will be smaller or equal to the posterior mean of π0 (σ0 | d) within M0 . Thus, when we are interested in the standard error σi , it matters whether we believe that M0 holds true or whether the population mean µ1 plays a role in the data generating process as specified by f1 . The Bayes factor helps us distinguish which of the two models is better suited to the data and which posterior for σi we should report. Hence, testing is a between-model matter. Jeffreys himself was very clear about the distinction between estimation and testing: A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 ‘‘We are now concerned with the more difficult question: in what circumstances do observations support a change of the form of the law itself? This question is really logically prior to the estimation of the parameters, since the estimation problem presupposes that the parameters are relevant’’. (Jeffreys, 1961, p. 245). Hence, testing implies that we are uncertain about which of the two functional relationships defined by the models M0 and M1 is adequate for the data under study. This uncertainty is expressed through the prior statement P (M0 ), P (M1 ) > 0 and when M0 and M1 are the only models under consideration we require that P (M0 ) + P (M1 ) = 1. The priors π1 , π0 in a Bayes factor are, thus, chosen to guide scientific learning, that is, how one transitions from prior model odds to posterior model odds and are not designed to yield posteriors that are good for estimation. To simplify notation, we drop the subscripts indicating model membership when the context is clear. Second, the separation of estimation and testing and the resulting separation of models led us to instantiate the hypotheses Hi with their respective models Mi as discussed in Ly et al. (2016). In effect, we have different contexts in which the respective parameters exist and, therefore, a philosophical conundrum in what is meant by common parameters. The difference between the posteriors π0 (σ | d) and π1 (σ | d) discussed above showed that one should not be fooled by the fact that the Greek letters are identical. We therefore agree with Robert’s warning concerning the treatment of common parameters. For the t-test the commonality between the two σ s within M0 and M1 is given by their meaning as a scaling parameter within either model. Furthermore, the nesting of π0 (σ ) as π1 (µ, σ ) = π1 (δ)π0 (σ ) can be considered a practical choice. In effect, the Bayes factor BF10 (d) is then given by the ratio of the following two marginal likelihoods n p(d | M1 ) = (2π )− 2 ∞ σ −n 2 s σ n p(d | M0 ) = (2π )− 2 0 exp − 2n x̄ σ −δ 2 π1 (δ) dδ π0 (σ )dσ , ∞ ∞ −∞ 0 + (3) σ −n exp − 2σn 2 x̄2 + s2 π0 (σ )dσ , 1.2. Jeffreys’s common-sense desiderata ‘‘Rejection of a null hypothesis is best when it is interocular’’. Edwards, Lindman, and Savage (1963, p. 240). In conclusion, the JLB paradox prohibits the usage of improper priors for testing, separates the estimation practice from a testing concern, and challenges the idea of common parameters. As noted above, we first require a justification before we can use the same prior on the nuisance parameters. After doing so, we then create an exception on the ban of improper priors allowing us to assign improper priors to the nuisance parameters, say, θ0 = σ . Furthermore, let δ denote the test-relevant parameter with, say, θ1 = (θ0 , δ). Hence, after specifying Jeffreys’s translationinvariant priors on the nuisance parameters θ0 , which we would use for estimation within each model, we only require to set the prior π1 (δ) in order to define the Bayes factor BF10 (d). We suspect that Jeffreys’s underlying reasons for the choice of π1 (δ) was to have a test that passes ‘‘the interocular traumatic test; you know what the data mean when the conclusion hits you between the eyes’’. Edwards et al. (1963, p. 217). We believe that the information consistency criterion makes explicit which data hit us right between the eyes. This criterion leads to a Bayes factor that is consistent for a finite sample, a requirement that is much harder to be fulfilled than the asymptotic consistency criterion, at least for parametric models (e.g., Bickel & Kleijn, 2012, Yang & Le Cam, 2000). We agree with Robert that information consistency is in some cases an approximate statement. In particular, when the data are either distributed according to M0 : X ∼ N (0, σ 2 ) or M1 : X ∼ N (µ, σ 2 ) then the interocular data set with n > 2, x̄ ̸= 0 and, in particular, s2 = 0 occurs with zero probability under both models, due to the assumption σ > 0. However, when M0 and M1 are the only two models under consideration, the observation x̄ ̸= 0 with n > 2, in addition to s2 = 0, then should lead to the logical exclusion of M0 , thus, BF01 (d) = 0. To appreciate the information consistency criterion, we revisit ˜ 10 ; τ (d) that lacks this the Bayesian t-test with Bayes factors BF property by constructing it from π0 (σ ) ∝ σ −1 and π1 (δ, σ ) = π1 (δ)π0 (σ ) where π1 (δ) is normal around zero with a standard deviation τ , i.e., ˜ 10 ; τ (d) =(1 + nτ ) BF 2 (4) where d = (n, x̄, s2 ). We would like to thank Robert for pointing out our notational inaccuracy, as Eq. (9) in Ly et al. (2016) should actually be Eq. (3), that is, the marginal likelihood of the alternative model, thus, the numerator of the Bayes factor BF10 (d), after π1 (δ) and π0 (σ ) are specified. In the original text we already filled in π0 (σ ) ∝ σ −1 , a choice which we elaborated on in Section 3.2.2 of Ly et al. (2016). With the nesting of π0 within π1 we made the following recommendation explicit: ‘‘It is to be understood that in pairs of equations of this type [such as Eqs. (3) and (4)] the sign of proportionality indicates the same constant factor, which can be adjusted to make the total probability 1’’. (Jeffreys, 1961, p. 247) More precisely, an improper prior π0 (σ ) ∝ σ −1 has a suppressed normalization constant π0 (σ ) = c0 σ −1 and we not only take π1 (σ ) = c1 σ −1 of the same form, but also choose to set c1 = c0 , which allows us to use improper priors on the nuisance parameters (see Berger, Pericchi, & Varshavsky, 1998 for a theoretical justification). More examples of this type of nesting can be found in Dawid and Lauritzen (2001), Consonni and Veronese (2008), and references therein. 45 n −1 2 1+ 2n nx̄2 ns2 ( 1 + nτ 2 ) + nx̄2 ns2 . (5) As before, letting τ tend to infinity, while keeping n, x̄ and s2 fixed, ˜ 10 ; τ (d) = 0. yields the JLB paradox, i.e., limτ →∞ BF To simplify the discussion we suppose that τ is set to one. ˜ 10 ; τ =1 (d) is then asymptotically The resulting Bayes factor BF consistent. This means that if we repeatedly sample from the null model, we let n tend to infinity and simultaneously let nx̄2 /(ns2 ) = t 2 /(n − 1) tend to zero yielding a Bayes factor of zero, where t √ is the usual t-statistic t = nx̄/sn−1 . Similarly, if we repeatedly sample from the alternative model, we let n tend to infinity and simultaneously let t 2 /(n − 1) tend to infinity yielding a Bayes factor ˜ 10 ; τ =1 (d) is able to detect the of infinity. Thus, this Bayes factor BF correct model when the number of data points tends to infinity. ˜ 10 ; τ =1 (d), however, is not information The Bayes factor BF consistent. For the t-test information consistency is concerned with having a fixed number of data points n > 2, an observed sample mean, say, x̄ ̸= 0 and s2 tending to zero. With τ , n and ˜ 10 ; τ =1 (d) is an decreasing function of x̄ fixed, this Bayes factor BF 2 s that attains its maximum when s2 = 0. For instance, when ˜ 10 ; τ =1 (d) = n = 4, x̄ = 7 the maximum is then given by lims2 →0 BF 11.18. Note that the data set with n = 4, x̄ = 7 and s2 → 0 46 A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 is interocular as it leads to an observed sample effect size, an realization of the t-statistic, that tends to infinity, which should therefore lead to infinite support for the alternative compared to the null model. The fact that the information inconsistent Bayes ˜ 10 ; τ =1 (d) is bounded makes it hard to be interpret. For factor BF instance, the observations n = 4, x̄ = 7 and s2 = 1 yields a Bayes ˜ 10 ; τ =1 (d) = 9.6, which does not seem a lot of evidence factor of BF against the null, but with respect to its maximum 11.81 might be considered substantial. On the other hand, a Jeffreys’s Bayes factor is by construction information consistent and has a supremum (i.e., maximum) at infinity, which makes it easier to be interpret. Jeffreys referred to this and other desiderata as common-sense as they came natural to him (Etz & Wagenmakers, 2015), but it took a long time before his intuition was formalized by Berger and Pericchi (2001) and extended by Bayarri et al. (2012). Recall that information consistency in a t-test requires us to construct a Bayes factor from a heavy-tailed prior on δ and we agree with Robert that the Cauchy prior with scale γ = 1 is only one of many possible choices. This is why we included a robustness analysis in our open-source software package JASP (https://jaspstats.org/). However, we believe that the merit of a Jeffreys’s Bayes factor (with γ fixed) is due to the fact that it kickstarts scientific learning. ‘‘In any of these cases it would be perfectly possible to give a form of [π1 (δ)] that would express the previous information satisfactorily, and consideration of the general argument of [Chapter] 5.0 will show that it would lead to common-sense results, but they would differ in scale. As we are aiming chiefly at a theory that can be used in the early stages of a subject, we shall not at present consider the last type of case’’ (Jeffreys, 1961, p. 252). Thus, Jeffreys was not opposed to incorporating previously acquired data in a Bayesian hypothesis test, but to do so he first designed a starting Bayes factor, for a first data set, say, dorig . After observing dorig , we can then straightforwardly update a Jeffreys’s Bayes factor for a future, not yet observed, data set, say, drep . This informed Bayes factor BF10 (drep | dorig ) is then constructed from the priors π1 (θ1 | dorig ) and π0 (θ0 | dorig ). This idea forms the basis of the replication Bayes factors introduced in Verhagen and Wagenmakers (2014) and is further exploited in Ly et al. (2015). Hence, the man who discovered the origin of the earth, thus, also provided us with the starting point for scientific learning. 1.3. Robert’s alternative approach ‘‘Prior distributions must always be chosen with the utmost care when dealing with mixtures and their bearings on the resulting inference assessed by a sensitivity study. The fact that some noninformative priors are associated with undefined posteriors, no matter what the sample size, is a clear indicator of the complex nature of Bayesian inference for those models’’ (Marin & Robert, 2014, p. 199). As an alternative to Bayes factors, Robert (2016) suggests to use a mixture model approach elaborated upon in Kamary, Mengersen, Robert, and Rousseau (2014). The data generating process of a mixture model can be envisioned as a stepwise procedure. First, a membership variable zj is realized; in a two-component mixture, zj assumes either the value zero or one. Next, given the outcome zj = 0 (or zj = 1) a data point xj is generated according to M0 : Xj ∼ f0 (xj | θ0 ) (or M1 : Xj ∼ f1 (xj | θ1 )). This means that the complete data should consist of n-pairs (z1 , x1 ), . . . , (zn , xn ), but in reality we only have the observations d = x1 , . . . , xn . As a result of not observing the membership variables zj , the observations are perceived as if each of the data points were generated from the (arithmetic) mixture model Ma : Xj ∼ (1 − α)f0 (xj | θ0 ) + α f1 (xj | θ1 ), where α is the mixture proportion. The artificial encompassing model Ma therefore contains the two competing models, M0 and M1 , as special cases; when α = 0 and α = 1 respectively. Hence, to uncover whether the observations are more consistent with M0 or M1 , Kamary et al. (2014) suggest to focus on estimating α within the encompassing model Ma . Inferring α amounts to a missing data problem which is in principle computationally intensive as there are 2n different combinations for the membership variables zj s. Luckily, one can resort to a completion method pioneered by Diebolt and Robert (1994). When this stochastic exploration method yields n0 and n1 numbers of observations allocated to M0 and M1 ,respectively, the posterior for α is then given by B (a + n0 , a + n1 ), when we use a beta prior on the mixture proportion, α ∼ B (a, a). When n0 is large and n1 small or zero, the posterior for α then concentrates most of its mass near zero indicating more evidence for M0 as one would expect. Kamary et al. (2014) note that the data generative view of the mixture model is theoretically justified, but that the resulting natural Gibbs sampler has convergence problems when the hyperprior a is smaller than one. To circumvent this problem, Kamary et al. (2014) propose to use a Metropolis–Hastings algorithm instead and illustrate its use by examples followed by a proof that shows that the method is asymptotically consistent. Thus, the work by Kamary et al. (2014) impressively introduces an alternative view on testing, an algorithmic resolution, and a theoretical justification. 1.3.1. Testing versus estimation We believe that the Kamary et al. (2014) mixture approach will be especially useful in psychological research. In particular, consider a hierarchical model where each participant’s performance xj on a psychological task is captured by a particular model or strategy represented by fi . The posterior for α then gives an indication of the prevalence of the model or strategy. When the posterior for α is near zero or near one, this suggests that one model or strategy is dominant; when the posterior for α is near 1/2, this suggests that some participants are better described by one strategy, and some are better described by another (for similar approaches see Friston & Penny, 2011; Lee, Lodewyckx, & Wagenmakers, 2015). The advantage of the mixture model approach is particularly acute when it is reasonable to assume that not all participants will follow one or the other strategy. In this special issue for Journal of Mathematical Psychology alone, the articles by Kary, Taylor, and Donkin (2016) and Turner, Sederberg, and McClelland (2016) demonstrate considerable heterogeneity among participants: the behavior of some participants is predicted much better by one model, the behavior of other participants is predicted much better by the competing model, and the behavior of a third set of participants is predicted by the models about equally well (see also Steingroever, Wetzels, & Wagenmakers, in press). The standard Bayes factor tests determines whether all participants are better predicted by model M0 or whether all participants are better predicted by model M1 . Therefore, one can construct situations in which the data support model M0 for 99 out of 100 participants, and nevertheless the Bayes factor strongly prefers model M1 . We believe that in these hierarchical scenarios, the mixture model approach is a valuable technique that can offers additional insight. The above considerations suggests that the mixture approach relaxes Jeffreys’s conceptualization of a hypothesis test. More precisely, Jeffreys viewed the null hypothesis as a general law, which by definition implies that the membership variables zj are either all zeros or all ones. Note that by embedding the models into A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 an artificial encompassing model, Kamary et al. (2014) transformed the testing problem into one of estimation. Jeffreys, however, did not feel that estimation is appropriate when the test of a general law is at hand: ‘‘Broad used Laplace’s theory of sampling, which supposes that if we have a population of n members, r of which may have a property ϕ , and we do not know r, the prior probability of any particular value of r(0 to n) is 1/(n + 1). Broad showed that on this assessment, if we take a sample of number m and find all of them with ϕ , the posterior probability that all n are ϕ ’s is (m + 1)/(n+1). A general rule would never acquire a high probability until nearly the whole of the class had been sampled. We could never be reasonably sure that apple trees would always bear apples (if anything). The result is preposterous, and started the work of Wrinch and myself in 1919–1923’’. (Jeffreys, 1980, p. 452). Wrinch and Jeffreys (1919, 1921, 1923) argued that within an estimation framework, a general law such as H0 : ‘‘All swans are white’’ cannot gain much evidence until almost all swans have been inspected.1 Moreover, common sense prescribes that the plausibility of a general law increases with every observation in accordance with the law, that is, s = n number of successes within n trials. Jeffreys (1961, p. 256) operationalized the general law as a binomial model M0 with θ0 fixed and its negation as the binomial model M1 with a θ free to vary. With a uniform prior on θ this (n+1)! then leads to a Bayes factor of BF01 (d) = s!f ! θ0s (1 − θ0 )f , where n denotes the total number of trials, s the number of successes, and f the numbers of failures. When only successes are observed (i.e., observations consistent with the general law H0 : θ0 = 1), the Bayes factor simplifies to n + 1; a single failure, on the other hand, indicates infinite evidence against the general law: the observation of a single black swan is interocular, as it conclusively falsifies the general law ‘‘all swans are white’’. Hence, Jeffreys’s Bayes factor formalizes inductive reasoning and the logic of proof by contradiction. The discussion above indicates that the mixture model approach does not formalize inductive reasoning and the logic of proof by contradiction: after having observed 10,000 white swans, the observation of a single black swan will not greatly affect the mixture proportion—the mixture proportion still reflects the fact that there is a great preponderance of white swans. However, in Jeffreys conceptualization, the single exception utterly destroys the general law. Another concern with the mixture model approach is that it is relatively insensitive to the shape of the prior distributions. Of course, this is also its strength, as this is needed to avoid the JLB paradox. However, models that make correct predictions should receive more reward when these predictions are risky, and the degree of risk is partly encoded in the shape of the prior distributions. For instance, suppose we model a binomial parameter θ and assume that M1 : θ ∼ U [1/2, 1] and M2 : θ ∼ U [0, 1]; further, suppose the observed data are highly consistent with the simpler model M1 . Because the predictions from M1 are twice as risky as those from M2 we would want to prefer M1 over M2 , and in fact, the Bayes factor in favor of M1 against M2 is asymptotically equal to 2 (e.g., Heck, Wagenmakers, & Morey, 2015; Shiffrin, Chandramouli, & Grünwald, 2016). 47 1.4. Conclusion Scientific learning involves more than just testing general laws and invariances. Estimation and exploration are important and the mixture approach has a lot to offer in this respect, particularly in hierarchical settings where the general law is unlikely to hold for all participants simultaneously. Other advantages of the mixture approach are apparent as well. For instance, Example 3.1 in Kamary et al. (2014) compares a Poisson distribution with parameter λ to a geometric distribution with parameter p (see Robert, 2015 for R code). The comparison begins by relating the parameterizations to each other by setting p = (1 + λ)−1 , which allows the use of the improper Jeffreys’s prior (with respect to the Poisson distribution) π (λ) ∝ λ−1 over the two models. Note how this procedure resembles Jeffreys’s recommendation for common parameters even though the arguments differ. Moreover, the resulting posterior π (λ | d) is then calculated from the mixture of the likelihoods of both models. The simulations show that the mixture approach performs well. We do not know how a Jeffreys’s Bayes factor can be constructed to deal with a test between two models of different relational forms as Jeffreys was only concerned with nested model comparisons (e.g., Robert, 2016). The mixture approach is not fully automatic, however, and requires some thoughts on how the priors should be chosen. In particular, one cannot naively use improper priors on the testrelevant parameters, as this may yield posteriors that are also improper (Grazian & Robert, 2015). This was acknowledged by Robert (2016) who used an (arbitrary) standard normal prior on µ in a ttest. Our implementation of this recommendation leads to a posterior median ranging from 0.3 to 0.9, for the interocular data with n = 4, x̄ = 7 and s2 = 0, while α should be 1.0 if it were information consistent. More recently, Kamary, Eun, and Robert (2016) proposed a noninformative reparametrization for locationscale mixtures to resolve the aforementioned arbitrariness. Hence, as with a Jeffrey’s Bayes factor, one should choose the priors carefully when one conceptualizes model selection as parameter estimation in a mixture model. Lastly, Robert notes that the mixture approach is superior to the Bayes factor as it leads to a faster accumulation of α to the √ null. The parametric convergence rate of n follows immediate from casting the testing problem as one of estimation. Similarly, it should be noted that Johnson and Rossell (2010) also use the rate of convergence as a motivation for their Bayes factor approach. We are unsure whether this rate is relevant as we do not consider a testing problem as one of estimation. In the end the Bayes factor and the mixture approach of Kamary et al. (2014) simply answer different questions. The choice which method to use should not be based on the rate of convergence, but on the research question the user seeks to address.2 2. Rejoinder to Chandramouli and Shiffrin Chandramouli and Shiffrin (2016) put forward a thoughtprovoking proposal which aims to explain and extend Bayesian induction using simple matrix algebra. We have given this novel idea considerable thought and outline some of our reservations below.3 We believe that Chandramouli and Shiffrin (henceforth C&S) put forward a belief propagation procedure that allows us to verify 2 We thank Joris Mulder for attending us to this. 3 The second and third authors are in a state of perpetual confusion regarding the 1 We now know that this particular statement does not hold true, since Australia is home to many black swans. The statement itself however cannot be discarded until the first exception is actually observed. details of the Chandramouli and Shiffrin proposal. All credit concerning this section goes to the first author, who, as such, takes full responsibility for any errors here. For a thorough understanding of our reply, we recommend to have the comment of Chandramouli and Shiffrin (2016) on hand. 48 A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 whether two given models, say, M1 and M2 align with a scientist’s prior belief about the true data generating process p∗ (X ). Instead of setting the priors onto the two given models M1 and M2 directly, C&S recommend to first elicit a scientist’s prior belief about the true data generating p∗ (X ) in the most general setting. This prior belief is then subsequently translated into priors on the models. Hence, the resulting prior model probabilities P (M1 ) and P (M2 ) are derived. In contrast, a Jeffreys’s Bayes factor follows from a top-tobottom procedure, where the top level is concerned with the comparison between two models (i.e., model classes) for which one has to (subjectively) choose prior model probabilities P (Mi ). Based on top level desiderata, i.e., a coherent comparison between the two models, we then derive the pair of priors π1 and π2 on the lower level that are concerned with the parameters (i.e., model instances) within the models M1 and M2 respectively. In effect, the sole purpose of the pair π1 , π2 is to mediate scientific learning through the Bayes factor, that is, to update the prior model odds to posterior model odds. On the other hand, the C&S induction scheme is a bottom-up approach based on the philosophy that the whole is the sum of its parts. At the lowest level, one has to subjectively elicit the scientist’s prior belief about the true data generating process. The procedure then elaborates on how this lowest level belief can be used to derive the model instance priors π1 and π2 at the intermediate level. By aggregating the model instance priors of π1 and π2 we then get the prior model probabilities P (M1 ) and P (M2 ) at the top level. As such, this method is not free from subjective input on the lowest level. Our major concern with the C&S method is the lack of invariance, which stems from their recommendation to operationalize their procedure with a seemingly innocent looking finitedimensional matrix with M rows and W number of columns.4 By using a finite-dimensional matrix, C&S basically made a choice in how they tackle the statistical modeling problem. The resulting model priors P (Mi ) are sensitive to this choice. More specifically, by initializing their procedure with a finite-dimensional matrix, they use discretized approximations of quantities that are essentially continuous. The approximation error due to discretization is non-negligible, as it permeates through all subsequent steps due to the bottom-up nature leading to an ill-defined Bayes factor. In brief, we believe that the C&S approach has to overcome some challenges before their procedure can be perceived as an extension of a traditional Bayes factors, let alone Jeffreys’s Bayes factors. We have three remarks: (1) The C&S procedure is not invariant to how one discretizes the statistical modeling problem; (2) the subjective assessment of the priors on the lowest level and the resulting prior model probabilities P (Mi ) on the top level are, therefore, illdefined, and (3) model selection based on posterior predictive pstatistics does not lead to a proper measure of evidence. This paper continues as follows: We first apply the C&S induction scheme to a concrete example. Then we show that we get different results when we choose a different finite-dimensional matrix to operationalize the C&S induction scheme. Lastly, we argue that the implicit discretization necessary for the finitedimensional matrix is the main culprit of the resulting lack of invariance. 2.1. Running example 4 We divert from the C&S notation, where the matrix is M × N dimensional, as the number of columns does not correspond with the number of samples in a data set. Instead, the number of columns refers to the number of possible outcomes a random variable can take on, we use w and W instead. 5 C&S call the function p(X | 0.9, M ) a data distribution predicted by the model 1 instance ϑ = 0.9. When we use a capital X we mean the three probabilities simultaneously. On the other hand, a small letter x refers to the probability with which it is generated, for instance, p(xw | 0.9, M1 ) = 0.18 when w = 2. To illustrate why we believe that the C&S method is essentially a belief propagation procedure, we consider a random variable X with a finite number of outcomes W . This W is denoted as n in Chandramouli and Shiffrin (2016) and defines the number of columns in their matrix representations (i.e., their Figures 1 and 2). To simplify matters, we use an example (taken from Ly et al., 2015) where X has W = 3 number of outcomes. Example 1 (A Psychological Task with Three Outcomes). In the training phase of a source-memory task, the participant is presented with two lists of words on a computer screen. List L is projected on the left-hand side and list R is projected on the right-hand side. In the test phase, the participant is then presented with two words, side by side, that can stem from either list, thus, ll, lr , rl, rr. At each trial, the participant is asked to categorize these pairs as either: • x1 meaning both words come from the left list, i.e., ‘‘ll’’, • x2 meaning the words are mixed, i.e., ‘‘lr’’ or ‘‘rl’’, • x3 meaning both words come from the right list, i.e., ‘‘rr’’. Thus, the random variable X has W = 3 outcomes. To ease the discussion, we assume that the words presented to the participant are ‘‘rr’’. As model M1 we take the so-called individual-word strategy. A participant guided by this strategy will consider each word individually and compare it with list R only. Within this model M1 , the parameter is given by θ1 = ϑ , which we interpret as the participant’s ‘‘right-list recognition ability’’. Hence, when the participant is presented with the pair ‘‘rr’’ she will respond x1 with probability (1 − ϑ)2 , thus, two failed recollections; x2 with probability 2ϑ(1 − ϑ), thus, one failed and one successful recollection; x3 with probability ϑ 2 , thus, two successful recollections. More compactly, a participant guided by this strategy generates the outcomes [x1 , x2 , x3 ] with the following three probabilities p(X | ϑ, M1 ) = [(1 − ϑ)2 , 2ϑ(1 − ϑ), ϑ 2 ], respectively. Note the data generative formulation. For instance, when the participant’s true ability is ϑ ∗ = 0.9, the three outcomes [x1 , x2 , x3 ] are then generated with the three probabilities p(X | 0.9, M1 ) = [0.01, 0.18, 0.81] respectively. We call the function p(X | θi , Mi ) with θi fixed a probability mass function (pmf) or model instance of Mi .5 Hence, every ϑ in (0, 1) yields a pmf that defines W number of probabilities. In effect, the model M1 consists of a collection of pmfs, which C&S refer to as a model class. As a competing model M2 , we take the so-called only-mixed strategy. Within this model M2 , the parameter is given by θ2 = a, which we interpret as the participant’s ‘‘mixed-list differentiability ability’’. With probability a the participant first checks whether the presented pair of words is mixed. If she perceives it as mixed, she then produces the outcome x2 with probability a. If she does not perceive the pair of words as mixed, the participant then randomly chooses x1 or x3 each with probability (1 − a)/2. More compactly, a participant guided by this strategy generates the outcomes [x1 , x2 , x3 ] with the following three probabilities p(X | a, M2 ) = [(1 − a)/2, a, (1 − a)/2], respectively. Again we formulated the model as a data generative process. For instance, when the participant’s true ability is a∗ = 1/3, the three outcomes [x1 , x2 , x3 ] are then generated with the same probability, i.e., p(X | 1/3, M2 ) = [1/3, 1/3, 1/3]. Note that this last pmf A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 p(X | 1/3, M2 ) is not in the collection of pmfs defined by M1 . Similarly, the pmf p(X | 0.9, M1 ) is not a member of the collection of pmfs defined by M2 . The two models M1 and M2 share only one pmf (model instance), that is, the pmf indexed by ϑ = 0.5 within M1 and, coincidentally, when a = 0.5 within M2 . We use these two models M1 and M2 to explain the C&S belief propagation procedure. 49 Table 1 The matrix is a simplified version of the matrix found in Figure 1 of C&S with M = 66 and W = 3. The quantities under the columns with G(ψm , M1 ) and G(ψm , M2 ) at the top refer to the KL-divergences, see the main text. The parameter under θi refers to the model instance that the pmf p(X | ψm ) is allocated to within the model under Mi . For example, the candidate true pmf p(X | ψ18 ) is allocated to the model instance p(X | ϑ = 0.60, M1 ) of model class M1 . 2.2. Chandramouli and Shiffrin’s procedure for induction For a Bayesian analysis we need priors on the model instances, which we denote by πi (θi ) as we have done before,6 and the priors on the models P (Mi ). Instead of doing so directly, C&S recommend to first (Step 1) elicit the scientist’s prior belief about the true data generating process p∗ (X ) in the most general setting. Next (Step 2) this subjectively chosen prior belief about p∗ (X ) is used to derive the model instance priors πi (θi ) and, subsequently, the model class priors P (Mi ). Lastly, (Step 3) C&S recommend to use posterior p-statistics for inference. 2.2.1. Step 1: Eliciting the prior on candidate true data generating Pmfs In our example, the true data generating pmf p∗ (X ) defines three probabilities p∗ (X ) = [p∗ (x1 ), p∗ (x2 ), p∗ (x3 )] with which it generates the three outcomes [x1 , x2 , x3 ]. For instance, a first candidate true data generating pmf could be p(X | ψ1 ) = [0.0, 0.0, 1.0], where ψ1 is an indicator for later reference. A second candidate true data generating pmf could be p(X | ψ2 ) = [0.0, 0.1, 0.9] and so forth and so on. This method yields a candidate set of true pmfs that we depicted in Table 1. The ‘‘matrix’’ depicted in Table 1 is a simplification of the table in Figure 1 in Chandramouli and Shiffrin (2016) with M = 66 rows and W = 3 columns. Please ignore the quantities to the right of the double vertical line for the moment. Note that the number of rows M = 66 is a result of our arbitrary choice of using a step size of 0.1 on the probabilities. Furthermore,recall that the pmfs p(X | ψm ) are candidates for the true data generating pmf p∗ (X ) and may not have any connection with the models M1 and M2 specified above. Of particular interest is the pmf p(X | ψ62 ) = [0.8, 0.1, 0.1], which is neither a member of M1 nor of M2 ,7 but because it defines a valid pmf it is, nonetheless, a candidate true data generating pmf. Given this finite-dimensional matrix of Table 1, C&S then recommend to elicit a scientist’s prior belief by setting prior beliefs λ(ψm ) for m = 1, . . . , M, thus, on each candidate true data generating pmf p(X | ψm ).8 For example, λ(ψ62 ) = 0.7 means that the scientist bestows a large portion of belief to the pmf indexed by ψ62 as being the true generating pmf p∗ (X ). Furthermore, λ(ψ61 )+ λ(ψ62 ) + λ(ψ63 ) = 0.90 means that the scientist is quite sure that the participant will generate the response x1 with 80% chance. As λ the scientist’s prior belief, we necessarily require that represents M m=1 λ(ψm ) = 1. 6 C&S denote this by p (θ ). Instead, we use the Greek letter π to distinguish this 0 i i model instance prior from the prior model probability P (Mi ) on the top level. The subscript i refers to the model membership. 7 A pmf of M with p(x | ϑ, M ) = 0.8 requires ϑ ≈ 0.11. However, this 1 1 1 automatically yields p(x2 | 0.11, M1 ) = 0.19. Hence, there is no ϑ in M1 that leads to the pmf indexed by ψ62 . Similarly, a pmf of M2 necessarily has p(x1 | a, M2 ) = p(x3 | a, M2 ), which is clearly not the case for the pmf indexed by ψ62 . 8 C&S denote this prior pmf probability as p (ψ ). Instead, we use the Greek 0 m letter λ to distinguish this prior pmf probability on the lowest level from the model instance prior πi (θi ) on the intermediate level and the prior model probabilities P (Mi ) on the top level. 2.2.2. Step 2: Propagating the prior belief to yield the prior model probabilities Once the prior beliefs λ(ψm ) about the true data generating p∗ (X ) are chosen, C&S commence their belief propagation procedure by redistributing λ(ψm ) over the two models. Recall that a model (class) Mi defines a collection of pmfs (model instances) denoted as p(X | θi , Mi ). The allocation of the prior pmf belief of the first candidate true pmf in Table 1, that is, λ(ψ1 ), is easy, because the associated pmf p(X | ψ1 ) = [0.0, 0.0, 1.0] does not belong to M2 , but it is a member of M1 ; the pmf indexed by ψ1 is a model instance of M1 when θ1 = ϑ = 1. C&S therefore allocate the prior pmf probability λ(ψ1 ) to the model instance π1 (ϑ = 1) of M1 . On the other hand, the pmf p(X | ψ62 ) = [0.8, 0.1, 0.1] is neither a member of M2 nor does it belong to M1 . To nonetheless allocate this prior pmf belief λ(ψ62 ) to a model instance of either M1 or M2 , C&S use a divergence measure denoted by G. For simplicity we take as G the Kullback–Leibler (KL) divergence, which is a measure of dissimilarity. The KL-divergence from a candidate true pmf indexed by ψm to a model instance of Mi is defined as G(ψm , θi | Mi ) = W w=1 p(xw | ψm ) log p(xw | ψm ) p(xw | θi , Mi ) , (6) and the larger this divergence, the more dissimilar the model instance p(X | θi , Mi ) is from the candidate true data generating pmf p(X | ψm ). For example, a direct calculation shows that the KLdivergence between the candidate true p(X | ψ1 ) in Table 1 to the model instance of M1 with θ1 = ϑ = 1.0 is given by G(ψ1 , θ1 = 1.0 | Mi ) = 0. The KL-divergence is zero if and only if the pmfs indexed by ψm and the model instance indexed by θi are exactly the same, hence, their dissimilarity is zero. 50 A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 The KL-divergence between the candidate true p(X | ψm ) and a collection of pmfs defined by the model Mi is given by G(ψm , Mi ) = minθi G(ψm , θi | Mi ). That is, the dissimilarity between the candidate true data generating pmf ψm and the model Mi is the smallest dissimilarity between p(X | ψm ) and the model instances p(X | θi , Mi ) of model Mi . For example, a direct calculation shows that the KL-divergence from the candidate true data generating pmf p(X | ψ62 ) to M1 is given by G(ψ62 , M1 ) = G(ψ62 , θ1 = 0.15 | M1 ) = 0.137. Similarly, the KL-divergence between the same candidate true data generating pmf to M2 is given by G(ψ62 , M2 ) = G(ψ62 , θ2 = 0.1 | M2 ) = 0.310. Because the divergence from the candidate true pmf p(X | ψ62 ) to M1 is smaller than the divergence to M2 , the C&S procedure implies that we should allocate the prior pmf probability λ(ψ62 ) to the prior model instance probability π1 (ϑ = 0.15) belonging to M1 . We suspect that the underlying idea of this belief allocation procedure is based on the idea of chaining. Thus, if λ(ψ62 ) = 0.70, the scientist has much fate in p(X | ψ62 ) being the true data generating pmf. However, as p(X | ψ62 ) is not in the model M1 nor in M2 , the C&S procedure then recommends to go for the next best thing; assigning the pmf prior λ(ψ62 ) to the model instance that is most similar to p(X | ψ62 ), in this case, π1 (ϑ) with ϑ = 0.15. This redistribution of the pmf prior λ(ψm ) to model instance priors can be read from their table in Figure 1 in Chandramouli and Shiffrin (2016) from left to right.9 In our Table 1 the numbers under G(ψm , M1 ) and G(ψm , M2 ) represents the KL-divergence from the candidate true pmf indexed by ψm to the models M1 and M2 respectively. The parameter value under θi indicates which parameter value ϑ within M1 or a within M2 corresponds to the model instance that is closest to the pmf of ψm . The last column indicates whether the ψm is eventually allocated to M1 or M2 . As in their table in Figure 1 of Chandramouli and Shiffrin (2016), note that there are multiple candidates ψm s allocated to certain parameter values in our Table 1. For example, the candidate pmfs indexed by ψ3 and ψ12 are both allocated to the same model instance indexed by ϑ = 0.90 within M1 . As such, C&S derive the prior on the model instances as π1 (ϑ) = λ(ψm ), where the sum is over the candidates ψm which have the same ϑ in the column under θi . For example, π1 (ϑ = 0.90) = λ(ψ3 ) + λ(ψ12 ). After allocating all the M number of prior pmf probability λ(ψm ) to the model instances of either model classes, we have π1 (ϑk ) and π2 (ak̃ ) for k = 1, . . . , K and k̃ = 1, . . . , K̃ . The K indicates the number of unique values of ϑ s in the column under θi . As there are multiple candidates allocated to certain parameter values we typically have K + K̃ < M. With the model instance priors at hand, the C&S scheme tells us to aggregate them to yield prior K model probabilities, i.e., P (M1 ) = k=1 π1 (ϑk ) and P (M2 ) = K̃ P( k̃=1 M2 π2 (ak̃ ). As a result of ) = 1. M m =1 λ(ψm ) = 1 we have P (M1 ) + 2.2.3. Step 3: Posterior predictive p-statistics So far, we only discussed the C&S belief propagation procedure as a method to translate a scientist’s prior belief λ(ψ) about the true data generating p∗ (X ) to prior beliefs on the model instances πi (θi ), which can then be used to define prior beliefs on the models P (Mi ). These priors can be used for inference after we observe data d. As in C&S, we simplify the discussion by supposing that the data consist of one observation where the participant responded with x1 . To invert the data generative view of pmfs, we fix the data part of each pmf at the observation p(X | ψm ) = p(d | ψm ) and consider 9 We are unsure what ϕ in their table indicates. the pmfs as a function of ψm , i.e., as a likelihood function. Bayes’ rule then allows us to update the subjectively chosen pmf prior to a pmf posterior using all specified candidate likelihood functions indexed by the ψm s, that is, λ(ψm | d) = p(d | ψm )λ(ψm )/C , for m = 1, . . . , M, where the normalization constant C is given by M C = m=1 p(d | ψm )λ(ψm ). Recall that the rows p(X | ψm ), thus, the likelihood functions, themselves do not need to belong to the models M1 and M2 . In fact, most of them do not, as most of the entries under G(ψm , M1 ) and G(ψm , M2 ) are non-zero. For inference concerning replication studies, C&S recommend using posterior predictive p-statistics. For example, the observations dorig of the original experiment might suggest that a participant’s ‘‘right-list recognition ability’’ ϑ is a half. To test whether this postulate ϑ = 0.5 can be reproduced, C&S recommend to update the subjectively chosen pmf prior about the true p∗ (X ) to a posterior yielding λ(ψm | dorig ). Recall that this posterior is also based on likelihood functions p(d | ψm ) that do not belong to M1 as discussed above. For example, if λ(ψ62 ) > 0 then p(X | ψ62 ) = [0.8, 0.1, 0.1] in Table 1 is used as a likelihood to relate the observations dorig to ψ62 . Because there is no ϑ that leads to p(X | ψ62 ), see the footnote at the end of Section 2.2.1, the likelihood function at ψ62 does not and cannot extract information about ϑ from dorig . Nonetheless, C&S use the posterior λ(ψm | dorig ) to weight all the candidate true pmfs in Table 1 resulting in a posterior predictive M p(xw | dorig ) = m=1 p(xw | ψm )λ(ψm | dorig ) for w = 1, . . . , W . This posterior predictive is used as a sampling distribution, i.e., it defines the probabilities with which new data are generated. If the actually observation drep is very improbable under this predictive, then the C&S procedure prescribes this as a failure of reproducibility. The problem with this prediction is that it is also calculated from the predictions of p(X | ψ62 ), even though this pmf ψ62 has no connection to ϑ whatsoever. In sum, it seems that the C&S recommendation for replication boils down to comparing the observed data drep in a replication attempt using the posterior predictive as a sampling distribution, which is based on irrelevant likelihood functions and subjective belief λ(ψm ). Moreover, by using the posterior predictive as a sampling distribution to assess replication, this method shares many pitfalls with common p-value tests and therefore does not quantify evidence (e.g., Bayarri & Berger, 2000; Wagenmakers, 2007). 2.2.4. C&S Bayes factors Although C&S do not recommend to use Bayes factors for inference, they note that Bayes factors can be constructed from their belief propagation procedure. The main idea is to reuse the belief propagation procedure, but this time to redistribute the posterior beliefs λ(ψm | d) about the true data generating p∗ (X ) to posterior beliefs for the ‘‘model instances’’ πi (θ̂i | d), which can then be used to define posterior beliefs on the ‘‘models’’ P (M̂i | d). We are reluctant to call P (M̂i | d) the posterior model probabilities, because they are calculated using likelihood functions that do not belong to Mi (hence, the hats in our notation). There are now two ways to derive a Bayes factor based on the quantities resulting from the C&S belief propagation procedure. The first method involves the ratio of the posterior and prior model odds, that is, ˆ 12 (d) = BF P (M̂1 | d)/P (M̂2 | d) P (M1 )/P (M2 ) . (7) This Bayes factor depends on the subjectively chosen prior beliefs λ(ψm ) about p∗ (X ), the chosen divergence measure G, and – most troublesome – on the collection of candidate likelihood functions p(d | ψm ) rather than on the likelihood that belong to the respective models. A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 The second method involves the ratio of marginal likelihoods, that is, K ˜ 12 (d) = BF p(d | ϑk , M1 )π1 (ϑk ) k =1 K̃ . (8) p(d | ak̃ , M2 )π2 (ak̃ ) k̃=1 ˆ 12 (d), this Bayes factor is calculated from the In contrast to BF likelihoods p(d | θi , Mi ) that actually do belong to the respective ˆ 12 (d) and BF ˜ 12 (d) will differ from each other. models. Hence, BF We have some reservations about the Bayes factor as defined in Eq. (7) or Eq. (8) as a generalization of traditional Bayes factors. First, a traditional Bayes factor leads to the same quantity whether it is computed as the ratio of the posterior and prior model odds or as the ratio of marginal likelihoods. Second, a traditional Bayes factor would involve continuous integrals, whenever the parameters ϑ and a are free to vary in continuous intervals. The replacement of the integrals by finite sums is an artifact of only considering a finite number M of candidate true pmfs p(X | ψm ). 2.3. Lack of invariance Our major concern with Bayes factors calculated from the C&S approach, however, is rooted in its operationalization using a finite-dimensional matrix (e.g., Table 1), as it causes a lack of invariance affecting every step of their belief propagation procedure. As such, two scientist with the same subjective belief λ(ψ) about the true p∗ (X ) using the same divergence measure G, but with a different finite-dimensional matrix will calculate different Bayes factors. We appreciate the attempt by C&S to assess how well models represent the true data generating process. Their procedure considers all possible data generating pmfs and as such can account for model misspecification. Although attractive, such an unrestrictive view leads to complications when one is concerned with testing models for which one has to set priors. The C&S recommendation is to do so subjectively, which we consider nigh impossible. More specifically, the collection of all possible data generative pmfs P is typically hard to describe and without a proper description even harder to subjective assign prior beliefs to. Our paper continuous as follows: (1) We first characterize P and simplify it with a parameterization; (2) a different parameterization of P is then given leading to a different finitedimensional matrix. (3) In effect, this leads to different prior beliefs and (4) different allocations, thus, different Bayes factors. (5) Lastly, we remark how this problem is related to the invariance problem already solved by Jeffreys (1946) and what his solution implies for the C&S procedure. 2.3.1. Characterizing the collection of all possible Pmfs When X has W = 3 number of outcomes, its true distribution p∗ (X ) can then be characterized by W − 1 = 2 parameters. Recall that a pmf for X then defines the three chances p(X ) = [p(x1 ), p(x2 ), p(x3 )] with which it generates the outcomes [x1 , x2 , x3 ]. The pmf must therefore satisfy two conditions: (i) it has to be non-negative and bounded by one, i.e., 0 ≤ p(xw ) ≤ 1 for each outcome xw of X with w = 1, . . . , W , and (ii) the probabilW ities must sum to one, i.e., w=1 p(xw ) = 1. Note that this holds true for any candidate true pmf p(X | ψm ) in Table 1. We call the collection of functions for which the conditions (i) and (ii) hold the collection of all possible pmfs or the full model and denote it by P . The collection P has an uncountably infinite number of members, each capable of being the true data generating process p∗ (X ). By using a finite-dimensional matrix such as the one in Table 1, C&S, 51 thus, restrict their prior belief elicitation to only M = 66 candidate true pmfs. To show that even for W = 3 the full model P is uncountable, we first parameterize P , that is, we identify each possible true pmf of P with a two dimensional parameter ψ = (b, c ). Given any pmf p(x) = [p(x1 ), p(x2 ), p(x3 )], we define b = p(x1 ), c = p(x2 ) and set ψ = (b, c ). This construction is essentially a function ξ that maps a member of the full model P into a parameter space Ψ of dimension W − 1 = 2. Using the inverse parameterization ξ −1 we can identify every parameter ψ = (b, c ), where (i’) 0 ≤ b, c ≤ 1 and (ii’) b + c ≤ 1, with a pmf such that the three outcomes [x1 , x2 , x3 ] are generated with the probabilities p(X | ψ) = [b, c , 1 − b − c ]. As there are an uncountable number of ψ = (b, c )s for which the conditions (i’) and (ii’) holds, we conclude that there are also an uncountable number of pmfs p(X | ψ) in the full model P for which (i) and (ii) holds. 2.3.2. Different parameterizations, different representation of P : A different set of candidate true pmfs The aforementioned parameterization ξ : P → Ψ relates to the candidate true pmfs of Table 1 as we have actually chosen ψ1 = (0.0, 0.0), ψ2 = (0.0, 0.1), . . . , ψ62 = (0.8, 0.1), ψ63 = (0.8, 0.2), ψ64 = (0.9, 0.0), ψ65 = (0.9, 0.1), ψ66 = (1.0, 0.0). The resulting M = 66 number of columns is due to the dependence between b and c. A different parameterization ξ̃ from the full model P to a parameter space Ψ̃ is based on a ‘‘stick-breaking’’ approach. Given a p(X ) we then choose b̃ = p(x1 ), c̃ = p(x2 )/[1 − p(x1 )] and define ψ̃ = (b̃, c̃ ).10 Using the inverse parameterization ξ̃ −1 we can also ′ ′ identify every parameter ψ̃ = (b̃, c̃ ), where (i + ii ) 0 ≤ b̃, c̃ ≤ 1, with a pmf such that the three outcomes [x1 , x2 , x3 ] are generated with the probabilities p(X | ψ) = [b̃, (1 − b̃)c̃ , (1 − b̃)(1 − c̃ )]. Note that every parameter ψ̃ lies within the unit square Ψ̃ = [0, 1] × [0, 1], and that b̃ and c̃ can be chosen independently from each other. Again, as there are an uncountable number of elements in the unit square, we have an uncountable collection of candidate true pmfs P . With this stick-breaking representation of P and a step size of 0.1 we get the matrix depicted in Table 2. This new matrix differs substantially from the previous one. First, it has more rows, thus, a larger number of candidate true pmfs; M = 111 compared to M = 66 in Table 1. Second, there are more candidate pmfs that imply that the first response x1 is generated with 80% chance; eleven in Table 2 compared to three in Table 1. 2.3.3. Different representation, different prior beliefs Expanding on these observations, we suspect that a scientist would subjectively set different prior beliefs depending on whether she is confronted with the matrix of Table 1 or with the matrix of Table 2. In particular, when confronted with the matrix of Table 1 the scientist might subjectively set λ(ψ61 ) = λ(ψ63 ) = 0.1 and λ(ψ62 ) = 0.7 meaning that she is quite sure, that the participant will generate the response x1 with 80% chance, i.e., P (p∗ (x1 ) = 0.80) = 0.9. To cohere to this belief the scientist would simply set λ(ψ̃89 ) = λ(ψ̃99 ) = 0.1 and λ(ψ̃94 ) = 0.7 and, subsequently, set the prior belief of all the ‘‘in-between’’ pmfs that generate x1 with 80% to zero in Table 2. We highly doubt that any scientist would be so specific in formulating her prior beliefs and, thus, doubt that a subjective assessment of the prior beliefs will work here. 10 This only works if p(x ) ̸= 1. When p(x ) = 1, we simply set c̃ = 0 and define 1 1 ψ̃ = (1, 0). 52 A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 Table 2 The matrix is a simplified version of the matrix found in Figure 1 of C&S based on the different parameterization ξ̃ defined in text. Note how the pmf p(X | ψ̃19 ) is allocated to M2 . As an alternative, we might think that we are noninformative if we give each candidate true pmf the same prior probability. This means that we then give each candidate true pmf of Table 1 a prior probability of λ(ψm ) = 1/66 ≈ 0.0152. The pmfs that the participant will generate the response x1 with 80% chance then get a total prior probability of 3/66 ≈ 0.0455. On the other hand, in Table 2 a uniform prior on λ(ψ̃m ) = 1/111 ≈ 0.009 and the pmfs that the participant will generate the response x1 with 80% chance then gets prior probability of 11/111 ≈ 0.099. Hence, a different set of candidate true pmfs will lead to a different assessment of prior beliefs. This lack of invariance depends on how many and which true candidate pmfs are chosen from P in constructing the finite-dimensional matrices of Tables 1 and 2. 2.3.4. Different representation, different prior model probabilities, thus, different Bayes factors Applying the C&S belief propagation procedure to the matrix of Table 1 yields different allocations, thus, different Bayes factors then when we use the matrix of Table 2. For example, a scientist might believe that the true data generating pmf is close to p(X | ψ18 ) = [0.1, 0.6, 0.3] of Table 1, thus, chooses λ(ψ18 ) = 0.50. This prior belief then gets allocated to the model instance p(X | ϑ = 0.6, M1 ) of M1 . Similarly, we would expect that the scientist would also set λ(ψ̃19 ) ≈ 0.50 when confronted with Table 2, because the candidate pmf p(X | ψ̃19 ) = [0.10, 0.63, 0.27] in the second matrix does not differ that much from the pmf p(X | ψ18 ) of the first matrix. However, according to the second matrix the prior pmf probability λ(ψ̃19 ) is then allocated to the model instance p(X | a = 0.63, M2 ) of M2 . In effect, a different representation leads to a different belief allocation, thus, different priors πi (θi ), P (Mi ) and different posteriors P (M̂i | d) and, consequently, different Bayes factors. As such, our understanding of the C&S belief propagation procedure leads to an inadequate definition of Bayes factors, which depends on how we choose to represent P . 2.3.5. Jeffreys’s prior and the C&S procedure The reason for this lack of invariance is due to an error incurred from (1) the parameterization ξ itself, and (2) the discretization of the parameter space. For example, the matrices depicted in Tables 1 and 2 were derived from the parameterizations ξ and ξ̃ , respectively, followed by a discretization of the parameter space with a step size of 0.1 in each coordinate. The first point can be repaired, as Jeffreys (1946) showed that the Fisher information can be used to neutralize the parameterization error. This solution is more commonly known as the Jeffreys’s prior. In Ly et al. (2015) we showed that the Jeffreys’s prior on the parameters, say, ψ = (b, c ) in Ψ leads to a uniform prior on pmfs in P . The second point however cannot be fixed. To elaborate on this latter point, recall that the collection of all data generating pmfs P is uncountably large, which means that the scientist’s actual prior belief λ(ψ) is a continuous quantity. By using a finite number M of candidate true data generating pmfs, the target continuous random variable λ(ψ) is then approximated by a discretized version λ(ψm ). The corresponding discretization errors are comparable to the errors incurred when histograms are used to approximate a smooth density function. Moreover, because the actual belief about ψ is continuous, we have zero probability of having the true data generating process p∗ (X ) being exactly equal to one of the finite number of candidate pmfs p(X | ψm ). As such, we cannot construct the actual belief λ(ψ) from point masses. Note that this continuity issue was already alluded to in Section 2.3.3 as one would expect that if the pmfs indexed by ψ̃89 , ψ̃94 , ψ̃99 in Table 2 are assigned some prior mass, the pmfs A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 in between would also receive some prior mass. The implication is that the C&S procedure might only work if we use a ‘‘matrix’’ with an uncountable number of rows. Furthermore, the discretization leads to another type of approximation error that we refer to as geometric approximation error due to the chosen divergence measure G. This error was alluded to in Section 2.3.4, where a small change in the candidate true data generating pmf p(X | ψ18 ) = [0.1, 0.6, 0.3] to p(X | ψ̃19 ) = [0.10, 0.63, 0.27] leads to a completely different allocation of the prior belief; from a model instance of M1 to M2 . The geometrical interpretation stems from the fact that KL-divergence can be thought of as a generalization of the Fisher information metric.11 Moreover, it follows directly from the geometric interpretation that the C&S belief propagation procedure favors the more complex model, as it will attract a larger number of candidate data generating pmfs indexed by ψm , see Ly et al. (2015). This a priori boosting of the more complex model is at odds with the simplicity postulate that seems to be central in the foundations of the C&S procedure, see Shiffrin et al. (2016) in this special issue. The fact that we cannot construct the actual belief λ(ψ) from point masses is at odds with the C&S idea that P (Mi ) is the sum of its parts. This bottom-up view is what caused Shiffrin et al. (2016) to avoid overlapping models; when M1 and M2 share a pmf and the shared instance receives some prior mass, this prior mass will be accounted for twice. As a result, the prior model probabilities will then exceed one, i.e., P (M1 )+P (M2 ) > 1. To deal with overlapping models Shiffrin et al. (2016) suggested to remove the common pmfs from the larger model. This idea is elaborated on with a toy example where M3 is a binomial model with the chance of success θ fixed at θ = 0.5 and where M4 represents the binomial model in which θ is free to vary between zero and one. They then reformulate M4 as the binomial model M̃4 in which θ is free to vary between (0, 0.49) and (0.51, 1). This replacement of M4 by M̃4 leads to another approximation error. One solution would be to allow M̃4 to converge to M4 by allowing θ to be in (0, 0.5 − ϵ) ∪ (0.5 + ϵ, 1). This construction however depends on how ϵ goes to zero and induces the Borel–Kolmogorov paradox (e.g., Lindley, 1997; Wetzels, Grasman, & Wagenmakers, 2010). This paradox is another indication of how the C&S belief propagation scheme depends on how we as scientists represent the problem in terms of the chosen parameterization and, subsequently, discretize the parameter space. In other words, we believe that the lack of invariance is inescapable when the C&S approach is operationalized with a finite-dimensional matrix leading to an oversimplication of the problem resulting in a representation that is not on par with the sophisticated ideas behind the C&S approach. 2.4. Conclusion Based on the different strategies used to set priors πi (θi ) within the models Mi , we conclude that the C&S belief propagation procedure answers a different question than a traditional Bayes factor. We believe that C&S are mostly concerned with how a scientist’s subjective knowledge of the true data generating p∗ (X ) is permeated in the models M1 and M2 . Hence, C&S focus on checking whether the models M1 and M2 give a good representation of expert knowledge. As such, we think that the C&S approach can be valuable at the preliminary stage of model building. In particular, by considering 11 The KL-divergence is not a metric in the formal sense, only its infinitesimal version can be related to the Fisher information as a metric, i.e., Jeffreys’s prior. 53 all possible data generating pmfs for the random variable X , the C&S procedure forces the statistician to focus on building a model that is relevant for the problem at hand, rather than being restricted by the standard models. We would like to emphasize that our remarks are not aimed at the aspiration of C&S to construct good models that mimic nature well. Our major concern deals with the finite-dimensional representation that C&S use to operationalize their procedure and the recommendations to set λ(ψ) subjectively. The idea to consider the full model P is to account for misspecification; as a result, however, the subjective assessment of prior beliefs is nigh impossible. Note that the subjective belief λ(ψ) is necessary a continuous random variable, because the full model P contains an uncountable number of candidate true pmfs p(X | ψ). To make their procedure viable C&S oversimplify the problem with a finite-dimensional matrix yielding approximation errors that cannot be ignored. The problem worsens when X is also continuous. In that case, the full model should then be represented by a ‘‘matrix’’ with an uncountable number of rows and columns. Moreover, this full model is far too complex, as it does not even allow for consistent inference (Dvoretzky, Kiefer, & J, 1956). This is why regularization methods were invented and alternative models were proposed that grow with the number of samples (e.g., Bickel, 2006). The goal set by C&S to compare models in a totally unrestrictive setting is ambitious and an active area of research that is progressing slowly, see Borgwardt and Ghahramani (2009), Ghosal, Lember, and Van Der Vaart (2008), Holmes, Caron, Griffin, and Stephens (2015), Labadi, Masuadi, and Zarepour (2014), Salomond (2013, 2014) for some recent results. For estimation problems, one solution would be to forgo the finite matrix representation and consider the prior on P as a continuous random variable instead. As a replacement of the subjective assessment, we then recommend Jeffreys’s prior as it is uniform on P when X has a finite number of outcomes W . A Jeffreys prior for the full model P is viable when W < ∞, as the distribution of X is then at most a multinomial distribution with W categories. When X is continuous the Jeffreys prior can then be extended by a method described in Ghosal, Ghosh, and Ramamoorthi (1997), which has been used successfully to justify Bayesian nonparametric estimation methods, see also Ghosal, Ghosh, and Van Der Vaart (2000) and Kleijn (2013). However, this replacement of the discretizated λ(ψm ) by a continuous version λ(ψ) is at odds with the philosophy that the prior on the whole, P (Mi ), is a sum of its parts πi (θi ) as the individual model instances then necessarily receive zero mass. Furthermore, we do not know how to translate a continuous λ(ψ) on all pmfs P to the model instances πi of Mi without an explicitly defined relationship between the true data generating p∗ (X ) and the model instances of Mi . In effect, we doubt that the C&S procedure extends traditional Bayes factors and that it is capable of yielding a Jeffreys’s Bayes factors that formalizes inductive reasoning and the logic of proof by contradiction. The reason for this doubt is due to the fact that C&S do not focus on the two models under test, instead, they embed these two models within a larger encompassing model as Robert did, see Section 1. In conclusion, we believe that a Jeffreys’s Bayes factor remains the preferred method of inference, because a Jeffreys’s Bayes factor does not depend on how the full model P is represented and discretized. Thus, it does not suffer from the lack of invariance as discussed above. Furthermore, a Jeffreys’s Bayes factor does not require a subjectively elicitation of prior beliefs. Note that the Bayes factor focuses on comparing the models M1 and M2 , no reference is made to any true data generating process p∗ (X ). Jeffreys was mostly concerned with quantifying the (relative) evidence provided by the observations for either model. The Bayes factor is not concerned with the true data generating process p∗ (X ) 54 A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 and it does not aspire to do so. Both M1 and M2 could be poor descriptions of the true data generating pmf p∗ (X ), but fortunately it has been shown that the model selected with a Bayes factor is the model closest to the true p∗ (X ) in terms of KL-divergence (e.g., Dass & Lee, 2004). Hence, the model that is preferred by the Bayes factor will be able to generalize better to yet unseen data—a guarantee that aligns with the spirit of the C&S approach. 3. Conclusion We would like to thank the authors of both comments for their stimulating remarks and for their creative alternatives and extensions to Jeffreys’s Bayes factors. We hope this discussion has resulted in a renewed appreciation for Harold Jeffreys’s foundational contributions to model selection and hypothesis testing, and we look forward to future developments in this exciting and important area of research. Acknowledgments This work was supported by the starting grant ‘‘Bayes or Bust’’ awarded by the European Research Council (grant number: 283876). Correspondence concerning this article may be addressed to Alexander Ly. We thank Maarten Marsman and Joris Mulder for their comments on an earlier draft of the manuscript. References Bayarri, M., & Berger, J. O. (2000). P values for composite null models. Journal of the American Statistical Association, 95(452), 1127–1142. Bayarri, M., Berger, J., Forte, A., & García-Donato, G. (2012). Criteria for Bayesian model choice with application to variable selection. The Annals of statistics, 40(3), 1550–1577. Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. Springer Verlag. Berger, J. O., & Pericchi, L. R. (2001). Objective Bayesian methods for model selection: Introduction and comparison. In Lecture notes-monograph series (pp. 135–207). Berger, J. O., Pericchi, L. R., & Varshavsky, J. A. (1998). Bayes factors and marginal distributions in invariant situations. Sankhyā: The Indian Journal of Statistics, Series A, 307–321. Bickel, P. (2006). Regularization in statistics. Test, 15(2), 271–344. With discussion by Li, Tsybakov, van de Geer, Yu, Valdés, Rivero, Fan, van der Vaart, and a rejoinder by the author. Bickel, P. J., & Kleijn, B. J. K. (2012). The semiparametric Bernstein–von Mises theorem. The Annals of Statistics, 40(1), 206–237. Borgwardt, K.M., & Ghahramani, Z. (2009). Bayesian two-sample tests. arXiv preprint arXiv:0906.4032. Chandramouli, S., & Shiffrin, R. (2016). Extending Bayesian induction. Journal of Mathematical Psychology, 72, 38–42. Consonni, G., & Veronese, P. (2008). Compatibility of prior specifications across linear models. Statistical Science, 23(3), 332–353. Dass, S. C., & Lee, J. (2004). A note on the consistency of Bayes factors for testing point null versus non-parametric alternatives. Journal of Statistical Planning and Inference, 119(1), 143–152. Dawid, A. P., & Lauritzen, S. L. (2001). Compatible prior distributions. In Bayesian methods with applications to science, policy and official statistics (pp. 109–118). Diebolt, J., & Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 363–375. Dvoretzky, A., Kiefer, J., & J, Wolfowitz (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 642–669. Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242. Etz, A., & Wagenmakers, E.-J. (2015). Origin of the Bayes factor. arXiv preprint arXiv:1511.08180. Friston, K. J., & Penny, W. (2011). Post hoc Bayesian model selection. NeuroImage, 56, 2089–2099. Ghosal, S., Ghosh, J., & Ramamoorthi, R. (1997). Non-informative priors via sieves and packing numbers. In Advances in statistical decision theory and applications (pp. 119–132). Springer. Ghosal, S., Ghosh, J. K., & Van Der Vaart, A. W. (2000). Convergence rates of posterior distributions. The Annals of Statistics, 28(2), 500–531. Ghosal, S., Lember, J., Van Der Vaart, A., et al. (2008). Nonparametric Bayesian model selection and averaging. Electronic Journal of Statistics, 2, 63–89. Grazian, C., & Robert, C. P. (2015). Jeffreys’ priors for mixture estimation. In Bayesian statistics from methods to models and applications (pp. 37–48). Springer. Heck, D., Wagenmakers, E.-J., & Morey, R. D. (2015). Testing order constraints: Qualitative differences between Bayes factors and normalized maximum likelihood. Statistics & Probability Letters, 105, 157–162. Holmes, C. C., Caron, F., Griffin, J. E., Stephens, D. A., et al. (2015). Two-sample Bayesian nonparametric hypothesis testing. Bayesian Analysis, 10(2), 297–320. Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007), 453–461. Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, UK: Oxford University Press. Jeffreys, H. (1980). Some general points in probability theory. In A. Zellner, Kadane, & B. Joseph (Eds.), Bayesian analysis in econometrics and statistics: Essays in honor of Harold Jeffreys (pp. 451–453). Amsterdam: North-Holland. Johnson, V. E., & Rossell, D. (2010). On the use of non-local prior densities in Bayesian hypothesis tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(2), 143–170. Kamary, K., Eun, J., & Robert, C.P. (2016). Non-informative reparameterisations for location-scale mixtures. arXiv preprint arXiv:1601.01178. Kamary, K., Mengersen, K., Robert, C.P., & Rousseau, J. (2014). Testing hypotheses via a mixture estimation model. arXiv preprint arXiv:1412.2044. Kary, A., Taylor, R., & Donkin, C. (2016). Using Bayes factors to test the predictions of models: A case study in visual working memory. Journal of Mathematical Psychology, 72, 210–219. Kleijn, B. (2013). Criteria for Bayesian consistency. arXiv preprint arXiv:1308.1263. Labadi, L.A., Masuadi, E., & Zarepour, M. (2014). Two-sample Bayesian nonparametric goodness-of-fit test. arXiv preprint arXiv:1411.3427. Lee, M. D., Lodewyckx, T., & Wagenmakers, E.-J. (2015). Three Bayesian analyses of memory deficits in patients with dissociative identity disorder. In J. R. Raaijmakers, A. Criss, R. Goldstone, R. Nosofsky, & M. Steyvers (Eds.), Cognitive modeling in perception and memory: A festschrift for Richard M. Shiffrin (pp. 189–200). Psychology Press. Lindley, D. V. (1977). The distinction between inference and decision. Synthese, 36, 51–58. Lindley, D. V. (1997). Some comments on Bayes factors. Journal of Statistical Planning and Inference, 61(1), 181–189. Ly, A., Etz, A., Marsman, M., Epskamp, S., Gronau, Q., & Matzke, D., et al., (2015). Replication Bayes factors. (in preparation). Ly, A., Marsman, M., Verhagen, A., Grasman, R., & Wagenmakers, E.-J. (2015). A tutorial on Fisher information. Journal of Mathematical Psychology (submitted for publication). Ly, A., Verhagen, A., & Wagenmakers, E.-J. (2016). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology, 72, 19–32. Marin, J.-M., & Robert, C. P. (2014). Bayesian essentials with R. Springer. Robert, C. (2007). The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media. Robert, C. (2016). The expected demise of the Bayes factor. Journal of Mathematical Psychology, 72, 33–37. Robert, C. P. (1993). A note on Jeffreys-Lindley paradox. Statistica Sinica, 3(2), 601–608. Robert, C. P. (2014). On the Jeffreys-Lindley paradox. Philosophy of Science, 81(2), 216–232. Robert, C.P. (2015). The Metropolis–Hastings algorithm. arXiv preprint arXiv:1504.01896. Robert, C. P., Chopin, N., & Rousseau, J. (2009). Harold Jeffreys’s theory of probability revisited. Statistical Science, 141–172. Rozeboom, W. W. (1960). The fallacy of the null–hypothesis significance test. Psychological Bulletin, 57, 416–428. Salomond, J.-B. (2013). Bayesian testing for embedded hypotheses with application to shape constrains. arXiv preprint arXiv:1303.6466. Salomond, J.-B. (2014). Adaptive Bayes test for monotonicity. In The contribution of young researchers to Bayesian statistics (pp. 29–33). Springer. Scott, J. G., & Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics, 38(5), 2587–2619. Shiffrin, R., Chandramouli, S., & Grünwald, P. (2016). Bayes factors, relations to minimum description length, and overlapping model classes. Journal of Mathematical Psychology, 72, 56–77. Steingroever, H., Wetzels, R., & Wagenmakers, E.-J. (in press). Bayes factors for reinforcement–learning models of the Iowa gambling task, Decision. Stephens, M., & Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nature Reviews Genetics, 10, 681–690. Turner, B. M., Sederberg, P. B., & McClelland, J. L. (2016). Bayesian analysis of simulation-based models. Journal of Mathematical Psychology, 72, 191–199. Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology: General, 143(4), 1457. Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804. Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi. Journal of Personality and Social Psychology, 100, 426–432. Wetzels, R., Grasman, R. P. P. P., & Wagenmakers, E.-J. (2010). An encompassing prior generalization of the Savage–Dickey density ratio test. Computational Statistics & Data Analysis, 54, 2094–2102. A. Ly et al. / Journal of Mathematical Psychology 72 (2016) 43–55 Wrinch, D., & Jeffreys, H. (1919). On some aspects of the theory of probability. Philosophical Magazine, 38, 715–731. Wrinch, D., & Jeffreys, H. (1921). On certain fundamental principles of scientific inquiry. Philosophical Magazine, 42, 369–390. 55 Wrinch, D., & Jeffreys, H. (1923). On certain fundamental principles of scientific inquiry. Philosophical Magazine, 45, 368–375. Yang, G. L., & Le Cam, L. (2000). Asymptotics in statistics: Some basic concepts. Berlin, Germany: Springer.