Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Putting the Brakes on the Breakthrough Deborah Mayo 1 ONE: A Conversation between Sir David Cox and D. Mayo (June, 2011) Toward the end of this exchange, the issue of the Likelihood Principle (LP)1 arose: COX: It is sometimes claimed that there are logical inconsistencies in frequentist theory, in particular surrounding the strong Likelihood Principle (LP). I know you have written about this, what is your view at the moment. MAYO: What contradiction? COX: Well, that frequentist theory does not obey the strong LP. MAYO: The fact that the frequentist rejects the strong LP is no contradiction. COX: Of course, but the alleged contradiction is that from frequentist principles (sufficiency, conditionality) you should accept the strong LP. The (argument for) the strong LP has always seemed to me totally unconvincing, but the argument is still considered one of the most powerful arguments against the frequentist theory. MAYO: Do you think so? COX: Yes, it’s a radical idea, if it were true. 1 I will always mean the “strong” likelihood principle. 2 MAYO: You’re not asking me to discuss where Birnbaum goes wrong (are you)? COX: Where did Birnbaum go wrong? MAYO: I am not sure it can be talked through readily, even though in one sense it is simple; so I relegate it to an appendix. It turns out that the premises are inconsistent, so it is not surprising the result is an inconsistency. The argument is unsound: it is impossible for the premises to all be true at the same time. Alternatively, if one allows the premises to be true, the argument is not deductively valid. You can take your pick. Thus arose the challenge to sketch the bear bones of this complex business, even though I must direct you to appropriate details elsewhere. 3 TWO: The Birnbaum result heralded as a breakthrough in statistics! (indeed it would undo the fundamental feature of error statistics and will be explained): Savage: Without any intent to speak with exaggeration it seems to me that this is really a historic occasion. This paper is a landmark in statistics … I myself, like other Bayesian statisticians, have been convinced of the truth of the likelihood principle for a long time. Its consequences for statistics are very great. ….I can’t stop without saying once more that this paper is really momentous in the history of statistics. It would be hard to point to even a handful of comparable events. (Savage 1962). …people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability… All error statistical notions, p-values, significance levels,…all violate the likelihood principle (ibid.) 4 The Birnbaum argument has long been treated, by Bayesians and likelihoodists at least, as a great breakthrough, a landmark, and a momentous event; I have no doubt that revealing the flaw in the alleged proof will not be greeted with anything like the same recognition (Mayo 2010). 5 THREE: (Frequentist) Error Statistical Methods Probability arises (in inference) to quantify how frequently methods are capable of discriminating between alternative hypotheses and how reliably they detect errors. These probabilistic properties of inference procedures are error frequencies or error probabilities Formally: the probabilities refer to the distribution of statistic T(x) (sampling distribution) behavioristic rationale: to control the rate of erroneous inferences (or decisions): inferential or testing rationale: or to control and appraise probativeness or severity of tests, for a given inference (about some aspect of a data generating procedure, as modeled) The general idea of appraising rules probabilistically is very Popperian (so should be familiar to philosophers of science) 6 In contrast to “probabilism” that inferring a hypothesis H is warranted only by showing it is true or probably true, we may assign probabilies to rules for testing (or estimating) H Good fits between H and x are “too cheap to be worth having”, they only count if they result from serious attempts to refute H (I see error statistical methods as allowing us to make good on the Popperian idea, although his tools did not) Severity Principle (Weakest): Data x do not provide good evidence for hypothesis H if x results from a test procedure with a very low probability or capacity of having uncovered the falsity of H (even if H is incorrect). Such a test we would say is insufficiently stringent or severe. Formal error statistical tools may be regarded as providing systematic ways to evaluate and promote this goal 7 FOUR: Error Statistical Methods Violate the LP (by considering outcomes other than the one observed) Critics of frequentist error statistics rightly accuse of us insisting on considering outcomes other than the one observed because that is what is need to assess probativeness A test statistic or distance measure T(x) may be regarded as a measure of fit; once we get its value we still want to know how often such a fit with H would occur even if H is false, i.e., the sampling distribution of T(x) Likelihood (likelihood ratios) yield measures of fit, but crucial information is given by the distribution of that fit measure: if so good a fit (between x and H) would very probably arise even if H were specifiably false, then the good fit is poor evidence for H.i 8 Aspects of the data and hypotheses generation can alter the probing capacities of tests, e.g., double-counting, ad hoc adjustments, selection effects, hunting for significance, etc. and error probabilities pick this up This immediately takes us to the core issue of the LP: Those who do not accept the likelihood principle believe that the probabilities of sequences that might have occurred, but did not, somehow effect the import of the sequence that did occur (Edwards, Lindman, and Savage 1963, 238) The error statistician is “guilty as charged!”: The question of how often a given situation would arise is utterly irrelevant to the question how we should reason when it does arise. I don’t know how many times this simple fact will have to be pointed out before statisticians of ‘frequentist” persuasions will take note of it.” (Jaynes 1976, 247) 9 What we wonder is how many times we will have to point out that to us, reasoning from the result that arose is crucially dependent on how often it would have arisen….. Error statistical methods consider outcomes other than the one observed, but it doesn’t say average over any and all experiments not even performed! One of the most common criticisms of frequentist error statistics assumes they do Cox had to construct a special principle to make this explicit 10 FIVE: Weak Conditionality (WCP): You should not get Credit (be blamed) for something you don’t deserve A mixture Experiment: Toss a fair coin to determine whether to make 10 or 10,000 observations of Y a normally distributed random variable with unknown mean . For any given result y, one could report an overall p-value: {p’(y) + p”(y)}/2. the convex combination of the p-values averaged over the two sample sizes. (WCP) Conditionality Principle (weak): If a mixture experiment (of the above type) is performed, then if it is known which experiment produced the data, inferences about are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed. Once we know which tool or test generated the data y, given our inference is about some aspect of what generated y, it should not be influenced by whether a coin was tossed to decide which of the two to perform. 11 If you only observed 10 samples, it would be misleading to report this average as your p-value, “It would mean that an individual fortunate in obtaining the use of a precise instrument sacrifices some of that information in order, in effect, to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available this makes no sense. Once it is known whether E’ or E” has been run, the p-value assessment should be made conditional on the experiment actually run.” (Cox and Mayo 2010 ) WCP is a normative epistemological claim about the appropriate manner of reaching an inference in the given context. Appealing to the severity assessment: Maybe if all you cared about was low error rates in some long run, defined in some way or other, then you could average over experiments not performed, but low long-run error probabilities are necessary but not sufficient for satisfying severity. 12 The severity assessment reports on how good a job the test did in uncovering a mistaken claim regarding some aspect of the experiment that actually generated particular data x0. The WCP is entirely within the frequentist philosophy. It does not lead to conditioning on the particular sample observed! Here’s where the Birnbaum result enters---his argument is supposed to show that it does…. How can so innocent a principle as the WCP be claimed to force the error statistician to give up on error probability reports altogether? 13 SIX: (Frequentist) Error Statistics Violates the LP—once again, more formally Strong Likelihood Principle (LP). It is a universal conditional claim: If two data sets y’ and y” from experiments E’ and E” respectively, have likelihood functions which are functions of the same parameter(s) µ and are proportional to each other, then y’ and y” should lead to identical inferential conclusions about µ. For any two data sets y’, y”… Whenever there are a pair of samples y’, y” Y’ is a shorthand for (y’ was observed in experiment E’) E’ and E” may have different probability models but with the same unknown parameter μii 14 Examples of LP violations: Fixed vs. Data-Dependent Stopping E’ and E” might be Binomial sampling with n fixed, and Negative Binomial sampling, respectively. I will focus on a more extreme example that is very often alluded to in showing the error statistician is guilty of LP violations: fixed versus optional stopping E’ might be iid sampling from a Normal distribution N(,2), known, with a fixed sample size n, and E” the corresponding experiment that uses this stopping rule: Keep sampling until H0: is rejected at the .05 level (Yi ~ N(µ,) and testing H0: µ=0, vs. H1: µ0. i.e., keep sampling until |Y| 1.96 / n ). 15 The likelihood principle emphasized in Bayesian statistics implies, … that the rules governing when data collection stops are irrelevant to data interpretation. (Edwards, Lindman, Savage 1963, p. 239). This conflicts with error statistical theory: We see that in calculating [the posterior], our inference about , the only contribution of the data is through the likelihood function….In particular, if we have two pieces of data y’ and y” with [proportional] likelihood function ….the inferences about from the two data sets should be the same. This is not usually true in the orthodox theory and its falsity in that theory is an example of its incoherence. (Lindley 1976, p. 36). Frequentist “inference about can take different form, but since the argument is to be entirely general, and given the need for brevity here, it will be easiest to take a particular kind of inference, say forming a p-value. As Lindley rightly claims, there is an LP Violation in the Optional Stopping Experiment: There is a difference in the corresponding p-values from E’ and E”, write as p’ and p”, respectively. 16 While p’ would be ~.05, p” would be much larger, ~.3. The error probability accumulates because of the optional stopping. Clearly p’ is not equal to p”, so the two outcomes are not evidentially equivalent InfrE’(y’) is not equal to InfrE”(y”) [for an error statistician] InfrE(y) abbreviates: to the inference2 based on outcome y from experiment E By contrast InfrE’(y’) is equal to InfrE”(y”) [for one who accepts the LP] It is more accurate to write this as something like “should be treated as equivalent", “should not be treated as equivalent” evidentially; they are based on one or another methodology or philosophy of inference (but I follow the more usual formulation) In the context of error statistical inference, this is based on the particular statistic and sampling distribution specified by E. 2 17 Suppose you observed y” from our optional stopping experiment E” that stopped at n = 100. InfrE’(y’) is equal to InfrE”(y”) [for one who accepts the LP] Where y’ comes from the same experiment but with n fixed to 100 Bayesians call this the Stopping Rule Principle SRP. The SRP would imply, [in the Armitage example], that if the observation in [the case of optional stopping] happened to have n=100, then the evidentiary content of the data would be the same as if the data had arisen from the fixed sample size experiment (Berger and Wolpert 1988, 76). 18 Some frequentists argue, correctly I think, the optional stopping example alone as enough to refute the strong likelihood principle.” (Cox, 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though = 0. It violates the principle that we should avoid misleading inferences with high or maximal probability (weak repeated sampling principle). In our terminology, it permits an inference with minimal severity (The example can also be made out in terms of confidence intervals, where the rule ensures 0 is never in an interval with probability 1. Berger and Wolpert grant that the frequentist probability that the interval exclude 0, even where 0 is true, is 1. pp. 80-1)3 3 See EGEK, p. 355 for discussion. 19 SEVEN: NOW FOR THE BREAKTHROUGH Birnbaum claims he can show that you, as a frequentist error statistician, must grant that it is equivalent to having fixed n= 100 at the start (i.e., experiment E’) Reminder: The (strong) Likelihood Principle (LP) is a universal conditional claim: If two data sets y’ and y” from experiments E’ and E” respectively, have likelihood functions which are functions of the same parameter(s) µ and are proportional to each other, then y’ and y” should lead to identical inferential conclusions about µ As with conditional proofs, we assume the antecedent and try to derive the consequent, or equivalently, show a contradiction results whenever the antecedent holds and the consequent does not. For the latter: 20 LP Violation Pairs Start with any violation of the LP, that is, a case where the antecedent of the LP holds, and the consequent does not hold Show you get a contradiction. Assume then that the pair of outcomes y’ and y”, from E’ and E” respectively, represent a violation of the LP. We may call them LP pairs. Step 1: Birnbaum will describe a funny kind of ‘mixture’ experiment based on an LP pair; You observed y” say from experiment E”. Having observed y” from the optional stopping (stopped say at n = 100) I am to imagine it resulted from getting heads on the toss of a fair coin, where tails would have meant performing the fixed sample size experiment with n = 100 from the start. Next, erase the fact that y” came from E” and report (y’, E”) Call this test statistic: TBB: 21 The Birnbaum test statistic TBB: Case 1: If you observe y” (from E”) and y” has an LP pair in E’, just report (y’, E”) Case 2: If your observed outcome does not has an LP pair, just report it as usual (Any outcome from optional stopping E” has an LP pair in the corresponding fixed sample size experiment E’) Only case 1 results matter for the points of the proof we need to consider. I said it was a funny kind of mixture, there are two things that make it funny: It didn’t happen, you only observed y” from E” Second, you are to report an outcome as y’ from E’ even though you actually observed y” from E” (and then report the mixture) We may call it Birnbaumizing the result you got; whenever you have observed a potential LP violation, “Birnbaumize” it as above 22 If you observe y” (from E”) and y” has an LP pair in E’, just report y’ (i.e., report (y’, E’) So you’d report this whether you actually observed y’ or if you got y” ----------------------------We said our inference would be in the form of p-values Now to obtain the p-value we must use the defined sampling distribution of TBB--the convex combination: In reporting a p-value associated with y” we are to report the average of p’ and p”: (p’ + p”)/2. (the ½ comes from the imagined fair coin) Having thus “Birnbaumized” the particular LP pair that you actually observed, it appears that you must treat y’ as evidentially equivalent to its LP pair, y”. 23 The test statistic TBB is a sufficient statistic, technically, but the rest of the argument overlooks that an error statistician still must take into account the sampling distributions at each step. At this step, it refers to the distribution of TBB. But it changes in the second step, and that’s what dooms the ‘proof’, as we will now see. 24 0. Let y’ and y” (from E’ and E”) be any LP violation pair, and say y” from E” has been observed y’ = .196 1. Premise 1: Inferences from y’ and y”, using the sampling distribution of the convex combination, are equivalent (Birnbaumization): InfrE’(y’) is equal to InfrE”(y”) [both are equal to (p’ + p”)/2) ] 2 Premise 2 (a): An inference from y’ using (i.e., conditioning on) the sampling distribution of E’ (the experiment that produced it), is p’ InfrE’(y’) equals p’ Premise 2 (b): An inference from y” using (i.e., conditioning on) the sampling distribution of E” (the experiment that produced it), is p” InfrE”(y”) equals p” From (1), (2a and b): InfrE’(y’) equals InfrE”(y”) 25 Which is, or looks like the LP! It would follow of course that p’ equals p”! But from (0), y’ and y” form a LP violation, so, p’ is not equal to p”. p’ was ~.05, p” ~.3 Thus it would appear the frequentist is led into a contradiction. The problem? There are different ways to show it, as always; here I allowed the premises to be true. In that case this is an invalid argument; we have all true premises and a false conclusion. I can consistently hold all the premises and the denial of the conclusion 1. The two outcomes get the same convex combo p-value if I play the Birnbaumization game 2. That if I condition, the inferences from y” and y’ are p” and p’, respectively Denial of conclusion: And p’ is not equal to p” (.05 is not equal to .3) No contradiction. 26 We can put it in a valid form but then the premises can never both be true at the same time: It’s not even so easy to put it in valid form (see my paper for several attempts): Premise 1: Inferences from y’and y” are evidentially equivalent: InfrE’(y’) is equal to InfrE”(y”)] Premise 2 (a): An inference from y’ should use (i.e., conditioning on) the sampling distribution of E’ (the experiment that produced it) InfrE’(y’) equals p’ Premise 2 (b): An inference from y” should use (i.e., conditioning on) the sampling distribution of E” (the experiment that produced it): InfrE”(y”) equals p” Usually the proofs just give the bold parts From (1), (2a and b): InfrE’(y’) equals InfrE”(y”) 27 Which is the LP! Contradicting the assumption that y’ and y” form an LP violation! The problem now is that in order to infer the conclusion the premises of the argument must be true, and it is impossible to have premises (1) and (2) true at the same time: Premise (1) is true only if we use the sampling distribution given by the convex combinations (averaging over the LP pairs). This is the sampling distribution of TBB. Yet to draw inferences using this sampling distribution renders both (2a) and (2b) false. The truth of (2a) and (2b) requires ‘conditioning’ on the experiment actually performed, or rather, they require we not ‘Birnbaumize’ the experiment from which the observed LP pair is known to have actually come! See pages in handout from ERROR AND INFERENCE. 28 Although I have allowed premise (1) for the sake of argument, the very idea is extremely far-fetched and unmotivated. Pre-data, the frequentist would really need to consider all possible pairs that could be LP violations and average over them…. It is worth noting that Birnbaum himself rejected the LP (Birnbaum 1969, 128): “Thus it seems that the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of erroneous interpretations.” 29 REFERENCES Armitage, P. (1975). Sequential Medical Trials, 2nd ed. New York: John Wiley & Sons. Birnbaum, A. (1962). On the Foundations of Statistical Inference (with discussion), Journal of the American Statistical Association, 57: 269–326. Birnbaum. A. (1969). Concepts of Statistical Evidence. In Philosophy, Science, and Method: Essays in Honor of Ernest Nagel, edited by S. Morgernbesser, P. Suppes, and M. White, New York: St. Martin’s Press: 112-143. Berger, J. O., and Wolpert, R.L. (1988). The Likelihood Principle, California Institute of Mathematical Statistics, Hayward, CA. Cox, D.R. (1977). “The Role of Significance Tests (with Discussion),” Scandinavian Journal of Statistics, 4: 49–70. Cox D. R. and Mayo. D. (2010). "Objectivity and Conditionality in Frequentist Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science, edited by D Mayo and A. Spanos, Cambridge: Cambridge University Press: 276-304. Edwards, W., Lindman, H, and Savage, L. (1963). Bayesian Statistical Inference for Psychological Research, Psychological Review, 70: 193-242. 30 Jaynes, E. T. (1976). Common Sense as an Interface. In Foundations of Probability Theory, Statistical Inference and Statistical Theories of Science Volume 2, edited by W. L. Harper and C.A. Hooker, Dordrect, The Netherlands: D.. Reidel: 218-257. Joshi, V. M. (1976). “A Note on Birnbaum’s Theory of the Likelihood Principle.” Journal of the American Statistical Association 71, 345-346. Joshi, V. M. (1990). “Fallacy in the Proof of Birnbaum’s Theorem.” Journal of Statistical Planning and Inference 26, 111-112. Lindley D. V. (1976). Bayesian Statistics. In Foundations of Probability theory, Statistical Inference and Statistical Theories of Science, Volume 2, edited by W. L. Harper and C.A. Hooker, Dordrect, The Netherlands: D.. Reidel: 353362. Mayo, D. (1996). Error and the Growth of Experimental Knowledge. The University of Chicago Press (Series in Conceptual Foundations of Science). Mayo, D. (2010). "An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle." In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science, edited by D. Mayo and A. Spanos, Cambridge University Press. 305314. 31 Mayo, D. and D. R. Cox. (2011) “Statistical Scientist Meets a Philosopher of Science: A Conversation with Sir David Cox.” Rationality, Markets and Morals (RMM): Studies at the Intersection of Philosophy and Economics. Edited by M. Albert, H. Kliemt and B. Lahno. An open access journal published by the Frankfurt School: Verlag. Volume 2, (2011), 103114. http://www.rmm-journal.de/htdocs/st01.html Mayo D. and A. Spanos, eds. (2010). Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science, Cambridge: Cambridge University Press. Pratt, John W, H. Raffia and R. Schlaifer. (1995). Introduction to Statistical Decision Theory. Cambridge, MA: The MIT Press. Savage, L., ed. (1962a), The Foundations of Statistical Inference: A Discussion. London: Methuen & Co. Savage, L. (1962b), “‘Discussion on Birnbaum (1962),” Journal of the American Statistical Association, 57: 307–8. 32 i In so-called behavioristic contexts, the concern is controlling errors in a long-run series, or in long run reliability; but in scientific contexts, we use error probabilities to quantify the capability of a given method to have discerned a flaw or error in some hypothesis which is correct or incorrect about some aspect of the data generating phenomenon methods for discerning errors are called error probabilities, I refer to an error-statistical approach ii We think this captures the generally agreed upon meaning of the LP although statements may be found that seem stronger. For example, in Pratt, Raiffa, and Schlaifer, 1995: If, in a given situation, two random variables are observable, and if the value x of the first and the value y of the second give rise to the same likelihood function, then observing the value x of the first and observing the value y of the second are equivalent in the sense that they should give the same inference, analysis, conclusion, decision, action, or anything else. (Pratt, Raiffa, Schlaifer 1995, 542; emphasis added) 33