Download Putting the Brakes on the Breakthrough Deborah Mayo ONE: A

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Putting the Brakes on the Breakthrough
Deborah Mayo
1
ONE: A Conversation between Sir David Cox and D. Mayo (June, 2011)
Toward the end of this exchange, the issue of the Likelihood Principle (LP)1 arose:
COX: It is sometimes claimed that there are logical inconsistencies in frequentist
theory, in particular surrounding the strong Likelihood Principle (LP). I know you
have written about this, what is your view at the moment.
MAYO: What contradiction?
COX: Well, that frequentist theory does not obey the strong LP.
MAYO: The fact that the frequentist rejects the strong LP is no contradiction.
COX: Of course, but the alleged contradiction is that from frequentist principles
(sufficiency, conditionality) you should accept the strong LP. The (argument for) the
strong LP has always seemed to me totally unconvincing, but the argument is still
considered one of the most powerful arguments against the frequentist theory.
MAYO: Do you think so?
COX: Yes, it’s a radical idea, if it were true.
1
I will always mean the “strong” likelihood principle.
2
MAYO: You’re not asking me to discuss where Birnbaum goes wrong (are you)?
COX: Where did Birnbaum go wrong?
MAYO: I am not sure it can be talked through readily, even though in one sense it is
simple; so I relegate it to an appendix.
It turns out that the premises are inconsistent, so it is not surprising the result is an
inconsistency.
The argument is unsound: it is impossible for the premises to all be true at the same
time.
Alternatively, if one allows the premises to be true, the argument is not deductively
valid. You can take your pick.
Thus arose the challenge to sketch the bear bones of this complex business,
even though I must direct you to appropriate details elsewhere.
3
TWO: The Birnbaum result heralded as a breakthrough in statistics! (indeed it
would undo the fundamental feature of error statistics and will be explained):
Savage:
Without any intent to speak with exaggeration it seems to me that this is really
a historic occasion. This paper is a landmark in statistics … I myself, like other
Bayesian statisticians, have been convinced of the truth of the likelihood principle
for a long time. Its consequences for statistics are very great.
….I can’t stop without saying once more that this paper is really momentous in
the history of statistics. It would be hard to point to even a handful of comparable
events. (Savage 1962).
…people will not long stop at that halfway house but will go forward and accept the
implications of personalistic probability…
All error statistical notions, p-values, significance levels,…all violate the likelihood
principle (ibid.)
4
The Birnbaum argument has long been treated, by Bayesians and likelihoodists
at least, as a great breakthrough, a landmark, and a momentous event; I have
no doubt that revealing the flaw in the alleged proof will not be greeted with
anything like the same recognition (Mayo 2010).
5
THREE: (Frequentist) Error Statistical Methods
Probability arises (in inference) to quantify how frequently methods are capable of
discriminating between alternative hypotheses and how reliably they detect errors.
These probabilistic properties of inference procedures are error frequencies or error
probabilities
Formally: the probabilities refer to the distribution of statistic T(x) (sampling
distribution)
behavioristic rationale: to control the rate of erroneous inferences (or
decisions):
inferential or testing rationale: or to control and appraise probativeness or
severity of tests, for a given inference (about some aspect of a data generating
procedure, as modeled)
The general idea of appraising rules probabilistically is very Popperian (so
should be familiar to philosophers of science)
6
In contrast to “probabilism” that inferring a hypothesis H is warranted only by
showing it is true or probably true, we may assign probabilies to rules for
testing (or estimating) H
Good fits between H and x are “too cheap to be worth having”, they only count if
they result from serious attempts to refute H
(I see error statistical methods as allowing us to make good on the Popperian
idea, although his tools did not)
Severity Principle (Weakest): Data x do not provide good evidence for hypothesis
H if x results from a test procedure with a very low probability or capacity of having
uncovered the falsity of H (even if H is incorrect).
Such a test we would say is insufficiently stringent or severe.
Formal error statistical tools may be regarded as providing systematic ways to
evaluate and promote this goal
7
FOUR: Error Statistical Methods Violate the LP (by considering outcomes
other than the one observed)
Critics of frequentist error statistics rightly accuse of us insisting on
considering outcomes other than the one observed because that is what is need
to assess probativeness
A test statistic or distance measure T(x) may be regarded as a measure of
fit; once we get its value we still want to know how often such a fit with H
would occur even if H is false, i.e., the sampling distribution of T(x)
Likelihood (likelihood ratios) yield measures of fit, but crucial information
is given by the distribution of that fit measure: if so good a fit (between x and H)
would very probably arise even if H were specifiably false, then the good fit is
poor evidence for H.i
8
Aspects of the data and hypotheses generation can alter the probing capacities
of tests, e.g., double-counting, ad hoc adjustments, selection effects, hunting for
significance, etc. and error probabilities pick this up
This immediately takes us to the core issue of the LP:
Those who do not accept the likelihood principle believe that the
probabilities of sequences that might have occurred, but did not, somehow
effect the import of the sequence that did occur (Edwards, Lindman, and
Savage 1963, 238)
The error statistician is “guilty as charged!”:
The question of how often a given situation would arise is utterly
irrelevant to the question how we should reason when it does arise. I
don’t know how many times this simple fact will have to be pointed out
before statisticians of ‘frequentist” persuasions will take note of it.” (Jaynes
1976, 247)
9
What we wonder is how many times we will have to point out that to us,
reasoning from the result that arose is crucially dependent on how often it
would have arisen…..
Error statistical methods consider outcomes other than the one observed, but it
doesn’t say average over any and all experiments not even performed!
One of the most common criticisms of frequentist error statistics assumes they do
Cox had to construct a special principle to make this explicit
10
FIVE: Weak Conditionality (WCP): You should not get Credit (be
blamed) for something you don’t deserve
A mixture Experiment: Toss a fair coin to determine whether to make 10 or 10,000
observations of Y a normally distributed random variable with unknown mean .
For any given result y, one could report an overall p-value:
{p’(y) + p”(y)}/2.
the convex combination of the p-values averaged over the two sample sizes.
(WCP) Conditionality Principle (weak): If a mixture experiment (of the above
type) is performed, then if it is known which experiment produced the data,
inferences about  are appropriately drawn in terms of the sampling behavior in the
experiment known to have been performed.
Once we know which tool or test generated the data y, given our inference is about
some aspect of what generated y, it should not be influenced by whether a coin was
tossed to decide which of the two to perform.
11
If you only observed 10 samples, it would be misleading to report this average as
your p-value,
“It would mean that an individual fortunate in obtaining the use of a precise
instrument sacrifices some of that information in order, in effect, to rescue an
investigator who has been unfortunate enough to have the randomizer choose a far
less precise tool. From the perspective of interpreting the specific data that are
actually available this makes no sense. Once it is known whether E’ or E” has been
run, the p-value assessment should be made conditional on the experiment actually
run.” (Cox and Mayo 2010 )
WCP is a normative epistemological claim about the appropriate manner of
reaching an inference in the given context.
Appealing to the severity assessment: Maybe if all you cared about was low error
rates in some long run, defined in some way or other, then you could average
over experiments not performed, but low long-run error probabilities are
necessary but not sufficient for satisfying severity.
12
The severity assessment reports on how good a job the test did in uncovering a
mistaken claim regarding some aspect of the experiment that actually
generated particular data x0.
The WCP is entirely within the frequentist philosophy.
It does not lead to conditioning on the particular sample observed!
Here’s where the Birnbaum result enters---his argument is supposed to show that it
does….
How can so innocent a principle as the WCP be claimed to force the error
statistician to give up on error probability reports altogether?
13
SIX: (Frequentist) Error Statistics Violates the LP—once again, more formally
Strong Likelihood Principle (LP).
It is a universal conditional claim:
If two data sets y’ and y” from experiments E’ and E” respectively, have
likelihood functions which are functions of the same parameter(s) µ and are
proportional to each other, then y’ and y” should lead to identical inferential
conclusions about µ.
For any two data sets y’, y”…
Whenever there are a pair of samples y’, y”
Y’ is a shorthand for (y’ was observed in experiment E’)
E’ and E” may have different probability models but with the same unknown
parameter μii
14
Examples of LP violations: Fixed vs. Data-Dependent Stopping
E’ and E” might be Binomial sampling with n fixed, and Negative Binomial
sampling, respectively.
I will focus on a more extreme example that is very often alluded to in showing the
error statistician is guilty of LP violations: fixed versus optional stopping
E’ might be iid sampling from a Normal distribution N(,2),  known, with a fixed
sample size n, and E” the corresponding experiment that uses this stopping rule:
Keep sampling until H0: is rejected at the .05 level
(Yi ~ N(µ,) and testing H0: µ=0, vs. H1: µ0.
i.e., keep sampling until |Y|  1.96 / n ).
15
The likelihood principle emphasized in Bayesian statistics implies, … that the
rules governing when data collection stops are irrelevant to data interpretation.
(Edwards, Lindman, Savage 1963, p. 239).
This conflicts with error statistical theory:
We see that in calculating [the posterior], our inference about , the only
contribution of the data is through the likelihood function….In particular, if
we have two pieces of data y’ and y” with [proportional] likelihood function
….the inferences about  from the two data sets should be the same. This is
not usually true in the orthodox theory and its falsity in that theory is an
example of its incoherence. (Lindley 1976, p. 36).
Frequentist “inference about  can take different form, but since the argument is to
be entirely general, and given the need for brevity here, it will be easiest to take a
particular kind of inference, say forming a p-value.
As Lindley rightly claims, there is an LP Violation in the Optional Stopping
Experiment: There is a difference in the corresponding p-values from E’ and E”,
write as p’ and p”, respectively.
16
While p’ would be ~.05, p” would be much larger, ~.3. The error probability
accumulates because of the optional stopping.
Clearly p’ is not equal to p”, so the two outcomes are not evidentially equivalent
InfrE’(y’) is not equal to InfrE”(y”) [for an error statistician]
InfrE(y) abbreviates: to the inference2 based on outcome y from experiment E
By contrast
InfrE’(y’) is equal to InfrE”(y”) [for one who accepts the LP]
It is more accurate to write this as something like “should be treated as equivalent",
“should not be treated as equivalent” evidentially; they are based on one or another
methodology or philosophy of inference (but I follow the more usual formulation)
In the context of error statistical inference, this is based on the particular statistic and
sampling distribution specified by E.
2
17
Suppose you observed y” from our optional stopping experiment E” that stopped at
n = 100.
InfrE’(y’) is equal to InfrE”(y”) [for one who accepts the LP]
Where y’ comes from the same experiment but with n fixed to 100
Bayesians call this the Stopping Rule Principle SRP.
The SRP would imply, [in the Armitage example], that if the observation in [the
case of optional stopping] happened to have n=100, then the evidentiary content
of the data would be the same as if the data had arisen from the fixed sample size
experiment (Berger and Wolpert 1988, 76).
18
Some frequentists argue, correctly I think, the optional stopping example alone as
enough to refute the strong likelihood principle.” (Cox, 1977, p. 54) since, with
probability 1, it will stop with a “nominally” significant result even though  = 0.
It violates the principle that we should avoid misleading inferences with high or
maximal probability (weak repeated sampling principle).
In our terminology, it permits an inference with minimal severity
(The example can also be made out in terms of confidence intervals, where the rule
ensures 0 is never in an interval with probability 1. Berger and Wolpert grant that
the frequentist probability that the interval exclude 0, even where 0 is true, is 1. pp.
80-1)3
3
See EGEK, p. 355 for discussion.
19
SEVEN: NOW FOR THE BREAKTHROUGH
Birnbaum claims he can show that you, as a frequentist error statistician, must grant
that it is equivalent to having fixed n= 100 at the start (i.e., experiment E’)
Reminder:
The (strong) Likelihood Principle (LP) is a universal conditional claim:
If two data sets y’ and y” from experiments E’ and E” respectively, have likelihood
functions which are functions of the same parameter(s) µ
and are proportional to each other, then y’ and y” should lead to identical inferential
conclusions about µ
As with conditional proofs, we assume the antecedent and try to derive the
consequent, or equivalently, show a contradiction results whenever the antecedent
holds and the consequent does not.
For the latter:
20
LP Violation Pairs
Start with any violation of the LP, that is, a case where the antecedent of the LP
holds, and the consequent does not hold
Show you get a contradiction.
Assume then that the pair of outcomes y’ and y”, from E’ and E” respectively,
represent a violation of the LP. We may call them LP pairs.
Step 1:
Birnbaum will describe a funny kind of ‘mixture’ experiment based on an LP pair;
You observed y” say from experiment E”.
Having observed y” from the optional stopping (stopped say at n = 100) I am to
imagine it resulted from getting heads on the toss of a fair coin, where tails would
have meant performing the fixed sample size experiment with n = 100 from the start.
Next, erase the fact that y” came from E” and report (y’, E”)
Call this test statistic: TBB:
21
The Birnbaum test statistic TBB:
Case 1: If you observe y” (from E”) and y” has an LP pair in E’, just report (y’, E”)
Case 2: If your observed outcome does not has an LP pair, just report it as usual
(Any outcome from optional stopping E” has an LP pair in the corresponding fixed
sample size experiment E’)
Only case 1 results matter for the points of the proof we need to consider.
I said it was a funny kind of mixture, there are two things that make it funny:
 It didn’t happen, you only observed y” from E”
 Second, you are to report an outcome as y’ from E’ even though you actually
observed y” from E” (and then report the mixture)
We may call it Birnbaumizing the result you got; whenever you have observed a
potential LP violation, “Birnbaumize” it as above
22
If you observe y” (from E”) and y” has an LP pair in E’, just report y’ (i.e., report
(y’, E’)
So you’d report this whether you actually observed y’ or if you got y”
----------------------------We said our inference would be in the form of p-values
Now to obtain the p-value we must use the defined sampling distribution of TBB--the convex combination:
In reporting a p-value associated with y” we are to report the average of p’ and p”:
(p’ + p”)/2.
(the ½ comes from the imagined fair coin)
Having thus “Birnbaumized” the particular LP pair that you actually observed, it
appears that you must treat y’ as evidentially equivalent to its LP pair, y”.
23
The test statistic TBB is a sufficient statistic, technically, but the rest of the argument
overlooks that an error statistician still must take into account the sampling
distributions at each step.
At this step, it refers to the distribution of TBB.
But it changes in the second step, and that’s what dooms the ‘proof’, as we will now
see.
24
0. Let y’ and y” (from E’ and E”) be any LP violation pair, and say y” from E” has
been observed
 y’ = .196
1. Premise 1: Inferences from y’ and y”, using the sampling distribution of the convex
combination, are equivalent (Birnbaumization):
InfrE’(y’) is equal to InfrE”(y”) [both are equal to (p’ + p”)/2) ]
2 Premise 2 (a): An inference from y’ using (i.e., conditioning on) the sampling
distribution of E’ (the experiment that produced it), is p’
InfrE’(y’) equals p’
Premise 2 (b): An inference from y” using (i.e., conditioning on) the sampling
distribution of E” (the experiment that produced it), is p”
InfrE”(y”) equals p”
From (1), (2a and b): InfrE’(y’) equals InfrE”(y”)
25
Which is, or looks like the LP!
It would follow of course that p’ equals p”!
But from (0), y’ and y” form a LP violation, so, p’ is not equal to p”.
p’ was ~.05, p” ~.3
Thus it would appear the frequentist is led into a contradiction.
The problem? There are different ways to show it, as always; here I allowed the
premises to be true.
In that case this is an invalid argument; we have all true premises and a false
conclusion.
I can consistently hold all the premises and the denial of the conclusion
1. The two outcomes get the same convex combo p-value if I play the
Birnbaumization game
2. That if I condition, the inferences from y” and y’ are p” and p’, respectively
Denial of conclusion: And p’ is not equal to p” (.05 is not equal to .3)
No contradiction.
26
We can put it in a valid form but then the premises can never both be true at the
same time:
It’s not even so easy to put it in valid form (see my paper for several attempts):
Premise 1: Inferences from y’and y” are evidentially equivalent:
InfrE’(y’) is equal to InfrE”(y”)]
Premise 2 (a): An inference from y’ should use (i.e., conditioning on) the sampling
distribution of E’ (the experiment that produced it)
InfrE’(y’) equals p’
Premise 2 (b): An inference from y” should use (i.e., conditioning on) the sampling
distribution of E” (the experiment that produced it):
InfrE”(y”) equals p”
Usually the proofs just give the bold parts
From (1), (2a and b): InfrE’(y’) equals InfrE”(y”)
27
Which is the LP!
Contradicting the assumption that y’ and y” form an LP violation!
The problem now is that in order to infer the conclusion the premises of the
argument must be true, and it is impossible to have premises (1) and (2) true at the
same time:
Premise (1) is true only if we use the sampling distribution given by the convex
combinations (averaging over the LP pairs).
 This is the sampling distribution of TBB.
 Yet to draw inferences using this sampling distribution renders both (2a) and
(2b) false.
 The truth of (2a) and (2b) requires ‘conditioning’ on the experiment actually
performed, or rather, they require we not ‘Birnbaumize’ the experiment from
which the observed LP pair is known to have actually come!
See pages in handout from ERROR AND INFERENCE.
28
Although I have allowed premise (1) for the sake of argument, the very idea is
extremely far-fetched and unmotivated.
Pre-data, the frequentist would really need to consider all possible pairs that could
be LP violations and average over them….
It is worth noting that Birnbaum himself rejected the LP (Birnbaum 1969, 128):
“Thus it seems that the likelihood concept cannot be construed so as to allow useful
appraisal, and thereby possible control, of erroneous interpretations.”
29
REFERENCES
Armitage, P. (1975). Sequential Medical Trials, 2nd ed. New York: John Wiley &
Sons.
Birnbaum, A. (1962). On the Foundations of Statistical Inference (with discussion),
Journal of the American Statistical Association, 57: 269–326.
Birnbaum. A. (1969). Concepts of Statistical Evidence. In Philosophy, Science, and
Method: Essays in Honor of Ernest Nagel, edited by S. Morgernbesser, P.
Suppes, and M. White, New York: St. Martin’s Press: 112-143.
Berger, J. O., and Wolpert, R.L. (1988). The Likelihood Principle, California
Institute of Mathematical Statistics, Hayward, CA.
Cox, D.R. (1977). “The Role of Significance Tests (with Discussion),”
Scandinavian Journal of Statistics, 4: 49–70.
Cox D. R. and Mayo. D. (2010). "Objectivity and Conditionality in Frequentist
Inference" in Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability and the Objectivity and Rationality of Science, edited by
D Mayo and A. Spanos, Cambridge: Cambridge University Press: 276-304.
Edwards, W., Lindman, H, and Savage, L. (1963). Bayesian Statistical Inference for
Psychological Research, Psychological Review, 70: 193-242.
30
Jaynes, E. T. (1976). Common Sense as an Interface. In Foundations of Probability
Theory, Statistical Inference and Statistical Theories of Science Volume 2,
edited by W. L. Harper and C.A. Hooker, Dordrect, The Netherlands: D..
Reidel: 218-257.
Joshi, V. M. (1976). “A Note on Birnbaum’s Theory of the Likelihood Principle.”
Journal of the American Statistical Association 71, 345-346.
Joshi, V. M. (1990). “Fallacy in the Proof of Birnbaum’s Theorem.” Journal of
Statistical Planning and Inference 26, 111-112.
Lindley D. V. (1976). Bayesian Statistics. In Foundations of Probability theory,
Statistical Inference and Statistical Theories of Science, Volume 2, edited by
W. L. Harper and C.A. Hooker, Dordrect, The Netherlands: D.. Reidel: 353362.
Mayo, D. (1996). Error and the Growth of Experimental Knowledge. The
University of Chicago Press (Series in Conceptual Foundations of Science).
Mayo, D. (2010). "An Error in the Argument from Conditionality and Sufficiency
to the Likelihood Principle." In Error and Inference: Recent Exchanges on
Experimental Reasoning, Reliability and the Objectivity and Rationality of
Science, edited by D. Mayo and A. Spanos, Cambridge University Press. 305314.
31
Mayo, D. and D. R. Cox. (2011) “Statistical Scientist Meets a Philosopher of
Science: A Conversation with Sir David Cox.” Rationality, Markets and
Morals (RMM): Studies at the Intersection of Philosophy and
Economics. Edited by M. Albert, H. Kliemt and B. Lahno. An open access
journal published by the Frankfurt School: Verlag. Volume 2, (2011), 103114. http://www.rmm-journal.de/htdocs/st01.html
Mayo D. and A. Spanos, eds. (2010). Error and Inference: Recent Exchanges on
Experimental Reasoning, Reliability and the Objectivity and Rationality of
Science, Cambridge: Cambridge University Press.
Pratt, John W, H. Raffia and R. Schlaifer. (1995). Introduction to Statistical
Decision Theory. Cambridge, MA: The MIT Press.
Savage, L., ed. (1962a), The Foundations of Statistical Inference: A Discussion.
London: Methuen & Co.
Savage, L. (1962b), “‘Discussion on Birnbaum (1962),” Journal of the American
Statistical Association, 57: 307–8.
32
i
In so-called behavioristic contexts, the concern is controlling errors in a long-run series, or in long
run reliability; but in scientific contexts, we use error probabilities to quantify the capability of a given
method to have discerned a flaw or error in some hypothesis which is correct or incorrect about some aspect
of the data generating phenomenon methods for discerning errors are called error probabilities, I refer to an
error-statistical approach
ii
We think this captures the generally agreed upon meaning of the LP although statements may be found that
seem stronger. For example, in Pratt, Raiffa, and Schlaifer, 1995:
If, in a given situation, two random variables are observable, and if the value x of the first and the value y
of the second give rise to the same likelihood function, then observing the value x of the first and
observing the value y of the second are equivalent in the sense that they should give the same inference,
analysis, conclusion, decision, action, or anything else. (Pratt, Raiffa, Schlaifer 1995, 542; emphasis
added)
33