Download Frequentism as a positivism: a three-tiered interpretation of probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Randomness wikipedia , lookup

History of randomness wikipedia , lookup

Dempster–Shafer theory wikipedia , lookup

Indeterminism wikipedia , lookup

Probability box wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Doomsday argument wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Frequentism as a positivism: a three-tiered
interpretation of probability
Shivaram Lingamneni
July 29, 2013
Abstract
I explore an alternate clarification of the idea of frequency probability, called frequency judgment. I then distinguish three distinct senses
of probability — physical chance, frequency judgment, and subjective
credence — and propose that they have a hierarchical relationship.
Finally, I claim that this three-tiered view can dissolve various paradoxes associated with the interpretation of probability.
1
1.1
Introduction
Frequentism and its challenges
Frequentism means, more or less, that probabilities are ratios of successes to
trials. It originates with John Venn and is arguably the first philosophically
rigorous account of probability — that is to say, it is the first account of
probability to appear as an attempt to correct a philosophically inadequate
pre-theoretic view. As Alan Hájek has observed, however, it has fallen on
hard times. In part, this is because it competes with the Bayesian interpretation of probability, in which probabilities are subjective degrees of belief.
Bayesianism offers a seductive unifying picture, in which epistemology and
decision theory can both be grounded in a quantitatively precise account of
an agent’s attitudes and propensities. But frequentism’s philosophical difficulties are not simply due to its being outshone by a competing view. As
Hájek has shown, frequentism itself faces a variety of vexing challenges.
Hájek reconstructs frequentism as containing two distinct conceptions of
probability — finite frequentism, in which probabilities are actual real-world
ratios of successes to trials, and hypothetical frequentism, in which they are
limiting relative frequencies over an idealized hypothetical infinite sequence of
trials. In a series of two papers [1996, 2009], he shows that each conception
1
is affected by numerous difficulties: in fact, each paper gives 15 distinct
objections to one of the conceptions!
In order to motivate what follows, I’ll briefly summarize what I consider
the most pressing of Hájek’s objections against each characterization. Finite
frequentism is intuitively appealing because of its metaphysical parsimony;
probabilities can be “read off” from the actual history of real-world events,
without the need to posit any unobservable entities. But taken literally, it
clashes with many of our important intuitions about probability. In particular, it is a kind of operationalism about probability, and hence suffers from
similar problems to other operationalisms. If we consider probability to be
defined by real-world frequency, then we have seemingly have no way to express the idea that an observed frequency might be aberrant, just as defining
temperature to be thermometer readings leaves us with no way to express
the idea that our thermometers may be inaccurate. This problem becomes
especially serious when we consider cases where the number of real-world
trials is very small — in particular, if there is only 1 trial, then the finite
frequency probability must be either 0 or 1, and if there have been no trials
yet, then it is undefined. Finite frequentism is in conflict with our intuitions
that actual trials constitute evidence about probability rather than its actual
substance.
Hypothetical frequentism answers this concern perfectly, but at far too
high a metaphysical cost. In particular, asserting the existence of an infinite
sequence of trials seems to involve an “abandonment of empiricism.” In the
real world, we cannot perform an infinite sequence of trials, so the meaning
ascribed to probabilities is evidently counterfactual. Even after granting this,
what kind of counterfactual are we dealing with? If we analyze it using a
possible-world semantics, in the style of Stalnaker or Lewis, we seemingly
require a possible world that (at the very least) violates the conservation of
mass-energy. Why should we believe that probabilities in this world have
anything to do with ours?
Finally, the following objection is commonly advanced against both conceptions of frequentism: frequentism entangles the probability of any individual event E with the question of what will happen to other, similar events.
We cannot make frequentist sense of the probability of E without assigning
it to some broader reference class of events, over which we will be able to
define a ratio of successes to trials. But at this point, P (E) will be a property
of the reference class, not of E itself. This objection is already troubling, but
it has even more teeth in cases when there are multiple possible reference
classes, each yielding a distinct value of P (E), or perhaps no reference class
at all. This is the so-called “reference class problem”, and it is another, crucial sense in which frequency notions of probability diverge from our ordinary
understanding of the word.
2
1.2
Where to?
I am a frequentist. What sort of frequentist am I? Of the two varieties
distinguished above, I am much more sympathetic to finite frequentism; the
metaphysical costs of infinite hypothetical sequences are too much for me
to bear. In fact, I think that finite frequentism, properly expounded, can
actually escape many of the criticisms Hájek levels at it — perhaps eight
out of fifteen. But I cannot deny the force of Hájek’s overall arguments,
and I think it inevitable that I must give some ground. Specifically, I think
an adequate analysis of probability must both seek a third way of defining
frequency probability and also acknowledge that not all probabilities are
frequency probabilities. Here are some of my desiderata for such an expanded
conception:
1. It should preserve core frequentist intuitions that relative frequency
is an essential component of probability. In particular, it should not
conflate probabilities that have an intuitively acceptable frequency interpretation (e.g., the probability that a U.S. quarter, when flipped,
will land heads) with those that do not (e.g., the probability referenced
in Pascal’s wager that God exists).
Indeed, the primary goal of this paper is to propose and defend a definition of frequency probability that is both reasonably rigorous and
free from paradox, in hopes that it will enable epistemological views in
which frequency probability has a privileged status.
2. It should not take a stance on the existence of physical chance (something which poses problems for both frequentist and Bayesian accounts
of probability). I think that a proper resolution of this question rests
on questions external to the philosophy of probability, in particular on
the philosophy of physics, and that consequently it is an advantage for
an account of probability to remain agnostic on the question.
3. It should not deny the validity of the Bayesian interpretation of probability outright. As Jaynes [1985] remarked, arguing in the reverse
direction:
I do not “disallow the possibility” of the frequency interpretation. Indeed, since that interpretation exists, it would
be rather hard for anyone to deny the possibility of it. I do,
however, deny the necessity of it.
Indeed, while I consider myself a frequentist, I affirm the value of
Bayesian probability, both its technical validity as a consistent interpretation of the laws of probability and as the correct solution to certain
3
epistemological problems such as the preface paradox. My skepticism is
confined to claims such as the following: all probabilities are Bayesian
probabilities, all knowledge is Bayesian credence, and all learning is
Bayesian conditionalization. I will say more about this later.
4. At the level of statistical practice, it should support a methodological
reconciliation between frequentist and Bayesian techniques. That is to
say, it should acknowledge that in practice both methods are effective
on different problems, independently of the philosophical debate. Kass
[2011] calls this viewpoint “statistical eclecticism” and Senn [2011] calls
it “statistical pragmatism”.
5. Thus, it is necessary for it to preserve the distinction between frequentist and Bayesian methods, that is to say, between methods that make
use only of probabilities that have a natural frequency interpretation
and those which make use of prior probabilities that do not. Otherwise,
frequentist and Bayesian methods are collapsed into a single group, in
which frequentist methods appear merely as oddly restricted Bayesian
methods.
Without further ado, I will introduce an account of probability that I
believe will fulfill all these criteria. The argument will necessarily detour
through many philosophical considerations related to probability. The reader
who is pressed for time should look at sections 2, 4, and 5.
1.3
Precedents for the view
The closest historical precedent I am aware of for my view is Carnap’s distinction [1945] between two senses of probability: Probability1 , which describes
credence or degree of confirmation, and Probability2 , which describes longrun relative frequency over a sequence of trials. In particular, he makes the
following parenthetical remark about Probability2 :
I think that, in a sense, the statement ‘c(h, e) = 32 ’ itself
may be interpreted as stating such an estimate; it says the same
as: “The best estimate on the evidence e of the probability2 of
M2 with respect to M1 is 2/3.” If somebody should like to call
this a frequency interpretation of probability, I should have no
objection.
My view differs substantially from Carnap’s in almost all respects — in
particular, I will not make use of the notion of logical probability that he
advocated. Nevertheless, I will interpret this remark as Carnap’s blessing.
4
2
The theory
Three conceptually distinct interpretations of probability suffice to describe
all uses of probability. They are arranged in a tiered hierarchy as follows:
1. Physical chance, if it exists. This is the only objective and metaphysically real kind of probability.
2. Frequency judgments. Pending a more precise motivation and definition, the core idea is this: given an event E, a frequency judgment
for E is a subjective estimate of the proportion of times E will occur
over an arbitrarily large (but finite) sequence of repeated trials. This
is intended as a frequency interpretation of probability, i.e., one that
can replace finite and hypothetical frequentism.
3. Bayesian subjective probability in the sense of Ramsey and de Finetti.
Probabilities pass “downwards” along this hierarchy in the following sense:
1. If an agent knows a physical chance (and no other relevant information),
that agent is obliged to have a frequency judgment coinciding with the
physical chance.
2. If an agent has a frequency judgment (and no other relevant information), that agent is obliged to have a Bayesian subjective probability
coinciding with the frequency judgment.
Thus, as we pass down the hierarchy, the domain of applicability of the
interpretations strictly increases. In particular, the conjunction of the two
relations yields a large fragment of (possibly all of) Lewis’s Principal Principle.
3
The first tier: physical chance
Lewis [1994] defines chance as “objective single-case probability”, which does
an excellent job of explaining why chance is so vexing for both frequentists
and Bayesians. For one, a chance is a probability that we intuit as being
objectively real, which is at odds with radical Bayesian subjectivist accounts
in which all probabilities are agent-relative and have to do with dispositions
to act. Thus, it is typical for Bayesians to accept chances, when they exist,
as an additional constraint on belief beyond that of simple consistency, in the
form of Lewis’s Principal Principle. This principle has varying formulations,
but the rough idea is that if an agent knows the chance of an event E, and
5
they have no other relevant information, they should set their credence in E
to be the same as the chance.
But chance is also problematic for frequentists because of the intuition
that they exist in the single case — a chance seems no less real despite
only being instantiated once, or perhaps not at all. Lewis gives the memorable example of unobtainium, a radioactive heavy element that does not
occur in nature, but can only be produced in a laboratory. One of the isotopes, Unobtainium-366, will only be instantiated twice as atoms. The other,
Unobtainium-369, will never be instantiated at all (perhaps due to budget
cuts). In the case of Unobtainium-366, we intuit that the true half-life of the
isotope (phrased equivalently in terms of probabilities, the objective chance
of decay within a particular fixed time period) may be something quite different from anything we might generalize from our two observed data points.
In the case of the heavier isotope, we have no data points at all to go on. So
there is a conflict with any frequentism that insists that probabilities are always synonymous with actual frequencies, or can always be straightforwardly
extrapolated from them.
But this is not yet the whole story about why chance is problematic.
There are two rather different senses in which physical chance appears in
accounts of probability. One is the existence of physical theories, for example the Copenhagen and objective collapse interpretations of quantum
mechanics, in which reality itself is nondeterministic and thus the existence
of chances is a physical and metaphysical fact about the universe. But the
other is when a physical phenomenon appears, on empirical grounds, to have
irreducibly probabilistic behavior. Radioactive decay is one example, but
another particularly intriguing case, appearing in Hoefer [2007] and Glynn
[2010], is Mendelian genetics, e.g., the probability that two carriers of a recessive gene will have a child in whom the gene is expressed.
Thus we encounter a dispute in the literature: is the existence of physical
chance compatible with a deterministic universe? One intuitive answer is
no: if the course of events is determined, then chance is annihilated and the
chance of any individual event E is 1 if it deterministically occurs and 0 if
it does not. This was the view of Popper and Lewis and it has continuing
defenders, in particular Schaffer [2007].
However, other authors defend the idea that a deterministic universe
could exhibit chance. For example, Lewis wanted chance to supervene (in
a Humean sense) on past, present, and future spatiotemporal events, rather
than existing as a distinct metaphysical property. He accomplished this via
the so-called “best-system analysis”, on which considerations such as symmetry or extrapolations from related systems can be chancemakers beyond mere
sequences of events. Although Lewis himself believed chance to be incompatible with determinism, nothing about such an analysis requires indeterminism
6
and it can support a compatibilist account of chance, as in Hoefer and Eagle
[2011]. Glynn also defends deterministic chance, but he is motivated instead
by the existence of probabilistic scientific laws, such as Mendelian genetics
or statistical mechanics, that would hold even in a deterministic universe.
Thus, he is essentially making an indispensability argument; if chance is essential to our understanding of the laws of Nature, then we are not justified
in denying its existence due to metaphysical qualms.
It follows that the question of whether chance exists is undecided. If you
believe the Copenhagen interpretation√of quantum mechanics, then measuring a quantum superposition such as 22 (|0i + |1i) yields either 0 or 1, each
with probability 12 , and the outcome is not determined in any sense before
the measurement. This is then a source of objective randomness and fulfills
the criteria for physical chance. If you are undecided about quantum mechanics, but believe Glynn’s arguments about chances from laws, then there
is still an objective chance of whether two heterozygous parents will have a
homozygous child. But if you believe the de Broglie-Bohm interpretation of
quantum mechanics, in which reality is deterministic, and you also endorse
Schaffer’s denial of deterministic chance, then there are no nontrivial physical
chances.
My purpose in proposing physical chance as the “highest” interpretation
of probability is not to adjudicate the question of whether chance exists, and
if so, what exactly it is.1 Rather, I am offering people with different views of
chance a blank check which they can fill in with their preferred conception.
The proper interpretation of quantum mechanics is a question for physicists
and philosophers of physics; whether Glynn’s argument is correct seems to
hinge, like other indispensability arguments, on deep questions about whether
scientific practice justifies scientific realism. Separating chance from other
notions of probability lets us separate these questions from the debate about
what probability itself means.
4
The second tier: frequency judgments
My characterization of frequency probabilities will rest on two primitive notions. One is that of a reference class: a reference class is simply a description that picks out a class of events. In the typical case, a reference class
1
In passing, I do have some sympathy towards the idea of√ deterministic chance, in
particular for microphysical events. For example, measuring 22 (|0i + |1i) produces an
apparently random sequence of 0s and 1s, no matter what interpretation of quantum mechanics one favors, and there seems to be a fine case for such a phenomenon exhibiting
chance. I become increasingly skeptical as this argument is extended upwards to macrophysical phenomena, such as genetics. I am also unimpressed with the best-system analysis
as such, which strikes me as a confusion of metaphysics with epistemology. But this is a
digression from my main argument.
7
will preferably satisfy some other criteria, for example Salmon’s [1971] notion of homogeneity: that there is no additional criterion, or “place selection
function”, that picks out a subclass with substantially different properties.
However, my discussion here will not impose any such additional requirements. One of the strengths of probabilistic analysis is that it can be applied
to data that are not “genuinely random” in any meaningful sense — in an
extreme but instructive case, the output of a deterministic pseudorandom
number generator. If the analyst considers the data to defy a deterministic analysis, or just that they can benefit from a probabilistic one, that is
sufficient.
The second primitive notion is that of epistemically independent events;
this is a kind of pre-theoretic counterpart to the idea of mutual independence.
Events are epistemically independent when knowing the outcome of some
does not does not tell us anything useful about the outcome of any other.
This is a subjective notion relative to the agent’s knowledge and needs; in
particular it is not necessary that the events, should they have objective
chances, have probabilistically mutually independent chances, or that the
agent take into account all available evidence about how the events might be
related.
Definition 1. Given an event E and a reference class R for it, an agent A’s
frequency judgment for E is a real number p ∈ [0, 1], representing a subjective
estimate of the proportion of times E will occur over an arbitrarily large (but
finite) sequence of epistemically independent trials in the chosen reference
class R.
Having a frequency judgment of p for E is a sufficient condition to model
E as being drawn I.I.D. (independently and identically distributed) from the
Bernoulli distribution with parameter p. That is to say, in intuitive terms,
we can model E in the same way as we would model flips of a coin with
bias p. This is not to say that we model E as such a coin — this would be
a circularity, since we need the definition of frequency judgment to clarify
what it means for the coin to have long-run bias! Rather, each situation has
a natural representation as a Kolmogorov-consistent probabilistic model, and
the resulting models are in fact the same.
In order for estimates of this kind to make sense, we require a clear
conception of the reference class R supporting an arbitrarily large number
of trials. The motivation for this is clear: we can toss a coin an arbitrary
number of times to clarify the relative frequency of heads, but we cannot
repeat a one-off event such as the 2000 U.S. presidential election to examine
any probabilistic variability in its results. Looking back to our discussion
of chance, all the chance-like physical phenomena we discussed (quantum
measurements, radioactive decay, and Mendelian genetics) admit frequency
judgments, even if they are excluded by a specific account of chance. Even
8
the decay of Unobtainium-369, the element that will never be instantiated,
admits one because we have a clear and unambiguous conception of what
it would mean to synthesize its atoms and measure the incidence of decay.
Thus, the existence of this intermediate interpretation of probability — less
objective than physical chance, but more so than Bayesian credence — should
soften the blow of deciding that some chance-like phenomena do not genuinely
exhibit chance.
4.1
Invariance under averaging
There are some formal difficulties with the definition of frequency judgment.
What does it mean to have a non-integer estimate of the number of times
E will occur over a integer-long sequence of trials? And why, if frequency
judgments are estimates of proportions over finite sequences, is it possible for
them to take on irrational values?2 I think the natural resolutions of these
problems succeed, but it is not entirely obvious that they succeed honestly;
one might suspect that they are parasitic on a prior, unexplained concept of
probability or expected value. So I will give a brief argument to justify that
real-valued proportions are sensible as frequency judgments.
The intuition is this. Consider someone who can give integer-valued estimates of the number of successes over n trials, for arbitrary n. We ask him
for his estimate of the number of successes over a single trial, and he tells us
either 0 or 1. Now we ask him, “if you repeated that single trial 10 times,
then averaged the number of successes over the 10 repetitions, what would
you estimate the average to be?” Because epistemic independence implies
that there is no difference between a 10-trial block and 10 1-trial blocks, he
should give us his estimate of the number of successes over 10 trials, divided
by 10: this will be the first decimal digit of his real-valued frequency judgment. We can continue this process to elicit more digits, or we can simply
ask him to “tell us the averages first,” rather than bothering with the integer
estimates. Formally:
Definition 2. Given an event E and a reference class R for it, an agent A’s
frequency judgment scheme for E is a map f : N → R, such that f (n) is a
subjective estimate of the number of times E will occur over n epistemically
independent trials of R. Evidently, f (n) ∈ [0, n] for every n.
So at this point, we are considering both frequency judgments in the
original sense, but also schemes that make integer predictions for every n.
2
This is Hájek’s 14th criticism of finite frequentism. There I think it succeeds to some
extent — unlike frequency judgments, there is an essential sense in which actual frequencies
are rational numbers. Of course, one could argue for the use of real numbers there too, as
an idealizing assumption that enables the use of continuous mathematics.
9
But now we impose another criterion: f should be invariant under averaging.
In other words, let us say that f estimates that if we do n trials, we will have
s successes. We should also estimate that if we do 2n trials and then divide
the number of successes by 2, we should get s. In other words, we should
have f (2n)
= f (n).
2
In general, for any a ∈ N, our estimate should be invariant under averaging over a repetitions of the trial, i.e., f (an)
= f (n). But this implies
a
that f should satisfy f (an) = af (n) for any a ∈ N. Now, fix some n
and let p = f (n)
; clearly p is a real number in [0, 1]. For any m ∈ N,
n
nf (m) = f (mn) = mf (n) = mpn. Dividing by n, we get that f (m) = pm
for all m. We have shown that frequency judgment schemes that are invariant under averaging are necessarily frequency judgments, i.e., real-valued
proportions.
Mathematically speaking, this argument is trivial; its significance is that
we appealed only to a notion of averaging over arbitrary repetitions, without
any circular appeal to probability or expected value. Furthermore, I think
this argument yields two important clarifications of the idea of frequency
judgment:
1. The concept of invariance under averaging gives rise to a simple notion of “long-run relative frequency” without appealing to an infinite
sequence of trials. Thus the frequency judgments interpretation appropriates some of the benefits of hypothetical frequentism as analyzed by
Hájek, without having to carry any of its metaphysical baggage.
2. If f is invariant under averaging, then f (n) = nf (1). Thus, in some
sense f “views” every individual trial as contributing a fractional success f (1) ∈ [0, 1] to the total estimate of successes. This is what justifies
modeling events that admit a frequency judgment as I.I.D. Bernoulli
trials.
A concern remains: why is it sensible for p to take on irrational values?
The key is that the reals are Archimedean, i.e., for any two reals r1 , r2 , we
have |r1 − r2 | > q for some rational q. It follows that over a sufficiently large
integer number of trials, any two distinct reals constitute distinguishable
frequency judgments; their estimates of the number of successes will vary by
at least one whole trial. For example, consider the irrational-valued frequency
judgment π4 ≈ .785398. Is this judgment identifiable with any rational-valued
approximation of it, e.g., .785? It is not, because over 100000 trials, they
predict quite different things.
At this point, one might take issue with the idea that arbitrary-precision
real numbers are distinguishable in this way. Surely, at some point, the
number of trials required to make the distinction is so large that the heat
10
death of the universe will come first? I appreciate this concern, but I don’t
think it’s specific to probability — it seems akin to the idea that instead
of modeling time as real-valued quantities of seconds, we should model it as
integer multiples of the Planck time. There may be a bound on the resolution
of reality, but it is methodologically convenient to represent it as unbounded.
5
Characteristics of frequency judgments
5.1
Caveats
It is problematic to claim that frequency judgments are in fact a frequency
interpretation of probability, and I do not wish to paper over the difficulties.
This conception is a substantial retreat from the classical frequentism of
Reichenbach and von Mises. In particular:
1. A frequency judgement is not “made by the world”; it is not directly
derivable from any actual past history of trials (as in the case of finite
frequentism), the past and future history of the world (as in some
cases of Lewis’s supervenience account), or any objective or universal
conception of an idealized hypothetical sequence of trials (as in the
analogous case of hypothetical frequentism).
2. A frequency judgment is explicitly relative to both an agent, because it
is a subjective estimate, and to a reference class. These relativizations
may look like reluctant concessions to realism, but in my opinion they
are features, not bugs — they capture essential indeterminacies that
must be part of any positivist account of probability. I will say more
about both relativizations below.
3. A frequency judgment need not pertain to events that are truly “random” in any sense. Deterministic phenomena that are too difficult to
analyze with deterministic methods (such as the operation of a pseudorandom number generator), when analyzed probabilistically, can be
classed at this level of the hierarchy. Thus, von Mises’s analysis of
randomness by means of the notion of Kollectiv (an idealized infinite
random sequence with certain desirable mathematical properties) is not
relevant.
4. The notion of frequency judgment is intended as a conceptual analysis
of probability — it is an attempted elucidation of what is meant by
statements such as “the probability of flipping a U.S. quarter and getting heads is 12 ,” or “the probability of a Carbon-14 atom decaying in
5715 years is 12 .” It does not follow from this that an agent’s frequency
11
judgments are necessarily a completed totality and form a σ-algebra
obeying the Kolmogorov axioms.
A frequency judgment is not necessarily part of any global probability
distribution, even one relative to a particular agent; it is created by an
act of model-building and can be revised arbitrarily in ways that do
not correspond to conditional update.
5.2
Relativization to reference classes
Frequency judgments are explicitly relativized to reference classes. Does this
mean that they cannot be an analysis of probability simpliciter? Concerning
this question, I endorse the argument by Hájek [2007] that in fact, every
interpretation of probability is affected by a reference class problem, and thus
explicit relativization to reference classes is needed to dissolve an intrinsic
ambiguity.
I will briefly sketch Hájek’s argument as it applies to Bayesian subjective
probability. According to the most radical accounts of subjective credence,
there are no constraints on credence besides mere consistency. But intuitively,
such a view is unsatisfying because it does not enforce any kind of relationship between one’s beliefs and reality. Hájek gives the following memorable
example:
The epistemology is so spectacularly permissive that it sanctions
opinions that we would normally call ridiculous. For example,
you may assign probability 0.999 to George Bush turning into a
prairie dog, provided that you assign 0.001 to this not being the
case.
Thus it seems necessary to admit additional constraints on belief — for
example, Lewis’s Principal Principle, in which beliefs must coincide with
known chances, or Hacking’s Principle of Direct Probability, in which they
must coincide with observed relative frequencies. But external “testimony”
of this kind is, by its nature, subject to a reference class problem. Consider
the following case: John is 60 years old, a nonsmoker, and previously worked
with asbestos. We have statistics for the incidence of lung cancer in 60year-old nonsmokers and 60-year-olds with asbestos exposure, but we have
no statistically significant data concerning the intersection of those groups.
What should our credence be that John will develop lung cancer? We might
pick the first rate, or the second, or try to interpolate between them, but
implicit in any of these decisions is a statement about what reference class
is to be preferred.
Hájek’s conclusion from this analysis is that we need new foundations for
probability; he considers the true primitive notion of probability to be conditional probability, where the assignment of the event to a reference class
12
is part of the proposition being conditioned on. That is to say, instead of
considering P (A), Hájek thinks we should be looking at P (A | A ∈ R), where
R is a reference class. I think that the frequency judgments interpretation,
in which the reference class is part of the definition of (unconditional) probability, is a more natural way of addressing this issue, and one that allows us
to retain our existing foundations. I discuss this question further in section
10.3.
5.3
Relativization to agents
The frequency judgments interpretation makes no reference to infinite sequences or possible worlds; it relies only on the conceivability of performing
additional representative trials. Thus, its closest relative in terms of metaphysical commitments is finite frequentism. But frequency judgments are
quite unlike finite frequencies in that they are agent-relative; two different
agents can have two different frequency judgments, even after they come to
agreement about a reference class. I will try to motivate this with a simple
case study. Consider the case of a coin that has been flipped 20 times and
come up heads 13 times. Is an agent constrained, on the basis of this data, to
have any particular estimate of the proportion of heads over a long sequence
of trials? Intuitively, the answer is no; a variety of beliefs about the coin’s
long-run behavior seem perfectly well justified on the basis of the data.
The ambiguities in estimation begin with the reference class problem.
13
One reading of finite frequentism is we must assign P (H) to be 20
, the ratio
of actual successes to actual trials. This could be quite reasonable in some
circumstances, e.g., if the coin seems notably atypical in some way; however,
to say that finite frequentism requires this value is to do it an injustice.
A finite frequentist might also say that the reference class provided by the
sample is deficient because of its small size, and choose instead the reference
class of all coinflips, yielding a P (H) of 21 , or rather, negligibly distant from
1
. But the spectrum of choices does not end there.
2
The maximum likelihood estimate of the probability of an event E is ns ,
the ratio of successes to trials; this is a frequentist estimator in the sense
that it does not involve the use of prior probabilities. As such, it coincides
13
with the first reading of finite frequentism and estimates P (H) to be 20
,
but it would be a mistake to identify the two perspectives. Rather, the
maximum likelihood estimate is the value of P (H) under which the observed
data are most probable; this is not an ontological attribution of probability
but explicitly an estimate. As such, it competes with Bayesian estimators
such as the Laplace rule of succession, which begins with a uniform prior
distribution over the coin’s biases and conditions repeatedly on each observed
flip of the coin. The resulting posterior distribution is the beta distribution
β(s + 1, n − s + 1); to get the estimate of the posterior probability, we take
13
s+1
its expected value, which is n+2
= 14
.
22
Since the rule of succession is derived from a uniform prior over the coin’s
biases, a different Bayesian might use a different prior. For example, using
a prior that clusters most of the probability mass around 21 , such as β(n, n)
for large n, will produce an estimate arbitrarily close to 12 . But on a different
note entirely, another frequentist might start with a null hypothesis that the
coin is fair, i.e., P (H) = 12 , then compute the p-value of the observed data
to be 0.26 and accept the null hypothesis, retaining the estimate of 12 .
None of these answers is prima facie unreasonable — even though they
differ considerably in methods and assumptions, they are all legitimate attempts to answer the question, “if this coin is flipped a large number of times,
what proportion of the flips will be heads?” I am therefore rejecting Carnap’s suggestion that there should in general be a best estimate of long-run
frequency from the data. We will have to live with a multiplicity of frequency
judgments, because room must be left for legitimate differences of opinion
on the basis of data.
5.4
Frequentism as a positivism
Given all this, why are frequency judgments still a frequency interpretation
of probability? I think they preserve the content of frequentism in two important senses. First, their definition depends essentially on the notion of
repeated trial. If there is no conception of a reference class of trials, then
there can be no frequency judgment. Thus, frequency judgments reflect the
intuition that there is no way to make frequentist sense of probability claims
about one-off events.
More crucially, even though frequency judgments are not objective, they
are directly falsifiable from empirical data. Consider the example in the previous section: on the basis of observing 13 heads over 20 trials, we considered
a range of different frequency judgments about P (H) to be valid. But no
matter what value we chose, we have a clear conception of how to further
clarify the question: we need to flip the coin more times and apply some
statistical test that can differentiate between the different judgments.
For example, consider the case of someone whose frequency judgment
for P (H) is 25 . If we go on to flip the coin 1000 times and get 484 heads,
then (using the normal approximation to the binomial) our observed result
is 5.42 standard deviations from the mean of 400 heads predicted by their
hypothesis, which yields a p-value on the order of 10−8 . This is so highly
improbable that we may consider the frequency judgment of 52 to have been
falsified. This is not to say that the much-maligned p-value test is the gold
standard for the falsification of frequency judgments; likelihood ratio tests
can be used to achieve the same results. If two agents can consense on a
reference class for E, they can settle whose frequency judgment for P (E) is
14
correct.
This explains the intuition that frequency probabilities are objective. If
there is a large, robust body of trials for an event E (such as coin flipping),
then any frequency judgment that is not extremely close to the observed
finite frequency is already falsified. Thus, for events such as “a flipped U.S.
quarter will land heads”, our expected frequency judgment (in this case 12 )
is very nearly objective.
How essential are repeated trials to this idea of probabilistic falsification?
Indeed, it is possible for a Bayesian probability for a one-off event to be
falsified, in the cases when that probability is very large or very small. For
example, if an agent makes the subjective probability assignment P (E) =
.00001, and then E in fact comes to pass, then the agent’s assignment has
been falsified in much the same sense as we discussed above. But if E is
one-off, an credence like P (E) = 0.5 that is far away from any acceptable
threshold of significance cannot be falsified. The event E will either occur
or fail to occur, but neither of these will be statistically significant. Such a
Bayesian credence lacks any empirical content.
In this sense, the definition of frequency judgment is an attempt to recover
the purely positivist content of frequentism. The metaphysical aspect of
frequentism, in which probabilities are inherently real and objective, has
been deferred to the level of chance. Inasmuch as Bayesian credences are
purely matters of personal opinion, without empirical content, they are also
deferred to another level.
5.5
Calibration
To remedy this, the literature on Bayesianism proposes the notion of calibration: a Bayesian agent is calibrated if 12 of the events he assigns credence
1
to come to pass, and so on.3 Calibration does seem to restore empirical
2
content to single-case subjective probability assertions — intuitively, given a
one-off event E, a subjective declaration that P (E) = 12 is more empirically
justified coming from an agent with a strong history of calibration than from
one without one. The problem is that calibration, as a norm on subjective
agents, represents a substantial compromise of the Bayesian view, so much so
that it cannot be taken to save the original notion of subjective probability
from these criticisms.
Firstly, as Seidenfeld [1985] observes, calibration is straightforwardly dependent on a notion of frequency probability, and what that notion is requires
explication. In what sense are we to interpret the statement that 12 of the
events will come to pass? Seidenfeld considers finite-frequentist (“ 12 of these
3
To solve the problem of sparseness, it is common to discretize or “bucket” the credences, e.g., by including also the events which were assigned credences in [.45, .55].
15
events have historically come to pass”) and hypothetical-frequentist (“the
long-run relative frequency of these events coming to pass is 21 ”) readings of
this claim and rejects them, for reasons akin to the difficulties Hájek sees
with these interpretations in general.4
Can we make sense of calibration under the tiered interpretation? In fact,
an assertion of calibration has a straightforward interpretation as a frequency
judgment: the agent is taking the class of events she assigns subjective probability 12 to be a reference class, and then making a frequency judgment of
1
for that class. This is an empirical assertion, subject to confirmation or
2
disconfirmation in the manner discussed in the previous section. However,
this notion of confirmation is a property not of the single case, but of the
class of predictions as a whole.
Secondly, just as calibration inherits the problems of definition that affect
frequency probability, it also inherits a reference class problem. For example,
van Fraassen [1983] gives the following surefire technique to achieve calibration: make 10 predictions, on any subject, with probability 16 . Then, roll a
fair die 1000 times, predicting an outcome of 1 each time with probability 61 .
At the end of this, you will (with high objective probability) be calibrated,
in the sense that almost exactly 61 of your predictions with probability 16
will have come true. But clearly your ability to make calibrated predictions
about the die says nothing about your predictive ability in general — it is
unreasonable to place the original 10 predictions and the subsequent 1000 in
the same reference class.
Both of these difficulties have a common theme: calibration, as a norm,
entangles individual subjective probability assertions with the facts about a
larger class of events. Thus it cannot be taken to provide empirical content
for single-case probability assertions. And inasmuch as this empirical content
does in fact exist, my claim is that it is captured exactly by the notion of
frequency judgment: it is no more and no less than the ability to define an
arbitrary reference class and make a relative frequency assertion about it.
6
The third tier: Bayesian probability
I will use the term “probabilism” to describe the following view:
4
Seidenfeld also cites a theorem by Dawid [1982], which asserts that according to a
Bayesian agent’s own subjective probability distribution, she will necessarily achieve longrun calibration with probability 1. This is a consequence of the Law of Large Numbers
— compare the observation that if a coin is in fact fair, even after an initial sequence of
999 heads and 1 tail, the relative frequency of heads will still converge in the limit to 21
with probability 1. Dawid and Seidenfeld take this to mean that the idea of calibration
is either trivialized or inexpressible under a strict Bayesian interpretation of probability.
But see new work by Sherrilyn Roush for an account of Bayesianism in which calibration
is a nontrivial norm on subjective probability.
16
1. Uncertain knowledge and belief can (at least some of the time) be modeled by probabilities (“credences”, “Bayesian personal probabilities”).
2. These credences can (at least some of the time) be measured by an
agent’s disposition to act or bet.
3. Credences ideally satisfy the Kolmogorov axioms of probabilistic consistency.
4. The desirability of this consistency is demonstrated by the Dutch Book
argument.
According to this definition, I consider myself a probabilist. It seems perverse to me to try and dispense entirely with the idea of real-valued credence
— at the very least, I really do have propensities to bet on a variety of uncertain events that have no frequency interpretation, and Bayesian subjective
probability can assist me in pricing those bets. Moreover, inasmuch as there
is any kind of precision to my uncertain knowledge and belief, I am more
sympathetic to classical probabilism as a representation of that uncertainty
than I am to other formal techniques in knowledge representation, for example the AGM axioms. And it seems to me that this system provides the
most natural resolution of various problems related to partial belief, such as
the preface paradox. Hence the three-tiered interpretation accords a place to
Bayesian credences, defined in the standard way according to the Bayesian
literature.
By contrast, I will use the term “Bayesian subjectivism” to denote the
following expansion of the view:
1. An agent has at all times credences for all uncertain propositions, representing implicit dispositions to act, and forming a completed σ-algebra
that is consistent according to the Kolmogorov axioms of probability.
2. All knowledge can be assimilated to this framework, and all learning
can be described as conditional update.
I disagree intensely with this view.5 As a frequentist, I am perpetually
surprised by the insistence of Bayesian authors that I have credences for
propositions I have never considered, that I should elevate my unfounded
hunches and gut instincts to the level of formalized belief, or that I should
apply the principle of indifference and believe completely uncertain propositions to degree 12 . When confronted with dispositional or gambling analyses,
which allege that my credence can be measured by my propensity to bet,
5
Binmore [2006], who has a similarly skeptical perspective, uses “Bayesian” to describe
moderate views of the first type and “Bayesianite” for the second.
17
my response is that there are many propositions on which I would simply
refuse to bet, or deny that I have a precise indifference price for a bet. And
indeed, the rationality of this response is being defended increasingly in the
literature, under the heading of “imprecise” or “mushy” credence — see Elga
[2010] or Joyce [2010] for arguments that there are situations in which precise
credences are unobtainable or unjustifiable.
Nor is the difficulty of eliciting precise credences the only foundational
difficulty with the Bayesian view. The intuition that Bayesian subjectivism
as an account of all uncertain reasoning represents an inherently unfeasible
ideal, even at the aspirational level, is supported by both philosophical and
mathematical evidence. Examples include Garber’s observation [1983] that
taking the position literally implies the existence of a unified language for all
of science (the project the logical positivists failed to complete), or Paris’s
proof [1994] that testing probabilistic beliefs for consistency is NP-complete.
However, as in the case of physical chance, a variety of conceptions of
Bayesian probability are enabled by the three-tiered interpretation. In particular, if you are a traditional Bayesian, then you have Bayesian credences
for a very wide range of propositions. Some of your credences also happen to
be frequency judgments, and some of those in turn happen to be chances, but
these distinctions are not of central importance to you. But the three-tiered
view also enables a much more skeptical attitude to Bayesian probability,
one that is identifiable with the skepticism of traditional frequentism: credences that have frequency interpretations can take on definite values while
credences that have no such interpretation are unsharp or remain in a state
of suspended judgment.
7
The transfer principles
I claimed that probabilities from the first tier transfer to the second, and from
the second to the third, the conjunction of these constituting a fragment of
the Principal Principle. However, I suspect that no one will be especially
interested in contesting this aspect of my argument — Bayesians already
endorse the Principal Principle, and frequentists find it perfectly acceptable
to bet according to frequency ratios. So the purpose of my discussion will be
as much to clarify the underlying notions as to prove the principles.
Definition 3. Let E be an event. If an agent can assign E to a reference
class R, she knows a physical chance p for events in the class R, and she has
no other relevant information, she is obliged to have a frequency judgment of
p for E and R.
This is more or less trivial. If we know that a class of events exhibits
chance, then we can model sequences of those events as I.I.D. draws from
18
the relevant distribution.
Definition 4. Let E be an event. If an agent has a frequency judgment p for
E (by virtue of associating it with an unambiguous reference class R), and
no other relevant information, he is obliged to have a Bayesian subjective
probability of p for E.
The argument for this is as follows: let the agent consider how to buy
and sell bets for a sequence of n sequence of events in the reference class
R, for n arbitrarily large. He estimates that a proportion p of these events
will come true. Therefore, the fair price for the sequence of bets is pn; any
higher and if he buys the bets at that price, he will lose money according to
his estimate, any lower and he will lose money by selling them. But since
R is epistemically homogeneous for the agent, and in particular he has no
information that distinguishes E from the other events, each individual bet
must have the same price. Thus, his fair price for a bet on E is pn
= p. n
How much of the Principal Principle have we recovered? We have it for
any event that has a chance and belongs to a reference class. This captures
most conventional uses of PP, for example the radioactive decay of atoms
(even Unobtainium). But we have seemingly failed to recover it in the case of
one-off events. For example, what we have come to understand an inherently
unique macrophysical phenomenon as possessing a chance of p? We cannot
have a frequency judgment about it, so on the basis of the reasoning here we
are not constrained to have a credence of p in it.
This is a genuine problem and I cannot resolve it entirely here — a solution
would seemingly require a detailed analysis of the meaning of chance. As a
last resort I can simply defer to an existing justification of PP that doesn’t
go through frequency judgments. But here is a brief sketch in defense of the
full PP on the basis of the frequency judgments view. PP is inherently a
principle of epistemology, not metaphysics, because it describes a constraint
on credences (which are necessarily an epistemic notion). Therefore it is
appropriate to ask how we would actually come to know the value of this
one-off macrophysical chance — we couldn’t have learned it from observed
frequency data. The most natural answer seems to be that we would learn it
via a theoretical model in which the overall macrophysical chance supervened
on microphysical chances. And then this model would provide the basis for a
frequency interpretation of the chance: over the reference class of situations
satisfying the initial conditions of the model, the desired event would come to
pass in some proportion p of the situations. This doesn’t exhaust all possible
methods by which we could come to know p, but I hope it fills in a good
portion of the gap.
Finally, notice the qualifications in the second principle: the reference
class must be unambiguous, and there must be no other relevant information. The second of these requirements corresponds to the requirement of
19
admissibility commonly associated with the Principal Principle; if you have
information about an individual event that informs you about it beyond the
background chance of success or failure, then PP is not applicable. (A simple example: you are playing poker and your opponent is trying to complete
a flush. You know that the objective chance of this occurring is low, but
you have seen him exhibit a “tell”, for example, the widening of the eyes
in excitement. Your credence that he has a flush should increase to a value
higher than that dictated by the PP.) There is a sophisticated literature on
when exactly PP is admissible, and I have no particular stance on the issue.
Indeed, my view is that both qualifications are features and not bugs. When
admissibility is debatable or the reference class is ambiguous, there is no fact
of the matter about what should be believed.
8
Populations, direct inference, and the Principle of Indifference
White [2009] calls the second transfer principle “Frequency-Credence”. He
claims that it implies the generalized Principle of Indifference, i.e., the rule
that if you are faced with n mutually exclusive alternatives and have no information to distinguish them, you should assume a credence of n1 for each
one. An especially revealing case is an individual proposition q concerning which you have no relevant information: since exactly one of {q, ¬q} is
true, the Principle of Indifference indicates that you should assign P (q) =
P (¬q) = 0.5. Such a principle is of course anathema to frequentists, since
it is applicable in cases when there is no possible frequency interpretation
of P (q). Thus, White’s purpose is to show that frequentist squeamishness
about the Principle of Indifference is incoherent. Here is his statement of
Frequency-Credence:
Definition 5. If (i) I know that a is an F , (ii) I know that freq(G | F ) = x
(the proportion of F s that are G), and (iii) I have no further evidence bearing
on whether a is a G, then P (a is a G) = x.
and here is his proof (' denoting epistemic indistinguishability):
Let F = {p1 , p2 , . . . , pn } be any set of disjoint and exhaustive
possibilities such that p1 ' p2 ' . . . pn . Let G be the set of
true propositions. For any pi , (i) I know that pi is an F ; (ii) I
know that freq(G | F ) = n1 (exactly one member of the partition
{p1 , p2 , . . . pn } is true); and (iii) I have no further evidence bearing
on whether pi is G (I am ignorant concerning the pi , with no
reason to suppose that one is true rather than another). Hence
by FC, P (pi is a G) = n1 , i.e., P (pi is true) = n1 , so P (pi ) = n1 . 20
White challenges opponents of the Principle of Indifference to identify a
restriction of Frequency-Credence that disallows this proof. Fortunately, the
frequency judgments interpretation and the second transfer principle qualify
as just such a restriction. Moreover, the precise way in which they block the
conclusion reveals some interesting information.
Everything hangs on the following assertion in the proof: that freq(G |
F ) = n1 . For White, this is just the observation that exactly one of the possibilities p1 . . . pn is true, i.e., it is the finite frequency of true propositions
among the available possibilities. But for the second transfer principle to
apply, this must constitute a genuine frequency judgment, and without a reference class and a conception of repeated trial, a frequency judgment cannot
exist. In particular, if the alternatives are q and ¬q for a single-case proposition q with no obvious notion of trial (“God exists”, “Chilperic I of France
reigned before Charibert I”), no frequency judgment will be supported, and
there is no obligation to set P (q) = P (¬q) = 0.5; rather it is perfectly reasonable to be in a state of suspended judgment, or to have an unsharp credence
interval.
There is a subtlety here because the principle of indifference can indeed be
a source of legitimate frequency judgments. If for some genuine reference class
of repeated trials, each trial has the same n mutually exclusive outcomes, then
it can be perfectly legitimate to estimate a priori the long-run frequency of
each one as n1 .6 This estimation may not be justified or accurate, but that
doesn’t matter; as discussed previously, what matters is the possibility of
confirming or disconfirming the judgment from empirical data. But even
in this case, we do not recover the principle of indifference as an obligation,
merely as an option. There is no obligation to formulate frequency judgments
in the absence of evidence — dispositional betting arguments try to elicit
credences in this way, but obviously this doesn’t go through for frequency
judgments.
8.1
Direct inference
There is another subtlety: observed finite frequencies are not necessarily frequency judgments! Consider the following scenario, discussed by Levi [1977]
and Kyburg [1977]: of the 8.3 million Swedes, 90% of them are Protestants.
Petersen is a Swede. What should our credence be that Petersen is a Protestant? Intuitively, there seems to be a frequency probability that P (Petersen is
a Protestant) = 0.9. Arguments of this form — going from relative frequency
in a population to a credence — are called direct inferences or statistical syllogisms, and they are a significant aspect of our probabilistic reasoning. But
6
In a Bayesian framework, this would be an uninformative prior, or indifference prior,
over the n alternatives.
21
if we try to phrase this as a frequency judgment, we encounter problems. The
Swedes are not a reference class of events, and there is no obvious notion of
repeated trial at work.
The situation seems analogous to the case of {q, ¬q}. The intuition that
we should have a credence of 0.9 is seemingly grounded in the idea that
Petersen is one of the 8.3 million Swedes, and we are indifferent as to which
one he is. But if we allow unrestricted reasoning of this kind, then it will apply
to the two propositions {q, ¬q} as well, and White’s challenge will succeed
after all — we will have conceded that making use of frequency probabilities
implies a generalized principle of indifference. Can we save the intuition that
P (Petersen is a Protestant) = 0.9 without conceding P (q) = P (¬q) = 0.5?
Here is a case that may clarify what the frequency judgments interpretation says about this kind of reasoning. You are a contestant on a game
show; a prize is behind exactly one of three closed doors, and you must
choose which one to open. What should your credence be that the prize is
behind the left door? Whatever this credence is, if it is to be associated with
a frequency judgment, it must be possible to clarify it with respect to the
long-run behavior of repeated trials. The natural conception of repeated trial
here is that we would play the game repeatedly and measure the proportion
of times that the prize is behind the left door. And it is not clear that any
particular frequency judgment is supported about this reference class of trials — we might imagine that the show host has a bias towards one of the
doors in particular. Considerations like this support a view in which your
credence that the prize is behind the left door is indeterminate or unsharp,
or in which you suspend judgment about the question. Contrast this with
the following claim: if you flip a fair coin with three sides and use the result
to decide which door to choose, you have a frequency judgment of 31 that
this procedure will yield the correct door, regardless of what the host does.
In this case, a frequency judgment is fully supported, because the reference
class is clear (flips of the coin) and its properties are unambiguous, and there
is a convincing case that the second transfer principle obligates you to have
a credence of 31 .7
However, it seems natural that we should wish the credence of 13 to be
available at least as an option for the rational agent, and to be able to make
sense of this under the frequency judgments interpretation. I think this is
possible via the following expedient: we construct a model of the show in
which the host selects the prize door via a coin flip. Acknowledging that
this model, like any model, may not be true, we can use it to support a
frequency judgment of 31 for each door. Returning to our original problem,
we can adopt a model in which the process by which we encounter Swedes
7
This distinction is closely related to the Ellsberg paradox, and to the decision-theoretic
and economic notions of ambiguity aversion and Knightian uncertainty.
22
is a chance process, analogous to a lottery in which we are equally likely to
draw any individual Swede. This model then supports a frequency judgment
of .9 for Protestants and a credence of .9 that Petersen is one.
This technique — modeling unknown processes as chance processes —
is the general idea of how direct inference is supported under the frequency
judgments interpretation. Does it, as White alleges, imply a generalized
principle of indifference? As discussed above, even when the technique is
applicable, it is not obligatory; the option of suspending judgment (or having
an unsharp credence) is left open. Moreover, the technique seems to get
at an important distinction between two kinds of indifference. It applies
straightforwardly to situations where one is indifferent between individuals
(prize doors, Swedes), but not to situations where one is indifferent between
propositions (which king reigned first). Indeed, to interpret the second kind of
indifference within our framework, we would seemingly have to talk about an
indifference between possible worlds, and of a chance process deciding which
one we live in. At this point we have regressed to the kind of reasoning decried
by C. S. Peirce, of imagining that “universes [are] as plenty as blackberries”
and we can “put a quantity of them in a bag, shake well them up, [and] draw
out a sample.” This kind of reasoning is not frequentist and therefore it is
appropriate that we cannot understand it on frequentist terms.
9
Hájek’s objections to frequentism
As I understand the frequency judgments interpretation, it avoids the bulk
of Hájek’s objections simply by failing to be a frequentism in the classical
sense of the term. Let F.n denote his nth objection to finite frequentism,
and H.n his nth objection to hypothetical frequentism. It seems to me that
most of his objections are straightforwardly dismissed by one or more of the
following concessions:
1. Not constraining frequency probabilities to be actual finite frequencies.
This obviates objections F.2, F.5, F.6, F.8, and F.12-15.
2. Not considering frequency probabilities to be determined by hypothetical infinite sequences of trials. This obviates objections H.1-6, H.8-9,
and H.13-14.
3. Acknowledging the possible existence of physical chance. This answers
objections F.3, F.7, F.9, F.11, H.7, and H.10,
4. Acknowledging the legitimacy of Bayesian subjective probabilities. This
answers objections F.10 and H.10.
23
Of the remaining objections: Hájek himself has subsequently repudiated
F.1, which criticizes finite frequentism on the grounds that it admits a reference class problem. As discussed in section 5.2, Hájek now considers the
reference class problem to affect every interpretation of probability, and I
fully concur. I take H.11 (which concerns paradoxes associated with uncountable event spaces) to affect the Kolmogorov formalization of probability
itself rather than frequentism specifically. H.15, which says that frequency
interpretations cannot make sense of infinitesimal probabilities, I take to be
a feature and not a bug.
The two remaining objections, F.4 and H.12, have a common theme —
they say that frequentism cannot make sense of propensity probabilities.
This is a serious issue that the three-tiered interpretation does not entirely
address. In particular, here is Hájek’s thought experiment from H.12:
Consider a man repeatedly throwing darts at a dartboard,
who can either hit or miss the bull’s eye. As he practices, he
gets better; his probability of a hit increases: P (hit on (n + 1)th
trial) > P (hit on nth trial). Hence, the trials are not identically
distributed. [...] And he remembers his successes and is fairly
good at repeating them immediately afterwards: P (hit on (n +
1)th trial | hit on nth trial) > P (hit on (n + 1)th trial). Hence,
the trials are not independent.
Intuitively, all of these probability statements are meaningful, objective statements about the properties of the man (or of the dart-throwing process as
a whole). Yet by their nature, we have difficulty in understanding them
as statements about relative frequencies over sequences of independent and
identically distributed trials. Hájek is unimpressed with the reply that in
order to obtain a frequency interpretation of these probabilities, we should
“freeze the dart-thrower’s skill level before a given throw” and then consider
hypothetical repeated throws by the frozen player. On one level, this notion
of “freezing“ involves an appeal to a nonphysical counterfactual. On another,
relative frequencies seem irrelevant to the intuition that the thrower has, before each throw, some single-case propensity to hit or miss the target. The
intuition here is analogous to the case of chance, except that there is no clear
way to interpret the dart-throwing system as subject to physical chance.
I can see no way for the three-tiered interpretation other than to resolutely
bite this bullet. That is to say, the three-tiered interpretation does not make
rigorous the idea of propensity probabilities that are not chances. In this I
am agreeing with von Mises, who held that we cannot make sense of such
single-case assertions as “the probability that John will die in the next 5 years
is 10%.” In defense of this refusal with respect to Hájek’s dart-thrower, I
can only say this: the only way we were able to formulate this model in the
first place was to observe the behavior of multiple dart-throwers, and thus
24
to reason about reference classes of darts players in specific situations (e.g.,
immediately after hitting the bulls-eye). Furthermore, how would we confirm
the applicability of this model to any specific player? It seems that we would
do so via some sort of calibration test — and, as discussed in section 5.5,
calibration is always implicitly or explicitly dependent on some notion of
frequency probability.
10
10.1
Advantages of the tiered interpretation
Statistical pragmatism
Two things are needed for the tiered interpretation to serve as a foundation
for statistical pragmatism [Senn, 2011], i.e., for a worldview in which frequentist and Bayesian methods coexist. In order to admit the use of Bayesian
methods, it must acknowledge the existence of non-frequentist prior probabilities. But it must also formally distinguish the probabilities used by properly
frequentist methods from those used by Bayesian methods; otherwise, a frequentist method is simply a peculiarly defective Bayesian method. Thus,
there are two claims: firstly, that the probabilities referred to in frequentist
statistical methods are frequency judgments, and secondly, that Bayesian statistical methods are characterized by their use of probabilities that cannot
necessarily be interpreted as frequency judgments.
Although I am not an expert in statistics, I am confident in both of these
claims. A canonical example is classical significance testing. A p-value of .05
means we estimate that if the null hypothesis were true and we repeated the
experiment, .05 of the experiments would exhibit results as extreme as the one
observed. This is straightforwardly a frequency judgment. Other methods
utilizing test statistics, such as chi-squared testing, follow this pattern; the
test statistic is computed from the data and then an estimate is given for
the proportion of experiments (given the null hypothesis) that would exhibit
correspondingly extreme values of the statistic. With confidence intervals,
the frequency judgment attaches to the procedure of deriving the interval: a
90% confidence interval is associated with the estimate that if we repeatedly
sampled and computed the confidence interval, 90% of the resulting intervals
would contain the true value of the parameter.
In contrast, Bayesian methods in general allow the use of probabilities
that have no frequency interpretation. For example, a prior probability for
a hypothesis will not have one in general; rather it will represent epistemic
uncertainty about the truth of the hypothesis. Of course, there are settings
in which the prior in a Bayesian method may be interpretable as a frequency
judgment. Consider someone with three coins, with biases 14 , 12 , 43 , who draws
one of them at random from an urn, flips it 10 times, and observes 6 heads.
25
The agent can begin with a uniform prior distribution that assigns probability
1
to each coin, then use Bayes’ rule to obtain posterior probabilities as to
3
which coin he has. In this case, his prior is in fact a frequency judgment
(“over a long sequence of urn drawings, each coin will be drawn 31 of the
time”), and thus his posteriors are also frequency judgments (“over a long
sequence of drawing coins from the urn and flipping them ten times, of the
times I see 6 heads, ≈ 0.558 of them will be because I drew the 12 -coin”).
But the method would be equally applicable if the prior reflected only the
agent’s subjective degrees of belief in which coin he had.
10.2
Cromwell’s rule
A notorious problem for subjectivist Bayesianism is the difficulty associated
with assigning probabilities of 0 or 1. Let’s say you assign P (A) = 0. Then
(A)
0
for any B, P (A | B) = P P(A∩B)
≤ PP (B)
= P (B)
= 0, so you can never
(B)
revise P (A) by conditioning on new information. The case for P (A) = 1 is
analogous, as is the situation when standard conditionalization is replaced
by Jeffrey conditionalization.
Thus, according to many interpretations, a strict Bayesian should never
assign a probability of 0 to an event, no matter how unlikely; Lindley calls
this requirement Cromwell’s rule. But frequency judgments are not affected
by this problem, because they can be revised arbitrarily. Perhaps the clearest
example is the case of estimating the bias of a coin, where we admit a third
event besides heads and tails: it is physically possible that the coin might
come to rest on its edge, or that the outcome of the flip might remain undetermined in some other way. A strict Bayesian is apparently committed to having prior probabilities for all of these events — and fixing P (H) = P (T ) = 0.5
entails a violation of Cromwell’s rule, since no probability mass is left over
for them.8 But under the frequency judgments interpretation, there is no
difficulty associated with revising a probability from zero to a nonzero value.
Perhaps questions of this kind are artificial, unrelated to genuine concerns
of statistical practice? On the contrary, they seem to correspond to actual
methodological difficulties that arise when adopting a strictly Bayesian perspective. Gelman and Shalizi [2012] describe how a rigidly Bayesian outlook
can be harmful in statistical practice. Since “fundamentally, the Bayesian
agent is limited by the fact that its beliefs always remain within the support
of its prior [i.e., the hypotheses to which the prior assigns nonzero probability]”, it is difficult to make sense of processes like model checking or model
8
One standard technique for dealing with this is to leave a small amount of mass over
for a “catch-all” hypothesis, which is a disjunction over all seen and unseen alternate hypotheses. See Fitelson and Thomason [2008] for an argument that this is false to scientific
practice.
26
revision, in which a model can be judged inadequate on its own merits, even
before a suitable replacement has been found. They instead join Box and
others in advocating a picture where individual Bayesian models are subjected to a non-Bayesian process of validation and revision. Dawid, whose
calibration theorem suggests a similar difficulty with the Bayesian agent being able to recognize his or her own fallibility, is led also to an endorsement
of Box. The point is not that these statisticians are betraying Bayesianism,
it is that their pragmatic interpretation of Bayesian statistical methodology
bears little resemblance to the worldview of the formal epistemologist who
endorses Bayesian confirmation theory.
10.3
Foundations of conditional probability
Bayesian probability proves its worth in dissolving paradoxes associated with
partial belief. Yet it is affected by its own set of paradoxes. I believe that
the tiered interpretation, in its capacity as a relaxation of strict Bayesian
discipline, can dissolve some of these as well — most notably, those in which
Bayesian conditionalization is expected to subsume all probabilistic modelbuilding.
Hájek [2007] gives the following paradox. An urn has 90 red balls and 10
white balls. Intuitively, P(Joe draws a white ball from the urn | Joe draws
a ball from the urn) = .1. But in the standard Kolmogorov interpretation
of probability, conditional probability is not a primitive notion but a derived
notion, so in order for this statement to be true, we must have P(Joe draws
a ball and it is white) / P(Joe draws a ball) = .1. But neither one of
these unconditional probabilities appears well-defined on the basis of our
assumptions. As Hájek asks, “Who is Joe anyway?”
Hájek’s solution is to suggest that conditional probability is the true primitive notion and that we should consider alternate (non-Kolmogorov) formulations of probability that elevate it to its rightful place as such. But this
seems to miss the mark. In particular, even though P(Joe draws a white ball
| Joe draws a ball) is well-defined, P(Joe draws a white ball | Bill flips a coin)
is not. Moreover, we can recover unconditional probability from conditional
probability, for example by conditioning on independent events (e.g., P(Joe
draws a white ball | a distant radium atom decays)) or on tautologies (e.g.,
P(Joe draws a white ball | p ∨ ¬p)). It seems that conditionalization is orthogonal to the true problem: when does a situation support a probabilistic
analysis?
Under the tiered interpretation of probability, this problem is confronted
directly and admits a natural resolution. The fact that Joe is drawing a
ball from the urn provides enough information to support a model and a
frequency judgement: it calls into existence a probabilistic model in which
we have an extremely simple event space: “Joe draws a white ball” or “Joe
27
draws a red ball”. In this model, the value from our intuition appears as
an unconditional probability: P(Joe draws a white ball) = .1. Saying this is
no more and no less than saying that if Joe repeatedly draws balls from the
urn with replacement, the natural estimate of the proportion of white balls
is .1. In general, the process of assigning an event E to a reference class and
then identifying P (E) with the frequency judgment for that class is a more
natural description of our probabilistic model-building than a strict Bayesian
conditioning view.
Hájek’s other paradox in the article, that of conditioning on events of
probability zero, admits a similar resolution. Hájek has us consider a random
variable X uniformly distributed on [0, 1]. Intuitively, P (X = 14 | X =
1
∨ X = 43 ) equals 12 . But if we expand this using the standard definition of
4
P (X= 1 )
4
conditional probability, we get P (X= 1 ∨X=
= 00 , which is undefined.
3
)
4
4
Once again, the problem seems to be that we are taking an unnecessarily
narrow view of the model-building process. It is natural that we should
try to transform a continuous distribution into a discrete one by setting
P (X = a) = f (a), where f is the density function, and renormalizing —
this has a natural interpretation as the outcome of considering P (|X − a| <
) for smaller and smaller values of . When applied to Hájek’s uniform
distribution, with a ranging over { 14 , 12 }, this yields the expected answer
P (X = 41 ) = P (X = 43 ) = 12 . It should not be considered problematic that
this model transformation cannot be interpreted as a conditional update.
10.4
Sleeping Beauty
The Sleeping Beauty Paradox, popularized by Elga [2000], goes as follows.
A fair coin, i.e., one that lands heads with an objective probability of 12 , is
flipped on Sunday, and then Beauty is put to sleep. If it lands heads, Beauty
is awakened on Monday, interviewed, his memory is wiped and he is put back
to sleep. If it lands tails, this is done once on Monday and once on Tuesday.
Beauty has just awoken. What should his credence be that the coin landed
heads? The “halfer” position is that since the coin is fair, P (H) must equal 21 .
But if the experiment is repeated many times, only 13 of Beauty’s awakenings
will be because the coin landed heads — hence the “thirder” position that
P (H) = 31 . Which of these is the correct credence?
Sleeping Beauty is a vexing problem for Bayesian epistemologists and has
generated a rich literature. But, as Halpern [2004] observed, the paradox
is immediately dissolved by a frequentist analysis: it is a pure instance of
reference class ambiguity. If Beauty analyzes his situation using the reference
class of all coinflips, then the probability of a head is 12 . If he analyzes it
instead using the reference class of all awakenings, the probability of a head
is 31 . Under the tiered interpretation, there are thus two possible frequency
28
judgments, one with value 12 and one with value 31 . But since the reference
class is ambiguous, neither one passes down to become a credence. For a
frequentist (or anyone who is free to suspend judgment about credences),
the problem is simply one of vagueness.
This seems unsatisfying. After the frequentist throws up his hands in
this way, how should he bet? As Halpern shows, the fact is that there
exist Dutch Books against both “halver” and “thirder” agents, but they
are not true Dutch Books: they rely on the ability of the bookie to vary
the number of bets that are bought and sold according to the number of
awakenings. Therefore the ideal betting behavior is not fixed, but depends
on the capabilities of the adversary.
Beauty has genuine probabilistic knowledge about his situation: over the
long run, half of all fair coin tosses are heads, and a third of his awakenings
are because the coin landed heads. And he can, in fact, use this knowledge to
buy and sell bets on H. For example, Beauty can buy and sell bets on heads
on Sunday, and the fair price for those bets will be 12 . And if Beauty has
an assurance that the exact same bets on heads will be on offer every time
he wakes up (perhaps they are sold from a tamper-proof vending machine in
the laboratory), the fair price for those bets will be 31 . What Beauty cannot
safely do is fix a single indifference price and then buy and sell bets at that
price, i.e., act in accordance with the traditional operational definition of
credence. Beauty can have probabilistic knowledge about H without having
a credence.9
10.5
White’s coin puzzle
White [2009] is committed to the Principle of Indifference, in particular as
an alternative to the suspension of judgment about credences. His thought
experiment of the “coin puzzle” is intended to show that suspension of judgment is unsatisfactory. As with the previous discussion of White in section
8, the onus is on the frequentist to reply.
You haven’t a clue as to whether q. But you know that I know
whether q. I agree to write “q” on one side of a fair coin, and
“¬q” on the other, with whichever one is true going on the heads
side (I paint over the coin so that you can’t see which sides are
heads and tails). We toss the coin and observe that it happens
to land on “q”.
9
I believe that Beauty will be protected against a variety of adversaries by having an
unsharp credence interval of [ 13 , 21 ], but formulating and proving this is beyond the scope
of this paper.
29
Let P denote your credence function before seeing the flip, and P 0 your
credence function afterwards. Let H denote the event that the coin lands
heads. White notes that the following statements are jointly inconsistent:
1. P (q) is indeterminate, i.e., before seeing the flip, you have no precise
credence that q. (One natural formalization of this is to say that P (q)
is interval-valued, e.g., P (q) = [0, 1]. This can be read as “my credence
in q is somewhere between 0 and 1.”)
2. P (H) = 12 , i.e., before seeing the flip, you have a precise credence of
that the coin will land heads.
1
2
3. P 0 (q) = P 0 (H). This should be true because after seeing the flip, q is
true if and only if the coin landed heads.
4. P (q) = P 0 (q). This should be true because seeing the flip provided no
information about whether q is in fact true. (Note that this would be
false for a biased coin.)
5. P (H) = P 0 (H). This should be true because seeing the flip provided
no information about whether the coin landed heads. (Note that this
would be false if you had meaningful information about p, in particular
a a sharp credence of anything other than 12 .)
Put these together and we derive P (q) = P 0 (q) = P 0 (H) = P (H) = 12 ,
contradicting claim 1. White’s conclusion is to deny that 1 is rationally
permissible — rather, we should begin with a sharp credence of P (q) = 21
via the Principle of Indifference. What should the proponent of unsharp
credences do instead? Joyce [2010] moves instead to deny claim 5 and set
P 0 (H) to equal P (q). Paradoxically, this causes an dilation of your credence
in P (H) — your P (H) was precisely 12 but your P 0 (H) has become unsharp or
interval-valued. Seeing the coin land has apparently reduced your knowledge!
My response to the coin puzzle is to affirm Joyce’s view and accept dilation, combined with the rule (maximin expected utility) given by Gärdenfors
and Sahlin [1982] for betting on unsharp credences. According to this view,
the correct action for an unsharp agent with credence interval P (q) = [0, 1]
is as follows: before seeing the outcome of the flip, it is permissible to buy
and sell bets on H for 0.5, to buy bets on p for prices ≤ 0, and to sell bets on
p for prices ≥ 1. After the outcome of the flip has been revealed, your betting behavior for H should dilate to match your behavior for p. But, on my
view, it is only your credences that dilate — your frequency judgment that
P (H) = 21 is exactly the fragment of your knowledge that is not destroyed
by seeing the p-side of the coin come up.
This is the “conservative betting” behavior that White discusses and rejects. His argument against it uses a scenario of long-run betting on repeated
30
instances of the coin puzzle, with a series of coin flips heads i and a different
unknown proposition pi each time:
On each toss you are offered a bet at 1:2 [i.e., for a price of
on heads i once you see the coin land pi or ¬pi . Since your
credence in heads i is mushy at this point you turn down all such
bets. Meanwhile Sarah is looking on but makes a point of covering
her eyes when the coin is tossed. Since she doesn’t learn whether
the coin landed pi her credence in heads i remains sharply 12 and
so takes every bet [....] Sure enough, she makes a killing.
1
]
3
This hinges on an ambiguity in how exactly the bets are being offered.
If you know for certain that the bets will be offered, i.e., if you have a
commitment from your bookmaker to sell the bets, then that is equivalent
to the bets being offered before the coin is tossed, and you are justified in
buying them. But if your bookmaker can choose whether or not to offer the
bet each time, you would be very ill-advised to buy them, since he can then
offer them exactly in the cases when he knows that ¬pi , and you will lose
your $ 31 every time! This is exactly the situation that unsharp credences are
intended to prevent: if you suspend judgment and refuse to bet, you can’t be
taken advantage of. And once the pi or ¬pi side of the coin has been revealed,
you can be taken advantage of by someone who knows the truth about pi ,
so you should stop buying and selling bets.10 But what has changed is your
betting behavior about H, not your knowledge about H. Your knowledge is
exactly your frequency judgment and it remains intact.
The coin puzzle is a powerful illustration of the following fact: for an
unsharp agent, probabilistic knowledge and betting behavior can come apart.
Thus, it is only paradoxical under behaviorist interpretations of probability
in which they are held to be synonymous. My hope is that the three-tiered
interpretation distinguishes the two in a natural way, and in a way that
affirms the core intuitions of frequentism.
11
Acknowledgements
I am grateful to Sherri Roush, Alan Hájek, Roy Frostig, Jacob Steinhardt,
and Jason Auerbach for helpful discussions.
10
There is, however, no need to revoke or cancel any existing bets, as White alleges in
a subsequent thought experiment.
31
References
Ken Binmore. Making decisions in large worlds. URL http://else.econ.
ucl.ac.uk/papers/uploaded/266.pdf. 2006.
Rudolf Carnap. The two concepts of probability: The problem of probability.
Philosophy and Phenomenological Research, 5(4):513–532, 1945.
A. Philip Dawid. The well-calibrated bayesian. Journal of the American
Statistical Association, 77(379):605–610, 1982.
Antony Eagle. Deterministic chance. Noûs, 45(2):269–299, 2011.
Adam Elga. Self-locating belief and the sleeping beauty problem. Analysis,
60(2):143–8211, 2000.
Adam Elga. Subjective probabilities should be sharp. Philosopher’s Imprint,
10(5), 2010. URL http://www.princeton.edu/∼adame/papers/sharp/
elga-subjective-probabilities-should-be-sharp.pdf.
Branden Fitelson and Neil Thomason. Bayesians sometimes cannot ignore
even very implausible theories (even ones that have not yet been thought
of). Australasian Journal of Logic, 6:25–36, 2008.
Daniel Garber. Old evidence and logical omniscience in Bayesian confirmation theory. In Testing Scientific Theories, volume X of Minnesota Studies
in the Philosophy of Science, pages 99–131. University of Minnesota Press,
1983.
Peter Gärdenfors and Nils-Eric Sahlin. Unreliable probabilities, risk taking,
and decision making. Synthese, 53(3):361–386, 1982.
Andrew Gelman and Cosma Rohilla Shalizi. Philosophy and the practice
of bayesian statistics in the social sciences. In The Oxford Handbook of
Philosophy of Social Science. Oxford University Press, 2012.
Luke Glynn. Deterministic chance. British Journal for the Philosophy of
Science, 61(1):51–80, 2010.
Alan Hájek. The reference class problem is your problem too. Synthese, 156
(3):563–585, 2007.
Alan Hájek. “mises redux” — redux: Fifteen arguments against finite frequentism. Erkenntnis, 45(2-3):209–27, 1996.
Alan Hájek. Fifteen arguments against hypothetical frequentism. Erkenntnis,
70(2):211–235, 2009.
32
Joseph Halpern. Sleeping beauty reconsidered: Conditioning and reflection
in asynchronous systems. In Oxford Studies in Epistemology, volume 1,
pages 111–142. Oxford University Press, 2004.
Carl Hoefer. The third way on objective probability: A sceptic’s guide to
objective chance. Mind, 116(463):549–596, 2007.
E. T. Jaynes. Some random observations. Synthese, 63(1):115–138, 1985.
James M. Joyce. A defense of imprecise credences in inference and decision
making1. Philosophical Perspectives, 24(1):281–323, 2010.
Robert E. Kass. Statistical inference: The big picture. Statistical science: a
review journal of the Institute of Mathematical Statistics, 26(1):1, 2011.
Henry E. Kyburg. Randomness and the right reference class. Journal of
Philosophy, 74(9):501–521, 1977.
Isaac Levi. Direct inference. Journal of Philosophy, 74(1):5–29, 1977.
David Lewis. Humean supervenience debugged. Mind, 103(412):473–490,
1994.
J. B. Paris. The Uncertain Reasoner’s Companion. Cambridge University
Press, Cambridge, UK, 1994.
Wesley C. Salmon. Statistical Explanation & Statistical Relevance. University
of Pittsburgh Press, 1971.
Jonathan Schaffer. Deterministic chance? British Journal for the Philosophy
of Science, 58(2):113–140, 2007.
Teddy Seidenfeld. Calibration, coherence, and scoring rules. Philosophy of
Science, 52(2):274–294, 1985.
Stephen Senn. You may believe you are a bayesian but you are probably
wrong. Rationality, Markets and Morals, 2(42), 2011.
Bas van Fraassen. Calibration: A frequency justification for personal probability. In Physics, Philosophy, and Psychoanalysis. D. Reidel, 1983.
Roger White. Evidential symmetry and mushy credence. In Oxford Studies
in Epistemology. Oxford University Press, 2009.
33