Download Many Kinds of Confirmation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Many Kinds of Confirmation*
Malcolm Forster
May 21, 2001
Example 1, Problem 1: Suppose that there are two processes by which a disease is contracted
(event D) depending on two genotypes in the population.
Type 1: This process occurs for half of the population. For this segment of the population,
there is 10% chance of developing the disease. There is a test for the disease such that 90% of
the people who have the disease in this case will test positive (event E), while the false positive
rate is 10%, which means that there is a 10% chance of testing positive for the disease when they
do not have the disease.
Type 2: For the other 50% of the population, the chance of developing the disease is higher,
at 50%. As far as testing is concerned, the situation is the same. 90% of the people who have
the disease will test positive (event E), while the false positive rate is 10%.
In terms of the framework developed in Forster and Kieseppä, the relative frequencies of the
two segments of the population are denoted by p1 = 50 100 , p2 = 50 100 , and the chance
probabilities for the two cases are:
λ1 ( D ) = 10 100 , λ1 ( E D ) = 90 100 , and λ1 ( E D ) = 10 100
λ2 ( D ) = 50 100 , λ2 ( E D ) = 90 100 , and λ2 ( E D ) = 10 100 ,
where D means that the disease is not contracted. Suppose that a person randomly selected
from the population tests positive for the disease (E). The problem is to decide what credence to
place on the prediction of D or D . With this information, a standard procedure is to measure
credence in terms of what Forster and Kieseppä call the population probabilities: Namely, the
prediction of D has greater credence than a prediction of D just in case
*
My thanks go to Ellery Eells, Branden Fitelson, Dan Hausman, Illka Kieseppä, Stephen Leeds, Elliott Sober,
and Elling Ulvestad for corrections and comments on previous drafts.
(
)
Pr ( D E ) > Pr D E .
It is useful to calculate the conditional probability in two steps:
Pr ( D, E ) = 27% , Pr ( E ) = 34% , and Pr ( D E ) ≅ 79.4% .
Therefore the standard answer in this example is that it is far more probable that the tested
person has the disease as not.
Given the information provided, this solution to this problem is correct. The question I wish
to raise is why it is correct. A standard response is that answer is obtained from the population
probability distribution by conditionalizing on the information given; namely the proposition E.
A feature of this rationale is that there is no mention of payoffs or utility and that
conditionalization, or some postulate that entails conditionalization like the maximum entropy
principle,1 is treated as a fundamental postulate.
I wish to put forward a quite different justification for the same answer, which does specify a
payoff and does not assume conditionalization or the maximum entropy principle as a primitive
postulate. The payoff is explained as follows. We want to find a joint distribution qE ( D, E ) that
maximizes the expected predictive accuracy (Forster and Sober 1994) in cases in which E occurs,
where the predictive accuracy in a particular case is measure by the log-likelihood of qE ( D, E ) .
The problem, as defined, already imposes the condition that E occurs, so that only a distribution
that satisfies the constraint that qE ( E ) = 1 is a candidate for the maximization of the loglikelihood unless there is some other constraint that is incompatible with it (and I will prove that
there is not). The constraint will fall out of the solution, and is not imposed upon it.
The next step is to write down an expression for the expected log-likelihood of qE ( D, E ) . I
divide this problem into the two possible cases; one in which λ1 is the true distribution and one
1
See Williams (1980) for one such proof of conditionalization. In the present framework, one can prove that
the probability distribution that satisfies the constraint that the probability of E is 1 and has the smallest KullbackLeibler distance from the unconditional population probability distribution is the distribution obtained by
conditionalizing on E (note that it satisfies the constraint since Pr(E|E) = 1). The idea goes back at least as far as E.
T. Jaynes (see Jaynes 1979 for back references).
2
in which λ2 . If λi is the true distribution, for i = 1, 2, then the expected log-likelihood of
qE ( D, E ) given that E is true is given by:
(
)
pay ( qE λi , E ) = λi ( D E ) log qE ( D, E ) + λi D E log qE ( D, E ) .
(1)
Here I assume that the expected payoff is defined in terms of the conditional probabilities
λi ( D E ) , and similarly for D . However, this is not to assume conditionalization in the previous
sense. What I have assumed is justified by the meaning of conditional probabilities in
probability theory. It suffices to make the case in terms of finite proportions, and to refer the
reader to any textbook on probability theory for the generalization. Suppose that one randomly
selects a marble from an urn consisting of 10 marbles, 4 of which have the property E, and only
one of those 4 have the property D. The assumption that the marble is selected randomly means
that each has an equal probability of being selected. Then the probability that the marble has the
property D given that it has the property E is 25%.
There are two ways of understanding the payoff described in (1). One way is in terms of
predictive accuracy, as already explained. The other is in terms of how close qE ( D, E ) is to
λi ( D E ) . The payoff in (1) received its greatest possible value when qE ( D, E ) = λi ( D E ) , and
decreases from this maximum as qE ( D, E ) moves away from λi ( D E ) . This measure of
discrepancy is the measure introduced by Kullback and Leibler (1951).
Overall, the expected payoff is given by,
Pr ( λ1 E ) pay ( qE λ1 , E ) + Pr (λ2 E ) pay ( qE λ2 , E ) ,
(2)
where Pr ( λi E ) is the expected proportion of cases in which λi is the true out of those for which
E is true. To maximize this expected payoff we must reach a compromise between maximizing
the two payoffs separately. When we substitute (1) into (2), then we find that the expected
payoff is
(
)
pay ( qE E ) = Pr ( D E ) log qE ( D, E ) + Pr D E log qE ( D, E ) .
Kullback and Leibler (1951) prove a theorem that shows that the solution to this maximization
problem is
3
(3)
qE ( D, E ) = Pr ( D E ) ,
(4)
which is what I wanted to show.
Again, there are two interpretations of this result. Under the first interpretation, the joint
probability distribution
(
)
qE ( D, E ) = Pr ( D E ) , qE ( D, E ) = Pr D E , and qE ( D, E ) = 0 = qE ( D, E ) ,
are the likelihoods of the predictively most accurate hypothesis, on average, for data generated in
the way described. In this sense, the hypothesis makes the better probabilistic predictions than
any hypothesis that assigns different probabilities. Does this mean that the hypothesis in
question makes the best possible predictions? That is clearly a different question. There is a
sense in which the hypothesis qE makes no predictions at all, in that it assigns only probabilities
to events. But what if we used this probability to guess at whether D is true according to whether
qE ( D ) > qE ( D ) . Could we expect such guesses to be right more of the time than those
provided by any other hypothesis? Since Pr ( D E ) is just the expected relative frequency in
which D occurs in this situation, if we predict D on the basis of the inequality, then we will be
right more often than not on average. However, there are many other probability distributions
that give exactly the same prediction. The distribution qE is very special—it is the closest
representation of the generating probability in the sense of (4). This is the second interpretation.
Some Bayesians may say that Pr ( D E ) expresses the correct rational degree of belief in the
hypothesis that D, given the background knowledge and the evidence E. Notice that there is no
reference to utilities in this statement. The viewpoint of Forster and Kieseppä is importantly
different. The explicit reference to the payoff is important, because there are other payoffs that
may be considered.
For example, consider what happens when we replace log-likelihoods with likelihoods in (1),
and define a payoff as follows:
(
)
pay ( qE λi , E ) = λi ( D E ) qE ( D, E ) + λi D E qE ( D, E ) .
When we substitute (5) into (2), we find that the expected payoff is
4
(5)
(
)
pay ( qE E ) = Pr ( D E ) qE ( D, E ) + Pr D E qE ( D, E ) .
(
(6)
)
The optimal solution for qE is qE ( D, E ) = 1 if Pr ( D E ) > Pr D E and qE ( D, E ) = 1 if
(
)
(
)
Pr ( D E ) < Pr D E . If Pr ( D E ) = Pr D E then it does not matter how the probability is
distributed between qE ( D, E ) and qE ( D, E ) , although the constraint qE ( E ) = 1 must still be
satisfied. Here we find that the direct prediction of D is valued above all else. It’s not that this
answer is wrong or irrational. It is the correct answer to a different problem.
It is interesting to note that in this case the optimal solution ( qE ( D, E ) = 1 if
(
)
(
)
Pr ( D E ) > Pr D E ) is known so long as the probabilities Pr ( D E ) and Pr D E are known.
However, there is an intuition that there is a tremendous risk is adopting this solution if the one is
(
)
uncertain of the probabilities Pr ( D E ) and Pr D E . This same intuition says that to avoid
excessive risk, then one would be better to adopt the non-optimal solution qE ( D, E ) = Pr ( D E )
(
)
and qE ( D, E ) = Pr D E . It is interesting to note that this solution has the same payoff as the
mixed strategy of randomly guessing one of the extreme solutions according to the probabilities
(
)
Pr ( D E ) and Pr D E . It might be argued that this highlights the point that these utilities no
account of risk—the two solutions are equivalent in this payoff yet the probabilistic solution is
far less risky.
The payoff in (2) avoids this dilemma, since the probabilistic solution has a greater payoff
(
)
then any extreme solution, including the one in which qE ( D, E ) = 1 if Pr ( D E ) > Pr D E .
One might say that the risk is factored directly into the payoffs. There is also a sense in which
the payoff in (2) is more fundamental than the payoff in (5), since the solution of the problem
based on (5) can be deduced from the solution to the problem based on (2), but not the other way
around. I shall consider only the maximization of the expected log-likelihoods from this point
on.
5
The shift in Pr ( D E ) is often measured against the unconditional probability Pr ( D ) . In
order to see how Pr ( D E ) arises out of Pr ( D ) in the present framework, it is best to express
both of these in terms of objective chances.
Pr 0( D E ) = p1λ1 ( D E ) + p2 λ2 ( D E )
Example 1, Problem 2: A different result follows from the kind of expected predictive
accuracy considered by Forster and Kieseppä. The difference arises from a different goal. In the
framework of Forster and Kieseppä the basic aim is to accurately represent the chance
probability that produced D. In example 1, the average chance is between 10% and 50%, and so
we may expect to see that the predictively most accurate representation of the probability of D
gives a probability between these values. If so, then the conditional probability Pr ( D E ) cannot
be the solution to this problem. I will now show how this turns out.
Prior to the evidence E, the hypothesis with the greatest expected predictive accuracy was
q( D ) = p1λ1 ( D ) + p2 λ2 ( D ) = 30% .
It is interesting to note that this expected chance is the same as Pr ( D ) , so both problems seek to
update the same probability.
After the evidence E is taken into account,
qE ( D ) = ( p1λ1 ( E ) λ1 ( D ) + p2 λ2 ( E ) λ2 ( D )) Pr( E ) = 39.4% ,
which raises the predictive probability of D but not to a value over 50%.
At first it seems almost circular that qE ( D ) should be shifted towards λ2 ( D ) because of the
dependence of λ1 ( E ) and λ2 ( E ) on λ1 ( D ) and λ2 ( D ) . However, Forster and Kieseppä’s
framework is not concerned with discovering the values of λ1 ( D ) and λ2 ( D ) . The problem is to
identify which chance mechanism is in force within the token case in which E was observed,
such that if a case of exactly the same type were to recur, what probability would one use to
predict D prior to testing. For example, suppose the person in question has an identical twin who
cannot be tested. The twins are either both Type 1 or both Type 2. The test result on one of the
twins enables us to make more accurate predictions about the other twin.
6
The same does not apply to the Bayesian inference, which aims to determine whether the
individual tested has the disease. I believe that it has always been taken for granted that the two
problems are the same. One purpose of this note is to demonstrate that they are not.
Example 2: Again, suppose that there are two processes by which a disease is contracted (event
D) depending on two genotypes in the population.
Type 1: This process occurs for only 30% of the population. For this segment of the
population, there is 10% chance of developing the disease. There is a test (event E) that provides
evidence that someone does not have the disease. The bad news is that it is ineffective for
people of this type because there is 10% chance of passing the test (event E) whether they have
the disease or not.
Type 2: For the other 70% of the population, the chance of developing the disease is 80%.
As far as testing is concerned, the news is much better. Only 10% of the people who have the
disease will pass the test (event E), while 90% of those who do not have the disease will pass the
test. The event E is negatively correlated with having the disease in this subpopulation.
Fortunately, they are also in the majority.
The relative frequencies of the two segments of the population are p1 = 30 100 , p2 = 70 100 ,
and the chance probabilities for the two cases are:
λ1 ( D ) = 10 100 , λ1 ( E D ) = 10 100 , and λ1 ( E D ) = 10 100
λ2 ( D ) = 80 100 , λ2 ( E D ) = 10 100 , and λ2 ( E D ) = 90 100 ,
where D means that the disease is not contracted. Suppose that a person randomly selected
from the population tests positive for the disease (E), and we wish to decide whether to predict
D. With this information, a standard Bayesian will settle the question in terms the population
probabilities: Namely, predict D just in case
(
)
Pr ( D E ) > Pr D E .
In this example,
Pr ( D, E ) = 5.9% , Pr ( E ) = 21.2% , and Pr ( D E ) ≅ 27.8% .
7
Therefore the standard Bayesian answer is that it is more probable that the tested person (who
has result E) does not have the disease. Moreover, the prior probability of having the disease is
59%, so this is case in which a passing test clearly reduces the probability that the person has the
disease.
But what should we predict for the person’s twin? Should we also tell her that her
probability of having the disease has been cut in half? The key point is that λ1 ( E ) = 10%
whereas λ2 ( E ) = 26% , so that the evidence E provides better support for the hypothesis that the
tested person is of Type 2. Type 2 individuals contract the disease at far higher rates than Type 1
individuals. So, the fact that one twin passed the test is good news for that twin, but bad news
for the other twin. Prior to the evidence E, the predictively most accurate hypothesis for either
twin was
q( D ) = p1λ1 ( D ) + p2 λ2 ( D ) = 54% .
But, now, by taking E into account,
qE ( D ) = ( p1λ1 ( E ) λ1 ( D ) + p2 λ2 ( E ) λ2 ( D )) Pr( E ) = 70% .
Therefore, the probability of having the disease has increased dramatically for the untested twin.
References
Forster, M. R. and I. A. Kieseppä (forthcoming): “Why Probabilistic Averaging Is All the
Reduction You Need.”
Forster, Malcolm R. and Elliott Sober (1994): “How to Tell when Simpler, More Unified, or
Less Ad Hoc Theories will Provide More Accurate Predictions.” The British Journal for
the Philosophy of Science 45: 1 - 35.
Jaynes, E. T. (1979): “Where Do We Stand on Maximum Entropy?”, in Ralph D. Levine and
Myron Tribus (eds.) The Maximum Entropy Formalism, Cambridge, Mass.: The MIT
Press, pp. 15-118.
Kullback, S. and R. A. Leibler (1951): “On Information and Sufficiency.” Annals of
Mathematical Statistics 22: 79-86.
Williams, P. M.: (1980): “Bayesian Conditionalization and the Principle of Minimum
Information.” The British Journal for the Philosophy of Science 31: 131-144.
8