Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Many Kinds of Confirmation* Malcolm Forster May 21, 2001 Example 1, Problem 1: Suppose that there are two processes by which a disease is contracted (event D) depending on two genotypes in the population. Type 1: This process occurs for half of the population. For this segment of the population, there is 10% chance of developing the disease. There is a test for the disease such that 90% of the people who have the disease in this case will test positive (event E), while the false positive rate is 10%, which means that there is a 10% chance of testing positive for the disease when they do not have the disease. Type 2: For the other 50% of the population, the chance of developing the disease is higher, at 50%. As far as testing is concerned, the situation is the same. 90% of the people who have the disease will test positive (event E), while the false positive rate is 10%. In terms of the framework developed in Forster and Kieseppä, the relative frequencies of the two segments of the population are denoted by p1 = 50 100 , p2 = 50 100 , and the chance probabilities for the two cases are: λ1 ( D ) = 10 100 , λ1 ( E D ) = 90 100 , and λ1 ( E D ) = 10 100 λ2 ( D ) = 50 100 , λ2 ( E D ) = 90 100 , and λ2 ( E D ) = 10 100 , where D means that the disease is not contracted. Suppose that a person randomly selected from the population tests positive for the disease (E). The problem is to decide what credence to place on the prediction of D or D . With this information, a standard procedure is to measure credence in terms of what Forster and Kieseppä call the population probabilities: Namely, the prediction of D has greater credence than a prediction of D just in case * My thanks go to Ellery Eells, Branden Fitelson, Dan Hausman, Illka Kieseppä, Stephen Leeds, Elliott Sober, and Elling Ulvestad for corrections and comments on previous drafts. ( ) Pr ( D E ) > Pr D E . It is useful to calculate the conditional probability in two steps: Pr ( D, E ) = 27% , Pr ( E ) = 34% , and Pr ( D E ) ≅ 79.4% . Therefore the standard answer in this example is that it is far more probable that the tested person has the disease as not. Given the information provided, this solution to this problem is correct. The question I wish to raise is why it is correct. A standard response is that answer is obtained from the population probability distribution by conditionalizing on the information given; namely the proposition E. A feature of this rationale is that there is no mention of payoffs or utility and that conditionalization, or some postulate that entails conditionalization like the maximum entropy principle,1 is treated as a fundamental postulate. I wish to put forward a quite different justification for the same answer, which does specify a payoff and does not assume conditionalization or the maximum entropy principle as a primitive postulate. The payoff is explained as follows. We want to find a joint distribution qE ( D, E ) that maximizes the expected predictive accuracy (Forster and Sober 1994) in cases in which E occurs, where the predictive accuracy in a particular case is measure by the log-likelihood of qE ( D, E ) . The problem, as defined, already imposes the condition that E occurs, so that only a distribution that satisfies the constraint that qE ( E ) = 1 is a candidate for the maximization of the loglikelihood unless there is some other constraint that is incompatible with it (and I will prove that there is not). The constraint will fall out of the solution, and is not imposed upon it. The next step is to write down an expression for the expected log-likelihood of qE ( D, E ) . I divide this problem into the two possible cases; one in which λ1 is the true distribution and one 1 See Williams (1980) for one such proof of conditionalization. In the present framework, one can prove that the probability distribution that satisfies the constraint that the probability of E is 1 and has the smallest KullbackLeibler distance from the unconditional population probability distribution is the distribution obtained by conditionalizing on E (note that it satisfies the constraint since Pr(E|E) = 1). The idea goes back at least as far as E. T. Jaynes (see Jaynes 1979 for back references). 2 in which λ2 . If λi is the true distribution, for i = 1, 2, then the expected log-likelihood of qE ( D, E ) given that E is true is given by: ( ) pay ( qE λi , E ) = λi ( D E ) log qE ( D, E ) + λi D E log qE ( D, E ) . (1) Here I assume that the expected payoff is defined in terms of the conditional probabilities λi ( D E ) , and similarly for D . However, this is not to assume conditionalization in the previous sense. What I have assumed is justified by the meaning of conditional probabilities in probability theory. It suffices to make the case in terms of finite proportions, and to refer the reader to any textbook on probability theory for the generalization. Suppose that one randomly selects a marble from an urn consisting of 10 marbles, 4 of which have the property E, and only one of those 4 have the property D. The assumption that the marble is selected randomly means that each has an equal probability of being selected. Then the probability that the marble has the property D given that it has the property E is 25%. There are two ways of understanding the payoff described in (1). One way is in terms of predictive accuracy, as already explained. The other is in terms of how close qE ( D, E ) is to λi ( D E ) . The payoff in (1) received its greatest possible value when qE ( D, E ) = λi ( D E ) , and decreases from this maximum as qE ( D, E ) moves away from λi ( D E ) . This measure of discrepancy is the measure introduced by Kullback and Leibler (1951). Overall, the expected payoff is given by, Pr ( λ1 E ) pay ( qE λ1 , E ) + Pr (λ2 E ) pay ( qE λ2 , E ) , (2) where Pr ( λi E ) is the expected proportion of cases in which λi is the true out of those for which E is true. To maximize this expected payoff we must reach a compromise between maximizing the two payoffs separately. When we substitute (1) into (2), then we find that the expected payoff is ( ) pay ( qE E ) = Pr ( D E ) log qE ( D, E ) + Pr D E log qE ( D, E ) . Kullback and Leibler (1951) prove a theorem that shows that the solution to this maximization problem is 3 (3) qE ( D, E ) = Pr ( D E ) , (4) which is what I wanted to show. Again, there are two interpretations of this result. Under the first interpretation, the joint probability distribution ( ) qE ( D, E ) = Pr ( D E ) , qE ( D, E ) = Pr D E , and qE ( D, E ) = 0 = qE ( D, E ) , are the likelihoods of the predictively most accurate hypothesis, on average, for data generated in the way described. In this sense, the hypothesis makes the better probabilistic predictions than any hypothesis that assigns different probabilities. Does this mean that the hypothesis in question makes the best possible predictions? That is clearly a different question. There is a sense in which the hypothesis qE makes no predictions at all, in that it assigns only probabilities to events. But what if we used this probability to guess at whether D is true according to whether qE ( D ) > qE ( D ) . Could we expect such guesses to be right more of the time than those provided by any other hypothesis? Since Pr ( D E ) is just the expected relative frequency in which D occurs in this situation, if we predict D on the basis of the inequality, then we will be right more often than not on average. However, there are many other probability distributions that give exactly the same prediction. The distribution qE is very special—it is the closest representation of the generating probability in the sense of (4). This is the second interpretation. Some Bayesians may say that Pr ( D E ) expresses the correct rational degree of belief in the hypothesis that D, given the background knowledge and the evidence E. Notice that there is no reference to utilities in this statement. The viewpoint of Forster and Kieseppä is importantly different. The explicit reference to the payoff is important, because there are other payoffs that may be considered. For example, consider what happens when we replace log-likelihoods with likelihoods in (1), and define a payoff as follows: ( ) pay ( qE λi , E ) = λi ( D E ) qE ( D, E ) + λi D E qE ( D, E ) . When we substitute (5) into (2), we find that the expected payoff is 4 (5) ( ) pay ( qE E ) = Pr ( D E ) qE ( D, E ) + Pr D E qE ( D, E ) . ( (6) ) The optimal solution for qE is qE ( D, E ) = 1 if Pr ( D E ) > Pr D E and qE ( D, E ) = 1 if ( ) ( ) Pr ( D E ) < Pr D E . If Pr ( D E ) = Pr D E then it does not matter how the probability is distributed between qE ( D, E ) and qE ( D, E ) , although the constraint qE ( E ) = 1 must still be satisfied. Here we find that the direct prediction of D is valued above all else. It’s not that this answer is wrong or irrational. It is the correct answer to a different problem. It is interesting to note that in this case the optimal solution ( qE ( D, E ) = 1 if ( ) ( ) Pr ( D E ) > Pr D E ) is known so long as the probabilities Pr ( D E ) and Pr D E are known. However, there is an intuition that there is a tremendous risk is adopting this solution if the one is ( ) uncertain of the probabilities Pr ( D E ) and Pr D E . This same intuition says that to avoid excessive risk, then one would be better to adopt the non-optimal solution qE ( D, E ) = Pr ( D E ) ( ) and qE ( D, E ) = Pr D E . It is interesting to note that this solution has the same payoff as the mixed strategy of randomly guessing one of the extreme solutions according to the probabilities ( ) Pr ( D E ) and Pr D E . It might be argued that this highlights the point that these utilities no account of risk—the two solutions are equivalent in this payoff yet the probabilistic solution is far less risky. The payoff in (2) avoids this dilemma, since the probabilistic solution has a greater payoff ( ) then any extreme solution, including the one in which qE ( D, E ) = 1 if Pr ( D E ) > Pr D E . One might say that the risk is factored directly into the payoffs. There is also a sense in which the payoff in (2) is more fundamental than the payoff in (5), since the solution of the problem based on (5) can be deduced from the solution to the problem based on (2), but not the other way around. I shall consider only the maximization of the expected log-likelihoods from this point on. 5 The shift in Pr ( D E ) is often measured against the unconditional probability Pr ( D ) . In order to see how Pr ( D E ) arises out of Pr ( D ) in the present framework, it is best to express both of these in terms of objective chances. Pr 0( D E ) = p1λ1 ( D E ) + p2 λ2 ( D E ) Example 1, Problem 2: A different result follows from the kind of expected predictive accuracy considered by Forster and Kieseppä. The difference arises from a different goal. In the framework of Forster and Kieseppä the basic aim is to accurately represent the chance probability that produced D. In example 1, the average chance is between 10% and 50%, and so we may expect to see that the predictively most accurate representation of the probability of D gives a probability between these values. If so, then the conditional probability Pr ( D E ) cannot be the solution to this problem. I will now show how this turns out. Prior to the evidence E, the hypothesis with the greatest expected predictive accuracy was q( D ) = p1λ1 ( D ) + p2 λ2 ( D ) = 30% . It is interesting to note that this expected chance is the same as Pr ( D ) , so both problems seek to update the same probability. After the evidence E is taken into account, qE ( D ) = ( p1λ1 ( E ) λ1 ( D ) + p2 λ2 ( E ) λ2 ( D )) Pr( E ) = 39.4% , which raises the predictive probability of D but not to a value over 50%. At first it seems almost circular that qE ( D ) should be shifted towards λ2 ( D ) because of the dependence of λ1 ( E ) and λ2 ( E ) on λ1 ( D ) and λ2 ( D ) . However, Forster and Kieseppä’s framework is not concerned with discovering the values of λ1 ( D ) and λ2 ( D ) . The problem is to identify which chance mechanism is in force within the token case in which E was observed, such that if a case of exactly the same type were to recur, what probability would one use to predict D prior to testing. For example, suppose the person in question has an identical twin who cannot be tested. The twins are either both Type 1 or both Type 2. The test result on one of the twins enables us to make more accurate predictions about the other twin. 6 The same does not apply to the Bayesian inference, which aims to determine whether the individual tested has the disease. I believe that it has always been taken for granted that the two problems are the same. One purpose of this note is to demonstrate that they are not. Example 2: Again, suppose that there are two processes by which a disease is contracted (event D) depending on two genotypes in the population. Type 1: This process occurs for only 30% of the population. For this segment of the population, there is 10% chance of developing the disease. There is a test (event E) that provides evidence that someone does not have the disease. The bad news is that it is ineffective for people of this type because there is 10% chance of passing the test (event E) whether they have the disease or not. Type 2: For the other 70% of the population, the chance of developing the disease is 80%. As far as testing is concerned, the news is much better. Only 10% of the people who have the disease will pass the test (event E), while 90% of those who do not have the disease will pass the test. The event E is negatively correlated with having the disease in this subpopulation. Fortunately, they are also in the majority. The relative frequencies of the two segments of the population are p1 = 30 100 , p2 = 70 100 , and the chance probabilities for the two cases are: λ1 ( D ) = 10 100 , λ1 ( E D ) = 10 100 , and λ1 ( E D ) = 10 100 λ2 ( D ) = 80 100 , λ2 ( E D ) = 10 100 , and λ2 ( E D ) = 90 100 , where D means that the disease is not contracted. Suppose that a person randomly selected from the population tests positive for the disease (E), and we wish to decide whether to predict D. With this information, a standard Bayesian will settle the question in terms the population probabilities: Namely, predict D just in case ( ) Pr ( D E ) > Pr D E . In this example, Pr ( D, E ) = 5.9% , Pr ( E ) = 21.2% , and Pr ( D E ) ≅ 27.8% . 7 Therefore the standard Bayesian answer is that it is more probable that the tested person (who has result E) does not have the disease. Moreover, the prior probability of having the disease is 59%, so this is case in which a passing test clearly reduces the probability that the person has the disease. But what should we predict for the person’s twin? Should we also tell her that her probability of having the disease has been cut in half? The key point is that λ1 ( E ) = 10% whereas λ2 ( E ) = 26% , so that the evidence E provides better support for the hypothesis that the tested person is of Type 2. Type 2 individuals contract the disease at far higher rates than Type 1 individuals. So, the fact that one twin passed the test is good news for that twin, but bad news for the other twin. Prior to the evidence E, the predictively most accurate hypothesis for either twin was q( D ) = p1λ1 ( D ) + p2 λ2 ( D ) = 54% . But, now, by taking E into account, qE ( D ) = ( p1λ1 ( E ) λ1 ( D ) + p2 λ2 ( E ) λ2 ( D )) Pr( E ) = 70% . Therefore, the probability of having the disease has increased dramatically for the untested twin. References Forster, M. R. and I. A. Kieseppä (forthcoming): “Why Probabilistic Averaging Is All the Reduction You Need.” Forster, Malcolm R. and Elliott Sober (1994): “How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions.” The British Journal for the Philosophy of Science 45: 1 - 35. Jaynes, E. T. (1979): “Where Do We Stand on Maximum Entropy?”, in Ralph D. Levine and Myron Tribus (eds.) The Maximum Entropy Formalism, Cambridge, Mass.: The MIT Press, pp. 15-118. Kullback, S. and R. A. Leibler (1951): “On Information and Sufficiency.” Annals of Mathematical Statistics 22: 79-86. Williams, P. M.: (1980): “Bayesian Conditionalization and the Principle of Minimum Information.” The British Journal for the Philosophy of Science 31: 131-144. 8