Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Springer Series in Statistics Advisors: D. Brillinger, S. Fienberg, J. Gani, J. Hartigan, K. Krickeberg Springer Series in Statistics L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications. x, 146 pages, 1979. J. O. Berger, Statistical Decision Theory: Foundations, Concepts, and Methods. xiv, 425 pages, 1980. R. G. Miller, Jr., Simultaneous Statistical Inference, 2nd edition. xvi, 299 pages, 1981. P. Bremaud, Point Processes and Queues: Martingale Dynamics. xviii, 354 pages, 1981. E. Seneta, Non-Negative Matrices and Markov Chains. xv, 279 pages, 1981. F. J. Anscombe, Computing in Statistical Science through APL. xvi, 426 pages, 1981. J. W. Pratt and J. D. Gibbons, Concepts of Nonparametric Theory. xvi, 462 pages, 1981. V. Vapnik. Estimation of Dependences based on Empirical Data. xvi, 399 pages, 1982. H. Heyer, Theory of Statistical Experiments. x, 289 pages, 1982. L. Sachs, Applied Statistics: A Handbook of Techniques. xxviii, 706 pages, 1982. M. R. Leadbetter, G. Lindgren and H. Rootzen, Extremes and Related Properties of Random Sequences and Processes. xii, 336 pages, 1983. H. Kres, Statistical Tables for Multivariate Analysis. xxii, 504 pages, 1983. J. A. Hartigan, Bayes Theory. xii, 145 pages, 1983. J. A. Hartigan Bayes Theory Springer-Verlag New York Berlin Heidelberg Tokyo J. A. Hartigan Department of Statistics Yale University Box 2179 Yale Station New Haven, CT 06520 U.S.A. AMS Classification: 62 AI5 Library of Congress Cataloging in Publication Data Hartigan, J. A. Bayes theory. (Springer series in statistics) Includes bibliographies and index. 1. Mathematical statistics. I. Title. II. Series. QA276.H392 1983 519.5 83-10591 With 4 figures. © 1983 by Springer-Verlag New York Inc. Softcover reprint of the hardcover 1st edition 1983 All rights reserved. No part of this book may be translated or reproduced in any form without written permission from Springer-Verlag, 175 Fifth Avenue, New York, New York 10010, U.S.A. Typeset by Thomson Press (India) Limited, New Delhi, India. 9 8 7 6 543 2 I ISBN-13 :978-1-4613-8244-7 001: 10.1007/978-1-4613-8242-3 e-ISBN-13 :978-1-4613-8242-3 To Jenny Preface This book is based on lectures given at Yale in 1971-1981 to students prepared with a course in measure-theoretic probability. It contains one technical innovation-probability distributions in which the total probability is infinite. Such improper distributions arise embarrassingly frequently in Bayes theory, especially in establishing correspondences between Bayesian and Fisherian techniques. Infinite probabilities create interesting complications in defining conditional probability and limit concepts. The main results are theoretical, probabilistic conclusions derived from probabilistic assumptions. A useful theory requires rules for constructing and interpreting probabilities. Probabilities are computed from similarities, using a formalization of the idea that the future will probably be like the past. Probabilities are objectively derived from similarities, but similarities are sUbjective judgments of individuals. Of course the theorems remain true in any interpretation of probability that satisfies the formal axioms. My colleague David Potlard helped a lot, especially with Chapter 13. Dan Barry read proof. vii Contents CHAPTER 1 Theories of Probability 1.0. Introduction 1.1. Logical Theories: Laplace 1.2. Logical Theories: Keynes and Jeffreys 1.3. Empirical Theories: Von Mises 1.4. Empirical Theories: Kolmogorov 1.5. Empirical Theories: Falsifiable Models 1.6. Subjective Theories: De Finetti 1.7. Subjective Theories: Good 1.8. All the Probabilities 1.9. Infinite Axioms 1.10. Probability and Similarity 1.11. References 1 1 2 3 5 5 6 7 8 10 11 13 CHAPTER 2 Axioms 2.0. Notation 2.1. Probability Axioms 2.2. Pres paces and Rings 2.3. Random Variables 2.4. Probable Bets 2.5. Comparative Probability 2.6. Problems 2.7. References 14 14 14 16 18 18 20 20 22 CHAPTER 3 Conditional Probability 3.0. Introduction 23 23 ix x Contents 3.1. 3.2. 3.3. 3.4. 3.5. 3.6. 3.7. 3.8. Axioms of Conditional Probability Product Probabilities Quotient Probabilities Marginalization Paradoxes Bayes Theorem Binomial Conditional Probability Problems References 24 26 27 28 29 31 32 33 CHAPTER 4 Convergence 4.0. Introduction 4.1. Convergence Definitions 4.2. Mean Convergence of Conditional Probabilities 4.3. Almost Sure Convergence of Conditional Probabilities 4.4. Consistency of Posterior Distributions 4.5. Binomial Case 4.6. Exchangeable Sequences 4.7. Problems 4.8. References 34 34 34 35 36 38 38 40 42 43 CHAPTER 5 Making Probabilities 5.0. Introduction 5.1. Information 5.2. Maximal Learning Probabilities 5.3. Invariance 5.4. The Jeffreys Density 5.5. Similarity Probability 5.6. Problems 5.7. References 44 44 44 45 47 48 50 53 55 CHAPTER 6 Decision Theory 6.0. Introduction 6.1. Admissible Decisions 6.2. Conditional Bayes Decisions 6.3. Admissibility of Bayes Decisions 6.4. Variations on the Definition of Admissibility 6.5. Problems 6.6. References 56 56 56 58 59 61 62 62 CHAPTER 7 Uniformity Criteria for Selecting Decisions 7.0. Introduction 63 63 xi Contents 7.1. 7.2. 7.3. 7.4. 7.5. 7.6. 7.7. 7.8. Bayes Estimates Are Biased or Exact Unbiased Location Estimates Unbiased Bayes Tests Confidence Regions One-Sided Confidence Intervals Are Not Unitary Bayes Conditional Bets Problems References 63 64 65 67 68 68 69 71 CHAPTER 8 Exponential Families 8.0. Introduction 8.1. Examples of Exponential Families 8.2. Prior Distributions for the Exponential Family 8.3. Normal Location 8.4. Binomial 8.5. Poisson 8.6. Normal Location and Scale 8.7. Problems 8.8. References 72 72 73 73 74 76 79 79 82 83 CHAPTER 9 Many 9.0. 9.1. 9.2. 9.3. 9.4. 9.5. 9.6. 9.7. 9.8. 9.9. 9.10. 9.11. Normal Means Introduction Baranchik's Theorem Bayes Estimates Beating the Straight Estimate Shrinking towards the Mean A Random Sample of Means When Most of the Means Are Small Multivariate Means Regression Many Means, Unknown Variance Variance Components, One Way Analysis of Variance Problems References 84 84 84 86 88 89 89 91 92 92 93 94 95 CHAPTER 10 The Multinomial Distribution 10.0. Introduction 10.1. Dirichlet Priors 10.2. Admissibility of Maximum Likelihood, Multinomial Case 10.3. Inadmissibility of Maximum Likelihood, Poisson Case 10.4. Selection of Dirichlet Priors 10.5. Two Stage Poisson Models 10.6. Multinomials with Clusters 96 96 96 97 99 100 101 101 Contents xii 10.7. 10.8. 10.9. 10.10. Multinomials with Similarities Contingency Tables Problems References 102 103 104 105 CHAPTER II Asymptotic Normality of Posterior Distributions 11.0. Introduction 11.1. A Crude Demonstration of Asymptotic Normality 11.2. Regularity Conditions for Asymptotic Normality 11.3. Pointwise Asymptotic Normality 11.4. Asymptotic Normality of Martingale Sequences 11.5. Higher Order Approximations to Posterior Densities 11.6. Problems 11.7. References 107 107 108 108 III 113 115 116 118 CHAPTER 12 Robustness of Bayes Methods 12.0. Introduction 12.1. Intervals of Probabilities 12.2. Intervals of Means 12.3. Intervals of Risk 12.4. Posterior Variances 12.5. Intervals of Posterior Probabilities 12.6. Asymptotic Behavior of Posterior Intervals 12.7. Asymptotic Intervals under Asymptotic Normality 12.8. A More General Range of Probabilities 12.9. Problems 12.10. References 119 119 120 120 121 122 122 123 124 125 126 126 CHAPTER 13 Nonparametric Bayes Procedures 13.0. Introduction 13.1. The Dirichlet Process 13.2 The Dirichlet Process on (0, 1) 13.3. Bayes Theorem for a Dirichlet Process 13.4. The Empirical Process 13.5. Subsample Methods 13.6. The Tolerance Process 13.7. Problems 13.8. References 127 127 127 130 131 132 133 134 134 135 Author Index 137 Subject Index 141 CHAPTER 1 Theories of Probability 1.0. Introduction A theory of probability will be taken to be an axiom system that probabilities must satisfy, together with rules for constructing and interpreting probabilities. A person using the theory will construct some probabilities according to the rules, compute other probabilities according to the axioms, and then interpret these probabilities according to the rules; if the interpretation is unreasonable perhaps the original construction will be adjusted. To begin with, consider the simple finite axioms in which there are a number of elementary events just one of which must occur, events are unions of elementary events, and the probability of an event is the sum of the nonnegative probabilities of the elementary events contained in it. There are three types of theory-logical, empirical and subjective. In logical theories, the probability of an event is the rational degree of belief in the event relative to some given evidence. In empirical theories, a probability is a factual statement about the world. In subjective theories, a probability is an individual degree of belief; these theories differ from logical theories in that different individuals are expected to have different probabilities for an event, even when their knowledge is the same. 1.1. Logical Theories: Laplace The first logical theory is that of Laplace (1814), who defined the probability of an event to be the number of favorable cases divided by the total number of cases possible. Here cases are elementary events; it is necessary to identify equiprobable elementary events in order to apply Laplace's theory. In many 2 l. Theories of Probability gambling problems, such as tossing a die or drawing from a shuffled deck of cards, we are willing to accept such equiprobability judgments because of the apparent physical indistinguishability of the elementary events-the particular face of the die to fall, or the particular card to be drawn. In other problems, such as the probability of it raining tomorrow, the equiprobable alternatives are not easily seen. Laplace, following Bernoulli (1713) used the principle of insufficient reason which specifies that probabilities of two events will be equal if we have no reason to believe them different. An early user ofthis principle was Thomas Bayes (1763), who apologetically postulated that a binomial parameter p was uniformly distributed if nothing were known about it. The principle of insufficient reason is now rejected because it sets rather too many probabilities equal. Having an unknown p uniformly distributed is different from having an unknown uniformly distributed, yet we are equally ignorant of both. Even in the gambling case, we might set all combination of throws of n dice to have equal probability so that the next throw has probability 1/6 of giving an ace no matter what the results of previous throws. Yet the dice wiII always be a little biased and we want the next throw to have higher probability of giving an ace if aces appeared with frequency greater than 1/6 in previous throws. Here, it is a consequence of the principle of insufficient reason that the long run frequency of aces will be 1/6, and this prediction may well be violated by the observed frequency. Of course any finite sequence will not offer a strict contradiction, but as a practical matter, if a thousand tosses yielded 1/3 aces, no gambler would be willing to continue paying off aces at 5 to 1. The principle of insufficient reason thus violates the skeptical principle that you can't be sure about the future. .JP 1.2. Logical Theories: Keynes and Jeffreys Keynes (1921) believed that probability was the rational belief in a proposition justified by knowledge of another proposition. It is not possible to give a numerical value to every such belief, but it is possible to compare some pairs of beliefs. He modified the principle of insufficient reason to a principle of indifference-two alternatives are equally probable if there is no relevant evidence relating to one alternative, unless there is corresponding evidence relating to the other. This still lea ves a lot of room for judgment; for example, Keynes asserts that an urn containing n black and white balls in unknown proportion will produce each sequence of white and black balls with equal probability, so that for large n the proportion of white balls is very probably near 1/2. He discusses probabilities arising from analogy, but does not present methods for practical calculation of such probabilities. Keynes's 1.3. Empirical Theories: Von Mises 3 theory does not succeed because it does not provide reasonable rules for computing probabilities, or even for making comparisons between probabilities. Jeffreys (1939) has the same view of probability as Keynes, but is more constructive in presenting many types of prior distributions appropriate for different statistical problems. He presents an "invariant" prior distribution for a continuous parameter indexing a family of probability distributions, thus escaping one of the objections to the principle of insufficient reason. The invariant distribution is however inconsistent in another sense, in that it may generate conditional distributions that are not consistent with the global distribution. Jeffreys rejects it in certain standard cases. Many of the standard prior probabilities used today are due to Jeffreys, and he has given some general rules for constructing probabilities. He concedes (1939, p. 37) that there may not be an agreed upon probability in some cases, but argues (p. 406) that two people following the same rules should arrive at the same probabilities. However, the many rules stated frequently give contradictory results. The difficulty with Jeffreys's approach is that it is not possible to construct unique probabilities according to the stated rules; it is not possible to infer what Jeffreys means by probability by examining his constructive rules; it is not possible to interpret the results of a Jeffreys calculation. 1.3. Empirical Theories: Von Mises Let Xl' X 2 , ••. , X n , ••• denote an infinite sequence of points in a set. Let f(A) be the limiting proportion of points lying in a set A, if that limit exists. Then fsatisfies the axioms of finite probability. In frequency theories, probabilities correspond to frequencies in some (perhaps hypothetical) sequence of experiments. For example "the probability of an ace is 1/6" means that if the same die were tossed repeatedly under similar conditions the limiting frequency would be 1/6. Von Mises (1928/1964) declares that the objects under study are not single events but sequences of events. Empirically observed sequences are of course always finite. Some empirically observed sequences show approximate convergence of relative frequencies as the sample size increases, and approximate random order. Von Mises idealizes these properties in an infinite sequence or collective in which each elementary event has limiting frequency that does not change when it is computed on any subsequence in a certain family. The requirement of invariance is supposed to represent the impossibility (empirically observed) of constructing a winning betting system. Non trivial collectives do not exist satisfying invariance over all subsequences but it is a consequence of the strong law of large numbers that 4 1. Theories of Probability collectives exist that are invariant over any specified countable set of subsequences. Church (1940) suggests selecting subsequences using recursive functions, functions of integer variables for which an algorithm exists that will compute the value of the function for any values of the arguments in finite time on a finite computing machine. There are countably many recursive functions so the collective exists, although of course, it cannot be constructed. Further interesting mathematical developments are due to Kolmogorov (1965) who defines a finite sequence to be random if an algorithm required to compute it is sufficiently complex, in a certain sense; and to Martin-Lof(1966) who establishes the existence of finite and infinite random sequences that satisfy all statistical tests. How is the von Mises theory to be applied? Presumably to those finite sequences whose empirical properties of convergent relative frequency and approximate randomness suggested the infinite sequence idealization. No rules are given by von Mises for recognizing such sequences and indeed he criticizes the "erroneous practice of drawing statistical conclusions from short sequences of observations" (p. ix). However the Kolmogorov or Martin-Lof procedures could certainly be used to recognize such sequences. How does frequency probability help us learn? Take a long finite "random" sequence of O's and I's. The frequency of O's in the first half of the sequence will be close to the frequency of O's in the second half of the sequence, so that if we know only the first half of the sequence we can predict approximately the frequency of O's in the second half, provided that we assume the whole sequence is random. The prediction of future frequency is just a tautology based on the assumption of randomness for the whole sequence. It seems necessary to have a definition, or at least some rules, for deciding when a finite sequence is random to apply the von Mises theory. Given such a definition, it is possible to construct a logical probability distribution that will include the von Mises limiting frequencies: define the probability of the sequence xl' x 2 ' ••. , xn as lim Nk(x)/N k where Nk(x) is the number of k random sequences of length k beginning with Xl' X 2 ' ... , xn and Nk is the number of random sequences of length k. In this way a probability is defined on events which are unions of finite sequences. A definition of randomness would not be acceptable unless P[ xn + 1 = 11 proportion of I's in Xl' '" , Xn = pJ - Pn -7 0 as n -7 00, that is, unless the conditional probability of a 1 at the next trial converged to the limiting frequency of 1's. True, definitions of randomness may vary, so that this is no unique solution-but the arbitrariness necessary to define finite randomness for applying frequency theory is the same arbitrariness which occurs in defining prior probabilities in the logical and subjective theories. Asymptotically all theories agree; von Mises discusses only the asymptotic case; to apply a frequency theory to finite sequences, it is necessary to make the same kind of assumptions as Jeffreys makes on prior probabilities. 1.5. Empirical Theories: Falsifiable Models 5 1.4. Empirical Theories: Kolmogorov Kolmogorov (1933) formalized probability as measure: he interpreted probability as follows. (1) There is assumed a complex of conditions C which allows any number of repetitions. (2) A set of elementary events can occur on establishment of conditions C. (3) The event A occurs if the elementary event which occurs lies in A. (4) Under certain conditions, we may assume that the event A is assigned a probability P(A) such that (a) one can be practically certain that if the complex of conditions C is repeated a large number of times n, then if m be the number of occurrences of event A, the ratio min will differ very slightly from P(A). (b) if P(A) is very small one can be practically certain that when conditions C are realized only once, the event A would not occur at all. The axioms of finite probability will follow for P(A), although the axiom of continuity will not. As frequentists must, Kolmogorov is struggling to use Bernoulli's limit theorem for a sequence of independent identically distributed random variables without mentioning the word probability. Thus "the complex of conditions C which allows any number of repetitions" -how different must the conditions be between repetitions? Thus "practically certain" instead of "with high probability." Logical and subjective probabilists argue that a larger theory of probability is needed to make precise the rules of application of a frequency theory. 1.5. Empirical Theories: Falsifiable Models Statisticians in general have followed Kolmogorov's prescription. They freely invent probability models, families of probability distributions that describe the results of an experiment. The models may be falsified by repeating the experiment often and noting that the observed results do not concur with the model; the falsification, using significance tests, is ·itself subject to uncertainty, which is described in terms of the original probability model. A direct interpretation of probability as frequency appears to need an informal extra theory of probability (matching the circularity in Laplace's equally possible cases), but the "falsifiable model" interpretation appears to avoid the circularity. We propose a probability model, and then reject it, or modify it, if the observed results seem improbable. We are using Kolmogorov's rule (4) (b) that "formally" improbable results are "practically" 6 I. Theories of Probability certain not to happen. If they do happen we doubt the formal probability. The weaknesses in the model approach: (1) The repetitions of the experiment are assumed to give independent, identically distributed results. Otherwise laws of large numbers will not apply. But you can't test that independence without taking some other series of experiments, requiring other assumptions of independence, and requiring other tests. In practice the assumption of independence is usually untested (often producing very poor estimates of empirical frequencies; for example, in predicting how often a complex piece of equipment will break, it is dangerous to assume the various components will break independently). The assumption of independence in the model theory is the analogue of the principle of insufficient reason in logical theories. We assume it unless there is evidence to the contrary, and we rarely collect evidence. (2) Some parts of the model, such as countable additivity or continuity of a probability density, are not falsifiable by any finite number of observations. (3) Arbitrary decisions about significance tests must be made; you must decide on an ordering of the possible observations on their degree of denial of the model-perhaps this ordering requires subjective judgment depending on past knowledge. 1.6. Subjective Theories: De Finetti De Finetti (1930/1937) declares that the degree of probability attributed by an individual to a given event is revealed by the conditions under which he would be disposed to bet on that event. If an individual must bet on all events A which are unions of elementary events, he must bet according to some probability P(A) defined by assigning non-negative probabilities to the elementary events, or else a dutch book can be made against him-a combination of bets is possible in which he will lose no matter which elementary event occurs. (This is only a little bit like von Mises's principle of the impossibility of a gambling system.) De Finetti calls such a system of bets coherent. In the subjectivist view, probabilities are associated with an individual. Savage calls them "personal" probabilities; a person should be coherent, but any particular event may be assigned any probability without questioning from others. You cannot say that "my probability that it will rain this afternoon is .97" is wrong-it reports my willingness to bet at a certain rate. Bayes (1763) defines probability as "the ratio between the value at which an expectation depending on the happening of the event ought to be computed, and the value of the thing expected upon its happening." His probability describes how a person ought to bet, not how he does bet. It should be noted that the subjectivist theories insist that a person be coherent in his betting, I. 7. Subjective Theories: Good 7 so that they are not content to let a person bet how he pleases; psychological probability comes from the study of actual betting behavior, and indeed people are consistently incoherent (Wallsten (1974)). There are numerous objections to the betting approach some technical (is it feasible ?), others philosophical (is it useful ?). (i) People don't wish to offer precise odds-Smith (1961) and others have suggested ranges of probabilities for each event; this is not a very serious objection. (ii) A bet is a price, subject to market forces-depending on the other actors; Borel (1924) considers the case ofa poker player, who by betting high, increases his probability of winning the pot. Can you say to him "your probability of winning the pot is the amount you are willing to bet to win the pot divided by the amount in the pot." Suppose you are in a room full of knowledgeable meteorologists, and you declare the probability it will rain tomorrow is .95. They all rush at you waving money. Don't you modify the probability? We may not be willing to bet at all if we feel others know more. Why should the presence of others be allowed to affect our probability? (iii) The utility of money is not linear-You may bet $1 to win $500 when the chance of winning is only 1/1000; the gain of $500 seems more than 500 times the loss of $1. Ramsey (1926) and Savage (1954) advance theories of rational decision making, choosing among a range of available actions, that produce both utilities and probabilities for which the optimal decision is always that decision which maximizes expected utility. The philosophical objection is that I don't particularly care how you (opinionated and uninformed as you are) wish to bet. To which the subjectivists will answer that subjective judgments are necessary in forming conclusions from observations; let us be explicit about them (Good (1976, p. 143». To which the empiricists will reply, let us separate the "good" empirically verifiable probabilities, the likelihoods, from the "bad" subjective probabilities which vary from person to person. (Cox and Hinkley (1974, p. 389) "For the initial stages ... the approach is ... inapplicable because it treats information derived from data as on exactly equal footing with probabilities derived from vague and unspecified sources.") 1.7. Subjective Theories: Good Good (1950) takes a degree of belief in a proposition E given a proposition H and a state of mind of a person M, to be a primitive notion allowing no precise definition. Comparisons are made between degrees of belief; a set of comparisons is called a body ofbeliefs. A reasonable body of beliefs contains no contradictory comparisons. The usual axioms of probability are assumed to hold for a numerical 8 1. Theories of Probability probability which has the same orderings as a body of beliefs. Good recommends a number of rules for computing probabilities, including for example the device of imaginary results: consider a number of probability assignments to a certain event; in combination with other fixed probability judgments, each will lead through the axioms to further probability judgments; base your original choice for probabilities on the palatability of the overall probabilities which ensue. If an event of very small probability occurs, he suggests that the body of beliefs be modified. Probability judgments can be sharpened by laying bets at suitable odds, but there is no attempt to define probability in terms of bets. Good (1976, p. 132) states that "since the degrees of belief, concerning events over which he has no control, of a person with ideally good judgment, should surely not depend on whether he uses his beliefs in any specific manner, it seems desirable to have justifications that do not mention preferences or utilities. But utilities necessarily come in whenever the beliefs are to be used in a practical problem involving action." Good takes an attitude, similar to the empirical model theorists, that a probability system proposed is subject to change if errors are discovered through significance testing. In standard probability theory, changes in probability due to data take place according to the rules of conditional probability; in the model theory, some data may invalidate the whole probability system and so force changes not according to the laws of probability. There is no contradiction in following this practice because we separate the formal theory from the rules for its application. 1.8. All the Probabilities An overview of the theories of probability may be taken from the stance of a subjective probabilist, since subjective probability includes all other theories. Let us begin with the assumption that an individual attaches to events numerical probabilities which satisfy the axioms of probability theory. If no rules for constructing and interpreting probabilities are given, the probabilities are inapplicable~for all we know the person might be using length or mass or dollars or some other measure instead of probability. Thus the theories of Laplace and Keynes are not practicable for lack of rules to construct probability. Jeffreys provides rules for many situations (although the rules are inconsistent and somewhat arbitrary). Good takes a belief to be a primitive notion; although he gives numerous rules for refining and correcting sets of probabilities, I believe that different persons might give different probabilities under Good's system, on the same knowledge, simply because they make different formalizations of the primitive notion of degree of belief. Such disagreements are accepted in a subjective 1.8. All the Probabilities 9 theory, but it seems undesirable that they are caused by confusion about meanings of probability. For example if you ask for the probability that it will rain tomorrow afternoon, one person might compute the relative frequency of rain on afternoons in the last month, another might compute the relative amount of today's rain that fell this afternoon; the axioms are satisfied. Are the differences in computation due to differences in beliefs about the world, or due to different interpretations of the word probability? The obvious interpretation of a probability is as a betting ratio, the amount you bet over the amount you get. There are certainly some complications in this interpretation-if a probability is a price, it will be affected by the market in which the bet is made. But these difficulties are overcome by Savage's treatment of probability and utility in which an individual is asked to choose coherently between actions, and then must do so to maximize expected utility as measured by an implied personal probability and utility. The betting interpretation arises naturally out of the foundations of probability theory as a guide to gamblers, and is not particularly attached to any theory of probability. A logical probabilist, like Bayes, will say that a probability is what you ought to bet. A frequentist will say that a bet is justified only if it would be profitable in the long run-Fisher's evaluation of estimation procedures rests on which would be more profitable in the long run. A subjectivist will say that the probability is the amount you are willing to bet, although he will require coherence among your bets. It is therefore possible to adopt the betting interpretation without being committed to a particular theory of probability. As Good has said, the frequency theory is neither necessary nor sufficient. Not sufficient because it is applicable to a single type of data. Not necessary because it is neatly contained in logical or subjectivist theories, either through Bernoulli's celebrated law of large numbers which originally generated the frequency theory, or through de Finetti's celebrated convergence of conditional probabilities on exchangeable sequences, which makes it clear what probability judgments are necessary to justify a frequency theory. (A sequence Xl' X 2 ' •.. ,Xn , ••• is exchangeable if its distribution is invariant under finite permutations of the indices, and then if the Xi have finite second moment, the expected value of Xn+ 1 given xl' ... ,xn and (l/n)Lx i converge to the same limiting random variable.) Thus the frequency theory gives an approximate value to conditional expectation for data of this type: the sequence of repeated experiments must be judged exchangeable. The frequency theory does not assist with the practical problem of prediction from short sequences. Nor does it apply to other types of data. For example we might judge that the series is stationary rather than exchangeable: the assumption is weaker but limit results still apply under certain conditions. The frequency theory would be practicable if data consisted oflong sequences of exchangeable random variables (the judgment of exchangeability being made informally, outside the theory); but too many important problems are not of this type. 10 I. Theories of Probability The model theory of probability uses probability models that are "falsified" if they give very small probability to certain events. The only interpretation of probability required is that events of small probability are assumed "practically certain" not to occur. The advance over the frequency theory is that it is not necessary to explain what repeatable experiments are. The loss is that many probabilities must be assumed in order to compute the probabilities of the falsifying events, and so it is not clear which probabilities are false if one of the events occur. The interpretation of small probabilities as practically zero is not adequate to give meaning to probability. Consider for example the model that a sample of n observations is independently sampled from the normal: one of the observations is 20 standard deviations from the rest; we might conclude that the real distribution is not normal or that the sampled observations are not independent (for example the first (n - 1) observations may be very highly correlated). Thus we cannot empirically test the normality unless we are sure of the independence; and assuming the independence is analogous to assuming exchangeability in de Finetti's theories. Finally the subjective theory of probability is objectionable because probabilities are mere personal opinions: one can give a little advice; the probabilities should cohere, the set of probabilities should not combine to give unacceptable probabilities; but in the main the theory describes how ideally rational people act rather than recommends how they should act. 1.9. Infinite Axioms Two questions arise when probabilities are defined on infinite numbers of events. These questions cannot be settled by reference to empirical facts, or by considering interpretations of probability, since in practice we do not deal with infinite numbers of events. Nevertheless it makes a considerable difference in the mathematics which choices are made. In Kolmogorov's axioms, the axiom of countable additivity is assumed. This makes it possible to determine many useful limiting probabilities that would be unavailable if only finite additivity is assumed, but at the cost oflimiting the application of probability to a subset of the family of all subsets of the line. Philosophers are reluctant to accept the axiom, but mathematicians are keen to accept it; de Finetti and others have developed a theory of finitely additive probability which differs in exotic ways from the regular theories-he will say "consider the uniform distribution on the line, carried by the rationals"; distribution functions do not determine probability distributions on the line. Here, the axiom of countable additivity is accepted as a mathematical convenience. The second infinite axiom usually accepted is that the total probability should be one. This is inconvenient in Bayes theory because we frequently need uniform distributions on the line; countable additivity requires that total probability be infinite. 1.10. Probability and Similarity 11 Allowing total probability to be infinite does not prevent interpretation in any of the standard theories. Suppose probability is defined on a suitable class of functions X. Probability judgments may all be expressed in the form P X ~ 0 for various X. In the frequency theory, given a sequence Xl' Xl' ". , X n , ••• PX ~ 0 means that L~= 1 Xi ~ 0 for all large n. In the betting theory, PX ~ 0 means that you are willing (subjective) or ought (logical) to accept the bet X. 1.10. Probability and Similarity I think there is probability about 0.05 that there will be a large scale nuclear war between the U.S. and the U.S.S.R. before 2000. By that I certainly don't mean that such nuclear exchanges will occur in about one in twenty of some hypothetical infinite sequence of universes. Nor do I mean that I am willing to bet on nuclear war at nineteen to one odds-lam willing to accept any wager that I don't have to payoff until after the bomb. (I trust the U.S.S.R. targeting committee to have put aside a little something for New Haven, and even if they haven't, bits of New York will soon arrive by air.) What then does the probability 0.05 mean? Put into an urn 100 balls differing only in that 5 are black and 95 are white. Shake well and draw a ball without looking. I mean that the probability of nuclear war is about the same as the probability of getting a black ball (or more precisely, say, war is more probable than drawing a black ball when the urn has 1 black and 99 white balls, and less probable than drawing a black ball when the urn has 10 black and 90 white balls.) You might repeat this experiment many times and expect 5% black balls, and you might be willing to bet at 19 to 1 that a black ball will appear, although of course the decision to bet will depend on other things such as your fortune and ethics. To me, the probability .05 is meaningful for the 5 out of 100 balls indistinguishable except for color, without reference to repeating the experiment or willingness to bet. Why should you believe the assessment of .05? I need to offer you the data on which the probability calculation is based. The superpowers could become engaged in a nuclear war in the following ways. 1. Surprise attack. Such an attack would seem irrational and suicidal; but war between nations has often seemed irrational and suicidal. For example the Japanese attack on the United States in 1941 had a good chance of resulting in the destruction of the Japanese Empire, as the Japanese planners knew, but they preferred chancy attack to what they saw as sure slow economic strangulation. Might not the U.S. or the U.S.S.R. attack for similar reasons? If such an attack occurs once in 1000 years, the chance of it occurring in the next twenty is .02 (I will concede that the figure might be off by a factor of 10.) 2. Accidental initiation. Many commanders have the physical power to 12 I. Theories of Probability launch an attack, because the strategic systems emphasize fast response times to a surprise attack. Let us say 100 commanders, and each commander so stable that he has only one chance in 100,000 of ordering a launch, in a given year, and that the isolated launch escalates to a full scale exchange with probability .2. In twenty years, a nuclear war occurs with probability one .004. 3. Computer malfunction. Malfunctions causing false alerts have been reported at least twice in the U.S. press in the last twenty years. Let us assume that a really serious malfunction causing a launch is 100 times as rare. In the next twenty years we expect only .02 malfunctions. 4. Third party initiation. Several embattled nations have nuclear capability-Israel (.99 probability), South Africa (040), India (.60), Pakistan (040), Libya (.20). Of these Israel is the most threatened and the most dangerous. Who would be surprised by a preemptive nuclear attack by Israel on Libyan nuclear missile sites? Let us say the probability is .01, and the chance of escalation to the superpowers is .2. The overall probability is .002. Summing the probabilities we get .046, say .05, which I admit may be off by a factor of 10. There is plenty of room for disagreement about the probabilities used in the calculations; and indeed I have committed apparently a circular argument characteristic of probability calculations; I am supposed to be showing how a probability is to be calculated, but I am basing the calculation on other probabilities. How are they to be justified? The component probabilities are empirical, based on occurrences of similar events to the one being assessed. An attack by the U.S. on the U.S.S.R. is analogous to the attack by Japan on the U.S. Dangerously deceptive computer malfunctions have already occurred. Of course the analogies are not very close, because the circumstances of the event considered are not very similar to the circumstances of the analogous events. The event of interest has been expressed as the disjoint union of the intersections of "basic" events (1 would like to call them atomic events but the example inhibits me !). Denote a particular intersection as BIB 2 ••• Bn. The probability P(BIB2 ... Bn)=P(Bl)P(B2IBl) ... P(BnIBlB2 ... Bn_1) is computed as a product of conditional probabilities. The conditional probability p(B11 BIB2 ... B i - 1) is computed from the occurrence of events similar to Bi under conditions similar to Bl B2 ... B i - 1. We will feel more or less secure in the probability assessment according to the degree of similarity of the past events and conditions to Bi and Bl B2 ... B i - 1. The probability calculations have an objective, empirical part, in the explicit record of past events, but also a subjective judgmental part, in the selection of "similar" past events. Separate judgments are necessary in expressing the event of interest in terms of basic events-we will attempt to use basic events for which a reasonable empirical record exists. The future is likely to be like the past. Probability must therefore be a function of the similarities between future and past events. The similarities will be subjective, but given the similarities a formal objective method should be possible for computing probabilities. 1.11. References 13 1.11. References Bayes, T. (1763), An essay towards solving a problem in the doctrine of chances, Phil. Trans. Roy. Soc. 53, 370-418, 54, 296-325, reprinted in Biometrika 45 (1958), 293-315. Bernoulli, James (1713), Ars Conjectandi. Borel, E. (1924), Apropos of a treatise on probability, Revue philosophique, reprinted in H. E. Kyburg and H. E. Smokier (eds.), Studies in Subjective Probability. London: John Wiley, 1964, pp. 47-60. Church, A. (1940), On the concept of a random sequence, Bull. Am. Math. Soc. 46, 130-135. Cox, D. R. and Hinkley, D. V. (1974), Theoretical Statistics. London: Chapman and Hall. De Finetti, B. (1937), Foresight: Its logical laws, in subjective sources, reprinted in H. E. Kyburg and H. E. Smokier (eds.), Studies in Subjective Probability. London: John Wiley, 1964, pp. 93-158. Good, I. J. (1950), Probability and the Weighing of Evidence. London: Griffin. Good, I. J. (1976), The Bayesian influence, or how to sweep subjectivism under the carpet, in Harper and Hooker (eds.), Foundations of Probability Theory, Statistical Inference, and Statistical Theory of Science. Dordrecht: Reidel. Jeffreys, H. (1939), Theory of Probability. London: Oxford University Press. Keynes, J. M. (1921), A Treatise on Probability. London: MacMillan. Kolmogorov, A. N. (1950). Foundations of the Theory of Probability. New York: Chelsea. (The German original appeared in 1933.) Kolmogorov, A. N. (1965), Three approaches to the quantitative definition of information, Problemy Peredaci Informacii 1, 4-7. Laplace, P. S. (1814), Essai philosophique sur les probabilites, English translation. New York: Dover. Martin-LOf, M. (1966), The definition of random sequences, Information and Control 9,602-619. Ramsey, F. (1926), Truth and probability, reprinted in H. E. Kyburgand H. E. Smokier (eds.), Studies in Subjective Probability. New York: John Wiley, 1964, pp. 61-92. Savage, L. J. (1954), The Foundations of Statistics. New York: John Wiley. Smith, C. A. B. (1961). Consistency in statistical inference and decision, J. Roy. Statist. Soc. B 23,1-25. von Mises, R. and Geiringer, H. (1964), The Mathematical Theory of Probability and Statistics. New York: Academic Press. Wallsten, Thomas S. (1974), The psychological concept of subjective probability: a measurement theoretic view: in C. S. Stael von Holstein (ed.), The Concept of Probability in Psychological Experiments. Boston: Reidel, p. 49-72. CHAPTER 2 Axioms 2.0. Notation The objects of probability will be bets X, Y, ... that have real-valued payoffs X(s), Y(s), ... according to the true state of nature s, where s may be any of the states in a set S. Following de Finetti, events will be identified with bets taking only the values 0 and 1. In particular, the notation {s satisfies certain conditions} will denote the event equal to 1 when s satisfies the conditions, and equal to 0 otherwise. For example {X ~ 5} denotes the event equal to I when s is such that X(s) ~ 5, and equal to 0 otherwise. In general algebraic symbols +, -, ~, v, 1\ will be used rather than set theoretic symbols u, II, c. 2.1. Probability Axioms Let S denote a set of outcomes, let X, Y, ... denote bets on S, real valued functions such that X(s), Y(s), ... are the payoffs on the bets when s occurs. A probability space f!{ is a set of bets such that (1) X, YEf!{=> aX + bYEf!{ for a, b real (2) XEf!{=>IXIEf!{ (3) XEf!{=> X 1\ IEf!{ (4) IXnl~XoEf!{, Xn~X=>XEf!{. A probability P on f!{ is a real valued function on f!{ that is LINEAR: P(aX + bY) = aPX + bPY for X, YEf!{ and a, b real NON-NEGATIVE: plXI ~O for XEf!{ CONTINUOUS: IXnl ~XoEf!{, Xn~X=>PXn~PX, 14 15 2.1. Probability Axioms A unitary probability P is defined on a probability space f!£ such that 1EX, and satisfies PI = 1. A finitely additive probability P is defined on a linear space such that X EX=> IX IE f!£, 1E X, and P is linear and non-negative, but not necessarily continuous. A probability space X is complete with respect to P if (i) XnE!![, (ii) Y Ef!£, I I X n ~ X => X E PI" PY=O =>XE'?£. P X n - X m ~ 0, o~X~ Y, The standard definition of probability, set down in Kolmogorov (1933), requires that it be unitary. According to Keynes (1921, p. 155), it was Leibniz who first suggested representing certainty by 1. However, in Bayes theory it is convenient to have distributions which have PI = 00, such as the uniform distributions over the line and integers. Jeffreys allows PI = 00, because his methods of generating prior distributions frequently produce such P, but in theoretical work with probabilities, he usually assumes PI = 1. Renyi (1970) handles infinite probabilities using families of conditional probabilities. But there is no formal theory to handle probabilities with PI = 00, which are therefore called improper. The measure theory for this case is well developed; see for example, Dunford and Schwartz (1964). The betting theory interpretation of probability is straightforward; P X 1/P X 2 is the relative value of bet X 1 to bet X 2; you accept only bets X such that P X ~ O. It is true that you may effectively give value 00 to the constant bet 1; those bets which you wish to compare are infinitely less valuable than I. It is also possible to make a frequency interpretation of non-unitary probability. Consider for example the uniform distribution over the integers. This would be the limiting frequency probability of a sequence of integers, such as 12 123 1234 12345 123456 ... in which each pair of integers occurred with the same limiting relative frequency. If it is insisted that continuity hold, then total probability is infinite. If it is insisted that total probability is 1, then continuity breaks down and the limiting frequency probability is finitely additive. It is not possible to justify either the continuity axiom or probabilities with PI = 00 by reference to actual experience, which is necessarily finite. Indeed de Finetti rejects the continuity axiom on this and other grounds. But the continuity axiom equally cannot be denied by reference to experience, and it is mathematically convenient in permitting unique extension of P defined on some small set of functions to P defined on a larger set of interesting limit functions: we begin by assuming that intervals have probability proportional to their length, and end by stating that the rationals have probability zero. [In contrast, de Finetti (1972) can say: consider the uniform distribution on the real line carried by the rationals, or carried by the irrationals.] We need to invent methods to handle invented concepts such as the 16 2. Axioms set of rationals; the main justification must be mathematical convenience; and the same reasoning applies to non-unitary probabilities - they must be mathematically convenient or they would not be so improperly ubiquitous (see them used by de Finetti, 1970, p. 237). 2.2. Prespaces and Rings A prespace d is a linear space such that XEd => IX lEd, X AlE d. A limit space L is such that XnEL, XoeL, IXnl ~ X o' Xn -> X implies XeL. A probability space is both a prespace and a limit space. Lemma. The smallest probability space including a prespace d is the smallest limit space including d. PROOF. Let L be the intersection of limit spaces, containing d. For each X, let L(X) be the set offunctions Y such that d(X, Y):IXI, IYI, X A 1, Y A 1, aX + bY lie in L; L(X) is a limit space. If Xed, then d(X, Y) cdc L for Y in d. Thus L(X)::::) d=>L(X)::::) L. If XeL, then XeL(Y)::::) L for Y in d. Thus L(X)::::) d => L(X) ::::) L. If X e L, Y e L then d(X, Y) c L, so L is a prespace and therefore a probability space. 0 A probability P on a pres pace d is linear, non-negative and continuous: Xn -> 0, IXnl ~ Xed => PX n -> 0. Theorem. A probability P on a prespace d may be uniquely extended to a probability P on a completed probability space including d. Let f!{ consist of functions X for which IX - an I ~ I~ 1 a~, a~ ~ 0, where an' a~ed. Say that the sequence an approximates X and define P X = lim Pan' It follows from continuity that the definition is unique, PROOF. L~ 1 Pa~ -> ° n which implies that P is unchanged for X in d. If an' bn approximate X, Y then aa n + bbn approximates aX + bY with P(aX + bY) = aPX + bPY.la,,1 approximates IX I with P IX I ~ 0, and an A 1 approximates X A 1. Now suppose Xnef!{, Xef!{, IXnl ~ X and Xn -> Y. We will show that Yef!{ and PY = lim PX n' First assume Xn i Y. Then plXn+1 - Xnl < 2- n n on a suitably chosen subsequence. Also IXn+ 1 - Xnl ~ I~ 1 a~ where a~ed, a~~O and IPa~<2-n+l, since IX +l-Xnlef!{. Thus IY-Xnl~ I~nIXi+l -Xil~I~nIi=laf where r.IPaf~2-n+2. Approximate Xn by an where plXn - ani < 2- n+2. Then Y is approximated by an' and PY = lim Pan = lim P X n' The general result follows using sup Xn jsup Xn N:?n:?M N:?n and sup Xn! Y. N:?n 17 2.2. Prespaces and Rings pi If XnE£!(, Xn ~ X and Xn - Xml ~O, a similar argument, first considering monotone convergence, shows that X E£!(. If an approximates Y, it approximates X, 0 ~ X ~ Y, so £!( is complete with respect to P. Suppose P' is a probability on £!( which agrees with P on d. Then P'IX-anl~O if an approximates X, so P'X=limP'an=limPan=PX. Thus P is uniquely defined on £!(. 0 A subset of S is a function on S taking the values 0 and 1. A family fF of subsets of S is a ring if A, BEfF => Au B, A - ABEfF. A function P on fF is a probability if (i) P(A + B) = PA + PB if A, BEfF, AB = 0 (ii) P A ;;;; 0 for A in fF (iii) An ~ 0, An ~ AEfF => PAn ~ O. (iii)' An!O=>PAn!O. [Note that (iii) and (iii)' are equivalent. Obviously (iii) => (iii)'. Suppose that (iii)' holds, and An ~ 0, An < A. Define Bn = An U An+ 1 ... U Am where 2- n + PBn > sup P(An+ 1 U ... U Am) ~ PA. Then m P[Bm - BmBn] = P[B mu Bn - Bn] < 2- n for m > n P[BmBn - BmBnBn+ I] ~ P[B m- BmBn+ I] ~ 2-(n+ 1) for m> n + I P[B m - fl BJ ~ 2- n + 1 n~i~m Since fln$i$mBi!O as m ~ 00, p(fl B)!O, so lim PBm ~ 2- n+1 for every n. Since Am ~Bm' lim PAm < 2- n+1 for every n, PAm ~ 0.] If P is a probability on fF it may be uniquely extended to the prespace d consisting of elements L~= 1 (XiAi' where (Xi is real, by P(L~= 1 (XiA ) = L~ = 1 (XiP Ai' It is easily checked that P is well defined, linear and non-negative on d. The continuity condition is a little more difficult; suppose an ~ 0, lanl ~ a. Then lanl ~ AA for some positive A, AEfF. lanl ~ eA + {Ianl ~ e}AA. Since {lanl;;;;e}A~O and {lanl;;;;e}A~A, P{lanl;;;;e}AA~O. Thus lim Planl ~ ePA for every e > 0 and Planl ~ 0 as n ~ 00. If P is a probability on fF it may, by Theorem 2.2, be extended uniquely to the smallest complete probability space £!( including fF. It is customary to call P on £!( an expectation or an integral, but we follow de Finetti in identifying sets and functions, probabilities and expectations. If P is a probability on £!(, P defined on the ring fF of 0 - I functions in £!( extends uniquely to a complete probability space £!((fF) that includes £!(. See Loomis (1953). Thus specifying P on fF determines it on £!(; the function X is approximated by the step functions k+l} L -k{k-::;X<-- 1<lkl<K2K K - K 18 2. Axioms and . PX= hm K-oo k {k-::;X<-k+l} . I-p K K K EXAMPLE. Let:F be the set offinite unions of half-open intervals A = U(a p bJ. Define P A = I Ibj - a j I if the intervals (a p bJ are disjoint. To check that P is a probability, it is difficult only to prove (iii)'. Assume An 1o. Let An = U7= 1 (a jn , bjJ, and if An < A, let A be the interval [a, b]. The function A - An is a union of half open intervals which converges to A. Define E = U I:!(a jn - (e/2 n), a jn + (e/2 n)). Then (A - An) U E is an open set, n , and U(A - An) U E includes [a, b]. From compactness, a finite number of (A - An)u E cover [a, b], and since A - An i, for some n, (A - An)u E => A. Since E has total length less than e, An must have total length less than e. Thus PAn ~ o. From length of intervals on :F, we define a probability on a prespace of step functions on intervals; from the prespace we define probabilities on a probability space f![ which includes, for example, all continuous functions zero outside a finite interval. This is lebesgue measure. 2.3. Random Variables Let P be a probability on Ii!f a probability space on T, and let PI be a probability space on S. A random variable X is a function from Tto S such that !(X)EIi!f for each! in PI. A probability pf'I is induced on PI by pf'I!= P[j(X)] each! in PI. The distribution of X is defined to be pf'I, also denoted by pX. If S is the real line, and PI is the smallest probability space including finite intervals, from 2.2, pX is determined by the values it gives finite intervals P{a < X ~ b} = G(b) - G(a). The distribution function G is right continuous and uniquely determined upto an additive constant. If sup P(a < X ~ o} < 00, set G(b) = sup P{ a < X a ~ a b}.1f pX is unitary, it will follow that lim G(a) = 0, lim G(a) = 1. a-+-oo a-oo 2.4. Probable Bets Let PI be a linear space of bets: X, YEPI => aX + bYE PI for a, b real. Let f!l', the probable set, be a cone of bets: X, YEf!l' => aX + b YE[~ for a, b "?; O. 2.4. Probable Bets 19 A generalized probability P for :!J> is a linear functional on :!J> (a real valued function on .0)1 with P(aX + bY) = aPX + bPY) such that PX ~ 0 for X in :!J>, PX > some X in :!J>. For sections 2.4 and 2.5, P will be referred to as a probability. ° Theorem. If :!J> =1= .0£ and & contains an internal point (a point X 0 such that for every X in .0£, X + kX 0 E:!J> for some k), then a probability P exists for :!J>. [Following Dunford and Schwartz (1964), p. 412.] Let N = :!J> n ( -:!J» be the neutral set of bets, bets X such that both X and - X are probable. If {:!J> al is a chain of probable sets with neutral sets N, then u:!J>, is a probable set with neutral set N. From Zorn's lemma, there is a maximal probable set :!J> 0 containing :!J> and having neutral set N. Then .or=:!J>ou(-:!J>o)' for if X~:!J>o'X~-.O)1o the set :!J>o(X) = {exX+ Y,ex~o, Y E:!J> o} is a probable set with neutral set Nand:!J> o(X) strictly includes:!J> o' The internal point X 0 does not lie in -:!J> for X = X + kX 0 + k( - X 0) would lie in :!J> for each X contradicting :!J> =1= .0£. Also X 0 is an internal point for :!J>o' Define PX=sup{exIX-exXoE:!J>o}; then -oo<PX<oo since X + kX 0 E:!J> 0 and - X + k' X 0 E:!J> 0 some k, k' ; P is a linear functional because X - exXoE:!J>o or -:!J>o for every X, ex; for XE:!J> c &0' PX ~ 0; and PX o = 1. Thus P is a probability for :!J>. 0 PROOF. It is necessary that :!J> =1= .0£, for otherwise we cannot separate probable bets from others, and it is necessary to assume an internal point so that one of the probable bets will be comparable to all possible bets. It is usual to take the bets as real values received according to whichever state of nature occurs, but it is not necessary to do so. See Ramsey (1926) and Savage (1954). The space of bets and probable set may be constructed from a preference ordering among a set of mixed actions as follows. Let d be an arbitrary set (not necessarily countable) of actions aI' az ' ... ; let d* be the mixed actions L7= 1 Piai (perhaps constructed by generating new actions by taking action ai with chance Pi) where Pi ~ 0, LPi = 1; and let ~ be a preference between mixed actions such that for 0 ~ ex ~ 1, a ~ exa + (1 - ex)a ~ a and a ~ b, c ~ d = exa + (1 - ex)c ~ exb + (1 - ex)d. Construct the space .0£ of bets L7 = 1 xia i where LX; = 0, and define the probable set :!J> to consist of bets Jc(a - b) where Jc ~ 0 and a ~ b, a, bed*. A probability P for :!J> wiII now satisfy Pa ~ Pb whenever a ~ band Pa> Pb for at least one pair a ~ b. The condition that an internal point exists is equivalent to assuming a pair ao ~ bo such that for each a, b, exa + (1 - ex)a o ~ exb + (1 - ex)b o some ex, O~ex~1. On the other hand there is no harm in assuming that bets are real valued functions. Assume that & =F .0£, and that an internal point exists. Then there exists a basis of .or, {X a}' X aE:!J> such that each X in .or is represented uniquely by I caX a where only a finite number of caare non-zero, and so X corresponds to the real valued functionf,f(ex) = ca' Note that X E:!J> whenever f ~ o. 20 2. Axioms 2.5. Comparative Probability In comparative probability, all pairs of events in fi' are compared by the relation ~, "is no more probable than": (i) ¢ ~ A for AEfi', (ii) S ~ ¢ is not true, (iii) A ~ B, C ~ D implies A + C ~ B + D if AC = BD = O. The statements A ~ B may be interpreted as offering the bet: pay 1 unit if A occurs to receive 1 unit if B occurs. The family of bets L7~ 1 rJ./Bi - A) for Bi' Ai in fi' forms a betting space; suppose the statements Ai ~ Bi are construed as accepting all bets L7~ 1 rJ.i(B i - Ai) for which rJ. i ~ O. It may be possible to make a book against the bets {Ai ~ BJ: find a linear combination L;'~ 1 rJ.JB i - A) which is negative. Otherwise, for S finite, the set of combinations of ~ bets is probable, and there exists a probability P on fi' such that A ~ B implies peA) ~ PCB). An example of such a "beatable" comparative probability is given by Kraft, et al. (1959) for a set of 5 elements. See also Scott (1964) who connects "unbeatability" with the existence of a conforming numerical probability, and Fine (1973) for a general discussion, and for continuity axioms. The above axioms of comparative probability are unsatisfactory because they may not generate a probable set of bets. One solution to the problem is to prohibit negative combinations, which is just equivalent to requiring that a certain subset of a betting space, generated by pairs of events, is probable. An alternative approach followed by Koopman (1940) and Savage (1954) supposes that S may be partitioned into sets of arbitrarily small probability; Koopman requires that for each n there exist a partition into n events of equal probability. Since all pairs of events are comparable, each event has a precise numerical probability determined by comparison with events in increasingly fine partitions, and this numerical probability satisfies the usual finitely additive axioms with peS) = 1. 2.6. Problems Exercises (E), are supposed to be easier than problems (P). Probability is used in the sense of Section 2.4. x tx EI. Let :i£ = [R2, define 9 = {x, yiy + ~ 0, y + ~ O}. Show that 9 is a probable set, and find all probabilities P such that P(X) ~ 0 for X in 9. PI. A bookie offers the following odds for various teams to win a basketball pennant. Knicks: 6/1 Bullets: 2/1 Braves: 2/1 Celtics: 1/1 Odds of6/l means that he receives $1 if the Knicks lose and pays $6 if the Knicks win. Consider the space of bets :i£{ (Xl' X 2 , X 3 , x 4 )} in which the bookie receives 2.6. Problems 21 if the ith team wins. Show that any probable set including the specified bets will include all bets. Xi E2. Let!!f = IRk. Let P be a probability on !!f with PX ~ 0 for X e!!f, X ~ O. Show that there exist PI' ... ,Pk' Pi ~ 0 such that P(X) = L:= 1 piX p where Xi denotes the ith co-ordinate of X. E3. Let!!f consist oflinear combinations of bets {sla < s ~ b}, a < b. Let &' consist of non-negative combinations of bets (a, a + 215] - (a - 15, a]. Find a probability on !!f which is positive for all nonzero bets in &'. E4. Let!!f be the real sequences, and let &' consist of sequences X with lim L~= 1 Xi ~ o. Show that if a probability P on (!!f, &') is such that X 0 = (1, 1, I, ... ) has P(X 0) = 1, the positive sequence X = (1, t, t, ... , lin, ... ) has P(X) = o. E5. Let !!f be the real sequences, X = (X l' X 2' ... ), with finitely many non-zero elements, and let &' = {Xl for some i, Xi > 0, X i + 1 ~ 0, X i + 2 ~ 0, ... } v {O}. If P is a probability on (!!f, &'), show that P{i} = 0 or 00 except for one {i}, where {i} is the bet equal to I at i and zero elsewhere. P2. Let S be the real line, :F be the ring of unions of half open intervals (a < s ~ b), where - 00 ~ a < b ~ 00. Define P((a, b]) = F(b) - F(a) where F is a non-decreasing right continuous function. Show that P is a probability on :F, in the sense of Section 2.1. E6. Let!!f be k-dimensional euclidean space, and let the probable set &' include all bets X = (X l ' ... , X k ) such that Xi ~ 0, I ~ i ~ k. Show that if &' is not neutral, &' includes no bet which is uniformly negative. E7. A bookmaker offers a number of bets X, Y, ... in k-dimensional euclidean space; the bet X = (X l ' ••. , X k) means he receives Xi if i occurs. Show that there is some mixture of the bets on which he always receives a negative payoff, or else there is a probability P which is non-negative for all bets. E8. Let!!f be the set of real-valued sequences X = (xl' x 2' ... , x n ' ••• ), let (PI' P2' ... , Pn' ... ) be a fixed sequence, Pi ~ 0, and let &' be the sequences X with lim L~= 1 PiXi ~ O. Show that &' is a probable set and specify the probability which gives value 1 to (1, 1, ... , 1, ... ), and the probability which gives value I to (1,0, ... ,0). Show that the first probability is continuous if and only if LPi < 00, and the second probability is bounded if and only ifLPi < 00. E9. Replace the third axiom of comparative probability by (iii)' ifL~= l(A i - B) = L~= 1 (A; - B;), and all Ai ~ Bp then at least one A; ~ B;. Then the set of bets L~= 1 (Xi(B i - Ai) where Ai ~ Bi' (Xi ~ 0 is a probable set in the space of real-valued function on S. P3. The axioms of comparative probability are satisfied by subsets of S = (1, 2, 3, 4, 5) with 0 <2<3<4<23<24< 1 < 12<34<5<234< 13< 14<25< 123<35< 124, the remaining sets being ordered by complements. Show that no numerical probability conforms to the order. [Kraft, et aI., 1959.] E I o. Add a fineness axiom to the axioms of comparative probability: (iv) for each n, there exists {Ai} with L~= 1 Ai = S, AiA j = 0, Ai ~ A j each i,j. 22 2. Axioms Then there is a unique probability with A 1940.] ~ B= peA) ~ P(B), A, BEY'. [Koopman, P4. Let a finitely additive probability P be defined on the plane so that p(lxl + Iyl > a) = 0 for a> 0, P[x < 0] = P[x = 0] = P[x > 0] = t, P[y < 0] = P[y = 0] = P[y > 0] = t, events determined by x are independent of y, and (A) P[x + y = 0, x > 0, y < 0] = P[x + y = 0, x < 0, y > 0] = These conditions determine P uniquely. Show that a different P is determined if (A) is replaced by P[x + y < 0, x > 0, y < 0] = P[x + Y < 0, x < 0, y > 0] = i, demonstrating that the distribution of x + y is not determined from the distributions of x and y when x and y events are independent. i. P5. Let X be a random variable from U, Z to S,.O£ where S is the real line and q; includes all finite intervals. Show that the P-completion of Z includes X if L~oPX{ lsi ~ k} < 00. 2.7. References De Finetti, B. (1970), Theory of Probability, Vol. 1. John Wiley: London. De Finetti, B. (1972), Theory of Probability, Vol. 2. John Wiley: London. Dunford, N. and Schwartz, J. T. (1964), Linear Operators, Part 1. John Wiley: New York. Fine, T. (19731, Theories of Probability, an Examination of Foundations. New York: Academic Press. Jeffreys, H. (1939), The Theory of Probability. London: Oxford University Press. Keynes, J. M. (1921), A Treatise on Probability. New York: Harper. Kolmogorov, A. N. (1950), Foundations of the Theory of Probability. New York: Chelsea. Koopman, B. O. (1940), The bases of probability, Bull. Am. Math. Soc. 46, 763-774. Kraft, c., Pratt, J. and Seidenberg, A. (1959), Intuitive probability on finite sets, Ann. Math. Statist. 30, 408-419. Loomis, L. H. (1953), An Introduction to Abstract Harmonic Analysis. Princeton: Van Nostrand. Renyi, A. (1970), Probability Theory. New York: American Elsevier. Ramsey, F. P. (1926), Truth and probability, reprinted in H. E. Kyburg and H. E. Smokier (eds.), Studies in Subjective Probability. New York: John Wiley, 1964, pp.61-92. Savage, L. J. (1954), The Foundations of Statistics. New York: John Wiley. Scott, D. (1964), Measurement structures and linear inequalities, J. Math. Psych. 1, 233-247. CHAPTER 3 Conditional Probability 3.0. Introduction Kolmogorov's exquisite formalization of conditional probability in the unitary case (1933) does not readily generalize to non-unitary probabilities. Stone and Dawid (1972) show one type of difficulty with their marginalization paradoxes for improper priors. Consider the case of the uniform distribution over pairs of positive integers {i,j}. The desired conditional distribution of {i, j} givenj = jo is uniform over i. Following Kolmogorov, the conditional distribution given j should combine with the marginal distribution to return the joint distribution: But the event [{i,j},j = joJ is not given a probability so the marginal probabilities p(jo) are not determined by p(i,j), 1 ~ i, j ~ 00. Correspondingly the uniform distribution over {i,j} givenj = jo is equally well represented by p(iljo) = k(jo) for any k(jo)' Thus although these conditional distributions are determined by the joint distribution, the marginal distribution is not. (This is the explanation of the marginalization paradoxes of Stone and Dawid.) It is assumed therefore that the joint distribution, the conditional distribution, and the marginal distribution are specified separately to follow the axioms of conditional probability. In particular the probabilities of the {i, j} and of [{i,j},j=joJ are separately specified. We are declaring that {i,j} has the same probability as {i',j'}, and in addition that [{i,j},j = joJ has the same probability as [{i, j}, j = ja 23 24 3. Conditional Probability 3.1. Axioms of Conditional Probability Let !![, iljj, !Z, ... be probability spaces of functions on S. The conditional probability on !![ given iljj is a function P from !![ to iljj that is LINEAR: P(YtX 1 + Y2 X 2 ) = YtPX 1 + Y2 PX 2 forXiE!![, YiEiljj, YiXiE!![ NON-NEGATIVE: plXI ~O CONTINUOUS: PXn-+PX for IXnl ~ XoE!![, Xn-+X INVARIANT: If YE!![niljj,PY= Y A family of conditional probabilities is assumed to satisfy the P;, P; PRODUCT RULE: If p~, denote conditional probabilities from !![ to iljj, iljj to !Z and !![ to !Z respectively, P;=P~P;. The conditional probability P is determined as a probability on !![ given the results of an experiment which determines the values of all functions in iljj. Each result of the experiment will give rise to possibly different values of functions in iljj, and possibly different probabilities. The conditional probability P determines these different probabilities for all possible results of the experiment. If PXE!![, then PX may be interpreted as a bet equivalent to X that has known value after the experiment is performed. The above axioms generalize the axioms of probability. Let !![ 1 denote the probability space of constant functions on S, and let 1 denote the constant function. Then P is a probability on !![ if and only if P~, : X -+ (P X)l is a conditional probability on !![ given !![ l ' Indeed P~, = P:,P; implies that P; is determined almost uniquely by P~, and [Suppose P;X could have values Y1 or Y2 ; then P:, [Y(YI - Y2 )J = 0 all YEiljj, so p:,1 Y1 - Y 2 1= O.J Kolmogorov (1933) defines conditional probability in terms of probability; under certain regularity conditions, there exists a "conditional probability" that satisfies the above axioms except on a subset of S of probability zero. Here we are following the more traditional scheme of axiomatizing conditional probability rather than defining it in terms of probability. pt. EXAMPLE 1: Toss a penny twice. Let!![ be the bets {XHH,XHT'XTH,XTT} where X HH means the amount received if two heads occur, and similarly for the other three results. The result of the first toss of the experiment, heads or tails, determines the values of all bets in iljj = {X IX HH = X HT' X TH = X TT }, bets ofform (X H' X H' X P X T)' Assuming that tails and heads have probability 1/2 given the results of the first toss, the conditional probability of X=(X HH , X HT ' X TH ' X TT ) is P<?IX=(XH, X H' Xp X T) where X H = t(X HH + X HT) and X T = t(X TH + X TT)' Here P <?IX is a bet equivalent to X that has known value, either X H or X T' once the first toss is known. 25 3.1. Axioms of Conditional Probability Suppose that head on the first toss has probability p, and tail has probability (1 - p). Then p?l" X = pqlj p?l"X ?l", ?l", qy = pi'[x H, X H' X T' X T] =pX H +(1-P)X T = ~PXHH + ~PXHT + 1-(1 - P)XTH + ~(l - P)X TT · The probability on f1£ corresponds to giving probability ~p, ~p, 1-(1 - p), 1-(1 - p) to the four outcomes HH, HT, TH, TT. These probabilities have been developed from conditional probability using the product rule, but in the finite case we could just as well define conditional probability in terms of probability; a separate axiomatization of conditional probability is necessary only in the infinite case. EXAMPLE 2: Uniform distribution on the square. Let f1£ denote the smallest probability space of functions including the continuous functions on the square; let X(u, v) denote the value of X at the point (u, v), 0 ~ u, v ~ 1. Let qy denote the set of functions in f1£ depending only on u- y(u, v) = Y(u, I) for all v. Define P;X = I X(u, v)dv. Define pi, Y = I Y(u, v)du. Here P;X is a bet equivalent to X that is a function of u alone. From the product axiom, P;,X = In X(u, v)dv)du = I JX(u, v)dudv. Thus the probability on f1£ is just Lebesgue measure, in which the probability of a set is its area. Beginning with P~:" it is possible to construct the conditional probability to satisfy the product axiom almost uniquely-any other solution satisfies (P; - Q;)X I = O. Tjur (1974) assures uniqueness by requiring continuity of the conditional probability, but then, establishing existence is sometimes formidable. P; pi Q; EXAMPLE 3; Uniform distribution on the integers. Let f1£ be the space of sequences {Xl' X 2 , ... ,Xn' ... } with '2] Xi I < 00, let qy be the space of sequences {a, {3, a, {3, ... }. Define P;X = {a(X), {3(X), a(X), {3(X), ... } where a(X) = 2 I~ 1 X 2i - 1 ' {3(X) = 2I~ 1 X 2i · Define Pi, Y = ~a + ~f3. Then P~,X = Pi,P~X = I~lXi' Here is a uniform distribution on the evens and the odds, according as Y = (0, I, 0, 1,0, 1, ... ) is one or zero. The conditional distribution is not unitary. The distribution on qy gives probability 1/2 to the evens, probability 1/2 to the odds; this distribution is unitary. The distribution on f1£ is uniform over the integers; it is non-unitary. Note that P~ is not determined by saying that it is uniform given evens and uniform given odds; probabilities given even must be compared explicitly p! 26 3. Conditional Probability with probabilities given odds (such comparisons are implicit when probabilities are unitary, since 1 has the same probability under all conditions). Let the conditioning family :!Z be the sets of sequences X with X 2i = X 2i - 1 ' i = 1,2, ... ,. Define + lx lx p irj!£ X -- {lx 2 1 2 2' 2 1 + lx lx + lx lx + lx l 2 2' 2- 3 2 4' 2 3 2 4'··· j P;,Z=2L Z i· LXi. Then again P~, = Here the conditional probability is uniform given 1 or 2, or given 3 or 4, or given 5 or 6, ... The conditional probability is unitary; the probability on :!Z is non-unitary. In this case, some elements of :!Z lie in !!£ and for these p;"Z = Z, satisfying invariance. 4: Uniform distribution in the plane. Let !!£ be the smallest probability space which includes continuous functions zero outside some square in the plane, - 00 ~ u, v ~ 00. Let qy be the probability space of such functions depending only on u. EXAMPLE Define P;X = S X(u, v)dv Pi-, Y Then = S Y(u, v)du. P~,X = S(JX(u, v)dv)du = SX(u, v)dudv, corresponding to the uniform distribution on the plane. Note that the conditional distribution is not determined by requiring that the distribution be uniform given each u; since the uniform distribution is non-unitary, it is possible to have a conditional distribution which is uniform given each u, a distribution on u which is not uniform, a distribution over the whole plane which is uniform. The marginal distribution on qy is not determined by the distribution on !!£; the conditional distribution is determined only up to an arbitrary weighting factor depending on u. Given the distribution on !!£ and the distribution on qy, the product axiom determines almost uniquely. P; 3.2. Product Probabilities For arbitrary Sand T define the function subscript S, denoted by s' by s(s)(t) = (s, t). Thus s is a function from S to the space of functions from T to S x T. Define T similarly. Fubini's Theorem. Let P be a probability on !!£ on S, let Q be a probability on qy on T, let X x Y be the function on S x T: (s, t) ~ Xes) y(t), and let !!£ x qy be the smallest probability space including all X x Y. Then P x QW = PQWs = QPWT is the unique probability on !!£ x qy such that P x QX x Y = PX QY. (Note that QWs is thefunctions~ QWs(s).) 27 3.3. Quotient Probabilities PROOF. Let "/If 0 be the set of functions W in !![ x q!J such that WsCS)Eq!J for each SE!![. Then "/If 0 includes all functions X x Y, and is a probability space, so "/If 0 = !![ x q!J. Again let "/If 0 be the set of functions Win!![ x q!J such that QWsE!![. Then "/If 0 is linear, includes all functions X x Y and is continuous, but it is not straightforward to show that WE "/If 0 :;. I Wi, W /\ I E"/If o' Let d(!![) be the set of functions L~= 1 aiX p let d(q!J) be the set of functions L~= 1ai Yi and let d(!![, q!J) be the set of functions LaiX i x Yi where X i and Yi are 0 - 1 functions on Sand T. For each X E!![ there is a sequence X ~ X, IXnI ~ X, X nEd(!![). Thus if XiE!![, yiE<W , then IXl x yl + X 2 X y2 lim Xln x yln + X n2 X y2n' where X~ X Ynl + X; x Y;Ed,{!![, <W) and is bounded by IXll v IX21 x lyllvly21. Since d(!![,<W)c"/lfo' and "/If o is continuous, IX 1 X yl + X 2 x y21lies in "/If o' By a similar argument, any finite sequence of operations involving linear combinations or absolute values or /\ I on the functions X x Y will yield a function in "/If 0' so that the prespace including all X x Y is included in "/If o' Since "/If 0 is continuous, by Lemma 2.2, "/If 0 = !![ x <W. Define P x QW = PQWs ; note that i= PQX x Ys = P(XQY) = PX QY. It is easy to show that P x Q is a probability. For example, continuity requires Wn~W, IWnl~Wo:;.PxQnW~P xQW. For each s, Ws(s)~WsCs), IWnsCs)I~Wos(s) so QWns(s)~QWsCs) and IQWnsl~IQwsl Therefore PQWns ~ PQWs as required. Also P x Q is the only probability on !![ x <W such that P x QX x Y = PX Q Y; for any Win!![ x q!J may be approximated by a sequence offunctions of form L~= 1 anX n X Yn ' By symmetry P x QW = QPWT' and the theorem 0 ~~~. 3.3. Quotient Probabilities Let !![, ~ be probability spaces on S and let Y be a random variable from S, ~ to T, <W. The conditional probability of!![ given Y is defined by (P; X)( Y) = P;-I('?I)X, Thus for each X, P;X is a function in <W, such that PyX(t) is a probability on!![ for each t. The notation P(X I Y) means (p; X)( Y), a function in y-l(<w). Suppose that X, Y and X x Yare random variables on U, ~ to S, !![, T, <Wand S x T, !![ x <W, and that there exists a conditional probability on (X x Y)-l!![ x <W given y-l(<w). The conditional probability of X, Y given Y is (p;'YW)(Y) = p~_~r~;I~X'?lW(X, WE!![ x <W. Thus for each W in !![ x <W, P;,yw is a function in <W, such that (P;'YW)(t) is a probability on !![ x <W for each t. The notation P[W(X, Y)I Y] means (p;'YW)(Y) and is useful for indicating that X is summed over while Y is held fixed. The product rule becomes pX, Y = pYp;,Y. A quotient probability P; is a probability on!![ for each t such that gPUE<W n 28 3. Conditional Probability for each gEOJ/,fE:T. A conditional probability p;.,Y is defined from a quotient probability by p;"Yfg = gP;'f each gEOJ/, fE:T; this equation determines P;' Y on fll' X{!!. A quotient probability is not a conditional probability because P;'fmay not lie in OJ/ for eachfin fll'; it is convenient to use quotient probabilities to generate conditional probabilities because it is necessary to specify probabilities on :T for each t, rather than on fll' x OJ/ for each t. As before P[j(X) I YJ means (p;' f)( Y). The random variables X and Yare independent if pX. Yfg = pXf pY g for fEfll', gEOJ/ or equivalently, given that p;' is defined, if P;' = pX. Similarly, the random variables {XJ are independent if for any finite subset Xl' ""X n Px;'""X»TIJ; = TIpX'!;,J;Efll';. The random variables X and Yare conditionally independent given Z if p~'Yfg = P~fPlg,fEfll', gEOJ/, that is, if P;',z = P~. EXAMPLE. Let S be the set of positive integer pairs (i, j), i ~ j. Let :T be the probability space offunctions X, L;;u IXCi, j) I < CIJ, Let !!Z be the probability space of real valued functions on S. Let Y be the function Y(i, j) = j from S, :?Z to T, OJ/ where T denotes the positive integers and Y consists offunctions g where L Ig(i) I < CIJ. A conditional probability P; is (P~'X)(j) = I X(i, j). For eachj, (P;X) (j) defines a probability on:T. For each X, P;X is a function in W. The function (P~X)(Y) has value at (i,j): P~XY(i,j) = (P;X)(j) = L;~jX(i,j) defining a conditional probability of fll' given y- l (OJ/). Now let X(i,j) = i, and define fll' and OJ/ to be the space offunctionsfthat take finitely many non-zero values. Then P;,'YW(j) = Li~j W(i,j) where W is any real valued function finitely non-zero. P;,'Y(WY) = P;,'YW(Y) = L Wei, Y),afunctionin y-I(OJ/). '>Y The quotient probability p;'f(j)';, L;~jf(i) defines a probability on :T for each j, such that {g(j)P;'f(j)} EOJ/ for each g in OJ/. P;'- Yfg = L f (i)g(j) = g P;.J: Here X and Yare not independent because p;'fU) varies withj. 3.4. Marginalization Paradoxes In Stone and Dawid (1972), and in Dawid, Stone and Zidek (1973) a number of "marginalization paradoxes" are produced using improper priors. See also Sudderth (1980) where it is shown that no marginalization paradox arises with unitary finitely additive priors. 3.5. 29 Bayes Theorem Consider example 1, Stone and Dawid (1972). Random variables X and Y are independent exponential given parameters ()rjJ and rjJ, and (), rjJ have density e- 8 with respect to lebesgue measure on the positive quadrant. The joint density of X, Y, (), rjJ is e- 8()rjJ2 exp [ - rjJ(()X + Y)], and the product rule pX,Y,8,tf> = p8,t/>p:.~,8,tf> is satisfied. The conditional density of (), rjJ given X, Y is e- 8 ()rjJ2 exp [ - rjJ(()X + Y)]/ f(X, Y) where f(X, Y) is the density of X, Y: If e- 8()rjJ2 exp[ - rjJ(()X + Y)]d()drjJ. Again the product rule is satisfied. However the conditional density of () given X,2 where 2 = Y/X is e- 8()/(() + 2)3f(2) which does not depend on X; Stone and Dawid take this to imply that the conditional density of () given 2 is e- 8 ()/(() + 2)Y(2). Similarly the conditional density of 2 given () and rjJ is ()/(() + 2)2 independent of rjJ; Stone and Dawid take this to imply the density of 2 given () is ()/(() + 2)2, which is inconsistent with the conditional density of () given 2 being e- 8()/(() + 2)3 f(2). The paradox 'is caused by the implicit assumption that p;,8 = p;:!'tf> if p;:!, tf> is independent of rjJ; this assumption is valid if (), rjJ given () is unitary for then, letting f be a continuous function of two real variables, vanishing outside some square, p;,8,tf>[J] = p:,tf> p;:!,tf>[J] = p;:!,tf>f (which is independent of rjJ). Thus P;'~ = p;:!,tf>f. However, if (), rjJ given () is not unitary, p:,tf>P;:!'tf>f is not defined, since p:,tf> is not defined for non-zero functions constant over rjJ. Instead, P;,M[h(rjJ)fJ = p:,tf>h(rjJ)P;:!,tf>f. Thus the joint distribution of rjJ and 2 given () is a product distribution, but because rjJ given () is not unitary it is not possible to determine the marginal distribution of 2 given (). (In the same way, if X, Y is uniformly distributed over the plane we cannot determine that X is uniformly distributed on the line.) In the example, take () to have density e- 8 and rjJ given () to be uniform. Then Z, rjJ given () has density ()/(() + 2)2; but 2 given () does not have density ()/(() + 2)2 because it is not valid to integrate over rjJ. 3.5. Bayes Theorem A real valued functionfon S is a density on f!{ iffX Ef!{ for X Ef!{. A probability space f!{ is a-finite if there exists X 0 E f!{, X 0 > 0; or equivalently, if there exists XnEf!{, Xn i 1 as n --+ OC!. 30 3, Conditional Probability Bayes Theorem. Let P be a probability on U, :!L, Let X, Y and X x Y be random variables from U,:!L to S, f![, T, qy and S x T, f!l' x qy, Let f be a density on f![ x qy, Let f![ x qy be (J-finite, Let f T(t): s -+ f(s, t)Ef![ each t. Let .fI pXfT : (s, t) -+ f(s, t)/P[j(X, t)] be a density on f![ xqy. Let p;'g = QY(gfs ) for some Q on qy, each gEqy. [that is (p;'g)(s) = QY- !('iY>[g(Y)f(s, Y)]] Then pYg = QY(gPXfT ) gEqy. P;h = pX(hfT)/px[JTJ as pY each hEf![. [that is (P;h)(t) = p[h(X)f(X, t)]/Pf(X, t), except for a set To oft values with pYTo = 0]' PROOF, Let hnEf![, hni1, since f![ is (J-finite. For gEqy, g;;;O, hnglg; since hngEf![ x qy and gEqy, pX,Y h 9 = p(X,¥)-! f!l' x qy (X)g(Y)I pY- !('iY)g(Y) = pY 9 hn n pX, Yhng = pX P;'hng = PX[hnQY(gfs)] pY 9 = pXQY(gfs) = QY PX(gY/T) = QY(gpXfT)' This shows pY 9 = QY (gpX f T) for 9 ;;; 0, and general 9 follow easily. is a quotient probability; from Secondly, it is necessary to show that the given definition, p; P;,Y[ghJ = PX[gThfl'J/PX[JTJ. Then pYp;'Ygh = pY(gPXhfT/pxfT) = QY (gpX hfT) from the first part of the proof, = pXQY(ghs!s) =pX[hP;'gJ = pX,y gh. from (v) Thus p;,y as defined satisfies the product rule, and for any other conditional probability Q;,Y satisfying the product rule, pYlp;,yW - Q;'YWI = O. Thus pY 9 I p; h - Q; hi = 0 for any quotient probability Q satisfying the product rule. Since qy is (J-finite, 9 may be chosen positive, and so pYlp;h - Q;hl = o. 0 In terms of densities, we have that Y given X has density fs with respect to some probability QY ; under specified conditions on f and f!r, X given Y is 31 3.6. Binomial Conditional Probability a unitary probability with density f T/ pXfT with respect to pX. In the usual terminology, fs would be the likelihood of Y given X, pX is the prior distribution of X, and f T/ pXf T is the posterior density of X given Y. Frequently pX has some prior density P with respect to a probability R X, and then the conditional probability of X given Y has posterior density fTP/Rx(fTP) with respect to RX. Note that pX and may not be unitary, but under the conditions of the theorem is unitary. Renyi (1970) takes unitary conditional probabilities as the basic concept, expressing non-unitary probabilities such as lebesgue measure by families of such conditional probabilities. It seems simpler to go the other way, to define unitary conditional probabilities from nonunitary probabilities; indeed we allow non-unitary conditional probabilities in general, though our Bayes theorem produces only unitary conditional probabilities. pi P; 3.6. Binomial Conditional Probability The binomial distribution is defined for n 0-1 random variables Xl' X 2' ... , X n given a parameter p, 0 ;:::; p ;:::; 1, by P[Xi = Xi' i = 1, ... , nip] = pIX;(l - py-Ix;. The random variables Xl' X 2' ... , X" are independent and identically distributed given p, with P(X i Ip) = p. In Bayesian analysis p (as well as the XJ is taken to be a random variable on some underlying probability space :r and a probability P is assumed on [![ such that the conditional distribution of Xl' ... , X" given p is binomial. If p is not unitary, the marginal probability of Xl' ... 'X n = p(pIx;(l - p)"-IX i ) is not defined for all X; thus conditional probability given the observations must be carefully handled. For example, if Pf = Hf(p)/p(1 - p)]dp using Haldane's prior, then p[ Xl = is not defined and the conditional distribution of P given Xl = 0 is not uniquely defined. Assume that p'"(1 - p)m' EPI whenever m ~ a, m' ~ b. Then define PI:,b to be the probability subspace of fil' including all functions OJ X i -- X i ' I'-1 , ... , n," n > n, " -X"' - 1'L...,Xj a, n' - " L-Xj >b = X i-Xi' I'-1 , ... , n," n > n, X n ' -0" 'L-Xi > = a, n' - " L-Xi -b Xi = Xi' i = I, ... , n, LXi ~ a, n - LXi ~ b. Thus :r:,b corresponds to the shortest sequences of observations containing at least n observations, at least a successes, and at least b failures. The conditional probability of p given PI:,b is n _ P[f(p) IPIa,b] - p[f(p)pIX;(l - pr'-IXi] P[pIXi(l _ p)n' IXi] It may be verified that P[P(f(P)I:r:,b)] = Pf(p). 32 3. Conditional Probability Consider, for example, Haldane's prior; here a = 1, b = 1 since S[j(p)/P(1 - p)]dp exists whenever f{p) = pm(l - pt', m;:;;; 1, m';:;;; 1. (Of course a and b could be positive fractions but this does not change X~.b·) Then xi,l is generated from the sequences 001, 0001, 00001, 110, 1110, 11110, 01, 10. The conditional probability of p given X~,I is p[j(p)IX] = S!(p)pLX,-I(1- pt'-LX,-Idp SpLX,-I(1- p)n'-LXi-1dp which is defined for each of the specified sequences since LXi;:;;; 1, n' - LXi;:;;; l. If a = 0, b = 0 we would have X~,o generated from 00, 01, 10, 11; and the conditional probability of f(p) averages to the probability of f(p) when weighted by the probabilities of 00,01, 10, 1l. In the case a = 1, b = 1 the sequences 00 and 11 do not have defined probabilities, so the average that validates conditional probabilities is not available-OO is replaced by 001,0001, , .. , and 11 is replaced by 110, 1110, ... ,and an average with valid marginal weights becomes available. In application, we will be able to give conditional probabilities whenever the data sequence has at least one 1 and at least one O. If the data are of form: all O's or all l's, no conditional probability consistent with the axioms is available. 3.7. Problems Problems (Q) are ones that I find very hard. EI. Let N denote the positive integers, and let :!Z be the space of sequences {z" Z2' ... , zn ... } with I IZi I < CIJ. Let X and Y be random variables into (N, :!Z), and let X x Y be a random variable into N x N, with P{X = i, Y = j} = pl,) . .. Define p.J = "'.p . .. L.r l.j Let b/k) = {i = k}, kE:!Z. Then E2. Let IR: be the real line, B be the space of bounded lebesgue integrable functions (obtained by completing the probability that values an interval by its length, and accepting bounded functions in that completion). Let X and Y be random variables into (IR:, B) and letf(x, y) be such thatg(y)f(x, y)EB x B for each gEB. Suppose pX,YW = Sf W(x, y)f(x, y)dxdy. Then pYg = Sf g(y)f(x, y)dxdy, and P;(h)(y) = Sh(x)f(x, y)dxlSi(x, y)dx. PI. Let X be a real valued random variable uniformly distributed, and let the condition- 33 3.8. References al distribution of Ygiven X give probability t to X - t and probability t to X + t. Find the joint distribution of X x Y and the conditional distribution of X given Y. (Bayes theorem fails.) P2. Say X - N(}J,o' (j~) if X is a real random variable having density exp[ -t(llo - X)2/(j~]/«(jo~) with respect to lebesgue measure. Suppose 0N(}J,o' (j~), X I0 - N(O, (j2). Find the distribution of 0 given X. If, in addition Y IX, 0 - N(X - 0, (j2), find 0 IX, Y. P3. The three observations 1,3, 7, given 0, are from a normal family (l/j2i)exp[ -t(x - Of] with probability or from the family texp[ -Ix - 01] with probability!. The prior distribution of 0 is uniform. Find the posterior distribution of 0 given the observations, and the posterior probability that the normal is the true distribution. ± P4. Assume that XIO - N(O, I), and that 0 - N(Oo' (j~). Usually 00 and (j~ are assumed known, but suppose, as an afterthought, you decide that 00 - N(O, (ji). Find the posterior distribution of 0 given X. P5. Let Xl' ... , X n be a sample from the uniform (0 - t, 0 + t) given O. Let the prior probability of 0 be uniform. Find the posterior probability of 0 IX l ' ... , X nand compute the posterior mean and variance. Ql. Suppose.cr and d,Ij are probability spaces on S, and that P: .cr to d,Ij satisfies the axioms of conditional probability, but that .cr => d,Ij is not assumed. When is it possible to extend P to !!l including.cr and d,Ij, so that P: !!l to d,Ij is a conditio~l probability? Q2. If P is a conditional probability on.cr to d,Ij, does there exist a complete conditional probability P' on .cr' to d,Ij such that P' coincides with P on .cr c .cr'? 3.8. References Dawid, A. P., Stone, M. and Zidek, J. V. (1973), Marginalization paradoxes in Bayesian and statistical inference, J. Roy. Stat. Soc. B 35, 189-223. Kolmogorov, A. N. (1933), Foundations of the Theory of Probability. New York: Chelsea. Renyi, A. (1970), Probability Theory. New York: American Elsevier. Stone, M. and Dawid, A. P. (1972), Un-Bayesian implications of improper Bayes inference in routine statistical problems, Biometrika 59,369-373. Sudderth, W. D. (1980), Finitely additive priors, coherence, and the marginalization paradox, J. Roy. Stat. Soc. B 42,339-341. Tjur, T. (1972), On the mathematical foundations of probability. Inst. of Math. Statist., University of Copenhagen. --(1974), Conditional probability distribution. Inst. of Math. Statist., University of Copenhagen. CHAPTER 4 Convergence 4.0. Introduction Notions of convergence, as the amount of information increases, are necessary to check the consequences of probability assignments in empirical fact. For example, if we assume that a penny has probability 1/2 of coming down heads on each toss, and that the different tosses are independent, it follows that the limiting proportion of heads will be 1/2 almost surely. A standard method of evaluating statistical procedures is through their asymptotic properties; partly this is a matter of necessity because the asymptotic behavior is simple. For example, one of the desirable properties of a niaximum likelihood estimate is that it converges, in a certain sense under certain regularity conditions, to the unknown parameter value. A famous theorem due to Doob (1949) handles consistency of Bayes procedures-if any estimate converges to the unknown parameter value in probability then the posterior distribution concentrates on the unknown value as the data increases. 4.1. Convergence Definitions Let f!E be a probability space on S. A real valued function X on S is measurable if {X > a}, {X < -a}Ef!E for each a> O. Thus X is a random variable into R, BI 0 where Blo is the smallest probability space containing intervals (a, 00), ( - 00, -a) for each a> O. The space of measurable functions is a probability space which includes f!E. In the following X, Xl' ... , X n , are assumed to be measurable functions 34 35 4.2. Mean Convergence of Conditional Probabilities with respect to f!( on S, and it is assumed that there is a probability P on f!( on S. Xl = X 2 as P means P(X 1 =1= X 2) = O. X.-+X as P means p{sIX.(s)+X(s)} =0; equivalently, for each B>O, p{lX. - Xl > B some n > N} -+0 as N -+ 00. X. -+ X in P means P { IX. - X I > B} -+ 0 each B> o. X. -+ X by P means P IX. - X 1-+ 0 as n -+ 00. If X. -+ X by P or X. -+ X as P then X n -+ X in P. If XEf!(,XnEf!l,Xn-+X in P, and supP(IXnl({X.I>A}+{lX.I< 1/A} )) -+ 0 as • A -+ 00, then X n -+ X by P. The sup condition is a general- ization of the notion of uniform integrability, necessary to handle non-unitary P. With some mess, it may be shown that it is sufficient to prove the result for X=O; lim pIX.1 ~ s~p p(lx.i( {IXnl > A} + {IX.I > ~})) + lim P[IXni( ~ ~ IX.I < A ) ] Since lim P( IX. I{1/A ~ IX. I < A} ) ~ A lim P{ IX. I ~ 1/A} lim pIX.1 = O. = 0 all A, X n -+ X in D (in distribution) if P[j(X n)] -+ P[j(X)] for each bounded continuous f that vanishes near zero. It is not necessary that the X n and X be defined on the same probability space; the definition involves only the distribution of the X. and X. If Xn -+ X in P then Xn -+ X in D. To show this, for each f there is a fixed Bo such that f(x) = 0 for Ixl ~ BO' and for each k, (j there is an B< Bo/2 depending on (k, (j) such that If(x) - fey) I < (j for Ixl < k, Ix - YI < B. Thus If(x) - fey) I ~ 2 sup f( {Ixl > k} + {Ix - YI > B} ) + (j{ Ixl ~ Bo/2} limlf(X) - f(Xn) I ~ 2 sup fP{ Ixi > k} + (jp(IXI ~ Bo/2) Choosing k large and (j small gives plf(X)-f(X.)I-+O=Xn-+X in D. 4.2. Mean Convergence of Conditional Probabilities Theorem. Let f!(n be a sequence of probability spaces contained in the probability space f!(, let P be a probability on f!(, and let Pn = P~n be the conditional probability of f!( given f!( •. Let X E f!(. (i) If X is mean-approximable by f!l n (that is P IX n - X 1-+ 0 for X nE f!l n)' then X is mean-approximable by the sequence PnX. (ii) If X is square-approximable by f!ln (that is P(Xn - X)2 -+0 for XnEf!(n)' then X is square-approximable by the sequence PnX. 36 4. Convergence PROOF. Since f![n c f![, PnX n = Xn for each Xn in f![n' so PnP n = Pn' + PnlX n - PnXI = PnlX - X"I + Pnlp)X X"I + p"P lX n - Xl = 2P IX - XJ PnlX - PnXI ~ PnlX - Xnl ~ PnlX - II - X)I II lI Thus if X is mean-approximable by f![", it is mean-approximable by PnX, xy = Pn(X - P)X - P X)2 +2PJ(X - PnX) (P X - Xn}J + P,,(X n - p"X)2 II lI = Pn(X - PnXf + 0 + PII(X II - Pn X)2. Thus if X is square-approximable by f![ nit is square-approximable by PnX. 0 EXAMPLE. Let P be a complete probability on 'lI on U. Let X, Xl' ... , X n be random variables on U, 'lI. Suppose Xl' ... , X n' ... are independent and identically distributed given e; let = for i = 1,2, .... Assume that X is a real valued random variable such that X, X 2 e'll. Set 11 = p(Xle). e, P;' P; Since Xl' ... , X n are independent given p[ (~LXi - e, 11 Yle ] = P[(X - !1)2I eJ/11 p[~LXi - 11 J = P[X - I1J2/ 11 • Note that 11 2eZ because 112{ 1111 ~ k}l tl2 as ki 00, 112{ 1111 ~ k} = 1111(11111'1 k)e'll, and P11 2{1111 ~ k} ~ PX 2 aIIk. Thus 11 is square-approximable by l/n LXi and hence by p[1l1 Xl' X 2' .. , , XIII This is a Bayesian adaptation of the law of large numbers, in which the unknown population mean Il is increasingly weIl approximated by its best estimate from the sample Xl' ... , X n' 4.3. Almost Sure Convergence of Conditional Pro ba bili ties Theorem. Let P be a probability on .Uf, and let f![ n be an increasing sequence of subspaces of f![. Let P n = pin' Let X e f![. If X is mean or square-approximable by {f![n}, then PnX PROOF. -+ X as P. Let Xn = PnX, Then Pn- 1 Xn = X n- 1 by the product law, so Xn is 37 4.3. Almost Sure Convergence of Conditional Probabilities a martingale when P is unitary; in this case the theorem is well known, Doob (1953). Lemma. P{ sup IXil ~ s} ;£ plxnl/s. 1 ~i~n PROOF. Then Let Ai = {IX 11 < e, Ix 21 < e, ... , IXil ~ e}, i = 1,2, '" , n. n IAi={ sup IXil~e}. AiX; = Ai(X; + 2(Xn - X)Xi + (Xn - P[Ai(X n - X)XJ = PPJAiX/X Ii - X/) XJ] = P[AiXiPi(X Ii - X,)] = 0 P(AiX;) ~ e2P Ai since IXil ~ s when Ai =1= 0 PX; ~ e2PIA i = e2p{ sup IXil ~ e} as required. For the second result, define Bi = {X 1 < e, X 2 < s, ... , Xi ~ s} XnBi = BlX i + Xn - X,) PXnBi = PB;Xi + PPiBi(X n - X,) = PB;Xi ~ePBi P(X,7) ~ P(X"IB) ~ eP(IBJ = eP{ sup Xi ~ e} - P(X';-) ~ eP{ inf X i ;£ - e} plx,,1 ~ eP{ suplXil ~ e}. Turning to the theorem, if X is square-approximable, from 4.2, P(Xn - X)2 ~ 0, and for a suitable subsequence IP(X nr - X)2 < 00. { sup IX i - Xjl ~ 4e} ;£ sup( { i,j?;;N nr?:N P{ supIXi-XJ~4e};£ i,j?:N I sup ",.-1 ~i~nr IX i - X"J ~ e} + {IX - XnJ ~ c}) [P(Xnr-xnr_y/e 2 +P(X-X"l/e 2 ] nr?,N -> 0 as N -> 00, using the lemma and Ip(X nr - xf < 00. lf X is mean approximable, the same argument holds with plX nr - Xl replacing P(Xnr - X)2 and PIX nr - Xnr_tl replacing P(Xnr - Xnr_Y. 0 38 4. Convergence 4.4. Consistency of Posterior Distributions Doob's Theorem. Let f!{n be an increasing sequence of subprobability spaces off!{· Let P be a probability on f!{, let P nbe conditional probabilities off!{ given f!{ n· Let () be a real valued random variable such that {a ;i: () ;i: b} E f!{ for a, b finite, and suppose that () is approximable by f!{ n : X n ~ () in P some X n in f!{ n· Let Cn(a, b) = Pn{a ;i: () ;i: b}. Then Cn(() - e, () + e) ~ 1 as P. PROOF. Doob (1949) proved a version of this theorem for P unitary. See also Schwartz (1965) and Berk(1970). Let a and b not be atoms of () :P{a = ()} = P{b = ()} = o. I{a;i: () ;i:b} {a ;i: X n ;i: b} I ;i: {I a - () I ;i: e} + {I b - () I ;i: e} + {I X n - () I ~ e}. Since P{ IXn - ()I ~ e} ~ 0 as n ~ co, and P{Ja - ()I ;i: e} + P{lb - ()I ;i: e} ~ 0 as e ~ 0, P{ a ;i: () ;i: b} - {a ;i: X n ;i: b} I~ O. From 4.3, Pn{a;i: ();i: b} ~ {a;i: ();i: b} as P. Cn(a, b)(s) ~ {a ;i: ()(s) ;i: b} except for SE A, PA = O. Cn(()(s) - e, ()(s) + e) ~ 1 except for SEA. Cn(() - o e, 0 + e) ~ 1 as P. The condition of approximability and the conclusion of consistency may both be expressed conditionally on (); () is approximable by f!{n if PII[ IXn - ()I > e] ~ 0 except for a set of ()-values of probability zero; and Cn(() - e, () + e) ~ 1 as P implies that Cn(O - e, () + e) ~ 1 as P II ' except for a set of 0 values of probability zero. Let X I ' X 2' •.. , X n be a random sample from N( (), 1); let () be uniformly distributed on the line. Let f!{ ndenote the probability space generated by X I ' ... , X n and note that IXn - 01 ~ 0 in probability given (). The conditional distribution of () given f!{ n is unitary. Therefore it concentrates on the true value 0 as n ~ co, except for a set of () values of probability zero. EXAMPLE. 4.5. Binomial Case Let P be a probability on f!{ on S. Let P, the binomial parameter, be a real valued random variable, 0 ;i: p ;i: 1, defined on f!{. Let X I' X 2' ... , X n be n Bernoulli random variables, each taking the values 0 or 1; thus X I' X 2' ... , X n maps S to the space of n-tuples (XI' X 2 ' ••• , xn) where Xi = 0 or 1. A binomial distribution has X I' X 2' ... , X n independent and identically 39 4.5. Binomial Case distributed given p: P[X; = Xi' i = 1, ... , nip] = p~:Xi(l - p)"-IXi. The posterior distribution of P given Xl' ... , X is /I PJ = P[j(p) IX l ' ... , XJ = P[j(p)pLXi(l - p)/I-IXi]/P(pLXi(l - p)n-IX,) defined whenever pIX'(l - p)"- I X'Efif. (Note that the conditioning space fifn must contain observations of varying length if P is not unitary.) Theorem. Let p, Xl' ... , X n be random variables on fif, and let X I ' .. , , X be Bernoulli given p. Assume that pm(l - p)m' Efif for m ~ a, m' ~ b. Say that Po is in the support ofP ifP[lp - Pol < e] > Ofor each e > O. Then pJlp - Po 1< e] ~ 1 as P po if and only ifpo is in the support of P. /I' '" Note that PJ/p - Pol < e] is a function in fif/l' Let R = IX i . If Po is not in the support of P, p[lp - Po 1< e] = 0 some e > 0, and so PJlpPo I < e] = 0 for all n. This establishes the "only if." Now suppose Po lies in the carrier of P. Assume 0 < Po < 1. Let f(po, p) = Po log p + (1 - Po) log(1 - p). Thenf(po, p), as a function of p, is continuous and has a unique maximum at p = Po' Thus for each small b > 0, there exists Ll > 0 with PROOF. f(po' PI) > f(po' P2) + Ll As n ~ 00, whenever IPo - PII < b, IPo - P21 > 2b. R/n ~ Po as P po from the strong law of large numbers. f( ~, Pl) > f( ~, P2) + Ll whenever IPo - Pll < b, IPo - P21 > 2b, for all large n, with probability approaching 1 as n ~ 00. (Note that the conditioning space may include observation sequences of length greater than n, but the inequality holds for all these sequences.) Pn{IP-Pol~b}/P/I{lp-pol~2b} =p[{lp-pol~b}exp[ nf(~,p)JJI P l{I p - Pol ~ 2b} exp [ nf ( ~ , p ) JJ ~ e"Ap{ Ip - Pol < b}/p{lp - Pol> 2b} ~ Thus P n { Ip - Po I ~ 2b} ~ 0 as P po as required. 00 as Ppo . o 40 4. Convergence Remark: The same result generalizes to multi nominal distributions, but not to observations carried by a countable number of points, as shown by Freedman (1963). 4.6. Exchangeable Sequences Let P be a probability on a space f!(. A sequence of random variables XI' X 2' .,. , X n, ... defined on f!( is said to be exchangeable if XI' X 2' ... , Xn has the same distribution as X ... 1 ' X 0'2' ••• , X an for each n and each permutation 0'. De Finetti's Theorem. Let XI' X 2' ••• , X n' .•• be an exchangeable sequence of 0 - 1 random variables on a- probability space f!( with unitary probability P. Then X l ' X 2' ... , X n' ... are conditionally independent and identically distributed given the random variable p = lim L:~ 1 X/2 n if the limit exists, p = 0 if the limit fails. PROOF. Let Xn = (l/n)L7= 1 Xi' Then P(Xn - Xm)2 = [p(XD - P(X 1 X 2)](m - n)/mn LPIX 2s +1 PUX2r+l - - X 2 ,1 ~ 00 x 2N I> e} ~ p{J)X 2 '+1 - x 2,1 > e}-70 as N, M -700. Thus p = lim L:~ 1 X/2n is defined except for a set A of probability 0; define p = 0 on the set A. Define a conditional probability on XI' X 2' ... , X n' ... given p by p[llp]=I,P[Xilp]=p, and let the XI'X 2 "",Xn ' ' ' ' be independent and identically distributed given p. The specified probability obviously satisfies the axioms of conditional probability except for possibly axiom 1 and the product axiom. For axiom I we need show that for Baire functions h, and for functions g in the smallest probability space including X l ' ... ,Xn , P[h(P)g Ip] = I h(p)P[g p]. Take h to be continuous and bounded, and note that the class of functions g satisfying the identity is a limit space, so that we need only verify that P [h(p)n Xii p] = h(p )p[n Xii p]. (The probability space generated by XI' X 2' ... , X n is the smallest limit space including Xi!' Xi,' ... ,Xi" for each sequence i l ' i 2 , ... ,in' If the axiom is satisfied for X l ' X 2' ... , X n, by symmetry it is satisfied for Xi ' Xi ' ... ,Xi .) Define X 2' i = x {2'(i - 1l < j ~ 2'i} /[2'i - 2'(i - 1 l]. The~ X 2: i -7 pas" r -7 00, whenever X 2' : . p as r -700. If X2' -7 p, ' LX. p[h(p)nXilp] If the limit = P[limh(X 2 ,)nX ilp] = P[limh(X 2 X 2"il p] = h(p)pn by the law of large numbers = h(p)p[nXilp]. ,)n X2' does not exist, then p = 0 and p[h(p)nXil p] = O. The axiom 41 4.6. Exchangeable Sequences is true for bounded and continuous h, and it is true on a limit space of functions, so it is true for Baire functions. For the product axiom, proceeding as before requires that pp(nXilp) = p(n X ). Now p(nX ) = p(nX 2') = P(lim r nX 2') = p(pn) = p[p(nXilp)]. D Thus the product axiom holds concluding the proof. Notes: This theorem, proved first by de Finetti, has philosophical importance in relating Bayes and frequentist theory. In the frequentist approach, probabilities are based on a sequence of observations Xl' X 2' ... , X n' ... that have limiting frequency p. De Finetti argues that this form of evidence requires the judgement that the X l ' X 2' ... , X n ' ••• are exchangeable; then the limiting frequency p exists except for a set of probability zero. Thus the frequentist probability statements may be derived from Bayes theory. The generalization is straightforward for real valued random variables; in this case the conditioning random variables are the functions 1 {Xi ~ x}/2r for rational x. The "nX/, for generating the probability lim space including X l ' X 2' X 3' ... are the functions n?= 1 {Xi ~ xJ where the Xi are rational. See for example, Loeve (1955), p. 365. How can the theorem be generalized to non-unitary probabilities? Consider a special case, corresponding to the prior density proportional to I;: 11p(1 - p). Let S be the set of all 0 - 1 sequences. Consider the smallest probability space [!( on S that includes all functions X depending on a finite number of elements of the sequence s, such that X(s) = 0 for s = 0, 0, 0, ... or s = 1, 1, 1, .... For example the finite sequences containing at least one 1 and one 0 will lie in this space. Let the sequence xl' x 2 ' ••• ,xn have probability 1/ (I:-:-\) for 0< IXi < n. This assignment of probability satisfies the axioms. For example ] = P(01) = P(Oll) + P(010). The function xn = (1/n)Ix i does not lie in the probability space, but the functions xn - xm do, since they give value 0 to s == 0 or s == 1. Since P(xn - xm? = ~ P(x 1 - x 2)21 n - mlimn, the sequence X2 " converges except on a set of probability zero to a function, p say. Let rJjj be the probability space generated by the polynomials pkl(l - p)k2 where kl' k2 > O. Define a conditional probability on [!( given rJjj by P[ Xl' X2 ' ••• , Xn rJjj] = pLxi(1 _ p)n-r.xi. I This conditional probability satisfies the product axiom: PP[xl' x 2 ' ••• , xnlrJjj] = p[pr.xi(1 - p)"-r.XiJ = P[lim X~~i(l - X2 N )n-r.xi ] = P(x 1 ' x 2 ' ••• , xn)· 42 4. Convergence And different sequences dent given p. XI' X 2 , ••• , xn and x n + I' x n + 2 ' ••• , x N are indepen- 4.7. Problems EI. In the binomial case the prior P is carried by the rationals (0, I], with P(r) = log [I - 2 - N]/[ - N], where N is the first integer for which rN is an integer. If Po is the true value of p, show that the posterior distribution of P concentrates on Po as n -+ 00. E2. Two parameters U, V ar~ independent uniforms. A coin is tossed giving heads with probability 1U - V I. Find the posterior distribution of U, V given r heads in n tosses, and specify its behavior as n -+ 00. PI. In the binomial case, assume P{H = pa} = t. Find the asymptotic behavior of p., as the true probability Po ranges over (0, I). Generalize results to P carried by a finite number of points. P2. Let p be a binomial parameter, and let 0 = tp{O ~ p ~ t} + (t + tp){p > t}· Let P be the prior distribution on 0 which is uniform over (0, ±)v(~, I). Let P n be the posterior distribution of 0 after R successes in n trials, and consider its distribution as R varies given the true value 00 = Show that Pn's distribution converges to a distribution over the two point distributions carried by {O = H and {O = ~}, with P{ 0 = H uniformly distributed between 0 and 1. ±. P3. In the binomial case, show that for a particular e > 0, it may happen that P(po - e, Po + e) > 0 but Pn(Po - e, Po + e)-+O as P po • P4. In the binomial, let the probability P be defined by 1 P(f) = Show that for each Po' 0 ~ Po ~ JfI[P(1 - o p) ]dp. 1, p.(Po - e, Po + e) -+ 1 as P po • P5. The observation X is poisson with parameter A., where P{A. = 1 + I/i} = 2-;, 1 ~ i < 00. Given A. = 1, specify the asymptotic behavior of the posterior distribution of A.. P6. Let X be 0 - 1 with probability given the parameter p, 0 P[ Xl p] = p + Hp = ~p ~ 1, ±} - t{p = ~}. Let the prior distribution of p be uniform. Show that the posterior distribution given n i.i.d. observations Xl' X 2' ... , X. is not consistent for p = or p = ± i. P7. Let P be a unitary probability on f!{. Let Y be a random variable, let qy be the probability space containing sets ofform {Y ~ a}, and let qy. be the probability space generated by sets ofform {Y ~ k/2n}, - 00 < k < 00. Let Snk = Pn(X) = k+I} . { 2kn < Y ~ ---z;- I P(Snk»O Show that SnkP(SnkX)/P(Snk) 4.8. References 43 is a conditional probability on PI to all •. Suppose that P co is a conditional probability on PI to all such that P",p. = P ",' Then p.(X) -+ P oo(X) in P. (In this way, the conditional probability of PI given the random variable Y is approximated by directly computed conditional probabilities given discrete random variables Y. which approximate Y.) P8. A Markov chain, with a finite number of states, has probability Pi for the initial state to be state i, and probability p/ j for a transition from state i to state j. The initial state and an infinite sequence of transitions is observed. Assume that the prior distribution for {Pi} and {p;) is uniform over the sets 0;;; Pi ;;; 1, LPi = 1; 0;;; Pi} ~ 1, LjPij = 1. Specify the limiting posterior distribution of {Pi} and {Pij}' P9. Let P be a probability on PI such th~t XePI=--X 2 ePI. Let all be a probability subspace of PI such that py2 = O=-- Y = 0, and suppose that all is complete with respect to the metric p(Y1, Y2) = P(Y1 - y 2)2: if p(Y., Ym)-+O, there exists Y e all such that p( Y, Y.) .... O. Define P(X Iall) = Y if Y e all minimizes P(X _ y)2. Show that P(X Iall) is uniquely defined, and is a conditional probability on PI to all. (Doob, 1953). PIO. In the binomial case, if P has an atom at Po, then p. {Po} -+ 1 as P if Po is true. 4.8. References Berk, Robert H. (1970), Consistency a posteriori, Ann. of Math. Statist. 41, 894-906. Doob, J. L. (1949), Application of the theory ofmartingaies, Colloques Internationaux du Centre National de la Recherche Scientlfzque. Paris, 23-27. Doob, J. L. (1953), Stochastic Processes. New York: John Wiley. Freedman, David (1963), On the asymptotic behaviour of Bayes estimates in the discrete case, Ann. of Math. Statist. 34, 1386-1403. Loeve, Michel (1955), Probability Theory. Princeton: Van Nostrand. Schwartz, L. (1965), On Bayes procedures, Z. Wahr.4, 10-26. CHAPTER 5 Making Probabilities 5.0. Introduction The essence of Bayes theory is giving probability values to bets. Methods of generating such probabilities are what separate the various theories. If probabilities are personal opinions, then they are determined by asking questions or observing which of a family of bets an individual accepts. There is a small theory for extracting personal probabilities, the elicitation of probabilities, for example Winkler (1967). To discover a person's probabilities for the disjoint events Ai for example, you offer to pay + log Xi if Ai occurs where Xi is the person's stated probability for the event Ai. If the person's "true" probabilities are Pi' his expected gain is LPi log Xi which is maximized when Xi = Pi. This elicitation function is not entirely satisfactory, since he may over-estimate the probability of unlikely events to avoid large losses. There are a number of objective methods of generating probabilities that are sophisticated versions of the principle of insufficient reason-they attempt to give probabilities that correspond to no information, in the hope that any information may be incorporated using Bayes theorem. It is necessary in every case to assume some prespecified probabilities on which~ to build the "indifferent" probabilities. We have rested the tortoise four square upon the elephant; but what does the elephant stand on? A method of basing probabilities on similarity judgments is proposed. 5.1. Information Iff is a density with respect to f!£, and P X = Q(fX) each X in f!£, write f = dP/dQ and P = Q.f. The iriformation of P with respect to Q is I(P, Q) = Q((dP/dQ) log (dP/dQ)) 44 5.2. Maximal Learning Probabilities 45 defined whenever (dP/dQ) log (dP/dQ)E f!(. If P is a discrete distribution over the integers, and Q gives value log 2 to each integer, - I(P, Q) coincides with Shannon's (1948) definition of entropy, which may be interpreted as the average number of bits per observation, required to send over a channel a stream of observations from P encoded into binary digits. See Good (1966) for a statistical interpretation. If P and Q are both unitary probabilities I(P, Q) is defined by Kullback and Leibler (1951). In this case I(P, Q);:;:; 0, and I(P, Q) = 0 when P = Q, so I(P, Q) may be interpreted as a measure of distance between P and Q. For a given Q, it is sometimes useful to find the minimal iriformation P satisfying various constraints-this is put forward as the principle ofmaximum entropy by Jaynes (1957) and Kullback (1959), but there is no reason to think that such a probability is correct for betting-it is merely the probability closest to Q in a certain way. Of course the minimal P always depends on the underlying probability Q. See also Christensen (1981). Theorem. If there exists a density f = exp(L~= 1 AjX j) such that Q(fX) = a p i = 1, ... , n, then P with dP/dQ = f is uniquely of minimal iriformation among all P satisfying the constraints P(X j) = a p i = 1, ... ,n. g log f ~ g log g + f - g with equality only if f = g. If a density g satisfies the constraints, Q(gX j ) = a p PROOF. Q(g logf) = Q[gLAjX)] = Q[fLAjX j)] = Q(flogf). Thus Q(flogf)=Q(glogf)~Q(glogg) with equality only if f=g as Q. Therefore P with f = dP/dQ is of minimal information among all P satisfying the constraints, and no other minimal P exists. Note: Since I(P, Q) is convex as a function of P, it may be shown that the minimal information P is unique if it exists. Note: It may happen that no minimal information P exists, but that there exists Po = Q ·fo such that for each A with 0 < P o(A) < 00, P A = Q(foA)/ P o(A) is the minimal probability P under the additional constraints P A = 1, P(X) = 0 if XA = O. The P A'S are conditional probabilities given A corresponding to Po, and Po may reasonably be taken to be the minimal P. For example, if Q is uniform on the integers, the optimal P under the constraints that all but a finite number of probabilities are zero and PI = 1, has all non-zero probabilities equal. Thus the overall minimal P should be taken to be uniform on the integers. 5.2. Maximal Learning Probabilities Let P be a probability on fL and let X, Y and X x Y be random variables defined on fL. Suppose that a quotient probability pi exists, satisfying pX,Y = pXpi· 46 5. Making Probabilities If pX,Y has density fwith respect to pX x pY (that is pX,YW = pXpY[JW] for We,q' x ~), define the information between X and Y to be pX,Y[IogJ] = pX'Y[log(dpX,y/ dPx x pY) J. This measure is zero if X and Yare independent = pY), and it is non negative if P is unitary. Define the conditional information of Y given X by I(YIX) = I(Pi, pY); then pX[I(YIX)] is the information between X and Y. See Lindley (1956). If pX is O'-finite, and pi has density h with respect to a probability QY, h = dPjjdQY, then pY has density pXhT with respect to Q, and (pi P[logdP/dpx x dpY] = pX[I(pi, Q)] - I[pY, Q], which may be interpreted as the probable change of information about Y with respect to Q, due to learning X. If Pi is known, pX,Y is determined by pX; Good (1969), Zellner (1977) and Bernardo (1979) have suggested determining pX to be a "maximal learning probability," maximizing pX,Y 10g(dpX,Y/px x pY); thus learning Y will maximally increase the information about X. Theorem. Let pX,Y be unitary, with marginal probabilities pX, pY and quotient probability pi. For pi fixed, if there exists pX with (1) pX{I(Pi, pY) = c} = 1 (2) I(Pi, pY) ~ c then the maximal information between X and Y is c, and QX produces maximal information if and only if it satisfies (1), (2) and pY = QY. PROOF. QX[I(pi, QY) - I(Pi, pY)] = QXPi[log(dpY/dQY)] = QY log dpY/dQY ~ 0 with equality only if pY = QY. Thus QXI[pi, QY] ~ c = pX[I(Pi, pY)] with equality only if pY = QY and if(I) and (2) are satisfied by Q. 0 Let Y = 0 or 1, let X be a binomial parameter 0 ~ X PxY=X,PY=PX. EXAMPLE. I-X ~ 1, so that X I[Pi, pY] = (1 - X) log 1 _ PX + X log PX' If pX{O} = pX{1} = 1/2, then I(Pi, pY) = In 2 at X = 0,1 and I[Pi, pY] = (1 - X) 10g(1 - X) + X log X -In 2 ~ In 2 for all X. Thus pX is a maximal learning probability, and it is unique. For two binomial observations, the maximal learning probability is pX{O} = 15/34, PX{I/2} = 4/34, PX{I} = 15/34; empirical evidence suggests that the maximal learning probability is carried by (n + 1) atoms for n 47 5.3. Invariance observations and converges weakly to the Jeffreys distribution with sin - 1 uniform. fi 5.3. Invariance Let X be a random variable from U,:!l' to S, fl'. Let a be a 1 - 1 transformation of S onto itself such thatfaefl' eachfefl'. A probability pX is relatively invariant under a, or a-invariant if pX = kP"X some k, that is Pf(X) = kPf(aX), each f in fl'. (If pX is unitary, k = 1; note that prxf = PXfa for any a.) For example, let X = Xl' X 2' ... , Xl 0 denote the results, in heads or tails, of tossing a penny ten times; and suppose we have no reason to differentiate between the tosses-each order is equally likely. Let a be a permutation of 10 numbers; aX = X"l' X,,2' ... , X"lO' Then X and aX have the same distribution. A quotient probability P: is relatively invariant under a, r if a is a 1 - 1 transformation of S, fl', r is a 1 - 1 transformation of T, I!!! and pXY = kP"x tY some k • (Note that P; f = P;y/r for any 1 - 1 r.) If pY is relatively invariant under r, and is relatively invariant under a, r, it follows from the product rule that pX. Y is relatively invariant under a, r-that is paX. tY = kp x , Y. Conversely, if pY is relatively invariant under rand pX. Y is relatively invariant under a, r, and if is defined, then is relatively invariant under a, r as pY. More precisely, for each fefl', gel!!!, pY(lp;f - kP;; fig) = O. Invariances are used to generate prior probabilities as follows. Suppose that X is an observation, Y is an unknown parameter. A model specifies the quotient probability and transformations a and r are found such that is a, r-invariant. It is now assumed that the same invariance applies to the posterior distribution of Y given X; this will occur if the prior distribution is r-invariant. Thus each invariance found on the model induces a constraint on the prior; conceivably, we might find so many invariances on the model that no prior satisfies them all! Here, we are arguing by analogy that observing a X given r Y is similar to observing X given Y, since the probability models are the same-therefore conclusions about rY given aX should correspond to conclusions about Y given X. See Hartigan (1964), Stone (1970), and also Fraser (1968) for a non-Bayesian theory of inference using invariances. P; P; P; P; P;, EXAMPLE. Let X and Y be real valued random variables with (P;f)(t) = Jf(s)h(s - t)ds some density h. Equivalently, X - Y has density h given Y with respect to lebesgue measure. Let a X = X + c, r Y = Y + c. 48 5. Making Probabilities Then paXf = pXrY fa = pX(fa)T rY Y 1 for any a, T . P;;f(t) = P:Ua)(t - c) = JI(s + c)h(s - t + c)ds = P:fU). P:, Thus a, L is an invariant transformation for and so pY is required to be invariant under T. Thus pY has density P, pet) = e)", with respect to lebesgue measure. The posterior density for P: is eA'f(s - t)/ Se)·'f(s - t)dt with respect to lebesgue measure-and pi is T, a-invariant. 5.4. The Jeffreys Density Let a distance p be a non-negative function on S x S, and assume {s Ipet, s) ~ c} lies in X' on S. A local p-probability on X' is a probability P such that limP{p(tl's)~c}/P{p(t2,s)~c}=1 dO for each tl' t z inS. Such a probability gives approximately equal value to small spheres. Jeffreys (1946) considered a number of measures of distance between two probabilities. Write f = dP/dQ if P has density f with respect to Q. Information distance: I(P, Q) = P ( log Root or Hellinger distance: reP, Q) = Q [ Absolute distance: a(P, Q) = Q I ~~) (~~) 1i2 - ~QP - 11 = 1 J sup (P X - QX). IXI~l Since (U 1/2 - 1)2 ~ Iu - 11, rep, Q) ~ a(P, Q). Using Schwartz's inequality, for P and Q unitary, a(P, Q) ~ 2r(P, Q)1/2. See Pitman (1979) for applications of reP, Q) to non-Bayesian inference. Lemma. Let {PJ and P be unitary. Then r(P n, P) ~ 0 as n ~ if dPn/dP ~ 1 in P. 00 if and only PROOF. Letf" = dPjdP. Then r(P", P) ~ O=> P[f,;i2 - 1]2 ~ 0 => fn1i2 ~ 1 in P=>fn ~ 1 in P. Conversely f" ~ 1 in P=> fn1/2 ~ 1 in P=> Ifn1/2 - 11 {lfn1/2 - 11 < e} ~O in P =>plfn1/2 - 11{lf,;/2 -11 < e} ~ 0 =>Pfn1i2{lfnl/2 -I} < e} ~ 1 since P{lfnl/2 -11 < e} ~ 1 pUn1/2 - 1)2 = 2 - 2P f~1/2 ~ 2 - 2P{fn1/2If,,1/2 - 11 < e} ~ O. D 5.4. The Jeffreys Density 49 Theorem. Let f!JJ be afamily of unitary probabilities P t on f!l, indexed by T, a compact subset of R Psuch that the interior of T is dense in T. Assume (1) Pt has density fe with respect to some probability J..i. on f!l. (2) f t = fs as J..i.=S = t. (3) for all t, there exists a vector derivative (8/8t)f/12 such that h(s, t) = jt:12 - f/12 - (s - t)' !f/121/Is where tl--+ Oas Is - tl--+ 0 Is - t I is euclidean distance. (4) Forls - tl < Dt > 0, h(s, t) < Zt where J..i.(Z;) < CD. The probability J that has density with respect to lebesgue measure on T equal to the determinant jet) = IJ..i.[(8/8t)f 1/2( (8/8t)F/2),] 11/2, is a local p-measure where p is the Hellinger distance, providedj(t) is continuous and non-zero in T. reP s' P)t = rlI(fl/2 _fI/2)2 Fix t . s t ' As reps' Pt) --+ O,fJft --+ I in P t ; let So be a limit point of the sequence of s values (by compactness, So exists). From (3),fs --+ fs o' Therefore fso/ft = I in P e which implies So = t from (2). Thus reP s' Pt) --+ 0 if and only if s --+ t, so that the set of s values with reps' Pt) < e may be found in a neighbourhood of t. From (3), reps' Pt) = (s - t)' J..i.[ (8/8t)f/12( (8/8t)f/12YJ (s - t) + o(ls - t 12). The sphere reps, Pt) ~ e corresponds to an approximate ellipsoid in T, (s - t)'It(s - t) ~ e of volume Kc: 1/2p lIt 1- 1/2 where jet) = IIt 11/2. The probability of reps' Pt) ~c:, with the specified density jet), is Ke 1/2P [1 + 0(1)] since j is continuous and positive. The density jet) thus generates a local 0 p-probability. PROOF . The Jeffreys density was put forward simultaneously by Jeffreys (1946) and by Perks (1947). Perks considered confidence regions for t which may be constructed from a sequence of independent observations each distributed as Pt' The confidence region in the neighbourhood of t has volume asymptotically proportional to jt- I, under certain regularity conditions similar to those in the theorem. Thus if to is the true value of t, we will have a confidence region closely concentrated near to if jt is large; Perks places density jt on t to represent this expectation. A more explicit confidence justification is given by Welch and Peers (1966), for the case where T is the real line: after n observations from f!l, the conditional probability of t given n observations is Pn ; choose t n,C( so that Pn(t < t n,GC ) = a; under regularity conditions the confidence size of the interval estimate {t < tn ) is Pt(t < t n ) = a + 0(n-1/2), but for Jeffreys' density Pt(t < tn,a) = a + 0(n- 1). Thus the Jeffreys density gives Bayes one-sided intervals which are more nearly confidence intervals than the intervals for any other prior. It should be noted 50 5, Making Probabilities that the same justification does not hold for two-sided intervals, Hartigan (1966). The Jeffreys density is also obtained from maximum learning probabilities (Bernardo (1979): suppose that n independent observations with probability P t generate the space :![ I X .. , :!t n , and let qy be the Baire functions on T. As n -> 00, the information between fill X ." filn and qy, denoted by I n' satisfies, under regularity conditions In - t log n -> - I(P 1J , J) + K, Thus the maximal learning probability for the asymptotic information between :I'I x ... Xn and qy, is the Jeffreys probability J.) The Jeffreys probability is induced on the indexing set T by the family of probabilities !!l' = {P t , tE T}; it will provide the same probability on !!l' regardless of the particular set T used to index !!l'. The topology on T is induced by the Hellinger distance on f!jJ. The probability on T is unchanged if Jeffreys' probability is computed using a number of observations from fil rather than a single observation. These properties are also possessed by the family of densities PaCt) ~ log PaCt) = {p (! log! :t: log! ) + ap( :t log! Y} Ip( :t log!Y t for t one dimensional, Hartigan (1965). This family gives the Jeffreys probability when a = 1/2, and often generates commonly accepted prior densities with suitable choice of a. Perhaps there is an interpretatIOn in differential geometry. If a subset of!!l' is considered, f!jJ' = {P t ' tE T'}, where T' is a compact set in Rk whose interior is dense in T', then the Jeffreys probability on T'may be obtained by conditioning the probability on T to T', provided J(T') > O. If however the indexing set T is partitioned into a family {T~} of indexing sets of lower dimensionality, the Jeffreys probabilities on each of the T~ might not be conditional probabilities from Jeffreys' probability on T; the Jeffreys construction is not consistent with the combination of conditional probabilities. 5.5. Similarity Probability Let X, Y and X x Y be random variables from some probability space into (S, fil), (T, qy) and (S x T, fil x qy). If pX. Y has density I with respect to pX x pY (that is pX, YW = pX PY[IW] for W in :I' xlJJ), call I, a real valued function on S x T, the likeness or similarity between Sand T. The random variable Y describes a number of possible outcomes in the past, the random variable X describes outcomes in the future, and I specifies similarities between pairs of these outcomes. We propose that I be specified 51 5.5. Similarity Probability subjectively to correspond to perceived similarities, and that the probabilities pX. Y, pX and pY be determined from 1. (Hartigan, 1971). In the notation of 5.2, [= dP;/dp x , the density of the quotient probability P; relative to the marginal probability pX. If X and Yare both discrete, [(x, y) = P(X = x, Y = y)/P(X = x)P(Y = y). If pX and pY have densities with respect to some probabilities QX and QY [(x, y) = pX. Y(x, y)/pX(x)pY(y) where pX. Y, pX and pY are densities of pX. Y, pX and pY. EXAMPLE 1 : Selecting from a deck of cards. A deck of cards is composed of 52 cardboard rectangles of apparently identical dimensions, one side of the rectangles being distinguished by different markings, the other side marked the same for all cards. The deck is shuffied with the uniform side showing. What is the probability that the top card is the ace of spades? Let X denote the top card. Let Y denote past knowledge about this deck of cards, observations of the shuffiing process, and any other information. For x one of the expected 52 cards, and for y past knowledge that does not refer to a particular card, take [(x, y) to be constant. Thus P(X = x, Y = y) = cP(X = x)P(Y = y) P(X = xl Y = y) = cP(X = x). Now consider the event that the top card is either x, one of the 52, or a card with a picture of a rabbit (the children have been at the cards again). Call this event {X = x or R}. Then P[X=xorRIY=y]=c'P[X=xorR], c'fc = RIY = y] = c'P[X = R] + (c' - c)P[X = x]. Thus P(X = xl Y = y) and P(X = x) are the same for all x. I do not feel happy P[X about the rabbit, but some event of different similarity is necessary to show that all probabilities are equal. Note that P[X = xl Y = y] is the same for all x only if the knowledge y contains nothing to distinguish the cards. This looks like the principle of insufficient reason, but it is not subject to partitioning paradoxes. Consider for example the rotatable cards-21 cards that look different when rotated through 180°. Distinguish between the two versions of these cards when selecting the top card, so that there are 73 possible results. There are now a number of different similarities. l(x, y) : rotatable x to a typical y l(z, y):non-rotatable z to a typical y l(x or R, y) : rotatable x or rabbit to a typical y l(z or R, y): non-rotatable z or rabbit to a typical y [(x' or x, y) : either version of a rotatable x to a typical y. 52 5. Making Probabilities Assume that lex' or x, y) = l(z, y) but that the other three similarities are different from each other and from l(z, y). Then P(x Iy) = p(x'i y) and P(x or x'ly) = P(zly) so that the probabilities are 1/52 for non-rotatable cards and 1/104 for rotatable cards. EXAMPLE 2: Uniform on the integers. It is not possible to present realistic examples of infinite sample spaces in a bounded universe, but such sample spaces have proved to be mathematically convenient. Who would give up Poisson and normal distributions? Let X be a random variable taking integer values. Let Y be past knowledge. Suppose y is such that no integer for X is preferred, and take lex, y) to be the same for all x. Again we need some outside event E such that lex or E, y) is the same for all x, and different to lex, y). Then P[X = xlY] is the same for all x, and the distribution on the integers is uniform. The uniform distribution on the line may be handled similarly-it will require that, l( {XI ~ X ~ x 2 }, y) depend only on the length ofthe interval {XI ~ X ~ x 2 }. Nothing much is happening, just the transfer of equal similarity perceptions to equal probability statements. EXAMPLE 3: A sequence of coin tosses. Let X I' X 2' X 3' '" denote the sequence of heads or tails in tossing a coin. Let Y be past knowledge about this and other coins and other things. If x = xl' x 2 ' ... ,xn is a particular sequences of n tosses, let x' = xul ' x u2 ' ... ,xun denote a permutation of x. Suppose lex, y) = lex', y) for all permutations x', and suppose l(xl' y) j. l(xi or x 2 ' y) for x 2 not a permutation of XI' l(x i or x 2 ' y) = l(x'i or x~, y). Then P(x Iy) = p(x'i y) and the sequence Xl' X 2' .•. is exchangeable. The probability distribution for X I' X 2' ... is then independent Bernoulli given P = lim(~::XJn), which exists almost surely. Frequency theory assumes no more than this-tlrust6e small probability assumptions of frequency theory may be derived from equal similarities of permuted sequences to given knowledge. To have a full probability model for a sequence of coin tosses, it is necessary to specify in addition the prior distribution of P, that is to specify the similarities of various P values to given knowledge. Assume that only a finite number of P values are possible, say PI' P2' ... ,PN' Then p(Pily) p(pjly) = More generally, P,(PEA)/I(PEA, y) [(Pi or Pi' y) 1 - --'---"-[(Pj' y) I(Pi or Pi' y) _ 1 l(pi' y) = Py[{pEA }/l(p, y) J. 53 5.6. Problems If the distribution of P given y has a density fy(P) with respect to lebesgue measure, differentiation ofthis formula gives fy(Po) l(p~po'Y) Forexample, if l(p + (1 0 f (P)dP)~[ y 1 ] = f,(Po) . dp o 1(P~po'Y) l(po'Y) Po, y) = Po for 0 ~ Po l(po' y) = 2po for for 0 ~ Po ~ 1/2, 0 ~ Po ~ 1/2. ~ fy(Po) = cPo ~ 1/2, We might, by symmetrical similarity judgments, require that fy(P) be symmetrical about P = 1/2. I might be charged with replacing mystical methods of determining priors by mystical methods of specifying similarities. I am not proposing formal methods of determining similarities. They are subjective judgments relating expected events to past knowledge; they may come only as comparative judgments-this P value is more similar than that; even such comparative judgments may usefully constrain the conditional distribution given y. 5.6. Problems E1. Find the minimum information probability with respect to lebesgue measure on the plane, with means, variances and covariance fixed. E2. Let X > 0 have minimum information with respect to lebesgue measure subject to P(X) = 1, and let Y > 0 have minimum information with respect to lebesgue measure subject to p(y2) = 1. Show that X and y2 have different distributions, illustrating lack of invariance of minimum information probabilities. E3. Let X and Y be two integer variables with fixed marginal distributions. Find the minimum information joint distribution of X and Y with respect to uniform probability on pairs of integers (i,j), - 00 < i < 00, - 00 <j < 00. PI. A person ranks N candy bars, which have delectability coefficients d" d2 , ... , dN such that the ith bar is preferred to the jth bar with probability dJ(d i + d}. Find the minimum information probability for the complete ranking of the candy bars, with the probabilities that i is ranked above j as given, with respect to uniform probability over permutations of the candy bars. P2. Let X" ... , X n be n discrete variables with known pairwise distributions. Find the minimum information probability for the joint distribution of X" ... , X n with respect to counting measure on atoms. Q1. Show that a minimal information P with respect to Q may exist, satisfying n constraints PX i = a p but not satisfying dPjdQ = ex p ( Ao + it, AiXi) for any Ao' Ai' 54 5. Making Probabilities P3. Let 0/1, !E I ' ..• , !En' '" be probability subspaces of!E such that !En i . Assume that a conditional probability Pn:!E ..... !En exists for each n with PnPn = Pn' Assume l(o/Il!En)E!En. Then 1(0/1 I!En) is a sub-martingale, that is +, Pn_J1(0/I1!En)] ~ l[o/Il!En_ I ]. (Roughly translated, we expect to learn something by knowing !En') P4. Let P; be a normal distribution on !E with mean J.l, variance 1. Find the invariant transformations (0", r) for the quotient probabilities an? show that the only probability on J.l which is r-invariant for every r is lebesgue measure. P;, P; P5. Let be a normal distribution on !E with mean 0, variance V. Find the invariant transformations (0", r) for P V' and find measures on V which are r-invariant (more than one !). P; P6. Let be a bivariate normal distribution on !E, with means 0 and covariance matrix V. Find the invariant transformations (0", r) for P y , and find measures on V which are r-invariant. E4. In the binomial case, find that function of P which is uniformly distributed according to the Jeffreys density. E5. Compute the Jeffreys density in the bivariate normal case with unknown means and covariances. P7. A contingency table has 1000 cells with cell probabilities PI' ... , PlOoo' with IJ2~O Pi = 1. Show that the Jeffreys density implies that the number of cells with Pi> 1/1000 is approximately N(160,134). Suppose that examination of the contingency table produced 12 empirical frequencies greater than 1/1000. Would you use the Jeffreys density for constructing estimates of Pi? P8. An observation comes from the normal mixture, N«()1' 1) with probability t and N«()2' 1) with probability t. Find the Jeffreys probability for «()" ()J P9. Observe the toss of a coin, with unknown success probability p, until r successes appear. Find the Jeffreys density for p. Now observe n tosses of the coin, and find the Jeffreys density for p. You observe a man toss a coin 50 times, getting 20 successes, and he asks you, as consulting Bayesian, to compute the posterior density of p. In an effort to be impartial, you do so with the Jeffreys density for p for 50 tosses of a coin. He then confides in you that he stopped tossing when 20 successes were reached. Do you change the posterior density? PIO. Let P e have density (1 + ()'x)/4n with respect to uniform probability on the three dimensional sphere, for each x, () in the sphere. Show that the Jeffreys probability is uniform over the sphere. If () is constrained to lie in a great circle through the poles, show that Jeffreys probability is uniform over the great circle. Show that the constrained Jeffreys probability is not a conditional probability for the unconstrained Jeffreys probability. (Similarly it is not possible to have joint probabilities and conditional probabilities which are rotation invariant.) 5.7. References 55 5.7. References Bernardo, J. M. (1979), Reference posterior distributions for Bayesian inference (with discussion), J. Roy. Statist. Soc. 41, 113-147. Christensen, Ronald (1981), Entropy Minimax Sourcebook, Vol. I: General Description. Lincoln, Massachusetts: Entropy Limited. Fraser, D. A. S. (1968), The Structure of Inference. New York: John Wiley. Good, I. J. (1966), A derivation of the probabilistic explanation of information, J. Roy. Statist. Soc. B 28,578-581. Good, I. J. (1969), What is the use of a distribution?, in Krishnaiah (ed.), Multivariate Analysis Vol. II, 183-203. New York: Academic Press. Hartigan, J. A. (1964), Invariant prior distributions, Ann. Math. Statist. 35, 836-845. Hartigan, J. A. (1965), The asymptotically unbiased prior distribution, Ann. Math. Statist. 36, 1137-1152. Hartigan, J. A. (1966), Note on the confidence-prior of Welch and Peers, J. Roy. Statist. Soc. B 28, 55-56. Hartigan, J. A. (1971), Similarity and probability, in V. P. Godambe and D. A. Sprott, (eds.), Foundations of Statistical InJerence. Toronto: Holt, Rinehart and Winston. Jaynes, E. T. (1957), Information theory and statistical mechanics, Phys. Rev. 106, 620-630. Jeffreys, H. (1946), An invariant form for the prior probability in estimation problems, Proc. R. Soc. London A 186, 453-461. Kullback, S. (1959), Information Theory and Statistics. New York: Wiley. Kullback, S. and Leibler, R. A. (1951), On information and sufficiency, Ann. Math. Statist. 22, 79-86. Lindley, D. V. (1956), On a measure of the information provided by an experiment, Ann. Math. Statist. 27, 986-1005. Perks, W. (1947), Some observations on inverse probability; including a new indifference rule, J. Inst. Actuaries 73, 285-334. Pitman, E. J. G. (1979), Some Basic Theoryfor Statistical Inference. London: Chapman and Hall. Shannon, C. E. (1948), A mathematical theory of communication, Bell System Tech. J. 27, 379-423. Stone, M. (1970), Necessary and sufficient conditions for convergence in probability to invariant posterior distributions, Ann. Math. Statist. 41,1939-1953. Zellner, A. (1977), Maximal data information prior distributions, in A. Aykac and C. Brumat, (eds.), New Developments in the Applications of Bayesian Methods, p. 211-232. Amsterdam: North Holland. Welch, B. L. and Peers, H. W. (1963), On formulae for confidence points based on integrals of weighted likelihoods, J. Roy. Statist. Soc. B 25, 318-329. Winkler, R. L. (1967), The assessment of prior distributions in Bayesian analysis, J. Am. Stat. Assoc. 62. 776-800. CHAPTER 6 Decision Theory 6.0. Introduction Fisher (1922) compared two estimators by considering their distributions gi ven an unknown parameter of interest. For example, in estimating a normal distribution mean the sample mean is unbiased with variance 21n times the variance of the sample median, for all values of the distribution mean, so it is to be preferred to the sample median. Of course, it may be difficult in general to decide between the two families of distributions. Neyman and Pearson (1933) proposed evaluation of a test statistic by considering the probability of rejection of the null hypothesis under various values of the parameter of interest. Wald (1939) proposed a general theory to cover both of these cases, in which a general decision function (of the data) is evaluated by its average loss for each value of the parameter. Wald suggested minimax techniques for selecting decisions, that arose out of von Neumann's theory of gameswe playa game against nature (the unknown parameter value) so that our loss will not be too severe if nature chooses the worst parameter value. Ramsey, de Finetti and Savage use similar ideas from the theory of games in showing that a coherent betting strategy requires a probability distribution on the set of bets. There is no technical or conceptual difference between coherent betting and admissible decision making. If we decide to use one decision function rather than another, we are accepting the bet corresponding to the difference in losses for the two decision functions. 6.1. Admissible Decisions It is necessary to choose one of a set of decisions D. The consequences of the decisions are determined by the outcome s in a set of possible outcomes S. For decision d, and outcome s, there is a loss L(d, s). Since there is no reason 56 57 6.1. Admissible Decisions to differentiate between two decisions which have the same loss for each value of s, one may regard the decision set D as a set of real valued functions on S :d(s) is the loss incurred by decision d if outcome s occurs. Let d [ ~ d z mean d[ (s) ~ d 2 (s) for SES. Say that d is admissible ifit is minimal in D, that is d' ~ d, d' E D = d' = d. A complete class C is a subset of D such that d' E D - C implies d ~ d' for some d in C. A complete class is minimal if it contains no proper complete class. It is easy to show that if a minimal complete class exists, it is the set of admissible decision functions. However, no admissible decision functions may exist; consider S = {l}, D = {dl- 00 < des) < 00 }; every decision is inadmissible, no minimal complete class exists. See Wald (1939, 1950) for the first theory, and Ferguson (1967) and Berger (1980) for expository texts. If P is a probability on a probability space fil on S, such that D c fil, a decision do is a Bayes decision if P(d o) ~ P(d), dE D. If a Bayes decision is unique it must be admissible (for otherwise there exists d' ~ do, d' =1= do, which implies P(d') ~ P(do), so that d' is also a Bayes decision, contradicting uniqueness). We will say that do is P-Bayes. If P is a finitely additive probability on fil, a linear space of functions including D, then do is P-Bayes if P(d - do) ~ 0, dED. Theorem. Let P be a finitely additive probability on fil including D. If do is unique P-Bayes, then do is admissible. If do is admissible and D is a convex space of bounded functions then do is P-Bayes with respect to afinitely additive probability P on a space fil including D [Heath and Sudderth (1978)]. PROOF. If do isP-Bayes, then P(d - do) ~ 0, dED. Ifd ~ do, thenP(d - do)~O, so d must be P-Bayes. Since do is unique, d = do and so do is admissible. If do is admissible, let fll be the bounded real valued functions, and define fl} = {Xld o + aX ~ d, some a > 0, some dED}. Then X [ ,X z E;JJ aX [ + bX 2 Ef!J for a ~ 0, b ~ by convexity of D. From Theorem 2.1, it is possible to extend r!Jl to f!J* ::::J f!J, such that f!J* (\ - fY/1* = ,9 (\ - ;JJ and r!Jl* u - r!Jl* = fil. Note that .9 includes all X ~ 0, and excludes all X < 0. Set P X = sup {IX IX - IXd o E,9*}. Then - co < P X < 00 because X is bounded, and P is an additive functional on X with P X ~ for X EfY/1; in particular P is non-negative, that is, P X ~ for X ~ 0. Since do + (d - do) ~ d, d - doE;J} and P(d - do) ~ all dED. Thus do is P-Bayes as required. 0 ° = ° ° ° N ate. The idea of this theorem is that accepting do over all other decisions d is accepting the bets d - do, all dED. You will surely lose money (that is, do is inadmissible) unless there is a finitely additive probability P such that P(d - do) ~ all dED. Apparently we should be satisfied with finitely additive probability; however finitely additive probabilities do not discriminate well between deci- ° 58 6. Decision Theory sions, so that a unique P-Bayes decision is unusual; many inadmissible decisions may be also optimal for a given functional P. Consider for example estimation of a normal mean. The data is x, the unknown mean (), and the decision is a function of x, say b. In the loss framework, the decision will be represented as a real valued function of (), d«(}) = J(b(x) - (})2 exp [ -!(x - (})2]dx/fo. Consider D to be composed of decisions d: estimate () by b;(x) with probability (Xi' where bi(x) - x ~ 0 as Ix I~ 00, i = 1, ... ,n. For all such decisions d«(}) ~ 1 as I() I~ 00. Uniform finitely additive probability on () gives value lim X«(}) to X when this limit 181-00 exists. Thus all the decisions proposed are finitely additive Bayes with respect to the uniform distribution; they are not all admissible, demonstrating that optimality by a finitely additive probability is rather too easy. See Heath and Sudderth (1972, 1978). 6.2. Conditional Bayes Decisions Let X be a random variable into S, f!£ and let Y be a random variable into T, IlJI and assume that X x Y is a random variable into S x T, f!£ x 1lJI. Assume there is a quotient probability pi on Y given X, and denote the value of Pi at s by Pi(s); this defines a probability on 1lJI. Similarly assume a quotient probability P:. By the product rule PZ = pX piZ = pYp:Z each Z in f!£ x 1lJI. Now suppose a decision d in D is to be taken using an observation t in T. The loss, if d is taken when the parameter s is true, is L(d, s). A family ,1 of decision functions b: T ~ D is constructed satisfying (s, t) ~ L(b(t), s) Ef!£ x IlJI for each bEL1. The loss associated with b is the risk r(b, s) = pi(s) [L(b, s)]. Theorem. Suppose that, for each t, bo(t) is P:(t)-Bayes. If bo EL1, then b o is pX-Bayes. Conversely if b~ is pX-Bayes, and b o (t) is P: (t)-Bayes, then b~ (t) is P:-Bayes as pY. PROOF. P:(t) [L(bo(t), .) - L(d, .)] ~ 0 Since P:(t) [L(bo(t), .) - L(b(t),·)] ~ 0 pXpX = pYpY Y all d. all bEL1. x' pXPi[L(b o") - L(b,·)] ~ 0 pX[r(b o , .) - r(b,·)] ~ 0 all bEL1 all bEL1. Thus b o is pX-Bayes. Conversely if b~ is also pX-Bayes, PYP:(t) [L(bo(t), .) - L(b~(t), .)] ~ 0 P:(t) [L(bo(t), .) - L(b~(t), .)] = 0 as pY. o 6.3. Admissibility of Bayes Decisions 59 Note. The theorem makes it practicable to find Bayes decision functions, since it is easier to search over the smaller space of decisions D to obtain a conditional Bayes decision, than to search over the larger space of decision functions LI. EXAMPLE. Let S, T be the real line, let pi(s) denote the normal distribution with mean s and variance 1. Let f!( and CfY be the Baire functions on Sand T. Take pX to be normal with mean 0 and variance (12. Then P:(t) is normal with mean (12 t/{l + (12) and variance (12/(1 + (12). Let D be the real line, L(d, s) = (d - S)2, and let LI be the set of Baire functions on D. Then r(b, s) = J(b(t)- S)2 exp [ - t(t - s)2]dt/J2n. For a particular t, the PX(t)-Bayes decision bo(t) minimizes J(d - sf exp [ - ~(s - ~)21 + (12 JdS, 15 (t) =~. 2 1 + (12 (12 0 1 + (12 Thus the pX-Bayes decision function is bo(t) = (12t/(1 + (12). It may happen that 15 0 is conditionally Bayes (that is bo(t) is P:(t)-Bayes for each t), but not Bayes because bo¢LI. In the present example, if pX is uniform, bo(t) = t is the conditional Bayes decision but it has risk r(b o' s) = 1 which is not integrable, so it is not the Bayes decision; in certain cases, conditional Bayes decisions are even inadmissible (see Chapter 9 on many means). It should not be thought that the possible inadmissibility of conditionally Bayes decisions is caused by pX not being unitary; in the present example, if pX = N(O, (12), and L(d, s) = exp(ts2/(12)(d - sf, the conditional Bayes estimate is bo(t) = t, but it is not the Bayes estimate. In the same way, the inadmissible estimate for many normal means is conditionally Bayes with respect to a unitary prior distribution. See also Chapter 7 on conditional bets. 6.3. Admissibility of Bayes Decisions If a Bayes decision do is unique it is admissible [for any decision beating it would also have to be the Bayes decision]. If the decisions d in D are continuous in some topology on S, and the finitely additive probability P is supported by S (that is, PJ> 0 ifJis continuous, non-negative, and not identically zero), then any decision which is P-Bayes is admissible. [If do is P-optimal and d' ~ do, then P(d o) ~ P(d') => P(d o - d') = o. Thus d' = do since P is carried by S.] Let P be a probability on f!( on S. Say that P is supported by S if for each continuous X in f!£, X =1= 0, X ~ 0 implies PX > o. Say that P is X n-(1-jinite if some sequence X n in f!( has X nil. Say that a decision do is X n-limit Bayes ifsupP[Xn(d o - d)] ~ 0 as n ~ 00. deD 60 6. Decision Theory Theorem. If P is supported by S and is X n-a-jinite for some continuous X n' and if D consists of continuous functions, and if do is Xn-limit Bayes, then do is admissible. PROOF. If d' ~ do and d' =1= do' Xn(d O - d') ~ 0, Xn(d o - d') =1= 0 for n large enough, so P[ X n(do - d')] > 0 for n large. Since X n(dO - d') i (do - d'), P[Xn(do - d')] -+ 0 is impossible. Thus do is admissible. 0 Note. If P is carried by S, if D consists of continuous functions in f!(, and if do is P-Bayes, then do is admissible. The present theorem applies to decisions do which may not lie in f!(. EXAMPLE. Let x be an observation from N(fJ, 1), and suppose that fJ is to be estimated with squared error loss function. The theorem will be used to show that x is admissible, being X n -limit Bayes with respect to uniform probability J1. on fJ. The estimate b generates the decision d(fJ) = J(b(X) - fJ)2 exp[ - -!{x - fJ)2]dx/J2n The decisions dare fJ-continuous. The measure J1. is a-finite with respect to f,,(fJ) = exp( - fJ2/n) i 1 as n -+ 00 inf J1.(f"d) = inf J J(b(x) - fJ)2 exp[ - -!{x - fJ)2 - fJ 2/n]dxdfJ/J2n d (j = J Jinf(b - fJ)2 exp[ - -!{x - fJ)2 - fJ 2/n]dfJdx/J2n (j = J exp[x 2/(n + 2)]dx I( + ~)3/2 1 =}nn 3/2 /(2 + n) =fo S~PJ1.[f,,(I-d)]=fo[ 1 J1.(f,,) 2: nJ = 2fo/(n + 2) -+ O. Since the estimate bo(X) = x generates the decision d == 1, bo is continuous a-finite Bayes as required. (Here the decision d == 1 does not lie in f!(.) EXAMPLE 2. Let r be a binomial observation, with Pp{r} = (~)pr(1 _ p)n-r, o ~ r ~ n. Suppose that p is to be estimated with squared error loss functionthe estimator b corresponds to d(p) = rto (b(r) - p)2 (~) pr(1 - p)"-r. 61 6.4. Variations on the Definition of Admissibility For the measure Ii, fl(f) j~(p) = inf fl(df) = [p(1 inrJ p) 2: /I r=O Y i 1 as al O. (b(r) - p)2 (n)r pr(l - p),,-r pa-l(l - pr 1dp .to i~rJ G Y± ±(n)r(r+a+l)r(n-I"+a+l)-->~asa-->o. dab = S6f(p)[p(1 - p)]-ldp, take the functions = (b - p)2 )pr(1 - p)"-r[p(1 - p) 1 dp (n)r(r+a)r(n-I"+a). (r+a)(n-r+a) r=O r r(n+2a) (n+2a)2(n+2a+l) = r=O Forb o r(11+2a+2)(n+2a) ,I" 11 =~,fl(d f)= SP(1-P)pa-l(1_pj"- l dp n n 0 a = S pCl(l - p)"dp/n = ['2(a + 1)/[r(2a + 2)n] 1 as a --> O. n Thus sup fl[f (do - d)] --> 0 as a --> O. Also fl is carried by the interval [0, 1J. d a --> - Therefore do is admissible. 6.4. Variations on the Definition of Admissibility A decision d is beaten by d' at e if dee) > d(e'). We say d is somewhere beaten by d' if d'(e) ~ dee) all e, d'(e) < dee) some e, d is everywhere beaten by d' if d'(e) < dee) all e d is uniformly beaten by d' if inf[ dee) - d'(e)] > O. Then d is admissible in a set of decisions D if d is not somewhere beaten by any d' in D. Say that d is weakly admissible if it is not everywhere beaten by any d' in D, and that d is very weakly admissible if it is not uniformly beaten by any d' in D. The sense of admissibility appropriate for finitely additive probabilities is very weak. If do is a finitely additive Bayes decision with respect to P, then do is very weakly admissible. [Heath and Sudderth (1978).] Conversely, the argument of Theorem 6.1 shows that if do is very weakly admissible, and D is convex with sup des) < 00 each d in D, then do is finitely additive Bayes s with respect to some P. The sense of admissibility appropriate for probabilities is weak. If do is a Bayes decision with respect to P, then do is weakly admissible. However, converse results are more complicated than in the finitely additive case. See for example Farrell (1968). 62 6. Decision Theory If D consists of continuous functions and S is compact, then a finitely additive P on the space of continuous functions on S is countably additive (since if a decreasing sequence of functions converges to zero it converges uniformly to zero). Thus if do is weakly admissible, it is P-Bayes with respect to a unitary probability P. More generally if D consists of continuous functions zero outside compact subsets of S, a weakly admissible do is P-Bayes with respect to a unitary probability P. [If do is carried by Sf, consider decisions and probabilities restricted to Sf.] 6.5. Problems E I. Let the sample space S be finite. Let D be a set of decisions on S (real valued functions on S). Let P give positive probability to each non-zero non-negative X on S. Show that a P-Bayes decision is admissible. E2. For decisions D on a finite S, show that no P-Bayes decision may exist, and that it might not be admissible if it does exist. PI. Let X be a Poisson observation with P(X = x) = AXe-.l/x!. Show that X is an admissible estimate of A with squared error loss. P2. Let X be binomial with P(X = x) = (: )pX( I - p)'-x, 0 ~ x ~ n. Consider estimates b of p using squared error loss. Show that the estimate b(x) = x/n is weakly admissible. 6.6. References Berger, James O. (1980), Statistical Decision Theory. New York: Springer-Verlag. Farrell, R. (1968), Towards a theory of generalized Bayes tests, Ann. Math. Statist. 39, 1-22. Ferguson, T. S. (1967), Mathematical Statistics, a Decision Theoretic Approach. New York: Academic Press. Fisher, R. A. (1922), On the mathematical foundations of theoretical statistics, Phil. Trans. Roy. Soc. A 222, 309-368. Heath, D. C. and Sudderth, W. D. (1972), On a theorem of de Finetti, odds making, and game theory, Ann. Math. Statist. 43, 2072-2077. Heath, D. C. and Sudderth. W. D. (1978), On finitely additive priors, coherence, and extended admissibility, Ann. Statist. 6, 333-345. Neyman, J. and Pearson, E. S. (1933), On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. Roy. Soc. A 231, 289-337. Neyman, J. and Pearson, E. S. (1933), The testing of statistical hypotheses in relation to probabilities a priori, Proc. Camb. Phil. Soc. 24,492-510. Wald, A. (1939), Contributions to the theory of statistical estimation and testing hypotheses, Ann. Math. Statist. 10, 299-326. Wald, A. (1950), Statistical Decision Functions. New York: John Wiley. CHAPTER 7 Uniformity Criteria for Selecting Decisions 7.0. Introduction The set of admissible decision functions in a particular problem is usually so large that further criteria must be introduced to guide selection of a decision function. Many such criteria require that the unknown parameter values be treated "uniformly" in some way; decision procedures are required to be invariant or unbiased or minimax or to have confidence properties. Since selection of a decision function, from a Bayesian point of view, is selection of a probability distribution on the parameter values according to which the decision function is optimal, these criteria may be viewed as methods of selecting indifference probability distributions on the parameter values. The general conclusion is that the various uniformity criteria are satisfied by no unitary Bayes decision procedures, establishing the necessity for considering non-unitary probabilities. 7.1. Bayes Estimates Are Biased or Exact Let P be a probability on fl, let OEfl be such that a conditional probability P[ X IOJ exists satisfying P X = P[P(X I0) J for all X. Let IfJ/ be a subspace of f£. An estimate Y in IfJ/ of 0 is unbiased if P[ Y IOJ = 0 and exact if P[ Y =1= OJ = O. A Bayes estimate of 0 in IfJ/ with respect to squared error loss is a Y such that P[Y - OJ2 is a minimum over P[y* - OJ2 with Y*EIfJ/, (y* - O)2Efl. Theorem. An unbiased Bayes estimate is exact. PROOF. Let Y be an unbiased Bayes estimate. Define X A for each A ~ O. Note IXIA ~ (XA)2Efl. = (- A) v X 1\ A 63 64 7. Uniformity Criteria for Selecting Decisions Setting y* = y P[Y - BY ~ + eyA, P[y* - B]2 = P[Y - B]2 + 2ep[yA(y - B)] + e2p(yA)2. Taking e small and of sign P[yA(B - Y)], PyA[B - Y] = O. If Y is unbiased, P[YIB] = B P[Y - BIB] = 0 P[(Y - B)BAIB] = 0 P[(Y - B)BA] = O. Thus if Y is Bayes and unbiased, P[Y - B] [BA - yA] = O. Since x - y ;;;; x A - yA if x ;;;; y, P[B A - yA]2 ~ 0, which implies P[B A =F yA] = 0 all A, P[B =F Y] = 0, so Y is exact. Note. It may happen that a posterior mean be unbiased. For example, let Y given B be distributed as N(B, 1) and let B be uniform on the line; the posterior mean of B given Y is Y, and P[[B - y]21 Y] ~ P[ [B - D(y)]21 Y] for all borel functions D. Also Y is unbiased, P[YIB] = B. However Y is not the Bayes estimate of B in the class of functions D( Y) because none of the functions [D( Y) - B] 2 are integrable. 7.2. Unbiased Location Estimates Let Xl' X 2' ... , X nand B be real valued random variables on f!£. Suppose Xl' ... , X n are independent and identically distributed given B, and that Xl - B given B has a distribution which does not depend on B; assume that this distribution has density fwith respect to Lebesgue measure. An invariant estimator 0 of B satisfies D(X + a) = D(X) + a. Theorem. Suppose that Xl has finite second moment given B. The Pitman estimator, the posterior mean of B given X corresponding to a uniform prior probability on B, is unbiased and has minimum mean square error given B of all invariant estimators. 65 7.3. Unbiased Bayes Tests PROOF. Consider first the case of one observation. Any invariant estimator is ofthe form (i(X) = X + a and has mean square error var X + [Pe(i(X) - 0]2, so the Pitman estimate (io will be optimal if it is unbiased. Now JOf (x - O)dO/ Jf(x - O)dO = = x - Juf(u)du (io = Po(io = Jxf(x - O)dx - J(x - u)f(u)du Juf(u)du = O. Thus the Pitman estimator is unbiased and optimal. For n observations, consider the behavior of an invariant estimator (i(X) and the Pitman estimator (i 0 (X) conditional on X 2 - Xl' X 3 - Xl' ... , Xn - X I ' The conditional density of X I is flf(X i - O)lJflf(X i - O)dO; the conditional Pitman estimate corresponding to this density is just (io(X). Also (i(X) is an invariant estimator of 0, considered as a function of X I with Xi - X I fixed, i = 2, '" , n. Thus the conditional risk of (io is no greater than that of (i, and hence the unconditional risk of (io is optimal. Similarly, since (io is conditionally unbiased for 0, it is unconditionally unbiased. D Note: The Pitman estimator is not the Bayes estimator corresponding to a uniform prior, because it has constant risk which is not integrable. Stein (1959) shows that the Pitman estimator is admissible whenever X I has finite third moment given 0, and Brown and Fox (1974) have shown admissibility under weak conditions. 7.3. Unbiased Bayes Tests In testing, a decision is made whether a parameter 0 lies in a set H 0 or in a set HI' HoHI = 0, Ho + HI = 1. Thus d takes the values Ho or HI and has loss L(d, 0) = !X{d = HI }Ho + f3{d = Ho}H I · The loss is !X if you mistakenly decide 0 E H I and 13 if you mistakenly decide OEH o ' A decision function based on a random variable Y is a function (i( Y) taking values H 0 or HI' The decision function is unbiased if PeL«(i(Y), 0) ~ PoL«(i(Y), Of) for 0, Of, which is equivalent to Pe [(i = HI] ~ f3/(!X + 13) ~ Po' [(i = HI] for 0 E H 0' Of E HI' Since !X and 13 are usually arbitrary, the more usual definition of unbiasedness requires 66 7. Uniformity Criteria for Selecting Decisions Suppose now that Y and () are random variables on PI, a probability P is defined on PI, and a quotient probability P~ exists such that pOp: = pYp~. The conditional Bayes decision J(Y) minimizes Py[a{J(Y) = H1}H o + P{ J(Y) = H o} H J which requires Py[OEH I];£ p/(a + P);£ P y.[8EH 1 ] for J(Y) = Ho' J(Y') = HI' Compare the form of the Bayes decision with the unbiased ness requirement. The conditional Bayes decision J is the Bayes decision if L(J(Y), (J)EPI. The Bayes decision is saying the obvious, that you decide 8EHo if the conditional probability of Ho is large, and that you decide 8EHI if the conditional probability of H 0 is small. To test () = 80 against 8 =f. 80 , assume that has density fee Y) with respect to some probability Q. The posterior distribution of 8 given Y is given by P: (P~g)(t) = P[g(8)fit)]/P[fo(t)] and the conditional Bayes decision is: 6(t) = I if P{8 = 80 }foo(t)/P[fit)] ;£ PI(a + P). For a given prior P on 8, consider the mixture P a ' PaX = aX(8 0 ) + px. The conditional Bayes decision is 6(t) = 1 if foo(t)/P[fo(t)] ;£ k where k depends on a, a, p. The atom at {8 0 } affects only k. A test of this form will be called a P-Bayes testfor 8 = 80 against 8 =f. (Jo' Theorem. Let 8 be a real valued random variable. Let Y be a random variable such that P: has density fo with respect to a probability Q. Assume that fo(t) is 8-differentiable for each t, and sup I(dfo)/d8 I is Q-integrable for each finite eel interval I of 8 values. Let the prior probability for (J be pO. The test 6(t) = 1 iffeo(t)/P[fo(t)] ;£ k is unbiased for every 80 , k if and only if feo(t)IP[fi t )] has the same distribution for every 80 , letting t have the distribution of Y. PROOF. Let h(t) = pO/it), g(8, Y) = fiY)/h(Y). Unbiased ness requires P: {g(8,·) ;£ k} ;£ P:. {g(8,·) ;£ k} Q{g(8, .) ;£ k} (fe - fo') ;£ 0 Q[ ~ {g( 8, . ) ;£ k} J= 0 where the differentiation is justified because sup I(dfe)/d8 I is Q-integrable. 8el 67 7.4. Confidence Regions 'I' Q[h dgd(fJfJ'),J..[g(fJ,.)]]--O d dfJ Q[h¢(g(fJ,·))] = so b ounded contmuous . lor ,J.. 'I' ° for ¢ twice differentiable. But P(¢[g(fJ, .)]) = Q[h¢(g(fJ, .))] for fJ fixed. Thus g(fJ, Y) must have the same distribution for all fJ. The converse follows by running the steps of the proof in reverse. 0 Note. Unbiased Bayes tests for unitary P rarely exist, but the above condition is met for some other P. For example, if Y '" N(fJ, 1) and fJ is uniform, the Bayes test is accept e = fJ 0 if Y - eo ~ k, and the test statistic Y - fJ 0 has the same distribution for all eo since the distribution of Y is uniform. I I I I 7.4. Confidence Regions Let Y and e be random variables, let P~ be the quotient probability on Y given fJ. Suppose we wish to select a set oflikely fJ-values; a decision d will be a set of fJ-values, and a decision function o(y) selects such a set for each Y value. A set selection function 0 is a confidence procedure if P~[fJEO] = ct o for all fJ. This requirement is analogous to invariance or unbiasedness in that all fJ-values are given the same treatment. Consider a family of testing decisions dee), where d(fJ o) = 1 decides fJ = eo and d(fJ o) =0 decides fJ=fofJ o' The set {d(fJ) = I} is selected by d, giving a correspondence between families of tests and set selection decisions. The loss (analogous to testing loss) for d is + /3{d(fJ o) = l}{e =fo fJo} P~[ L(o, eo' fJ)] = ctp~[e 0 ¢ oJ{ e = fJ o} + /3P~[ eo E 0] {e =fo eo} L(d, eo' e) = ct{d(fJ o) =fo l}{e = fJo} Thus we want a large probability P~[fJEd] and a small probability P~[fJoEd] with fJ =fo fJ o' The standard decision theory is not applicable because of the appearance of both fJ and fJo in the loss function; it is necessary to consider a prior distribution over e and fJo to discover admissible set selection procedures fJ. For a given prior pe.eo, the Bayes set d given Y minimizes ctP~·Oo{d(fJo) =fo I} {fJ = eo} + /3P~·eo{d(fJo) = I} {e =fo eo} which requires d(fJ o) = 1 if p~.oo {fJ = eo} ~ /3/(ct + /3). For a given prior pO on e, the conditional prior P: o suggested by testing is P:oX = aX(e o) + pOX, and then the set selection procedure is d(fJ o) = 1 if foo(Y)/POUo(Y)] ~ K 68 7. Uniformity Criteria for Selecting Decisions where Y has density f6 given O. Regions of this form are called Bayes high density regions; see for example Box and Tiao (1973) and Hartigan (1966). From theorem 5.2, it follows that unitary Bayes high density regions are confidence regions for all K only if the probability P is a maximal learning probability; in the many cases where maximal learning probabilities do not exist, confidence regions cannot be unitary Bayes high density regions. However, confidence regions are often Bayes high density regions corresponding to non-unitary prior measures. Hartigan (1966) shows that high density regions are asymptotically closest to confidence regions for the Jeffreys density. 7.5. One-Sided Confidence Intervals Are Not Unitary Bayes Let Y be a random variable on fi£, and let 0 be a real valued random variable on fi£. A one-sided confidence interval [ - 00, O( Y)] is such that P6 [O;;::; O(Y)] = ex all O. A one-sided Bayes interval [ - 00, O( Y) ] is such that P y[O ;;::; O( Y)] = ex all Y. Theorem. A one-sided unitary Bayes interval of size ex, 0 < ex < I, is not a confidence interval. Proof. Let Py[O;;::; O(y)] = ex. Then P[O;;::; O(Y)] = ex. For each fixed 00 , P[O;;::; O(Y)IO;;::; 00 ] > ex if P[O(Y) ~ 0010;;::; 00 ] > O. If P[O(Y) ~ 0010;;::; 00 ] = 0 for all 00 , P[O(Y);;::; 0] = 1 which contradicts 0< ex < 1. Thus P[O;;::; O(Y)IO;;::; 00 ] > ex for some 00 , If 0;;::; O(Y) is a confidence interval, P6 [O;;::; O(Y)] = ex all 0, so P6 [O;;::; O(Y)IO;;::; 00 ] = ex,P[O;;::; O(Y)IO;;::; 00 ] = ex which is a contradiction. 0 7.6. Conditional Bets Let Y and 0 be random variables on fi£. A bet Z( Y, 0) is conditionally probable given Y if Z( Y, 0) E fi£ each Y and P yZ( Y, 0) ~ O. A bet Z( Y, 0) is conditional given Y if P6 [Z(Y, O)f(Y)] < 0 all 0 for no f~ 0 such that Z(Y, O)f(Y)Efl' all O. If Y and 0 take finitely many values, the two conditions are equivalentfor any matrix X ij there exists no Pj ~ 0 such that LjX ijPj < 0 all i if and only if there exists ex i ~ 0 (oc =/= 0) such that LjXiPi ~ O. (Equivalently, a convex set disjoint from the negative quadrant is separated from the negative quadrant by a hyperplane.) In general, the two senses of conditionality are not equivalent. For example, suppose that Y and 0 take values on the integers 1, 2, .... 69 7.7. Problems Define P~{Y = i}Z(i, 0) = [ - {O ~ i} + {O > i} ]/i 2 Then LP~{Y = i}Z(i, O)g(i) = - Le5,ig(i)/i2 + Le>ig(i)/i 2 • This quantity is defined only if Lg(i)/i 2 converges, and thus it cannot be negative for every 0 when g ~ 0, and Z(i, 0) is a conditional bet. For any probability p 8, Lep8P~ {Y = i}Z(i, 0) = [ - P(O ~ i) + P(O > i)]/i 2 is necessarily negative for some t, so Z cannot be conditionally probable. Theorem. Let Y and 0 be real valued random variables, and suppose that y) is a confidence interval of size IXfor 0, and that 0 < Po(Y ~ a) < 1 for all a. Then Z(Y, 0) = {O ~ Y} -IX is not a conditional bet given Y. ( - 00, Proof. For 0 ~ a, For 0 > a, P8(Z(Y, O){Y ~ a}) = Pe{O ~ Y ~ a} -IXP{Y ~ a} = (1 -1X)Pe{Y ~ a} - Pe{Y < O} = (1 - IX) [P e{ Y ~ a} - 1] < O. PiZ(Y, O){Y ~ a}) = -IXPe(Y ~ a) < O. Thus Z( Y, 0) is not a conditional bet. o Note. See Olshen (1973) for references and an application to confidence ellipsoids. From 7.5, we know that one sided confidence intervals are not conditionally probable with respect to a unitary probability, but they may be conditionally probable with respect to a non-unitary probability. If the definition of a conditional bet is weakened so that Z( Y, 0) is weakly conditional given Y provided P~[Z(Y, O)f(Y)] < 0 for no f such that Z(Y, O)f(Y)E.%" all 0, AND Z(Y, O)f(Y)E.%" taking Y and 0 random, then if Z(Y, 0) is conditionally probable it is weakly conditional given Y. Thus for example if Y ~ N(O, 1) where 0 is uniform, 0 < Y + 1.64 is a 95% confidence interval, conditionally probable, and weakly conditional given Y, but not conditional given Y. The bets ( {O < Y + 1.64} - .95) {Y ~ O} have negative conditional probability given 0, but are not integrable overall. Freedman and Purves (1969) and Dawid and Stone (1972) show, under regularity conditions, that the notions of conditionally probable and conditional bet coincide if and only if the distributions P~ and P~ are constructed according to Bayes theorem. 7.7. Problems E 1. Let t, 0 ~ t ~ n be an observation from the binomial distribution with P p {t} = (: )p'(l - p)n-'. Show that the posterior mean for p, corresponding to any prior unitary probability P, is biased. 70 7. Uniformity Criteria for Selecting Decisions PI. Show that the Bayes estimate iJ corresponding to the loss function L(d, 0) = Id - 0 I is the median of the posterior probability of 0 given Y. Does there exist a nonatomic posterior probability for which iJ is median unbiased, that is Po[O < iJ] = Pa[O;;:; iJ] = t all O? QI. Let Y\, Y2 ' ••• Y. denote independent observations from f(O, Y), and let P have density g(O) with respect to lebesgue measure on the line. Under suitable regularity conditions, when 00 is true, show the posterior mean p[OI Y\, ... , Y.] = 00 + ~[ - 0: 0 logg + Poo(fd2 + f 3 )/PaJ2 JI PaJ2 +O(n- 312 ) where fi = [Oi/OOi log f]o=oo' Then show that g is asymptotically unbiased if g(O) = -Po[0:210gf(0, Y)J (the square! of the Jeffreys density). Hartigan (1965). E2. Let Y,O;;:; Y < 2n, have density f(O, Y) = t[1 + cos(Y - 0)], where 0;;:; 0 ;;:;2n. Let the prior probability be uniform over 0 ;;:; 0 ;;:; 2n. Show that the Bayes high density region is a confidence region. Q2. As the number of observations Y\ ' ... , Y. becomes infinite, under suitable regularity conditions, find an asymptotic expression for the confidence size of the Bayes high density region with respect to P, Pao{nf(Oo' Y)lPa[nf(O, Y)] > k}. P2. Let the decision d be an ordering of the parameter values 0, with L(d, 0) = l{d(') > d(O)} -l{d(') < d(O)} for some loss measure /. Show that the Bayes decision given the observation Yorders 0 according to dP y/d/. P3. Let the decision d be an interval { - 00, c} on the line. Let 0 be on the line, and set L(d, 0) = - {O;;:; c} + K(c - 0)+. Show that the Bayes decision corresponding to a probability P with density g, satisfies c K S g(O)dO = g(c). P4. Consider the 95% confidence interval for 0 based on one sample from N(O, 1), {O;;:; Y + 1.64}. Bet $95 to win $5 that 0;;:; Y + 1.64 whenever Y ~ 0, and bet $5 to win $95 that 0 > Y + 1.64 whenever Y < O. Find your probable gain as a function of o. P5. In the normal location case, show the 95% confidence interval for 0, (Y - 1.96, Y + 1.96), is not a conditional bet. (Bet an amount proportional to e Y that 0 lies outside the interval.) P6. Let x l ' x 2 , ... ,x. be a sample from N{Jl, 0'2), .Iet i be the mean of x l ' ... , x. and let s be the standard deviation. Show that the confidence interval for J.J, i - ks, i + ks) can be beaten by betting that the interval contains p. if s > 1, and that it doesn't contain p. when s;;:; 1. (Buehler and Fedderson (1963». 71 7.8. References P7. If y, Xl' ... , x. are sampled from N{Jl, 1), show that the 95% tolerance interval for y, {y < x+ 1.64[1 + (l/n)]112} may be beaten by betting differently according to the value of x. E3. The decision d chooses one of two parameter values sl' S2 and L(d, s) = Is - dj. Show that the Bayes decision, given an observation t with density f(s, t), for any prior probability which has P{SI} > 0, P{S2} > 0, is SI if f(sl' t)Jf(S2' t) > c, d = S2 if f(sl' t)If(S2' t) ~ c. d= P8. XI' x 2 are observations from N{Jl, (12). A test for J.l = 0 against J.l +- 0 is similar ifthe probability of deciding J.l +- 0 when J.l = 0 is independent of (12. Are any Bayes tests similar? P9. If X and Yare random variables with the same distribution, show that P(X - Y > a) ~ P( IX I > tal. Let P8 be a family of probability distributions with positive densities f(O, Y), 0 ~ 0 ~ 00, satisfying the conditions of theorem 7.3, such that f(8, Y)J f(Oo' y) --+ 0 as 0 --+ 00 for each Y. Show that no unbiased unitary Bayes test exists. P1O. Let X be an observation with density f(x - 0) with respect to lebesgue measure, where f(u) = (2/lul- l)f(2 -Iui){ lui ~ 2}. Let g be a prior density with respect to lebesgue measure, g(O) = {[20] = 2[0]} where [0] is the largest integer ~ O. Show that the posterior mean with respect to g is unbiased for O. [The uniform distribution is not the only unbiased distribution in location problems]. 7.8. References Box, G. E. P. and Tiao, G. C. (1973), Bayesian Inference in Statistical Analysis. Reading: Addison-Wesley. Buehler, R. J. and Fedderson, A. P. (1963), Note on conditional property of student's t, Ann. Math. Statist. 34,1098-1100. Brown, L. D. and Fox, M. (1974), Admissibility in statistical problems involving a location or scale parameter, Ann. of Statistics 2, 807-814. Dawid, A. P. and Stone, M. (1972), Expectation consistency of inverse probability distributions. Biometrika 59, 486-489. Freedman, D. and Purves, R. A. (1969), Bayes method for bookies, Ann. Math. Statist. 40,1177-1186. Hartigan, J. A. (1965), The asymptotically unbiased prior distribution, Ann. Math. Statist. 36, 1137-1154. Hartigan, J. A. (1966), Estimation by ranking parameters, J. Roy. Statist. Soc. B 28,32-44. Oishen, R. A. (1973), The conditional level of the F-test, J. Am. Stat. Ass. 68, 692-698. Pitman, E. J. C. (1939), Location and scale parameters, Biometrika 30,391-421. Stein, C. (1959), The admissibility of Pitman's estimator ofa single location parameter, Am. Math. Statist. 30, 970-979. CHAPTER 8 Exponential Families 8.0. Introduction Let /-1 be a probability on qy, and choose a unitary probability P on qy to minimize the information P(log(dP/d/-1)) subject to PYi = ci' i = 1, ... , k. The optimal probability P has density -dP = d/-1 exp [ I k i= 1 a. Y.(t) ! +b J. ! Such a P is said to be exponential with respect to /-1, for the functions Y and the parameters a, denoted E[/-1, Y, a]. The further parameter b is determined as a function ofa by PI = 1. An exponential family {Ps' SES} consists of E[/-1, Y, s], SES where S is a subset of k-dimensional Euclidean space. The set of all values S with /-1 [ exp s'Y] < 00 is convex because exp is convex. Exponential families are attractive for statistical analyses because they remain exponential under repeated sampling and under formation of posterior distributions. If X is distributed as E[/-1, Y, SJ, then the random sample XI'X 2"",X n is distributed as E[/-1n,IY(X),s]. If S has prior probability P, and X is distributed as E[/-1, Y, sJ, the posterior probability of s given Xl' X 2' ... , X n is P x = E[P, [sl' S2' ... , Sk' Je(s)], [I Y (X), ... , I Yk(X), n]] 1 where Je(s) = -logPsexp[s'Y(t)]. Thus the posterior probabilities, for all data X, belong to the same exponential family! (This occurs for all prior probabilitiesP; there is NO special family of"conjugate" prior distributions!) 72 73 8.2. Prior Distributions for the Exponential Family 8.1. Examples of Exponential Families (i) BERNOULLI T = {a, I}, for t = 0, I, Let s = logp/(l- p), P s = exp[t 10gp + (l - t) log(l - p)] = exp [ts + A(S)], ),(s) = - log (1 For n observations, this becomes the binomial. (ii) POISSON (iii) EXPONENTIAL ° !1{l} P p = pr(l - p)1 -r = 1. ~ p ~ 1. +e S ). T = {a, 1, ... }, !1{t} = l/t!. p;. {t} = ),t e - A/t! = exp [t log ), - ),]Il{ t}. PJt}=exp[ts-e S ] withs=logA. T = [0, (0), !1 uniform on T. fACt) = e-t/A/Je. s=-I/Je. fs(t) = exp[ts + loge - s)], (iv) NORMAL LOCATION T = ( - 00, 00), !1 unit normal. feet) (v) NORMAL SCALE = exp[et - te2], T = ( - 00, 00), f)t)= s = e. !1 uniform I /1.:exp[-tt 2/0"2] O"y 2n fs(t) = exp [st 2 + t log [ - s/n]], s= - 1/20"2. [N ote. The A(S) for the posterior density is the second expression in the exponential argument.] [Note. Here parameters have been transformed to demonstrate reduction to exponential form; in computations, it is usually better to leave the parameters untransformed.] 8.2. Prior Distributions for the Exponential Family If P s = E[I/, Y, sJ, the Jeffreys density is [var Y]1i2; only the binomial among the standard families has the Jeffreys density unitary. Since Y is of minimum variance among unbiased estimators of Ps Y, it is of interest to discover the prior distribution such that the posterior mean is Y. This prior distribution is necessarily not unitary. Theorem. Suppose Ps = E[I/, Y, s], SI < S < S2 and that,foragiven y, exp(Ys)/ -> as s -> SI or s -> 8 2 , The posterior mean, given Y, of Ps Y is Y if the prior probability ofs is uniform over (SI ' 8 2 ), 1/[exp( Y s) ] ° 74 8. Exponential Families PROOF. P5 Y = J.l(Yexp(Ys»IJ.l(exp(Ys» a = as log J.l[ exp (Ys)]. 5, J(PsY - Y)[exp(Ys)IJ.l(exp Ys)]ds a 51 J - as exp{Ys -log[J.l(exp(Ys»]}ds S2 = 51 o =0. In the binomial case tin is the minimum variance unbiased estimate of p. If s = log(pll - p) is uniform over - 00, 00, then tin is the posterior mean of p, provided 0 < t < n; in the case t = 0, n the posterior mean is not defined (the limiting conditions of the theorem break down). EXAMPLE. Name Binomial (~)pt(l _ Poisson Exponential Normal Location Normal Scale Parameter Family p)n-t p s=log-I-p ),!e-).It! s= log), e-t/).j). s = -Ill _1_e9t- (1/2)9' y'bt I --exp( -ft 2I( 2) qy'bt The Jeffreys Density (for s) lis s=8 s = -1/2u 2 lis 8.3. Normal Location Assume Xl"'" X n' () are random variables such that Xl"'" X n given o are independent, P;' = N«(), 1). (i) The posterior distribution. If () has prior probability P, the posterior probability P X1 ••••• Xn has ~ensity exp [ - tn(X - ()2]1 P exp [ - tn(X - ()2] with respect to P, where X denotes the mean of Xl' ... , X n' The posterior probability is defined for n large enough provided p[exp( - A()2)] < CI) for some A. Say that such a P is docile. (ii) Asymptotics. If ()o is in the support of a docile P, then P x eventually concentrates on () 0 as P80' (For each e > 0, P80 {p x(i() - ()oi > e) +O} = 0.) If a docile P has a density wrt Lebesgue measure that is continuous and positive at ()o' the posterior distribution is asymptotically normal N(X, lin) 75 8.3. Normal Location given eo' (That is, _ px(e~x+z/Jn)-+ z J 1 ,r;:cexp(-!u2)du -roy 2n [inPeJ) If a docile P has a density p wrt Lebesgue measure that is continuously differentiable and positive at eo' the posterior distribution is asymptotically N(X + [(%eo) logp]/n, lin) given eo' (That is, In[ px[ e ~ X + [o~o logp JI n + zJn J- Jro$exp ( -!u2)du J-+o [in PeJ) Thus the principal effect asymptotically of a smooth prior density is a shift in location of the mean of the posterior distribution. (iii) The uniform prior. 'The uniform prior is Lebesgue measure on the line. It is docile, the Jeffreys density, the only density invariant under location and sign changes, the only density for which the corresponding location estimate is unbiased. The posterior distribution is P~ = N(X, lin). The posterior mean X is mean-square admissible, unbiased, of minimum variance among unbiased estimators. The high density regions {eln(e - X)2 < k} of posterior probability IX are confidence regions of confidence size IX. To test e < 0 against e ~ 0, the Bayes decision accepts e < 0 if Px( e < 0) > k, which is equivalent to X < c, the uniformly most powerful test of e < 0 against ~ o. If X 0 is observed, it is customary to report the tail probability Pe=o(X> Xo) in testing e < 0 against e ~ 0; this is the same as pxo(e < 0), the posterior probability of the null hypothesis. To test e = 0 against e =1= 0, the Bayes test is ofform accept e = 0 if IX I < c, which is the most powerful unbiased test. If X 0 is observed, it is customary to report the tail probability Pe=o(IXI > IX o !)' which is the same as Pilxo - el < IXo !)' the posterior probability that the true mean is farther away from the observed mean than O. (iv) Normal priors. If P = N(e o' (J~), then P~ = N{[(eo/(J~) + nX]/ «1/(J~) + n), ((1/(J~) + n)- l}; the posterior mean is of the same form as the prior. (This is true for any prior; see 8.0.) The formulae for means and variances may be remembered by the followingscheme: e PRIOR: e", N(e o' (J~) OBSERVATION: X'" N(e, lin) Act as if eo is an observation on e; combine with the observation X inversely weighting by variances: l/varie) = l/variX) + l/var e Px( e)jvari e) = X /var eX + e o/var e. 76 8. Exponential Families It may happen that the prior N(Oo' a;) and the observed X contradict each other. Note that X - eo "" N(O, a~ + lin). Thus if (X - Oo)2/(a; + l/n) is very large, we might decide to revise the observation X, its distribution given 0, or the prior for O. The contradiction will not arise if is very large. The posterior distribution P x approaches the posterior distribution N{X, lin) corresponding to the uniform as a o ---> 00, and so this family of priors is useful in showing admissibility of classical statistical procedures corresponding to the uniform. (v) Two stage normal priors. Consider a family of normal priors P A = N[O(l), a 2 (),lJ. Given A, the posterior distribution P). x is normal with parameters given in (iv). Suppose that ), itself has a prior distribution Q; then the prior distribution on 0 is Q(P ).), a mixture of normal priors. The posterior distribution corresponding to Q(P;) is also a mixture of normals Qx(P A.x) where Qx is the posterior distribution for), for the prior Qand for the observation X ~ N[ O(A), a 2(),) + l/n J. [Given ),' 0 is distributed as N[ O(),), a 2 (A) J and X is distributed as N(O, l/n); ignoring 0, X"" N(O{A), a 2 (),) + l/n). J Thus values of A for which [X - Or),) J2 /[ a 2 (A) + l/n J is large will be downweighted, and contradiction between the prior mean O(A) and the observed X is prevented. a; 8.4. Binomial The number of successes t, °~ t ~ n has probability (: )pt(l - p)"-t. The prior P is docile if pA(1 - p)B is integrable for some A, B. (i) The posterior density with respect to the docile prior P is pt(l _ p)n-I/ p[pl(l - p)n-t]. The posterior density concentrates with probability 1 on Po if Po is true and Po lies in the support of P. If P has a density that is continuous and positive at Po' then PI is asymptotically normal N[t/n, Po(1 - po)/n]. (ii) Beta priors. If P is Be(C(, /3), having density pa-l(l - p)/3-1/B(C(, /3) wrt Lebesgue measure, then PI is Be{C( + t, /3 + n - t). Note that if P = Be(C(, /3), then Pp = C(/(C( + f3), var p = C(f3/[(C( + f3)2(C( + f3 + 1)]. The Jeffreys density is Be(1/2, 1/2). The "unbiased" prior, P = Be(O, 0) has posterior mean PIS = tin for 0 < t < n; the posterior distribution is not defined for t = 0, t = n. The admissibility of t/n may be demonstrated by considering it as a limit of the Bayes posterior means (t + C()/(n + 2C() corresponding to priors Be (C(, C() as C( ---> O. (iii) Confidence properties of beta priors. The discreteness of t makes it impossible to achieve unbiasedness of two sided tests p = Po against p =1= Po' or to find set selection procedures that have the confidence property. Welch and Peers (1963) show that the Jeffreys density generates Bayes one sided intervals which are most nearly confidence intervals; but their proof is invalid for discrete observations. 77 8.4. Binomial For a particular prior, consider the one sided Bayes intervals [0, pJ, such that P, {p ~ Pt} = a, ~ t ~ n. The confidence properties of such intervals are determined by the function P /p ~ P,); this function is discontinuous in general at (Po' PI' ... , pJ For the prior Be(O, I), P/p ~ P,) ~ a all P > ane;! for the prior Be(l, 0), P /p ~ p) ~ a all p < 1 (Thatcher, 1964). These might be viewed as "liberal" and "conservative" confidence regions for p. The inequalities follow from the identity: ° ° ° ° which relates binomial and beta (confidence and Bayes) probabilities. For Be(l, 0), Pp(P ~ ptlLa as pi Pk ' < k ~ 11. For Be(O, 1), Pp(P ~ p,)ia as plp k , ~ k < n. Let P,,). be such that Pt(p ~ p,) = a for the prior Be p, 1 - ).], ~ ~ 1. Then Pt,). is increasing in A. [The prior is equivalent to observing). successes and 1 -). failures; the larger )., the more the posterior for a particular t is shifted to the right.] Thus for each prior Be [)., I - ).], ~ ~ 1, lim P /p ~ p,) ~ a, lim P p(p ~ P,,;) ~ a. The confidence values cross the ° ), ° ). piPr,A. plpt} correct probability a at each of the points of discontinuity Pt,A' In Figure 1, n = 10, a = 0.9 and the upper and lower bounding confidence curves for Be(O, 1) and Be(1, 0) are given, together with the intermediate curve for Be(I/2, 1/2), the Jeffreys density. Note that P~(p ~ Pt) ---> 1 as P ---> 0, P~(p ~ p,) ---> as P ---> 1 for the Jeffreys density, so that it can never give confidence values uniformly near a. It does give confidence values which are closer on the average to the correct a than the bounding priors Be(O, 1) and Be(1, 0). ° I.O~----~--.---~~--~~----~----------~--~--~~ 0.9 0.8 0.6 0.4 ........ Density CC 1/ P Density CC II(I-p) Density cc Binomial Parameter Figure 1. I [p( l-p)r'2 78 8. Exponential Families An arbitrary interval selection procedure specifies an interval {p ~ p,} for each of t = 0, 1, ... , n. Its confidence properties are given by the function P {p ~ PI}, which is discontinuous at each of the points Po < Pi < ... < Pn. The overall error of the procedure might be assessed by sup IP p(P ~ PI) - ° ° e<p<l-e oc I; it is necessary to bound P away from and 1, because lies in every interval and 1 usually lies in no intervals, so that P o(P ~ PI) = 1, Pi (P ~ PI) = 0. The maximum error is achieved at the points of discontinuity; it will be minimized by ensuring that Wim Pp(P ~ PI) + lim P p(P ~ PI)) = oc at each ptps P~Ps point of discontinuity Ps ' In this case, the asymptotic error at Ps is (l/2fo) exp( - !Z;)·(l/JPs(l- Ps)).(1/";;;) where Za is such that P(Z ~ Za) = oc for a normally distributed Z. (This result is obtained by equating binomial and beta tail areas and then using Edgeworth expansions for the beta distribution, involving the first three moments.) For a Bayes procedure, the asymptotic error is (l/fo) exp (- tZ;Hl/JPs(l - Ps (l/";;;)sup(ILlI,ILI-1j), where the prior density h satisfies LI= [(%p)log(h(P))p(l- p)]p=ps' This error is minimized for all Ps ' precisely when h = j, the Jeffreys density, and in this case the interval selection procedure is close as possible to being a confidence procedure. (The error sup IP/p ~ PI) - oc I is 0 (n -1/2) for every prior, but is smallest for the ». e<p< 1-e Jeffreys prior. If Po' Pi' ... 'P n are the upper bounds of intervals taken to ensure t[lim Pp(P ~ PI) + lim P/p ~ PI)] = oc for each Ps ' and if P: denote pt Ps P~Ps the Bayes upper bounds, then P: = Ps +0 (l/n) for any Bayes procedure, and P: = Ps + o(l/n) for the Jeffreys density.) In the table below, the intervals {p ~ PI} are specified corresponding to the three priors Be(O, 1), Be(1, 0), Be(!, t), and also for a confidence procedure minimizing maximum error. EXAMPLE. For n = 10, oc = .90, the intervals for various methods: Po PI P2 P3 P4 Ps P6 P7 Ps P9 Plo Be (0, 1) Be(1,0) Be(t,t) Confidence .0 .2057 .3368 .4496 .5517 .6458 .7327 .8124 .8842 .9955 .9895 .2057 .3368 .4496 .5517 .6458 .7327 .8124 .8842 .9455 .9895 1.0000 .1236 .2744 .3948 .5018 .5997 .6901 .7735 .8494 .9164 .9704 .9998 .1487 .2981 .4063 .5118 .6090 .6990 .7823 .8584 .9257 .9799 1.0000 79 8.6. Normal Location and Scale 8.5. Poisson The number of occurrences t, 0 ~t < 00, has probability PA{t} = A.te-A/t!. The prior P is docile if AKe- Ais integrable some K. (i) The posterior density. with respect to the prior probability P is Ate- A/ P(),te-A). The posterior density concentrates on A.o with probability 1, if 1..0 is true and lies in the support of P. If P has a continuous positive density at A. o' then Pt is asymptotically normal N(A o' Ao/n). (ii) Gamma priors. The prior G(m, a) has density amAm-le-aA/r(m), the gamma density. The posterior given t is G(m + t, a + 1). The Jeffreys density is G(t, 0), not unitary. The "unbiased" prior is G(O, 0), which has posterior mean t; the posterior distribution is not defined for t = O. (iii) Confidence properties of gamma priors. Similar considerations to those for the binomial apply. Ex~ct confidence intervals are not possible because of the discreteness of the Poisson. For a particular prior P, let 0 ~ A ~ At be the IX-probability interval, Pt(O ~ A ~ A.t) = IX. The confidence function P(A ~ At) will be discontinuous at A. o' A!, .... For the prior G(O, 0), P A(A. ~ At) ~ IX all t, and for the prior G(l,O), P A(A ~ A.t ) ~ IX all t. These results follow from the equivalence between Poisson and gamma tails: co A.t co xto-1 L -e- A= J e-Xdx. oCto-I)! t=tot! As in the binomial case, Jeffreys' density gives intervals which are closest to confidence intervals in that !(lim P A(A. ~ At) + lim P /A ~ At}} ATA. AlA. is closest to IX at every As' 8.6. Normal Location and Scale Suppose Xl' X 2 , .. ·, Xn are from N(f.l, then Xl' ... , X n is E [ vn , 0'2), (~~? ). (_~~;;2 on Rn. The prior P is docile if exp [ - (A A,B>O. where J1 and )] 0'2 are unknown; where vn is Lebesgue measure + BJ12)/O' 2 ] is integrable for some (i) General priors. For the prior P, the posterior Pt has density O'- n exp[ - tLX?/O' 2 + LX i J1/O' 2 - tnJ1 2 /O' 2 ]k(X l ' ... , Xn) with respect to P. If (J1 0 ' o'~) lies in the support of P, and J1 0 ' o'~ is the true value, then the posterior distribution concentrates on (J1 0 ' o'~) with probability 1. If P has a positive continuous density at (J1 0 ' o'~), then the posterior density is asymptotically normal. 80 8. Exponential Families (ii) I nvariance generated priors. A prior with density [with respect to Lebesgue measure on {t, u 2J u- A exp( - tB/u 2 + C{t/u 2 - tD{t2/U 2 + K) is called an in variance generated prior IG(A, B, C, D). After the observations Xl' X 2' ... , X n' the posterior is IG(A + n, B + LX?, G + LXi' D + n). Priors of the form IG(A, 0, 0, 0) are improper, invariant priors under the transformations Xi --+ a + bX i , {t --+ a + b{t, u --+ Ib Iu; by considering posterior distributions obtained from invariant priors, for various types of data, we obtain the distributions IG(A, B, C, D) where B:?; 0, D is an integer, C 2 ~BD. The Jeffreys density is IG(2, 0, 0, 0). The density IG(5, 0,.0, 0), for parameters (l/n)P ll.cr2(LX) = {t and (l/n)P Il ,cr2(LX?) = u 2 + {t2, has posterior means (l/n)LX i and (l/n)IX?; this corresponds to ({t/u 2, - 1/2( 2) being uniform in the plane, see 8.2. (iii) Marginal distributions of {t and u 2. The marginal density of {t corresponding to IG(A, B, C, D) is K1(tB - C{t + tD{t2)-(A-l)/2, which is a student distribution with A - 2 degrees offreedom. (The conditional density of {t given u 2 is normal.) The marginal density of 1/u 2 = u corresponding to IG(A, B, c, D) is K2dA-2)/2-lexp[ -t(B-D 2/C)uJ, which is a gamma distribution with A - 2 degrees of freedom. (iv) The "confidence" prior IG(1, 0, 0, 0). For this prior, the posterior is IG(n + 1, LX?, LXi' n), and the marginal densities of {t and u are: In(X - {t)/s '" Tn-I (n - l)s2/u 2 '" X;-l where X = l/nIXi' S2 = L(X i - X)2/(n - 1), Tn-I denotes a student distribution on (n - 1) degrees offreedom, 1 denotes a chi-square distribution on (n - 1) degrees of freedom. Since the same distributions hold when {t and u 2 are fixed, and X and s are random, Bayes intervals and regions have a confidence interpretation. For example, the high density region of {t, {{tIJnIX - {tl ~ sTn- 1,a} has posterior probability (L given Xl''''' X n' but also probability (L given {t, u 2. (Here P(ITn_ll~Tn_l)=(L.) Or, the one-sided Bayes interval for u 2, {u 2 (n - l)s2 /u 2 ~ X;- 1, a} has posterior probability (L of containing u 2, but also probability (L given u 2. (Here P(X; _ 1 ~ X; _ 1, a) = (L.) (It should be noted that the high posterior density region for {t and u 2 is not a confidence region for this prior; the Jeffreys prior IA(2, 0, 0, 0) gives a high posterior density region x;_ 1 (s/u)n exp [ - t(n - l)s2/u 2 - tn(X - {t)2/U2J :?; c which is also a confidence region.) The Bayes test for {t = {to against {t =1= {to is: accept {t = {to if IX - {tol < cs. The tail area p[1 Tn-II:?; Jnl X - {to l/sJ is the Bayes posterior probability 81 8.6. Normal Location and Scale p x[ III - Xl ~ IlLo - Xl], the probability that IL is further from the observed X than lLo' (v) Unbetworthiness of confidence intervalfor IL. The intervaLjnl1L - Xl ~ Tn_ 1 as, where P(ITn_11 < T n_ 1 a) = OC is not betworthy. If s < I, bet 1 - oc to receive 1 if.jn11L - Xl > Tn_l:as.1f s ~ I, bet oc to receive 1 ifJnl1L - Xl ~ Tn_l,as. The strategy is to bet that IL does not lie in the interval when s is small, and to bet that IL does lie in the interval when s is large. Since P[Jnlx -ILl ~ Tn_l,asl s, IL, oj increases strictly with s, and averages oc over all s, p[Jnlx -ILl ~ Tn_I,asl s < I] ~ oc < P[ JnIX -ILl ~ Tn_I,asls > 1]. The above bet will always have positive expectation, no matter what the value of IL, u. However, as u 2 --+ 0 or 00, the net gain from the bet will be arbitrarily close to zero. More generally, bet ock(s) to receive k(s) if JnIIL - Xl ~ Tn_I,as, where k(s) may be negative. Whenever the function k(s) is strictly increasing in s, the net gain from the bet is P",Jk(s) [ {In[1L - X] ~ Tn-I,A - oc]] > O. For k(s) = sA., the net gain is u)K(n, oc, l) where K(n, oc, l) has the same sign as l. Thus the bet s2IK(n, oc, 2) + S-2 1K(n, oc, - 2) has gain u 2 + u- 2 ~ 2. It is thus possible to devise bets whose net gain is greater than 2 for all u 2 • See Brown (1967). (vi) The Behrens-Fisher problem. Suppose that Xl' ... , Xn are a sample from N( 1L 1' u:), and Y l , ... , Ym are a sample from N(1L 2, u~). Taking the "confidence" prior density (with respect to Lebesgue measure v on ILl' 1L2' u l ' ( 2 ) u~ IU;- I, and letting X = ~IXp si = I(X i - X)2/(n - I), Y= ~ IYp s: = L<Y i - y)2/(m - 1), the posterior distributions of ILl and 1L2 are independently ILl - X + sxTn-lIJn, 1L2 - Y +SyTm_llrm· Then ILl - 1L2 is the convolution of two student distributions. In order to test ILl = 1L2 against ILl =1= 1L2' Behrens and Fisher propose the test which rejects ILl = 1L2 if P xA IILI - 1L2 - (X - Y) I > IX - YI] is sufficiently small. The Bayes test would reject ILl = 1L2 if the posterior density of ILl -1L2 at 0 is 82 8. Exponential Families sufficiently small-that is, if J I [ m (Xi - V)2 J-<m+1)/2[ n J-<n+1)/2 I (¥i + V)2 dv i= 1 i= 1 is small enough. Both tests have probability of rejection, given III = 1l2' that depends on (11/(12' 8.7. Problems E1. Suppose XI' ... , X.is a random sample from N(O, 1), and thatthe prior distribution P has P{ - I} = P{ I} = t. If 0 = 0, what is the asymptotic behavior of the posterior distribution? E2. If X is an observation from N(O, I), show that for every unitary prior P, PPe[O - Px(O) Y < 1. PI. If X is an observation from N(O, 1), show that aX is an admissible estimate of 0 using the loss function L(d, 0) = (d - 0)2, for 0 ~ a ~ 1. P2. If X is Poisson with parameter A, and the prior on A is gamma, G(m, a), find the Bayes estimate of A with loss L(d, A) = (d/A - Ajd)2. Q1. In a test of 10 questions, a child gets t questions correct where t is binomial with parameter p. Over many children, the parameter p has prior distribution P. The observed number of successes over many children is: NUMBER CORRECT NUMBER OF CHILDREN 0 66 240 2 3 4 540 960 2450 5 3016 6 7 8 2520 2520 2970 9 10 2640 1716 Estimate P. P3. If an observation t has probability P s = E[Jl, Y, s], show that Bayes tests of s ~ So against s > So are ofform: decide s ~ So if Y ~ Yo' P4. If XI' ... , Xn are observations from N(O, I), and for the uniform prior on 0, find the conditional distribution of X k + l' ... , Xn given XI' X 2' ... , X k • E3. For a normal sample X l ' ... , Xn from N(Jl, (2), with the prior 10(1,0,0,0), find the posterior mode of J1. and q2, and the posterior means of J1. and q2, based on the posterior density of J1. and q2. P5. For a normal sample X l ' ... , X n from N(J1., (2), with prior 10(1,0,0,0) find the Bayes estimator of q2 using loss function L(d, (2) = (d - q2j2, and compare its risk function with those of maximum likelihood and unbiased estimates of q2. 8.8. References 83 8.8. References Brown, L. (1967), The conditional level of the t-test, Annals of Mathematical Statistics 38, 1068-1071. Thatcher, A. R. (1964), Relationships between Bayesian and confidence limits for prediction, J. R. Statist. Soc. B 26, 176-210. WeIch, B. L. and Peers, H. W. (1963), On formulae for confidence points based on intervals of weighted likelihoods, J. Roy. Statist. Soc. B 25,318-329. CHAPTER 9 Many Normal Means 9.0. Introduction Given X, suppose pi' = N(X i , 1), i = 1,2, ... , n, and the Y; are independent. The straight estimate Y; of X; is least squares, maximum likelihood, of minimum variance among unbiased estimators, posterior means with respect to the Jeffreys density (the XI' ... , Xn are uniform) but for all these virtues inadmissible with loss function 2:7= 1 (d; - Xl for n > 2, Stein (1956). 9.1. Baranchik's Theorem Lemma. If Y ~ N(X, 1), X ~ 0, and iffis integrable 00 P[J(y2)] = 2: PkP[J(X~k+ I)J k=O where X~k+ 1 denotes a variable with the chi-square distribution, and Pk = exp( - ±X 2 H±X 2 Nk! are Poisson probabilities with expectation ±X2. PROOF. The first result, that a non-central chi-square is a mixture of central chi-squares with Poisson mixing probabilities, should have a nice probabilistic proof, but I don't know one. 84 85 9.1. Baranchik's Theorem (i) p[J(y2)] = J':"",f(y2) exp[ -1{X - y)2]dYlfo = SI(y2)exp( -ty2)exp[XY]dYexp[ -tx2)lfo 00 Xkyk = J':"rof(y2) k~O k! exp[ - ty2]dYexp[ - t X2 ]/fo· {Note X~k+ 1 = y2 = u has density: Uk- 1/2 exp( - tuHW+ 1/21 f'(k + t).} X2ky2k ,exp[ -tY2]dYexp[ -tx2]/fo k=O 2k. co P[J(y2)]=2J~f(y2) L (tX2)k(ty2)k - t)(k - t) ... t _ 2Joof(y 2)" - L.. k !(k 0 ·exp[ - tyZ]dYexp[ - tX2]/fo (1. y2)k-I/2 dy2 =LJ~f(y2)Pk 2f'(k+t) exp[ - t yZ ]-2-[sincef'(t)=Jn] = IPkP[J(X~k+l)J. 00 =2J~f(y2)~ X2k+ly2k+2 (2k+I)! ·exp[ -ty2]dYexp[ -tx2]/fo ro = X L PkP[J(X~k + 3)] after some algebra. D k=O Theorem (Baranchik (1970)). Let Yi be independent N(X p 1), i = 1, ... , n. Let S = L y i2, and let f be a non-decreasing non-negative function with f < 2(n - 2). Then peL {YJI- f(S)/S] - XJ2] < p[I(Yi - X;l2] for every X 1 'X z , .. ·,X n ,ijn>2. Since S = L7= 1 y i2 is invariant under rotations of the Y i , L(Y;O - f(S)/S) - X;l2 has the same distribution if Y and X undergo the same rotation. It is sufficient therefore to consider X 1 ~ 0, Xi = 0 for every i ~ 2. PROOF. Let g(S) = 1 - f(S)/S pI(Yig(S) - X)2 Then S = Yi +Z = peL Y;g2(S) - 2X 1 Y1g(S) + Xi]. where Z ~ X; _ 1 independent of Yl · 86 9. Many Normal Means P[Sg2(S)] = pZp z(Yi + Z)g2(Yi + Z) + Z)g2(X~k + 1 + Z)] from the lemma LPkP[(X~k+n)g2(X;k+n)], Pk = exp( - txi)(txiNk! = pZL PkP Z[ (X~k + 1 = P[Y1g(S)] = X lLP kP[g(X;k+n+2)] from the lemma PL[Yig(S) - XJ2 = LPkP[X~k+ng(X;k+n) - 2Xig(x~k+n+2) + Xi] 00 = L PkP[X~k+ng2(X~k+n) - 4kg(X;k+n) + 2k] P[L(Yi9(S) - X)]2 - PL(Yi - Xi)2 = L PkP[X;k+ng 2(X;k+n) - 4kg(X~k+n) + 2k - nJ. This expression is to be shown < U. Set g(S) = I - f(S)/S and note that ] 2 I P [ f(X 22k+n)/X 2 2k+n ~ P f(x 2k+n)P-2X2k + n because f is non-decreasing, so f(x2) and I/x2 are negatively correlated. P[X~k+ng2(X;k+n) - 4kg(X;k+n) + 2k - n] = Pf(X;k+n) [ - 2 + f IX;k+n + 4kIX;k+n] 2 [ 2(n - 2) + 4kJ _ < Pf(X2k+n) - 2 + n + 2k _ 2 - 0. since f < 2(n - 2) and P[I/X;k+n] = I/(n + 2k - 2). Thus Yg(S) beats Y. D Note. Yg(S) shrinks the estimate towards 0, but the same result holds if it is shrunk towards any other point Z by Z + (Y - Z)g(Sz) where Sz = LeYi - Zi)2. Or shrunk towards Yl,f < 2(n - 3). 9.2. Bayes Estimates Beating the Straight Estimate Theorem. Suppose Yi '" N(X i,(J2), i = I, 2, ... , n, (J2 known. Let the prior distribution for Xi be Xi'" N(O, (J~) independently given (J~, where (J~ has a density 9 such that log 9 is concave in log«(J2 + (J~) and «(J2 + (J~)1-(1/2)l%g is increasing for some IX. The posterior mean of Xi given Yi has smaller mean square error as an estimate of Xi than Yi for every choice of Xl' ... , X n ' whenever n ~ 4 - IX. PROOF. First fixing (J~, p[Xil Yp Yi '" N(O, (J2 (J~] = + (J~) independently. Y/ ( 1 + :;). 87 9.2. Bayes Estimates Beating the Straight Estimate The posterior density of u~ given Y 1 ' g[u~IY] OC ••• , Yn is + u~]-n/2 exp[ -t'LYN(u o + U~)]g(U2 + u~). V = S/(u 2 + u~), [172 Letting S = 'LY?, p[VIY] = SV n/2- 1 exp( - tV)g(S/V)dV/ Svn/2-2 exp( - tV)g( = P[x; _2g(S/X; -2)]/ P[g(S/X; - Now p[XjIY] = Yjp[ t )dV 2)]. (I + :;r1Iy ] = Yj[1 - p(VIY)u 2/S]. From Theorem 9.1, the estimate p[XjIY] will beat Yj if p(VIS) = P(VIY) is a non-negative, non-decreasing function of Su 2 such that P(VI S) < 2(n - 2). It is obviously non-negative. Let k(V) = V n/2- 2 exp( - tV). For S > s' p[VIS]/p[VIS'] = SVk(V)g(S/V)k(U)g(S'/U)dVdU ;::: I SUk(U)g(S'/U)k(V)g(S/V)dVdU - if S(V - U)k(V)k(U) [g(S/V)g(S'/U) - g(S/U)g(S'/V)]dUdV ~ O. Since logg is concave in log(u 2 + u~), g(S/ V)g(S' / U) - g(S/U)g(S' IV) ~ 0 for S ~ S', V ~ U. Thus p[VIS] is increasing in S. Since (17 2 + u~)I-(1/2)a 9 = h is increasing, p[VIS] = Sv(n-a)/2 exp( - tV)h( ~ )dV /Sv(n-a)/2 exp( - tV)h( ~ )dV = P[X;_ah(S/X;_a)]/P[h(S/X;_a)] ~ P(X; -a) = n - 0: ~ 2(n - 2) if n ~ 4 - 0:. Thus p[VIS] satisfies the conditions of Theorem 9.1 and the theorem is proved. D Note. Priors of the above type will be unitary only if 0: < 0, so that for a unitary Bayes estimate n ~ 5 is required to beat the straight estimate. For 0: < 2, the loss D X j - P[ X j Iy])2 is integrable, so the posterior mean is Bayes and hence admissible; a Bayes estimate may thus be obtained for n ~ 3. Strawderman (1971) considers thedensitiesg(u~)oc (u~ + u 2 )(1/2)a-l ; then p[VIS] = P[x; -a < S/u 2]. The particular choice 0: = 0 is suggested by Jeffreys (1961). James and Stein (1961) showed that estimates Y j (1 2)u 2 )/'L Y?) beat Y j whenever n > 2; these estimates are not admissible. This estimate may be justified by noting that Peen - 2)/'LYj2) = 1/(17 2 + u~) under the conditions of the theorem, so that the shrinking factor is estimated unbiasedly. The -alx; «n - 88 9. Many Normal Means Bayes estimates, in contrast, shrink rather less when S is small than when it is large; for large S, the shrinking factor P[ V ISJ will be close to (n - ct); for small S, it will be close to zero. If (n - ct) is even, p[VISJ = (n - ct)P[Z > (n - ct)/2J/P[Z ~ (n - ct)/2J where Z is Poisson with expectation is/(J2. For example, for n = 3, ct = 1, P[V ISJ = 2[1 - e - (I/2)S/cr'(1 + is/(J2) J/[1 - e-(l/2)S/cr 2 J, and the estimate is Yi [I/(l - exp( -is/(J2)) - 2(J2/S]. Let (J = 1, and consider the sample Y1 = 1.2, Y2 = - 0.6, Y3 = 0.8. Then S = 2.44, the shrinking factor is .6, and the new estimates are .72, - .36, .48. 9.3. Shrinking towards the Mean Lindley and Smith (1972) use the prior Xi'" N(e o' (J~), independently for and N(O, ,2). For the moment (J~ and ,2 will be assumed known. Then i = 1,2, ... , n given eo' eo '" p[XiIY, eoJ = [Y/(J2 and Yi '" N[e o' (J2 + eo/(J~J/(1/(J2 + l/(J~) + (J~J independently for i = = [ Y/(J 2 1,2, ...• n 1 1). 2J/( (J2 + (J~ + Y/(Jo If r = Note that p[XiIY, (J~J = Y+ (Yi - Y{ 1 - (J2 ~ (J~ 00. J (J~ has prior density ((J2 + (J~)1-«1/2)a), pO)Yi - y)2/((J2 + (J~)] = P[X~-a-ll X~-a-l < L(Yi - y)2/(J2J and the estimate beats Y i if n ~ 5 - ct, from Theorem 9.1. Shrinking towards the mean rather than towards an arbitrary constant loses a degree of freedom, but doesn't change the basic arguments. If 89 9.5. When Most of the Means Are Small 9.4. A Random Sample of Means Suppose Yi ", N(Xp 1) independently, and the Xi are a random sample from some prior Po' The Yi are then a random sample from the density g, g(y) = P o[ exp( - t(X - y)2)/.j21t]. Assuming first that 9 is known, the posterior mean of Xi given Yi is P o[X exp( - t(X - y i)2)]/PO[ exp( - t(X - y)2)] = Yi + (d/dYi)logg. If 9 is not known, it is necessary to place a prior distribution on it so that the posterior expectation of the "correction" (d/ dY) log 9 may be computed. An "empirical Bayes" approach permits estimation of 9 by any method, not necessarily a Bayesian method. It is known that Y1 , ••• , Yn form a random sample from g. A density estimation technique might be used to estimate (d/ dY) log g. For example, iflog 9 has a continuous first derivative at y, P[Y - d yll Y - YI < e]/P[(Y - y)211 Y - YI < e] -+ dy logg as e -+ O. Thus (d/dy)logg may be estimated by LIY;_yl<e(Yi -y)/LIY;_yl«(Yi -y)2 for sensibly selected e. (For e large enough to include all data values, the estimate will be similar to the James-Stein estimate (9.2). The estimate Yi will be replaced by an estimate closer to the mean of those observations near Yi . 9.5. When Most of the Means Are Small In 9.2 and 9.3, 9 is normal with mean 0 and unknown variance, and a prior distribution is placed on the variance. In many regression and analysis of variance problems, most of the means Xi are very close to zero, but a few are quite large. Such a situation is not well represented by a normal g, because it is not sufficiently long tailed. One alternative is to assume that Xi comes from a distribution ptJ o + (1- p)N(O, a~) where tJo{O} = 1. Then Yi is a random sample from pN(O, I) + (1 - p)N(O, a~ + 1), 1 { (1 - p) exp( - ty2/(1 g(y) = ~ p exp( - ty2) + ~ v 2n 1 + a~ d - y{ p exp( - ty2) + (1 -Iogg(y) = dy {p exp( - ty2) + ~ :~~/2 exp( - ty2/(1 0 ~ exp( 1 +a~ t y2/(1 + a~)) } + a~)) } + a~))} If y is small, the adjustment is close to - y; if y is large it is close to y/(1 + a~); in this way the small observed values Yi are moved very close to zero, but 90 9. Many Normal Means the large observed values Yi are relatively unchanged. In practice p and O"~ must be estimated from the Yi . A Bayesian approach requires computation of the posterior mean of (d/ dy) log g(y) but no prior on p and 0"0 is known which permits explicit computation. It is straightforward computationally to estimate p and O"~ to maximize the likelihood of the observations, but explicit expressions are not available, and it is not known whether the resulting estimates of the Xi beat the straight estimates. A standard approach to the problem of many small means is to carry out a significance test on each mean separately, and to set all those means to zero which do not exceed some significance level. Here, the estimate would be Xi = Yi { IYil ~ c}, where cis the cutoff point in the significance test. Then Ip(Yi - X)2 - Ip(Xi - X)2 = Ip( {IYil < C}(Yi2 - 2YiXJ) = Ip( {IZi + Xii < c}(Z; - X;» ° where Zi ~ N(O, 1). If IXI> 1, P( {IZ + Xl < C}(Z2 - X2» < for~ every choice of c. Thus there is no way to choose c so that the estimates Xi have uniformly smaller mean square error than Yi ; it does not help to allow c to depend on the Yi . Yet there is practical value in setting many small means to be exactly zero if there is no evidence of significant departure from zero. Suppose the loss function is L(d, s) = {d +s} + ked - S)2. Let Po be a unitary probability on S which has an atom Po {so} only at so' Then the probable loss for d is Po {d s} + kP oed - S)2 = P oSo + + {d = so} (1 - 2P oSo) + ked - P osf + kP o(S - P OS)2. The Bayes decision is d = So if 2P 0 {so} > k(so - P OS)2 + 1, and d = P oS otherwise. If Yi ""' N(Xi' 1) independently, where the Xi are sampled from pb o + (1 - p)N(O, O"~), then + Xii Yi '" Py,b O + (1 - py)N [~' _1-1 ] 1+1 +0"2 0"2 o where Py, = { [12(1 + )]}-1 ° ° 1+ [0 -py)~ Jk -+ I-p 1 + -p-' (1 1 0 + 0"~)1!2 exp - "2Y / O"~ is the posterior probability that Yi came from the estimate will be Xi = if < 2py" 0"2 o 1 1 component. The Bayes 91 9.6. Multivariate Means and y. X. = (l - Py ) ~l_ Iii -+ 1 v2 o otherwise. 9.6. Multivariate Means Let Y ~ N(X, L}, X ~ N(O, kLo} where Y and X are n dimensional vectors, Land Lo are known covariance matrices, and k is unknown. By a linear transformation applied to Y and X, this case may be reduced to Y i ~ N(X i, v;}, Xi ~ N(O, v~} where v~ is unknown, and the distributions are independent for different i. See Efron and Morris (1973). Given v~, p(XiIY) = Yi(l - (v;/v~ + v;}). A Bayes procedure for a prior density f(v~) on v~ would use P [( I v~ + v; I )1 Y J= __Sv2+v2'TI(v~+v;)-1/2exp(-tYi2/(v~+v;))fdv~ __ _________ ~O==~l~____________~=-~~ STI(v~ ~ + v;)- 1/2 exp [ - tI y i2/(v~ + v;)] f dv~ but no magical f exists that permits a simple explicit computation. As a practical matter, taking a uniform discrete prior on v~ from 0 to 2 i (l +~) in 100 steps, should give a reasonable Bayes estimate of l/(v~ + v~). [For a continuous 1, the above integrals will have to be approximated as if the prior were discrete, anyway.] A simple alternative to a Bayes procedure uses P[Yi2Iv~] = v~ + v~, p[IYi2Iv~] = nv~ + Iv;, so thatv~ is estimated unbiasedly by L<yi2 - v;)jn. This estimate is sometimes embarrassed by being negative, and may not lead to a good estimate of I I( v~ + v;}. A slightly better non-Bayesian method is maximum likelihood, which finds v~ to maximize - I log (v; + v~} - I YN(v; + v~). The maximum value occurs at 0, or at a solution of an equation I(Yj2 - v; - v~)/ (v; + v~}2 = 0; thus v~ is the weighted average of y i2 - v; with weights inversely proportional to the variances of given v~. However the solution may not be unique, and checking a spectrum of v~ values is about as difficult as doing a Bayes approximate integration. The above procedures are not known to be uniformly better than the straight estimates Yp which have sum of squared error loss Lv;. Given v~, the loss of the Bayes estimate is I(v;(v6/(v~ + v;)2} + viX;/(v~ + v;j2} which is less than Iv; if Iy Y: 92 9. Many Normal Means This condition is analogous to one given in 9.3; it will always be satisfied for (J"~ large enough. This suggest that an estimate beating Y i might ~e obtained by overestimating (J"~. Of course, if loss is measured by P(l)X i - xyj (J";IX), the problem may be transformed to one in which all the (J"; are equal and Stein's estimate and unitary Bayes estimates exist beating Y i . Brown (1966) shows the estimate Y i to be inadmissible for a large class of loss functions; better estimators are given by Brandwein and Strawderman (1978). 9.7. Regression Suppose YIX ~ N(AX, (J"2 I n)' X ~ N(O, (J"~I p) where Y is n x 1, A is n x p, X is p x 1 and In denotes an n x n unit matrix. Then Xl Y - N[(A'Aj(J"2 + Ij(J"~)-1 A'Yj(J"2, (A'Aj(J"2 + Ij(J"~)-lJ. The estimate P(X I Y) of X is often advocated for purely computational reasons, to guard against singularity or near-singularity of A' A, Hoerl and Kennard (1970). As in 9.6, it is difficult to estimate (J"~ by a simple Bayes procedure. It is tempting to use the unbiased estimate 8~ = [Y'A(A'A)-lA'Y - p(J"2]jtrace(A'A), but this is dangerous because it might be negative. The maximum likelihood estimate for (J"~, assuming (J"2 known, minimizes log I(J"~AA' + (J"2 II + y'((J"~AA' + (J"21) - 1 Y. This looks nasty, but when AA' is diagonalized, it reduces to the likelihood expression in 9.6. See Lindley and Smith (1972). 9.8. Many Means, Unknown Variance Let YilX i ~ N(O, (J"2), Xi ~ N(O, (J"~), i = 1,2, ... ,n, independently for each i, and suppose there is an independent estimate of variance S, with S ~ (J"2X; . Such a situation arises in regression problems. Given (J"2 and (J"~, P[XIY] The density of S, Y given (J"2 S(k/2)-1 ((J"2)k/2 exp( - !Sj(J"2) We may estimate (J"2 = Y{ 1 - (J"2 ~ (J"~J and (J"~ is proportional to (I y 2 )(n/2)-1 ((J"2 ~ (J"~)n/2 exp [ -! y i2j((J"2 and (J"~ unbiasedly by solving S = k(J"2. I y i2 = n( (J"2 + (J"~), + (J"~)] 93 9.9. Variance Components, One Way Analysis of Variance + (J~) unbiasedly by (n - 2J/k. 2J/k). S/'i Yi2 ] beats Yi , from Baranchik but it is better to estimate the coefficient (J2/«(J; S/"i.. y i2 ; the estimator YJ! - «n - (1971). Even so the estimator can occasionally give foolish results with the coefficient of Yi negative. A maximum likelihood procedure gives the same results as the unbiased method S = k(J2, 'i y i2 = n«(J2 + (J~) except when (J~ is estimated negative; in that case (J~ is estimated to be zero, and (J2 is estimated by (S + 'i Y/)f (n + k). For the prior density «(J2 + (J~)((1/2)")-I((J2)((1/2)fJ)-I, from 9.2, (J-2 ~ X;_p/S, «(J2 + (J~)-I ~ X;-)'i y i2 where the X;_ , X;-a are sampled from independent chi-squares but accepted only if (J-/;?; «(J2 + (J~) - I. Thus (J2/«(J2 + (J~) '" (S/'iYi2)X;_)X;_fJ constrained not to exceed 1, and p[(J2/«(J2 + (J~)!y] = (S/'iYi2)P[X;_"Ix;_fJ!x;_,,/X;_fJ ~ 'iYNS]. The computation is an incomplete beta integral. From Baranchik (1971), the estimator YD - (S/'i y i2 )r('i YNS)] beats Yi ifr is non-decreasing, r ~ 2(n - 2)/(k + 2). Here r is obviously non-decreasing, r ~ P[X;-)X~-fJ] = (n - rx)/(k - f3 - 2). Thus the posterior mean beats Yi if (n - rx)f(k - f3 - 2) ~ 2(/1 - 2)/(k + 2). (These estimates are not Bayes because the loss is not integrable. For example, when rx = 0, f3 = the condition is satisfied for no k, n; when rx = 2, f3 = - 4, it is satisfied for n ;?; 3. ° 9.9. Variance Components, One Way Analysis of Variance Suppose that a number of normal samples estimate the means X I ' for the /" sample Yij ~ N(X j , (J2), i ... , Xn; = 1, ... ,m The X {s are assumed to be sampled from N(X, (J~J. Finally X '" N(O, (J~[). Since Yj ~ N[Xj' (J2/m] this is essentially the same situation considered in 9.3. Given (J~, (J2, (J~ there will be posterior mean estimates of the X .. In practice, it is necessary to estimate the "variance components" (J~, (J2, (Jt somehow, and they are of interest in themselves to indicate how important between group effects (represented by (J~) and within group effects (representare: ed by the (J2) J Here " ( Y _ y)2 '" (J2X2n(m-I) L..L.. ij j j i ' ( Y. _ y)2 '" ((J2 + (J2 )X2 J m 0 n- 1 L.. 94 9. Many Normal Means independently. (The distributions are not so simple if there are unequal numbers in the different samples.) Unbiased estimates of a 2 , a~ and a~ may be easily constructed from linear combinations of the sums of squares in Y, but the estimates are inadmissible because they may be negative. Maximum likelihood gives the same estimates if the solutions to the equations are positive. For a prior uniform in log a 2, log(a 2 /m + ( 2) and log(a 2/mn + aUn + a~), t~e posterior dis!ribu~ion of a 2 ,-a 2 /m + a~, a 2 /mn + a~/n + a~ is LL(Yij Y//X;(m-l)' L(Yj - y)2/X;_1' y2/X~ where the chi-squares are taken independently, but accepted only if the appropriate inequalities hold between the three variables. Computation of posterior means would require formidable numerical integrations in three dimensions. Similar considerations arise in estimating variance components for more complicated analysis of variance models. 9.10. Problems P L A surveyor, poor but honest, measures the three angles (8 1 , 82 , 8 3 ) of a triangle with independent errors N(O, I). The measured angles are 81 = 63°, 82 = 3P, 83 = 92°. For a suitable prior on 8, find the posterior distributions of each of 8 1 ,8 2 ,83 given the data. [The true values should add to 180°.] P2. In football, the scoring difference between team i and team j is distributed as N[ll i - Ili' 0- 2 ]. The prior distributions at the beginning of a season are, independently, Yale Harvard Princeton Dartmouth III ~ N(O, 0- 2 ) ~ N(O, 0- 2 ) 113 ~ N(O, 0'2) 112 114 ~ N(6, 0- 2 ) Game scores are: Harvard 13-Princeton 6 Princeton 27-Dartmouth 20 Princeton 21-Yale 3. Compute the probability that Harvard will beat Yale, given the observed scores. P3. For Yi ~ N(Xp 1) independent Xi ~ N(X o, o-~) independent assume g(X o' o-~) = .I/(o-~ + I). Find the posterior mean of Xi given Y1 , •.• , Yn' 95 9.11. References P4. Votes for the Democratic candidate for President: South 21 29 30 27 21 Central 43 47 42 New England 61 62 65 Construct a model II analysis of variance, and estimate variance components, using unbiased estimates and Bayes estimates. P5. Show that the following estimate in the Stein problem, Yi beats Yi : - NCB i , 1), is Bayes and Show that the multiplier is never negative. 9.11. References Baranchik, A. J. (1970), A family of minimax estimators of the mean of a multivariate normal distribution, Ann. Math. Statist. 41, 642-645. Brandwein, A. R. and Strawderman, W. E. (1978), Minimax estimation of location parameters for spherically symmetric unimodal distributions under quadratic loss, Annals of Statistics 6, 377-416. Brown, L. D. (1966), On the admissibility of invariant estimators of one or more location parameters, Ann. Math. Statist. 37, 1083-1136. Efron, B. and Morris, C. (1973), Stein's estimation rule and its competitors-an empirical Bayes approach, J. Am. Stat. Ass. 68, 117-130. Hoerl, A. E. and Kennard, R. W. (1970), Ridge regression: biased estimation for non-orthogonal problems, Technometrics 12, 69-82. James, W. and Stein, C. (1961), Estimation with quadratic loss, Proc. Fourth Berkeley Symposium, University of California Press, 1, 361-379. Jeffreys, H. (1961), Theory of Probability. Cambridge University Press, Cambridge. Lindley, D. V. and Smith A. F. M. (1972), Bayes estimates for the linear model, J. Roy. Stat. Soc. B 34, 1-41. Stein, C. (1956), Inadmissibility of the usual estimator for the mean of a multivariate normal population, Proc. Third Berkeley Symposium 1,197-206. Strawderman, W. (1971), Proper Bayes minimax estimators of the multivariate normal mean, Ann. Math. Statist. 42, 385-388. CHAPTER 10 The Multinomial Distribution 10.0. Introduction A discrete random variable X takes values i = 1, 2, ... ,k with probabilities {Pi' i = 1,2, ... ,k}. A sample of size n from X gives the value X = i ni times. The multivariate distribution {np i = 1, ... , k} is multinomial with parameters n, {Pi' i = 1, ... , k}. It is ubiquitous in problems dealing with discrete data. The values 1, 2, ... ,k are called categories or cells. If(n l , n 2 ,···,nk ) is multinomial n, {pJ then n l +n 2 , n3 , ••• ,nk is multinomial n, (PI + P2 , P3 , ... , Pk); nl' n2 , ... , nj given I,{= Il1i is multinomial I,{= j ni , {p/I,{= jP i ' i = 1, ... , j}. The multinomial is obtained from k independent Poissons ni with expectations }; ; the distribution of III ••.• , 11k given n = I,11; is multinomial with parameters n, {A/I,~ = 1 Ai' i = I, .,. ,k}. This fact is very convenient in formulating models and handling computations, because the Poisson n i are independent. In general, the interesting problems in asymptotics and decision theory arise when some of the Pi are small. For example the usual maximum likelihood estimates of Pi are inadmissible if IPi I ~ c, i = I, ... ,k. Standard families of prior distributions exist for the multinomial, but they don't work too well for many parameter problems. It is necessary to incorporate expected similarities between the p/s into the prior for many parameter problems. 10.1. Dirichlet Priors The multinomial X taking values i = I, 2, .. , ,k with probabilities Pi p(X) = Ilplx=iI = 96 exp [t: {X = i} log [p/O - Pk) ] + log(l - Pk) J 97 10.2. Admissibility of Maximum Likelihood, Multinomial Case is exponential E[,u, T, 0] where T is the vector {X = i}, i = 1, ... , k - 1 and 0 is the vector 10g[p/(1- Pk )], i = 1, ... ,k - 1, and J1. is counting measure on 1, 2, ... ,k. Because of the asymmetry of this parameterization, it is often convenient to think of the multinomial as E{J1., [{X=i}, i=I, ... ,k], [log Pi' i = 1, ... ,k]} where the parameters log Pi are constrained to lie in a (k - 1) dimensional subset (l>i = 1) of Rk. For n observations, with ni= Li= I {Xj = i}, p[X I " " ,XJ = TIp~i If {pJ is uniformly distributed over the simplex Pi ~ 0, LPi = 1, the posterior density given n l , ... , nk is TIp~i «n + k - 1)!lTI ni !). More generally, the Dirichlet density on {Pi}' with respect to the uniform J1. over Pi ~ 0, LPi= 1 is d,,(P)=r(LlXi)TIp~i-I/r(IX); the Dirichlet probability is D,,= E[J1., {log pJ, {lXi - I} J. The Dirichlet generalizes the beta to many dimensions. If Vi are independent gamma with densities oc U~i - I exp ( - au i)' then {V /L~ = I V J is Dirichlet D" (similarly to the multinomial being independent Poissons nl'''' ,nk conditioned by Ln i = n). If the prior density is d., then the posterior given nl, .. ·,nkisd.+ n · I DJP] = a./LIXi' DJPP'] = ( lXIX' + [IXI '''IXJ) LlXi(La i + 1). If (PI' ... , Pk) is Dirichlet D" then PI' P2 ' ... , Pr ' 1 - L;= IP i is D"', .... IX .. L>r IX i' and PI' ... ,p r given Pr + I' ... 'P k is (1 - L...l r p.)D l cq, ... ,!Xr . ".> 10.2. Admissibility of Maximum Likelihood, Multinomial Case The maximum likelihood estimates of Pi are n/n; these are posterior means for the non-unitary prior density I/TIpi' They are the only estimates that do not depend on the fineness of subdivision of the multinomial cells, Johnson (1932). If there are very many Pi' all probability estimates will be O/n or l/n which is unsatisfactory. Theorem (Johnson (1971)). The maximum likelihood estimator Pi = n/n is an admissible estimator of Pi with loss function L(d, p) = L~= I (Pi - d)2. PROOF. The technique of 6.3 would approximate Pi by Bayes estimates for densities TIP~- I, with IX ~ 0, but this is not effective for k > 2 because more than one of the n i may be zero, and this causes irretrievable degeneracy in the 98 10. The Multinomial Distribution posterior densities when the corresponding Pi are zero. The essence of Johnson's proof is careful handling of the cases where some ni are zero. Consider first k = 2 and suppose 6 has risk nowhere greater than the risk of nino ret>, p) = = n, +~=J:J [(6 °at PI = ° °or 1 - P2 = P1 )2 + (6 2 ° - py]p~' p~2 since n/n has zero risk for P1 P2 = 0. Thus 6 1 (0, n) = 0,6 2 (0, n) = I and so t> and n/n agree when n 1 = or n l = n. ret>, p) - r(n/n, p) = I O<n,<n ± (n) [(6 i n 1 i=1 p)2 - (n/n _ - p)2]p~'p~2 J[r(t>, p) - r(n/n, p)] dP I PI P2 ~ I ~ ± (n)S [(6; - Pi? - (n/n - p)2]p~'-1 p~2-ldpl nl i=1 since n/n is Bayes on prior I/P 1 P2, ° O<n,<n with equality only if 6i = n/n, i = 1, 2. Thus if 6 has risk no greater than n/n, it equals n/n; thus n/n is admissible. Note that integration is possible after multiplying by I/P 1 P2 because the cases n 1 = and n 1 = n have been eliminated. Consider next k = 3, and suppose that t> has risk no greater than that of n/n. Letting PI = 0, ° ret>, p) = itl i n2+~=n (:J J2 ~ (:) (6; - P fTIp7' 3 (6l0, n 2, n3) - p)2 ;=Q, p7' + n2 +~ =n (n~ }5~(0, n2, n3 )) i =Q, /;" Since ret>, p) ~ r(n/n, p), then In2 +n3 =11 (:J It= 2(6;(0, n2, n3) - p/ TIi=2,3P7' is not greater than r(O, n2/n, n3 /n; p), which implies 6i (0, n2 , n3 ) = n/n, i = 1,2,3. [Note that 6 1(0, n2 , n3 )=0 since otherwise, for PI =0, r(o,p»r(n/n, p).] Similarly, t> agrees with n/n whenever n 1 or n2 or n3 = 0. J[r(t>,p)-r(n/n,p)]d p 1 dP2 P1P2P3 I = n,n,>O (n)SI[(6;-p)2 n - (n/n - p;?]TIp7,-1 ~o, 10.3. Inadmissibility of Maximum Likelihood, Poisson Case 99 since n/n is Bayes for the density l/fIpi with equality only if b; = nino The integration is justified because n; > O. Thus if r(t5, p) ~ r(Dln, p), t5 = n/n, so Din is admissible. General k is handled by induction; if one of the p's is set zero, the decision procedure with the corresponding n; zero must coincide with maximum likelihood; so the difference between two procedures need only be assessed over n; > 0; and integrating with respect to l/fIp; shows that t5 cannot beat n. [Decisions of form: take bj with probability dj have risk exceeding that of Ldjb j so they can't beat Din either.] D 10.3. Inadmissibility of Maximum Likelihood, Poisson Case If the n; are independent Poissons with expectations A;, then n 1 , · · · , nk given In; = n are multinomial with parameters A/LA;. For this reason, it is often convenient to formulate multinomial models using independent Poissons. A convenient family of prior densities for the Poisson P A' PJn}=e-A).nln! is the gamma G(m,a) with density amAm-le-aAlr(m); given observation n, the posterior density is G(m + n, a + 1). Theorem (Clevenson and Zidek (1975)). Let n; be independent Poisson with expectations Ai' i = 1,2, ... , k. Then,for k ~ 2 for all A; . Thus n; is inadmissible as an estimate of A;. PROOF. Given n = L~= I np the ni are multinomial with parameters P; = A/LA; P[Dan i - A)2 IAi In] = L[ a2npp - p)1A; + (anp; - A)2 I AJ = a 2n(k - l)/LA; + (an - LA)2/LA; The value a which minimizes this expression is LA)(n + k + 1), which is estimated by n/(n + k + 1) = an' Set A = LA;. P[L(ann i - AYIAJ = a; (n(k - 1) + n2)IL Ai - 2a nn + LA; =n 3 /(n+k-l)LA;-2n 2/(n+k-l)+ LA; n3 /(n + k - 1) - 2n 2A/(n + k - 1) = n2 - (2A + k - l)n + (k - 1)(2A + k + 1) (k - 1)2(2A + k - 1) n+k-l P[(n 3 - 2n 2A)/(n + k - 1)] < A2 + A + (2A + k - l)(k - 1 - A) - (k - 1)2(2A + k - 1)/(A + k - 1) 100 10. The Multinomial Distribution since X and I/X are negatively correlated. PO)ann j - .A. j )2/AJ < 2.A. + 1 - (2.A. + k - 1).A./(.A. + k - 1) < k - (k - 1)2/(.A. + k - 1) < k = PO::<n j - 0 ),Y/.A.J. 10.4. Selection of Dirichlet Priors np Jeffreys's prior is D 1/2.1/2 ..... 1/2 giving density oc j- 1/2, and posterior means (nj + 1/2)/(Inj + k/2); the cell estimate depends significantly on k, so that if other cells are subdivided into, say, 100 more cells, a given estimate is substantially reduced. Perks (1947) suggests npj-l/k which gives estimates (nj + l/k)/(Inj + 1). Following the binomial case, it is useful to consider the family of prior densities p~i- 1, IX j > 0, IlXj = 1 which gives ranges of estimates unaffected by amalgamation or subdivision of cells; these are analogous to confidence priors in the binomial case. Another possibility is to estimate the Dirichlet prior; assume the prior np~-I; then P[n j ] = n/k, P[n;J = n/k + (n - I)(IX(IX + I)n/klX(klX + 1», P(Inn = n + n(n - I)(IX + I)/(klX + 1) Thus IX is estimated by solving In~ = n + n(n - I)(IX + 1)/(klX + 1), or equivalently, (1/(k - 1)I(nj - n/k)2 = n(klX + n)/(klX + 1)k. There may be no non-negative solution IX if nj has small enough variance; set IX = (1/(k - l»I(nj - n/k)2 ~ n/k. 00 if A more satisfactory (and more difficult) procedure due to Good (1965) selects IX to maximize the likelihood P(nllX) = r(klX)r(n+ 1) nr(nj+IX). r(1X)kr(n + IXk) r(n j + 1) Good (1975) shows that P(nllX) is maximized by IX = 00 when the chisquare goodness of fit statistic, X = (k/n) I(nj - n/kf ~ k - 1. He suggests using G = sup [2 log P(n IIX)/P(no IIX) ] 1/2 as a test statistic for deciding '" n = n/k. In Good (1967) it is asserted that G2 is distributed Pj = I/k, where Oj as X~ given that G > 0, asymptotically as n ~ 00 for k fixed. However, asymptotically, the expansions for gamma functions (Abramowitz and Stegun, 1964, p. 257) show I G(IX) = log P(nllX)/P(n o IX) ~ t(k - 1) 10g(lXk/(n + IXk» + tXn/(n + IXk) which is maximized by IX = 00 if X ~ k - 1, and IX = (n - I/k)/(X - k + 1) if X > k + 1, which is the same as the estimate based on the first two moments. Thus sup G(IX) ~ {f(X - k + 1) -t 10g[1 + (X - k + 1)/(k - I/n)]} + which is a monotone function of X; its asymptotic distribution is determined from the asymptotic distribution of X, which is X~-I; and Good's test 101 10.6. MuItinomials with Clusters statistic G is just a monotone function of X asymptotically, without the asymptotic behavior stated in Good (1967). Of course, a complete Bayesian notes that p(Pila) = (n i + a)/(n + ka); specifies a prior density for a; and computes P[Pi In] = P [(n i + a)/(n + k) In] averaging over the posterior density of a given n. The likelihood P(n Ia) is messy enough to suggest no simple closed form expression will be available. Good (1967) uses a logCauchy distribution for a. 10.5. Two Stage Poisson Models Suppose ni are Poisson Ai' and the Ai are drawn from some distribution Po, as in 9.4. The ni are sampleg from the discrete distribution with density Po, po(n) = Po [Ane- "/n l]. The posterior mean of Ai given ni is po[An;+le-"/nil]/po[An;e-A/nil] which equals (n i + I)po(n i + l)/po(nJ Thus we can compute posterior means (and variances and other moments) if we know po. If Po is not completely known, as good Bayesians, we would need a prior distribution for it; the whole data set nl' ... ' nk would determine a posterior distribution for Po and the estimate P[ Ai In] = (n i + I)P[po(n i + l)/po(n)ln]. For many Poissons np we might estimate po(n) by # [ni = n]/ # ni, the maximum likelihood estimate; however this does not take advantage of smoothness induced by po(n) = Po [Ane-A/nl]. See Robbins (1956). A special case is Po gamma with density aI1.AI1.-1 e-a"/r(a). Then po(n) = (I + a)-<I1.+n>( a +: - I). the negative binomial; and a, a may be estimated from the observed ni by maximum likelihood or by the moments Pn = aa/{l + a), P(n - Pn)2 = a[a/{l + a)]2. Also P[Ailn] = P[(n i + a)/(I + a)ln]; if only we could think of a nice prior distribution 9 of a, a, the posterior density 9 n~= 1 (l + a)-<I1.+n;>( a +:: -1) could be used to obtain a Bayes estimate 00. 10.6. Multinomials with Clusters In previous sections, all cells have been treated symmetrically, but it will frequently happen that some groups of probabilities Pi will be expected to be similar. One possibility is that the cells are grouped in clusters C l' C 2 , ••• , Cjand then the prior density might betaken to be (LiEclJi- 1 ; n 102 10. The Multinomial Distribution this is as if we had made previous observations in which cx j individuals occurred in the cluster CJ.. If the clusters are hierarchical, so that C.I and C j overlap only if Ci c: C j or C j c: Ci , this model may be reformulated as a number of Dirichlet priors on conditional probabilities, and probability estimates may be simply computed. Suppose for example the multinomial is a 2 x 3 contingency table: Let C 1 =(11, 12, 13), nil n l2 n l3 n 21 n 22 n 23 Cij={ij}. The prior density may be transformed to a density on the marginal probabilities PI. = Pl1 + PI2 + P13 , P2. = 1 - PI.' and conditional probabilities Pili = Pi/Pi.' Il2-ln Pi? ""-IdPII dPI2dP13 dP21 dP22 (PI. )",-1 P2. =p",+I+I("lj-l)p"2+1+r<"2j-l)np"u-Idp dp dp dp dp I 2 jli 1. 111 211 112 212 (Note that only five parameters appear in the differential element, since LPij = 1 .) The new density is P'" + I +r<"'j-I) pill + I +I("lj- I)np~;{-I. n(Lijec,)'i}Xk- 1 = (PI I 1. C 2 =(21,22,23), + P12 + P13 )''' -1(P21 + P22 + P23 )"2- 1np:y-I 2. JII The advantage of this formulation is that the marginal and conditional probabilities are independent, and so it is easy to do posterior computations. For example P[Pij! nJ = P[P i.! nJP[Pili! n J. See Good (1965) for other methods of generating priors for contingency tables. 10.7. Multinomials with Similarities It may happen that the cells of a multinomial are ordered in such a way that neighboring probabilities are likely to be close. The prior density ensures that neighboring probabilities are not too different. Pioneering work in this area occurs in Good and Gaskins (1971, 1980) studying density functions. For multinomial probabilities, Leonard (1973) presents the following prior density. Let [log pJ be multivariate normal, subject to the constraint LPi = I; neighboring p/s are required to be highly correlated. A similar prior density is considered by Simonoff (1980), the density exp [A L~':-II log2(pJp i+I)]; the penalty function L log2 (pip i + I ) ensures that Pi and Pi + I must be close. In order to avoid the pesky dependence LPi = 1, let us assume ni Poisson with expectation Ai' and that the log Ai form a normal autoregressive process with lag one, [log Ai + I - f..L] = p[log Ai - f..L] + (J 8 i +I' 8i independent N(O, 1). 103 10.8. Contingency Tables A simple limiting case, with p = 1, has log '\ uniform, log Ai+ l/Ai independent N(O, cr 2). The posterior density with respect to Lebesgue measure on {log Ai} is ex exp [Ln i log Ai -!L log2(Ai+ 1/A)/cr2 - LA} It is difficult to compute posterior means, but the posterior mode is easier to compute: the function to be maximized is called a penalized likelihood function by Good and Gaskins (1971), with penalty function L log2(Ai+ l/Ai) requiring neighboring Xs to be close. It is not feasible to estimate the cr 2 in the obvious way, to maximize the posterior density, because the inaccessible constant of proportionality includes cr. The modal value of u i = log )'i satisfies ni + (2u i - Ui - 1 Ui+ 1 )/2cr 2 - e"i = 0. Concavity of Lniu i - L(U i+ 1 - Ui)2/2cr 2 - LeUi guarantees the existence and uniqueness of a modal value. The solution may be found by a NewtonRaphson technique. Simonoff(1980) shows that for large k, with n., moderate, the estimates Ai are weighted averages of the nj for j near i, giving asymptotic behavior similar to kernel estimates. These techniques are related to spline fitting methods used in regression and density estimation; see for example Wahba's remarks in the discussion of Stone (1977). An alternative prior on log Ai is exp [ - A L !log(Ai + l/Ai)!] which specifies the absolute differences to be exponentially distributed. The model ui = log Ai maximizes niu i - A L !ui + 1 - ui ! - L e"i; thus eUi = ni - 2A, ni + 2A, e"i+' or e"i- '. The solution may be described by a number of intervals (Jr' J r) such that ui is constant for 1r~i~Jr. If U1r _, < u1r <u1r +" then (Jr -1r + l)e"lr = Llr~i:;;Jrni; this is equivalent to amalgamating the cells i, 1r ~ i ~ Jr. Search for tne optimal intervals requires techniques similar to Barlow et al. (1972). This method clusters the cells which have similar ni and is clearer in its action than the normal prior considered previously. Its asymptotic properties are unknown. - - 10.8. Contingency Tables The entries in a contingency table may be regarded as multinomial or Poisson; the special structure of the contingency table requires special priors for the parameters. Good (1965) has many useful ideas for such priors. See also Leonard (1975). For a two way contingency table with entries nu' 1 ~ i ~ 1, 1 ~j ~ J and probabilities {p 1).. }, we often expect independence plJ.. = p.I .p. J. where p.I. = "J.p. ., '-,J P. j = LiPi .. Good considers putting a prior density on the parameters [Pi/Pi.pj, which has the effect of moving all parameter estimates Vii towards independence. For large tables with ordered rows and columns, the prior density in the Poisson model, exp[ - LA log2[Aiii+lj+l/Aij+1Ai+lj]]' with respect to 104 10. The Multinomial'Distribution log Aij Lebesgue, encourages each 2 x 2 table of neighboring cells to be nearly independent. The modal posterior estimates of A are then approximate weighted averages of counts in nearby cells. With prior density exp( - LAllog(AijAi+li+l/Alj+1Ai+l)I), the posterior mode requires blocks of neighboring 2 x 2 tables to be independent, and so breaks the contingency table into a number of (unbalanced) subtables where independence is achieved. Computations with both these techniques are formidable. 10.9. Problems PI. For all nj large, find an approximate expression for the Dirichlet parameter a maximizing the likelihood (r(ka)/r(cfl)(r(n + I)/r(n + ak»n(r(n; + rx)/r(n; + 1). P2. Let ..1.; be independent gamma variables with density a(a..1.)·-1 exp( -a..l.)jr(rx). Let n; be independent Poisson with expectations ..1.;' Show that {}.j~)J given {nJ has the same posterior distribution as {pJ in the multinomial model with Dirichlet prior density ex np~- '. P3. In a binomial model with n = 10, compute the mean square error of the Bayes estimators corresponding to beta prior densities [p(1 - p)].-l for rx = -1,0, t, I, 10 and sketch the risks as a function of p. [Hand computation will suffice.] Obtain the distribution of r (the number of successes) given a and estimate rx given r. If Po[rx= 1/2] Po[rx 1] findP[plr]. = = =t, EI. If p can take only the values kiN, 0 ~ k ~ N, show that the proportion of successes in n trials in the binomial model, is inadmissible as an estimate of p with squared error loss, when n > N. P4. On visiting a new cafeteria, a distinguished statistician took five cubes of sugar for his coffee. On each wrapper was pictured a bird; of the first four, the third was a cardinal but the other three were swallows; what bird is likely to appear on the fifth wrapper? (See Good (1965).) E2. In a week books were borrowed from a library by persons in the following categories First year students = 6 Second year students =10 7 Third year students Fourth year students 5 Statistics faculty 3 Undergraduates 2 Other graduate students = 8 Other faculty I Other persons 3 Estimate the probability that the next book is borrowed by a person in each of the above categories. 105 10.10. References P5. For a 2 x 2 table, find a prior distribution on the probabilities PI1' P12' P21' Pzz so that the Bayes test for independence is Fisher's test, rejecting independence if the first observation nIl is too large or too small given n l ., n I' E3. Is the estimate fi = 0 admissible as an estimate of P in binomial problems, with mean squared error loss? QI. Do a two stage analysis of the multinomial model analogous to the two stage Poisson model, 10.5. P6. In the binomial, is the maximum likelihood estimate fi admissible with loss (p _ fi)2/ p(1 - p) or p logp + (1- p)log(l- p)? Assume 0 < P < 1. P7. Johnson (1971). In the binomial problem, given r successes in n trials, admissible estimates of pare ofform: p=O, r~L p = p o[P,-L(1 - p= I, where - I ~ p)U-,-I]/p o(P,-L-I(1 _ p)U-'-I) r~ U L < U ~ n + 1, and Po is not carried by {O, I}. P8. (Clevenson and Zidek, 1975) For p independent Poissons ni with means Ai' show that ()i(n) = (1 - (fJ + p - 1)/(})i + fJ + p - 1»n beats n as an estimate of )., using loss function i - },YP. ;l, for 1 ~ fJ ~ p - 1. L«b P9. In the multinomial, show that {n/n} is inadmissible for {pJ, with squared error loss, if the parameter values satisfy Ipil > e, i = 1,2, ... , k. 10.10. References Abramowitz, M. and Stegun, L A. (1964), Handbook of Mathematical Functions. U.S. Department of Commerce. Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D. (1972), Statistical Inference under Order Restrictions. New York: John Wiley. Clevenson, M. L. and Zidek, J. V. (1975), Simultaneous estimation of the means of independent Poisson Laws, J. Am. Stat. Ass. 70, 698-705. Good, I. J. (1965), The Estimation of Probabilities. Cambridge, Mass: M.LT. Press. --(1967), A Bayesian significance test for multinomial distributions, J. Roy. Statist. Soc. B 29,399-431. --(1975), The Bayes factor against equiprobability of a multinomial population using a symmetric Dirichlet prior, Annals of Statistics, 3, 246-250. --and Gaskins, R. (1971), Nonparametric roughness penalties for probability densities, Biometrika 58, 255-277. --(1980). Density estimation and bump hunting by the penalized likelihood method exemplified by scattering and meteorite data, J. Am. Stat. Ass. 75, 42-73. Johnson, B. M. (1971), On the admissible estimators for certain fixed sample binomial problems, Annals of Math. Statistics 42, 1579-1587. Johnson, W. E. (1932), Appendix to probability: deductive and inductive problems, Mind 41,421-423. Leonard, T. (1973), A Bayesian method for histograms, Biometrika 60,297-308. 106 10. The Multinomial Distribution --(1975), Bayesian estimation methods for two-way contingency tables, J. Roy. Stat. Soc. B 37,23-37. Perks, W. (1947), Some observations on inverse probability including a new indifference rule, J. Inst. Actuaries 73,285-312. Robbins, H. E. (1956), An empirical Bayes approach to statistics, Proc. III Berkeley Symposium, 157-163. Simonoff, J. S. (1980), A penalty function approach to smoothing large sparse contingency tables, Ph.D. Thesis, Yale University. Stone, C. J. (1977), Consistent non-parametric regression, Annals of Statistics 5, 595-645. CHAPTER 11 Asymptotic Normality of Posterior Distributions 11.0. Introduction Suppose X I ' ... , XII are independent observations from Po' 8ER. Suppose that Pe has density fe(x) with respect to some measure v. The maximum likelihood estimate of 8 (or the value of 8 that maximizes the density of the posterior probability relative to the prior probability), maximizing TI;'~ I fiX) is denoted by As n -> 00, Fisher established that is asymptotically normal with mean 80 and variance (nI( 80)) - 1, where 80 is the true value of8, and I(8 0 )is Fisher's information-{ - (d 2 /d8 2 )PoJlogfo(X)] }e~oo' The asymptotic normality requires a tedious list of regularity conditions, first promulgated by Waldo Under almost the same conditions, with the additional requirement that the prior density be positive and continuous in the neighborhood of 80 , the posterior distribution of 8 given X I ' ... , X is asymptotically normal with mean ell and variance [nI(e)] - I . In the same way that the posterior distribution is consistent for 80 under very general conditions, it may be shown to be normal under very general conditions; however these elegant general conditions are often more difficult to verify than the longer maximum likelihood list. The prior density does not affect the asymptotic distribution of 8 in the terms of 0(\) or 0(n- 1/2 ). It does shift the mean of the asymptotic distribution by a term O(n - 1). Similar results hold for k-dimensional parameter spaces. en' en II 107 108 II. Asymptotic Normality of Posterior Distributions 11.1. A Crude Demonstration of Asymptotic Normality Let Po denote the prior probability. The posterior distribution P x is given by P x y = Po[flfiX;ly(e)]/po(flfi X ). For I e near ell' ~ d2 log JiX) = I 10gfeJX) + ¥e - eYIde2logfeJX) + small. 11 Assume that Xl' X 2' ... _ X n are drawn from Q, not necessarily a member of the family Po- eE R; if Q = POo say eo is the true value of O. Then _ d2 I 10gjiX) = 10gf1lJX) + ~(e - eyn de 2 Q(logfll) + small. II Let Po have density Po with respect to Lebesgue measure fl· Then log Po (e) = log po(ell ) + small for e near en' And P x has density Px with respect to Lebesgue measure px(e):::::: ex p [ ~(e - ell)2n~Q(logf8) }(X) so that e has the asymptotic density of a normal distribution with mean ell and variance (nI( e) )- 1. It is necessary to produce regularity conditions which will validate the omission of various "small" terms. 11.2. Regularity Conditions for Asymptotic Normality See also Walker (1969). Theorem. (il Observations Xl' X 2' ... , X n are drawn from the unitary probability QoniJ.Y. (ii) It is contemplated that Xl"'" Xli might be drawn from Pe- some (}ER. It is assumed that Pe has density fe(X) with respect to a measure von '!Y. (iii) The function Q(logfo) has a unique maximum at 0 = 00' (If Q = POb' necessarily () a = (}~.) 109 11.2. Regularity Conditions for Asymptotic Normality (iv) The prior Po has a density Po with po(e o) > 0, Po continuous at eo' (v) px[le > e] --+0 as Q each e > O. (vi) In a neighborhood of eo' the derivatives (d/de) logfg, (d 2/de 2) logfo exist and are continuous in e, uniformly in X. (vii) Q[ (dideo) logfooY < 00, Q«d2/de~) logfoo ) < O. (viii) Let en be the maximum likelihood estimate for e near eo (necessarily unique as n--+ (0). Let cP n = (e - On) [ -l)d2/de2)logfg(Xi)]!~tin' Then the posterior density of cP n with respect to Lebesgue measure satisfies eol sup I"'nl ~ K Pn(cP n) ( 1 -1.2) - I --+ 0 I J21r exp -"2'f' n as Qfor each K > O. First it will be shown that L log fiX) is maxjmized by a unique O. Since (d/de) logfo is continuous in e uniformly in X, (d/de)Q(logfo) = Q«d/de) logfo)' Since Q(logfo) has a unique maximum at e = eo' PROOF. (1) On in a small neighborhood of eo' as Q as n --+ 00, with (d/den)L loghn(X i ) = (d/deo)Q(logfoo ) = O. Let Q[(d2/de~) logfoo(X i)] = - ,10' I d2 1 d2 ~Lde210gfiX) = ~Lde~ logfoo(Xi) + ,1, where ,1 < ,10/2 whenever Ie - eo 1< c5, by uniform continuity, (vi). Since 2 [ d 1 d2 ~Lde~IOgfoo(X)--+Q de~logfoo(X) ] 1 asQ, d2 ~Lde2IogfiX) < - ,10/2 whenever Ie - eo I < c5, for all large n as Q. Thus L logfiX i) has at most one maximizing value in as Q. Also 1 1 dId Ie - eo I < c5 d2 ~Lde logfo(Xi) = ~Ldeo logfeo(X i) + ;;(e - eo)Lde*2Iogfe*(Xi) where 1 e - e* 1,1 eo - e* I ~ Ieo as Q. Thus el· As n --+ 00, (l/n) L(d/de o) logfeo(X) --+ 0 1 d ;;Lde log fiX i) < (e - e o)L1/2 for Ie - eol < c5, e > eo as Q 1 d ~Lde logfe(X i) > (e - e o),1/2 for Ie - eo 1< c5, e < eo as Q. Thus L (d/de) log fe(X i) has a zero in Ie - eo I < c5 as Q as n --+ 00, and the 110 II. Asymptotic Normality of Posterior Distributions 10-001 10-001 zero is unique because I(d 2 /d0 2 )logfiX i ) < -Llo/2 in <0 as Q; the zero at o=en will maximize Ilogfe(X); since en lies in <0 for n l~rge enough, for each choice of 0, en --> 00 as Q. (2) On is asymptotically normal with mean 00 and variance (J2/n where (J-2= -Q[(d2/dO~)logfeo(X)]. d d _ d2 0= de I 10gfeJX) = de I 10gfoo(X) + (On - 00)dozI 10gj~o(X) n 0 0 + nB(O, X) where B(O, X) --> 0 as Q from (vi). Now (l/n)(d/dO o)I 10gfoo(X) is asymptotically normal N[O, (l/n)Q( (d/dO o) 10gfelJ and (l/n)(d Z /dO~) I log foo(X) --> Q[ (d z/dO~) log foo(X) J as Q. Thus en - 00 is asymptotically normal with mean 0 and variance (J2/n. (3) To conclude, let 0 --> 0; log px(O)/px(e n) = l;g po(O)/po(e) + I 10gfe(X) - IlogfuJX) = B(O, eJ + ¥o - ey ( I ::; 10gfeJXi ) + nB(O, x») where by (iv) and (vi), B(O, en) --> 0 as Q and B(O, X) --> 0 as Q, uniformly over 10-001;£°n' From (vii), (l/n)Iidz/de;)logfeJX)--> -=- LI, and cP n is a linear transformation of 0, with cPn(On) = O. Thus px(O)/PX(On) = Pn(cPn)/Pn(O), (A) log Pn(cPn)/pn(O) + 1cP; --> 0 as Q upiformly over IcPnl ;£ K. (B) Also log Pn(cP,,)/pn(O) < - ±(O - OynLl for all 10 - en 1;£ 0." as Q as n --> 00. (C) Finally, from (v) Px( I0 - 00 I> on) --> 0 as Q for some on --> o. It is necessary to combine facts (A), (B), (C) to determine pJO). From (C), J Pn(cPn)dcP n--> 1 From (B), J le-eoI3Dn.le-8nl >KI.Jil Pn(cPn)dcP n ;£ Pn(O) J 10-6n l >Ki.fil exp[ - ±(O - e,lnLlJdcP n From (A), P;; 1(0) J lo-ilnl <KI.Jn Pn(cPn)dcP n--> lim J le-onl <K/.jn exp( -1cP;)dcP n III 11.3. Pointwise Asymptotic Normality Here C n and C~ are bounded by I as K -+ 00, n -+ 00, so Pn(O)fo -+ 1. Thus Pn(¢n)/[exp(-t¢~)jfo]-I-+O uniformly over l¢nl~K as D required. Notes: The conditions of the theorem look forbidding but they are merely those conditions which permit neglect of "small" terms in the Taylor series expansion. It is not necessary that the Xl' ... , X n be sampled from a member of the family Po, but it is necessary that a unique member POo be "closest" to Q in maximizing Q[logdPo/dQ]; ifthere are eo' e~ such that POa and POb are both closest, then the limiting posterior distribution should be bimodal with modes near eo and e~. The regularity conditions on log fo near eo are far stronger than is necessary. It is necessary that the prior density be positive at eo' and that it be continuous at eo (the conclusion of the theorem requires px(e,,)/px(e o) -+ 1 as In(e n - eo) -+ 0, n -+ 00; continuity of (d/de) log fiX) requires po{en)jpoCeo) -+ I as .;;iCen - eo) -+ 0, which requires continuity of Po). It is necessary that the posterior distribution concentrates at eo; maximum likelihood conditions for convergence of to eo might be given, governing the behavior of the likelihood outside neighborhoods of eo' but it may be easier to check convergence of the posterior distribution directly. en 11.3. Pointwise Asymptotic Normality Theorem. (i) Let Xl' X 2' ... , X n be sampled from a unitary Q on r[!I. (ii) Let Po, eE R be a family of probabilities on r[!I with densities fo with respect to some measure v on r[!I. (iii) Let e = eo be a local maximum of Q(log fo). (iv) Let a prior probability Po have density Po positive and continuous at eo. (v) (d/de o) logfoo exists as Q and Q((l/b) 10g(.foa H / j~o) - (d/de o) logfo)2-+ o as b -+0. (vi) Q((d/de o) logfo)2 < 00; (d2/de~)Q(logfo) = - l/v < o. Then the posterior density Px of e is pointwise asymptotically normal in the neighborhood of eo with mean eo + (v/n)'j)d/deo)logfoa(X) and variance v/n; that is px(eo + (Jvji;)/px(e o) - ¢ [( - ¢[ Jf;Id~o J~Id~o log foo(X;)J 10gfooCX)J-+o in Q-probability for each (, where ¢(u) = exp( - tu 2)jfo. I 112 II. Asymptotic Normality of Posterior Distributions PROOF. Let I hi X) = -;s 10gfooH(X)! foo(X) ho(X) = ~ 10gfoo(X) = de o lim hiX) 0-0 defined as Q. bn = ~~. Let + bn)/po(e o)] x Il [JOoHJX i )/ foo(X)] log px(e o + bn)/px(e o) = log po(e o + bn)!po(e o) + IbnhoJX) Then px(e o + bn)!px(e o) = [po(eo Ln(~) = From (iv), L n(~) - "b h, (X.)l -+ 0 as n -+ 00. ~nun Also, IbJhoJX) - ho(X)] has mean nbnQ(h On - ho) and variance nb;Q(hOn - h O)2 -+ 0 as n -+ 00 by (v). From (iii) and (v), (d!deo)Q(logfoo ) = Q«djde o) logfo) = Qh = O. From (vi), Q(logj~oH) = Q(logj~o) + bn(d!deo)Q(logfoo) + tb;[ - l!v + sbJ where sb -+ 0 as n -+ 00. Thus nbnQ(ho" - ho) + tnb;!v -+ 0 as n -+ 00. Therefore Ln(~) - bnIho(X) + (1!2)nb;!v -+ 0 in Q-probability. ll (*) Ln(~) - ~ AIho(X) + t~2 -+ 0 log {Px({.io in Q-probability. +~ ~)!px(eo)} -IOg{ 4>[ ~ - AIho(X) JI 4>[ AIho(X)J}-+o in Q-probability. px(eo + ~~)!px(eo) - 4>[ ~ - in Q-probability as required. AIho(X)J/4>[ AIho(X)J-+o o Notes: The condition (iii) is weaker than the corresponding condition (iii) of Theorem 11.2; also there is no condition corresponding to 11.2(v) which requires the posterior distribution to concentrate on eo' Thus the posterior density may be asymptotically normal in the neighborhood of eo without concentrating there! Condition 1l.3(v) is much weaker than 11.2(vi); thus it is only possible to prove pointwise convergence of the posterior density, rather than uniform convergence. The core of the proof is showing that the log posterior density is parabolic near the "optimal" eo' as in equation (*). Note that the expression for asymptotic variance is (d2!de~)Q(logfo) rather than Q«d2!de~)logfo); the second derivatives of logfo may not exist for many X and e, but the second derivatives of Q(logfo)' averaging out X, may well exist. See Ibragimov and Khas'minskii (1973) and LeCam (1970) for some related results in maximum likelihood asymptotics. 113 11.4. Asymptotic Normality of Martingale Sequences 11.4. Asymptotic Normality of Martingale Sequences Theorem. Let ito c it 1 C .. , cit" c .. , c it be probability spaces, and let Po be a unitary probability on it. Let Pi be probabilities on it to iti with Pi = PiP j, i ~j. Let X be an element ofit 00' where it 00 is the minimal complete probability space with a probability P 00 equal to Po on it" each n. Define = p"X - Pn _ 1 X S2 = P (X - P X)2 n n " U" s; Assume (i) 0 < < 00 all n, as Po' (ii) 2:.i="+ 1 Pj- 1(u;)/s; ~ 1 as n ~ 00, as Po' (iii) ~up Pj - 1(u;)/s; ~ 0 as n ~ 00, as Po' )>" (iv) 2:.i=n+l Pj_llujI3/S; ~ 0 as n ~ 00, as Po' Then PJ[(X - p"X)/s"] ~ Jfof(u)e-(1/2IU2dU, as Po' for each bounded continuous function f (so that X is asymptotically normal with mean p"X and variance given itJ s; The proof parallels the usual method for proving a central limit theorem for sums. Here X - p"X = 2:.i="+ 1 Uj = lim 2:7="+ 1 uj ; the quantiPROOF. N .... oo ties {u) play the role of the summands in the central limit theorem; they are not independent, but satisfy uj Po-uncorrelated with f(u k ) for k > j,f measurable. Letf"(u) = exp(itu/s"). Then PJfn(un+1 + un+2 )/PJ,,(un+I)P n+tf"(u,, + 2)] = P"Pn+1 [fn(u n+1)/PJ,,(Un+1)] =1 By induction p,,[fn(2:f=n+ 1 u)/TIf=n+ 1 Pi - tf,,(U i)] = 1 (this makes the characteristic function of P NX - P"X nearly the same as a product). From Theorem 4.2, P" IPNX - X I~ 0 as N ~ 00. Also Ifn(x)-f"(y)I~lx-YI, and Ifn(x)l~l, If"-I(X)I~l. Therefore Ipn([fn(2:~+ 1 Ui) - fn(X - PnX)]/TIf=" + 1 Pi-tf"(ui»1 ~ p"IPNX - Xl ~ Oas N~oo. P i- tf,,(U i) = 1 + ~Pi-1Ui - tt 2 P i _ 1Uf/S; + vltI3Pi_lluiI3/S; N 2: i=n+ 1 s" logPi_tf,,(U) = -tt 2 N 2: i=n+ 1 Pi _ 1(uf)/s;+vltI 3 where with Ivl ~1 N 2: Pi_ 1Iu i I3/s;+e" ,,+1 en ~ 0 as n ~ 00 by (iii). 114 II. Asymptotic Normality of Posterior Distributions [Using the facts x - tx2 < log(l + x) < x, I Ix; - suplxJIlx;1 < 10g(1 + xJ < Ix;.] Thus I~n+ 1 log P;_ In(u) exists by (ii) and (iv), and approaches - tt 2 as n ~ 00 as Po. Since I~N+ 1 log P;_ Ifn(u) ~ 0 as N -> 00, Pn[ fn(X - PnX) l=u 1 P;- In(U)] = Pn[fn(X - PnX)/ = P n[ f,,(X - PnX) lim = N---)-oo Pn[j~(X - TI ;=n+l l=~ PnX)/ 1 Pi-JJu;) x fi Pi-Jn(U)] N+l P;- In(U)ZN] fI i=n+ 1 P;_ J"(U)] where ZN ~ 1 as N ~ 00 = 1. co Il i=n+ 1 P.1-1 f n(u.)l ~ e-(1/2)t 1 • Thus the characteristic function of (X - PnX)jsn approaches the characteristic function of the unit normal as Q. Since any continuous function that is zero outside a compact set can be uniformly approximated on the set by a finite sum j exp(itp), the same result holds for such continuous functions. Extension to arbitrary bounded continuous functions is straightforward. D Ia Notes: These results are very free of regularity conditions, and especially of independence conditions. The conditions on the increments in posterior means P"X - Pn- 1 X might be difficult to verify. Conditions (iv) might be weakened to the Lindeberg-like condition IT:'n+ 1 Pj _ 1[I ujl2{ 1ujl > e} ]/S; ~ o. See Hall and Heyde (1980), and Brown (1971). EXAMPLE. Suppose !!t n is generated by XI' ... ' X n' a sample from the Bernoulli distribution P[x] = p{x=l}(l- p){x=O} given p, and where p has some prior distribution Po. See Awad (1978), p. 53. Letting r = Ixi' P n(P) = fer, n) = P o[pr+ 1( 1 - p)n-r]/ P o[pr(l - p)"- r] Un+1 = fer +x lI +l ' n + 1) - fer, n) Un+1 given!!tn has values fer + 1,11 + 1) - fer, n) with probability fer, n), fer, 11 + 1) - fer, n) with probability 1 - fer, 11). Pn(U n+1) = 0, so fer, n + 1) [1 - fer, n)] s; = P/p - PnP)2 = + fer + 1, n + l)f(r, n) = fer, n). fer + 1, n + l)f(r, n) - fer, n)2 Pn(u; + 1) = s~/ fer, n)(1 - fer, n» Pnlunl3 = s~/[f-2(r, n) + (1- fer, n»-2J. Note thatf(r, n) = PnP~ p as 11~ 00, as Po· 115 ! 1.5. Higher Order Approximations to Posterior Densities Assume that ns; ~ p( 1 - p), as Po' 00 Then Since 00 L P j _ 1(uJlls; - n I [p(l - p)/(j - 1)2]/p(l - p) ~ o. n+l n+l L 00 I/(j - 1f ~ I as n ~ 00, (ii) is satisfied. j=n+ 1 sup P j _ 1(uJl/S2 ~ (1 + £)/n ~ 0 as n ~ 00, satisfying (iii) j>n co n3/2 [p(1 - p)]3/[ I I ] .L Pj-llujI3/S:~[p(1_p)]3!2.L (j_1)3 p2+(1_p)2 j=n+l j=n+l n 00 ~O satisfying (iv). Thus the posterior distribution of p given Xl' ... , xn is normal whenever ns; ~ p(1 - p), as Po' that is, whenever the posterior variance converges to the asymptotic variance of the maximum likelihood estimator. 11.5. Higher Order Approximations to Posterior Densities Let's just blast away with Taylor series expansions and leave the regularity conditions till later. See Johnson (1967) and Hartigan (1965). (i) Assume XI' ... , X n are a sample from Po having density fe with respect to some measure v on all. Let e be a real valued random variable. Let hr(X) = Cd/de]' logfo(X)o=8o' gr = Q[h/X)] with respect to some measure Q on all. (ii) Assume Q(log fe) is maximal at B = Bo. (iii) Assume the prior Po on S has density Po' and the posterior P x has density Px' Then n log Px(B) = log Px(Bo) + log Po(B)/Po(B o) + L 10gfe(X;l/ foo(X i ) i= 1 d log [Px(B)/Px(B o)] = (B - BO)-B log Po(Bo) d 0 + (8 - Bo)Lhl(X) + t(B - BO)2LhiX) + k(B - BO)3 Lh 3(X;l +0[(0 - Bo)] +0[n(8 - BO)3] (iv) This expansion is justified by requiring the first three derivatives [d/dB)'logfe(X) to be continuous in a neighborhood of eo' uniformly in X; and by requiring the derivatives (d/de) log pee) to be continuous in a 116 11. Asymptotic Normality of Posterior Distributions neighborhood of 80 , In order to ensure that large deviations 1 8 - 80 I have negligible probability, assume P x[ I 8 - 80 1 > n - 112+ <Jn k --+ 0 for every k > O. The later terms are negligible if Qh2 < O. Px(8) = c(X) ex p { U)2(XJ[ 8 - 8 0 x {I + i(8 - + [ ~)l(XJ + d: o logpo J/~)Z(Xi) J} 80)3h3(X) +0(8 - 80) +o[ n(8 - 80)3]}. Here the term i(8 - 80 )3 l )3(X i ) causes an 0 (n- 1/2 ) skewness departure from normality. The only effect of the prior is in shifting the mean by - (d/d8 0) log PO/'ih2(XJ The first three moments determine the asymptotic distribution Pn8 = 80 - ~)I(X)/'ihz(Xi) - {d~o log PO + H('ihY - 'i h zJ'i h 3/('i h2)2 } j'ih2 + 0 (n- 2) o P II (8 - P 8)2 = ('i h l'i h 3/'i hz - 'ihz)- 1 +0 (n- 2) 1I P II{8 - PIIW = - 'ih3/('ih z )3 +0 (11- 3) 11.6. Problems El. Show that the binomial model satisfies conditions 11.2, when the prior density is continuous and positive at Po, 0 < Po < 1, and Po is assumed true. E2. In the binomial case, if the prior distribution has an atom at Po' show that the conditions of 11.4 are not satisfied. E3. Let f[ Xl' X 2' ... , X n] be the marginal density of the observations, and let p(O) be the prior density. Show, under conditions 11.2, that f(Xl'X2"",Xn)/Of(Xilen)~p(00)j2rr { nQ [ 82 - 80210gf o ]}-1/2 PI. Observations X are N(/l, I) and /l has prior density uniform on l/ll ~ 1. Give the asymptotic behavior of the posterior distribution as the true /lo ranges from - 00 to 00. E4. Under the conditions 11.2, when 00 is true, show that P(O < 00 IXn) is asymptotically uniformly distributed. E5. Under the conditions 11.2, the posterior distribution of log [Of (Xi IIl)Of(X i Ien) J is asymptotically -hi. [Bayes intervals for 0 thus coincide approximately with maximum likelihood intervals.] e E6. Xl"'" X n are uniform over [0 - t, + tJ and 0 is uniform over the asymptotic behavior of the posterior density when 0 = o. 00 to 00. Give 117 11.6. Problems E7. f(x I0) = = I/O 1/(1 - 0) =0 O<x<O if 0 ~x < 1 elsewhere. If 0 is uniform over (0, I), specify the asymptotic behavior of the posterior density of 0 when 0 = 1/2 is true. P2. f(x I11) = i exp { -I x-Ill}, 11 uniform. Specify the asymptotic behavior of the posterior density of 11, given 11 = O. P3. Let g be such that g(X) and g2(X) are Po integrable. If X I' ... , X. is a sample from Po' and P on 0 is unitary, show that P(n var [P o[g(X)] IX]) ~ P Po[g2(X)] - P[ pi g(X)]. Thus Pog(X) is known, given X, to order n- I !2. E8. Generalise theorem 11.2 to k-dimensional parameters. HO Hi ~ p ;::;t} + ~ p ~ I}. P4. For the binomial model, the prior on p has density What is the posterior distribution of p asymptotically, when the true value is p = ±? P5. The observation X, Y is bivariate normal, means a(O), b(O), identity covariance matrix, a(8) = 0 for 8 ~ 0, a(O) = 8 for 8 ~ 0, and b(8) = a( - 0). Find the asymptotic posterior distribution of 8, when the true value of the means is (1, 1). Assume a uniform prior distribution for O. P6. In the binomial model, find nondegenerate prior distributions for p for which n var(p IXn) -> O. P7. For X l ' ... , X. from N{J1,0"2), prior 11- N(11 0 ' O"~) verify the conditions of the martingale central limit theorem. Ql. Let X l ' ... , X. be from the normal mixture pN{J11' O"~) + (1 - P)N(J12' O"~) where p has uniform prior, III and J1 2 are independently N(O, 1), O"~ fixed. What is the asymptotic posterior distribution of p, J1 1 ,J12 for various true values of p, J1 1 and J1 2 ? Q2. Let.'!ll c.'!l2 c ... c .'!In''' be increasing, OE.'!l 00' and suppose Z.E.'!l., Z. -> 0 has the property that (Z. - 8)/0".(8) -> N(O, 1) in distribution given O. Show that (8 - Z.l/O".(Z.) -> N(O, 1) in distribution given Z., provided 0 has continuous positive density on the line. (Note: Z. may not have a convergent density.) [Here 0".(8) is the standard deviation of 0 given.'!l. and O".(Z.) is the standard deviation of Z. given .'!l•. ] P8. LetX o' X l ' X 2' ... be observations from an autoregressive process X, = IXX'_I +~, where the ~t are i.i.d. normal. Assume IX is uniform on (-1,1). Find the asymptotic behavior of the posterior distribution of IX given X 0' X I' X 2' ... , X.' P9. Let X l ' ... , X. be a sample from the density exp(O - Xl, X;;; 8. Let 0 have a prior density which is continuous and positive at 0 = O. Find the asymptotic distribution of 0 given X I' ... , X. if X I' ... , X. are sampled from the uniform on (0, 1). 118 II. Asymptotic Normality of Posterior Distributions 11.7. References Awad, A. M. (1978), A martingale approach to the asymptotic normality of posterior distributions, Ph.D. Thesis, Yale University. Brown, B. M. (1971), The martingale central limit theorem, Ann. Math. Statist. 42, 59-66. Hall, P. and Heyde, C. C. (1980), Martingale Limit Theory and Its Applications. New York: Academic Press. Hartigan, J. A. (1965), The asymptotically unbiased prior distribution, Ann. Math. Statist.36,1137-1154. Ibragimov, I. A. and Khas'minskii, R. Z. (1973), Asymptotic behaviour of some statistical estimators II Limit theorems for the a posteriori density and Bayes estimators, Theor. Probability Appl. 18, 76-91. --(1975), Local asymptotic normality for non-identically distributed observations, Theor, Probability Appl. 20, 246-260. Johnson, R. A. (1967), An asymptotic expansion for posterior distributions, Ann. Math. Statist. 38,1899-1907. LeCam, L. (1958), Les proprietes asymptotiques des solutions des Bayes, Publ. Inst. Statist. Univ. Paris 7, 17-35. --(1970), On the assumptions used to prove asymptotic normality of maximum likelihood estimates, Ann. Math. Statist. 41, 802-828. Walker, A. M. (1969), Asymptotic behavior of posterior distributions, J. Roy. Stat. Soc. B 31, 80-88. CHAPTER 12 Robustness of Bayes Methods 12.0. Introduction A statistical procedure is robust if its behavior is not very sensitive to the assumptions which justify it. In classical statistics these are assumptions about a probability model {Po' 8Ee} for the observations in C!!I, and about a loss function L connecting the decision and unknown parameter value. In Bayesian statistics, there is in addition an assumed prior distribution. Bayesian techniques have been used by Box and Tiao (1973) and others to study classical robustness questions such as the choice of a good estimate of a location parameter for "near-normal" distributions; they imbed the normal in a family with one more parameter, and then use standard Bayesian techniques to determine the posterior distribution of the location parameter. The usual robustness studies allow for a much larger neighborhood of distributions however. In studying Bayesian robustness, we wish to evaluate the effect on the posterior distribution and on Bayesian decisions of various components of the probability model. Since the loss function is chosen by the decision maker it seems plausible to concentrate on the probability parts of the model. (i) the likelihood component {Po' 8Ee} (ii) the prior component Po Here we consider mainly the prior component Po' using the techniques of de Robertis and Hartigan (1981). 119 120 12. Robustness of Bayes Methods 12.l. Intervals of Probabilities Let Q 1 and Qz be probabilities on X. Define Q 1 ~ Qz if Q 1 X ~ Qz X whenever X ~ O. The interval of probabilities (L, U) is the set of probabilities Q with L ~ Q ~ U. The probability L will be called the lower probability, and the probability U will be called the upper probability. If L has density I, and U has density u with respect to v, then (L, U) consists of the probabilities with density q, I ~ q ~ u. Theorem. Inf {Q( Y)/Q(X) IL ~ Q ~ U} is the unique solution U(Y - J.X)- + L(Y - J.X)+ = 0, provided UX- + LX+ > O. }, of Note that X+ = X{X(s) ~ O}, X- = X{X(s) ~ O}. Since Q(Y)~UY-+LY+ for L~Q~U, and QoZ=U[{Y~O}Z]+ L[{ Y ~ O}Z] satisfies L ~ Qo ~ U, inf{Q(Y)IL ~ Q ~ U} = Qo Y = UY-+LY+. Now infQ(Y)/Q(X) ~}. if and only if inf[Q(Y) - },Q(X)] ~ 0, since Q(X) ~ for L ~ Q ~ U. Thus inf Q( Y)/Q(X) ~ J. if and only if U(Y - J.X)- + L(Y - },X) + ~ O. Also PROOF. ° d dJ. {U(Y - }.X)- + L(Y - ;.X)+} = U(-X{Y~J.X})+L[ -X{Y~J.X}J. ~ - U X - - LX + < 0 (since QoZ = U{ Y ~ J.X}Z + L{ Y ~ },X}Z lies in (L, U». Thus U(Y - J.X)- + L(Y - J.X)+ is strictly decreasing, and is zero at }, = infQ(Y)/Q(X) as required. 0 12.2. Intervals of Means Theorem. Let L = N(O, I), U = kL. For QE(L, U), the mean ofQ is QX/QI where Xes) = s. Then QX/QI has range [ - y(k), y(k)] where y(k) satisfies ky=(k-l)[¢(y)+y4>(y)] where ¢(x) = exp( PROOF. tx vfo, 4>(x) = too ¢(u)du. 2 From 12.1, inf QX/Ql is the solution of U(X - J.)- + L(X - That is I (x - J.)k¢(x)dx + I (x - J.)¢(x)dx =0 J.)+ = O. 121 12.3. Intervals of Risk that is + cPU,) 1) [¢(A) + ),4>(J,)] ),[1 - 4>(),)] - k¢(),) - kU)(A) ), = (k - = 0 - kA = (k - l)[¢( - ),) - 24>( - A)] o Thus - y(k) = inf QX/Q 1. Similariyy(k) = sup QX/Ql. k l,(k) 1 0 1.25 .089 1.50 .162 1.75 .223 2 .276 2.5 .364 3 .436 4 .549 5 .636 6 .707 7 .766 8 .817 9 .862 10 .901 Thus quite substantial changes in the probability Q do not affect the mean too much. Similar Bayes estimates will arise from a wide range of priors. 12.3. Intervals of Risk Theorem. Suppose that the risk r(d, 0) is the loss in making decision d when 0 is true. Let the Bayes risk B( Q) = inf( Q[r(d, 0) ]/Q(l», the probable loss when d the best decision is taken. Assume 0 < L(1) ~ U(l) < Inf {B( Q) IL ~ Q ~ U} is the unique solution of f3 1 (),) = inf[U(r(d, 0) - 2)- 00. + L(r(d, 0) - },)+] = 0 d Sup{B(Q)IL ~ Q ~ U} is no greater than the unique solution of fJ 2 (2) = inf[L(r(d, 0) - 2)- + U(r(d, 0) - A)+] = O. d PROOF. It is straightforward to show that f3 1 (2) and strictly decreasing and have unique zeroes Al and A2 inf{B(Q)IL ~ Q ~ U} = sup {AIB(Q) ~}, f3 2 (A) are continuous, for Q such that L ~ Q ~ U} = sup{2IQ[r(d, 0)] ~ AQ(l) for all d and Q, L ~ Q ~ U} =sup{AIQ[r(d,O)-2J ~O all d and Q, L ~ Q ~ U} = sup {A Iinf(L[r(d, 0) - A]+ + U[r(d, 0) - d = sup {A If3 1(},) ~ O} = sup{B(Q)IL ~ Q ~ U} = inf{JcIB(Q) <), }'I for Q, L ~ Q ~ U} = inf {A IQ[r(d, 0)] < 2Q(l) some d, for each Q, L ~ Q ~ U} ~ inf{A1 sup Q[r(d, 0) Q A] < 0 some d} 2]-) ~ O} 122 12. Robustness of Bayes Methods = infPIL[r(d, 8) - AJsomed} + U[r(d, 8) - ).]+ <0 = infPlfJ 2 ()·) < O} = )'z. D 12.4. Posterior Variances For the normal location problem, take L = N(O, 1), U For r(d, 8) = (d - 8f, B(Q) is the variance of 8. It is bounded by the solutions of inf U {k[ (d - 8)Z - ..1.1 Jd inf U{k[(d - 8)Z - A2J- = kL. r }= 0 + [(d - 8)2 - ..1.1 + [(d - 8)2 - AzJ-} = O. d By symmetry of U about 0, the optimal solution is d = 0 for both equations. (Consider probabilities of form k¢{ (d - 8)Z < )'1} + ¢{ (d - 8)2 > 1. 1 }; note that the mean value of such probabilities lies between d and 0 for k> 1; this implies that the only solution to the first equation is d = O. For the second equation, k[ (d - 8)Z - Az)+ + [(d - 8f - AzJ+ is convex so its minimizing value is unique. By symmetry if d is a minimum, so is - d. Therefore d = 0.) The solutions ..1.1 and Az are posterior variances of elements of (L, U) so the bounds of Theorem 12.3 are sharp. k -1 ..1.1 = Ai is the solution of A¢(A) + (AZ - 1) [ 4J(A) - 2k _ 2 J -IJ = 0 Az = A; is the solution of A¢(A) + (A2 - 1) [ 4J(A) - 2k 2k _ 2 = 0 k Ll1 Ll2 1.25 .947 1.055 1.50 .904 1.100 1.75 .870 1.140 2 .840 1.174 2.5 .792 1.233 3 .754 1.282 4 .697 1.360 5 6 .654 .621 1.421 1.472 7 .574 1.515 8 .592 1.552 9 .552 1.585 10 .535 1.615 Thus, again, a very large change in the probability density causes a relatively minor change in posterior variance. [Factor of 10 for density gives factor of 2 for variance.J 12.5. Intervals of Posterior Probabilities Lemma. L ~ cQ ~ U for some c, X~O, Y~O. PROOF. The "only if" is obvious. if and only if QX/Q Y ~ U X / L Y for each 123 12.6. Asymptotic Behavior of Posterior Intervals allX,Y:;;;O IfQXLY~UXQY sup Q(X)/UX = c 1 ~ c 2 = inf QY/LY Y~O X~o c 2 LX ~ QX ~ C1 c 2 LX c 2 UX ~ 1 c2 QX L~-Q ~ ~ U UX . d as reqUIre. Theorem. Let X, Y be random variables satisfying the conditions of Bayes' theorem (3.4): (i) X, Y and X x Yare random variables from U,:!Z to S,:!£, T, UJ! and S x T,:!£ x UJ!. (ii) f is a density on :![ x UJ!. (iii) :![ and UJ! are a-finite. (iv) fT(t):s-+f(s,t)E:![eacht. (v) P~g = RY(gfs)for some probability R on UJ!. (vi) For each QX, LX ~ QX ~ u x , f /Qx f T is a density on :![ x '!Y. Then the quotient probability QX satisfies, for some k( Y), Q; corresponding to the prior probability L; ~ Q;k(Y) ~ U;(UXfT/LxfT)' PROOF. By Bayes' theorem, Q;h = QX(hfT)/QX(fT) Q;hl/Q;h2 = QX(hJT)/QX(h 2f T ) ~ UX(hJT)/LX(h2fT)' From the lemma, o 12.6. Asymptotic Behavior of Posterior Intervals Theorem. (i) Let X, Y1 , Y 2 , ... , Y/l' ... be random variables from :!Z to fl£, UJ! I' UJ! n' ... , and assume that y i- 1 (UJ!) is increasing. ... , (ii) Let the quotient probability p~n[gJ = RYn(gf;) for some f" which is a density with respect to :![ x UJ!n on S x Tn' and some probability R on UJ! /l' Assume that these quotient probabilities agree with a conditional probability P x defined on the smallest probability space including all y i- 1(UJ!J Px9n(Yn) = P;tgn for 9/l E UJ!n' (iii) Assume that P~" is unitary. 124 12. Robustness of Bayes Methods (iv) Let LX, V X be unitary probabilities such that F/Lx(f;) is a density with respect to q; x OJ! n' and LXg = VX[lg] where VX{1 = O} = O. (v) Assume that goEq; and IEq; are OJ!n-approximable in V-probability that is, knEOJ!n' k~EOJ!~, vXpY"lg - k n I~O' X 0 V X pY"IIk'n I~O • x Then sup LX;:iiQX;:iiUX IQ;"go - go (X) I~ 0 almost surely. PROOF. Assume I(X) > 0 without loss of generality. Let A be the set of values of X such that, for all rational A, V;" {l[go - A]+ + [go - A]-} ~ {l(X) [go - A]+ + [go - A]-} as PX. By Doob's Theorem (Doob, 1949), VX(AC) = O. For a fixed value of X in A, suppose go(X) > oc, rational. Then l(X) [go (X) - oc]+ + [go (X) - oc]- > 0, so V;" {l[go - oc]+ + [go - A]-} > 0 for large n, so sup Q;"go > oc for all large n, as P X. Similarly, if go(X) < P for P rational, L;:iiQ;:iiU inf Q;"go < L;:iiQ;:iiU for all rational oc and P all large n, as P X. Since these results hold p, sup IQ;"(go(X» - go I ~ 0 as P x' except for a set L;:iiQ;:iiU 0 of X values of V probability zero. Note: A similar theorem is proved in deRobertis and Hartigan (1981) with L, V u-finite. 12.7. Asymptotic Intervals under Asymptotic Normality Theorem. Let X, Yl' ... , Yn, ... be random variables satisfying the conditions of Theorem 12.6, and, in addition, assume that go(X) is asymptotically conditionally normal under V: v;"c[(go-Jln)/un]~Jc(u)exp( -~u2)du/fo for each bounded continuous c, where Jl n = V;"go' u; Then as V, = V;"g~ - Jl;. U;I[ sup Q;"go-(Jln+uny(kn»]-+O as V, U;1 [ L~Q;:iiU inf Q;"go - (Jl n - uny(kn»] ~ 0 as V, L~Q~U where y(k) is the solution ofky = (k - 1) [cP(y) + ytP(y)], kn = I/V;)(X). 125 12.8. A More General Range of Probabil ities By 12.6, kn = l/U:) --- I/I(X) as U. By asympt otic normal ity of go (X), for each A, PROOF. U: n {[go ~Pn - AT + {go ~Pn - AT} --- f{ [u - A] + + leu - A) - }¢(u)du as U. ence Let A be the set X values for which I(X) > 0, and the above converg < }¢du IX) /(X)(u + + IX) (u { S then IX, < X)] occurs for all rationa!},. If}{l/I( n, large all < -} IX] 11n)/O"n (go 0, so that U:J [(go - Pn)/O" n - IX] + + I[ then IX, > )] y[l/I(X If n. so supQ:J (go-Pn )/O"n] >1X all large sup Q:J (go - Pn)/O"nJ > ct all large n. Also l(k n ) --- y[l/I(X )] as U. Thus ° o or Note: This theorem permits a close approx imation to the interval of posteri ility probab means Q:ngO(X), compu ted by assumi ng that go(X) has upper N(P n , 0";) and lower probab ility N(P n , 0"; )U;Jl( X)]. 12.8. A More General Range of Probabilities ~ q ~ u and If L ~ Q ~ U, and the measur es have densities [, q, u, then / permitt ing of q(s! )/q(S2) ~ u(s! )/I(s2). This formul ation has the advant age probab iliof l interva Q to be a unitary probab ility. A difficulty in the present ties is that q may be dramat ically discont inuous. More generally let q(s! )/q(S2) ~ u(s!' S2) define Q E R. The functio n u(sp S2) might be such that U(SI' s2)---l as SI --- S2· Necessarily U(SI' S2} ~ U(SI' S3)U(S3' S2)· or The posteri or density qt(SI)/q/S2)~Ut(SI)lft(S2»U(SpS2)' so posteri ient conven es densities are handled in the same framework. It is sometim 3). to use log[q(SI)/q(S2)J~P(SpS2). Then P(Sl'S2)~P(SI,S3)+P(S2,S (Note that p may be negative, so it is not a metric.) ents For conditi onal densities Is(t) it seems desirable to constra in movem e baroqu in Is, (t) and Is)t) where SI and S2 are close. This suggests the densities Us, (t 1)/ ISl (t 1». U S1 (t 2)/ Is, (t 2» ~ U(S l' S2' t l' t 2 )· Again posteri or tz) 2)V(t!, obey a bound of the same type. Maybe U(SI,sz ,t p t z )= U(Sl'S t2· = tl or would be viable, but it doesn't force u(s I' s2' t l' t 2) = 1 if SI = S2 C = {s IXes) ~ O}, A = It is necessary to decide if QX ~ all Q E R. Let A /u(s,s') {sIX(s) < O}. Then q(s) = inf q(s')u(s, s') for sEA; and q(s') = supq(s) seA s'eA way for s' E A C, in a solutio n which minimizes QX. But I can't see any simple ry necessa really is n erizatio charact a such and 0, to charact erize X with QX ~ to use the range. ° C 126 12. Robustness of Bayes Methods 12.9. Problems E1. A prior distribution for the binomial parameter p is such that no interval oflength 1 has more than twice the probability of any other interval of length 1, for all 1. Show Pp ~.j2 - 1. E2. Let X l ' ... , Xn be n observations from N(8, I). Let the prior for 8 be Q, L ~ Q ~ U where L is Lebesgue measure and U = 2L. Show that the posterior mean lies in the interval X ± .276/ In. PI. Suppose 7 successes are observed in 10 binomial trials. Let the prior for p lie between U = uniform (0, 1) and 2U. Find the posterior mean's range. [Hint use the binomial cumulative distribution.] P2. A prior distribution for the binomial parameter p lies between U(O, 1) and 2U(0, 1). Find the range of the variance of p. P3. Consider densities of form k ~ 1. Find the value of d for which the density fd has minimum variance. P4. Suppose r successes are observed in n binomial trials. Let the prior for p lie between U(O, I) and 2U(0, I). Find an asymptotic expression for the interval of posterior means. J2n P5. Let f = 1/ exp [ - t x 2 + e(x)] where Ie(x) I ~ 1. Find bounds for the posterior density of 8 given Xl' ... , Xn where Xl' ... , Xn is a sample from f(x - 8), and 8 has uniform prior density. 12.10. References Box, G. E. P. and Tiao, G. C. (1973), Bayesian Inference in Statistical Analysis. Reading: Addison - Wesley. Doob, J. L. (1949), Applications of the theory of martingales, Colloques Internationaux de Centre National de la Recherche Scientific Paris 22-28. DeRobertis, L. and J. A. Hartigan (1981), Bayesian inference using intervals of measures, The Annals of Statistics 9,235-244. CHAPTER 13 Nonparametric Bayes Procedures 13.0. Introduction Whereas Bayes procedures require detailed probability models for observations and parameters, nonparametric procedures work with a minimum of probabilistic assumptions. It is therefore of interest to examine nonparametric problems from a Bayesian point of view. Usually nonparametric procedures apply to samples of observations from an unknown distribution function F. Inferences are made which are true for all continuous F. For example if X(l)' X(2)' ••• ,x(n) denote order statistics of the sample, [X(k)' X(k + I)J is a confidence interval for the population median of size (~)2-n, provided the true Fis continuous. We must give some sort of family of distributions over distribution functions F which can be used as priors and posteriors in a Bayesian approach. Ferguson (1973) suggests the Dirichlet process, which for a general observation space 1Jjf, gives a distribution over probabilities P on IJjf such that P(B 1 ), ••• ,P(Bk ) is Dirichlet whenever {B j } is a partition of the sample space. No unitary prior is known to reproduce nonparametric confidence procedures; worse, no prior of any sort is known that reproduces such confidence procedures. However some confidence procedures correspond to families of conditional probabilities, and Lane and Sudderth (1978) have used finitely additive probabilities to generate confidence procedures. 13.1. The Dirichlet Process Let IJjf on Tbe a probability space, let &> be the set of unitary probabilities P on 1Jjf, and let f£ denote the smallest probability space on &> such that X: P --+ PY lies in f£ for all Yin 1Jjf. A Dirichlet process DIXon ?t, correspond127 128 13. Nonparametric Bayes Procedures ing to a bounded measure 11. on T(a(T) < (0), is such that PB l ' P B 2 , ..• , PBk is distributed as a Dirichlet Da(B,),a(Bz) .... ,a(Bd for each partition Bl ' B 2 , ... , Bk of T. Proofs that a Dirichlet process exists are given in Ferguson (1973) and Blackwell and MacQueen (l973). Following Blackwell and MacQueen, a sequence of random variables Y 1 , Y 2 , Y 3 , .•. taking values in qy, Tis a Po/ya sequence with parameter 11. if P[j(Y)] = a(f)/a(T), for fEOJJ PJ=P[j(Yi+1)IYI' Y z ,· .. , YJ=[a(f)+ l:f(Y)]/[a(T)+i]. j~i Given Y1 , Yz ' ... , Yi' the distribution of Yi+ 1 is a mixture, in the proportion of i to a(T), of the empirical distribution based on Yl' Y2 ' '" , Yi and the distribution a/a(T). As i approaches 00, the empirical component predominates and the limiting distribution of Y i + 1 given Y1 , Y 2 , ..• , Y i is the limiting frequency distribution of Y1 ' Y z , ... , Yi . This limiting distribution, when it exists, will be taken to be a realization P of the Dirichlet process. The different limiting distributions P, for different sequences Y l' Y2' ... , Yn' ... give a distribution of probabilities P that satisfy the definition of the Dirichlet process. Theorem (Blackwell-MacQueen). Let qy on T be separable (there exist a sequence of 0-1 functions AI' A 2' ... , An' '" such that OJJ is the smallest probability space including AI' A z , .. ·, An' ... ). Let {YJ be a Polya sequence with parameter 11.. For each Yl ' Yz ' ... , Yi' ... , define P*f = P*f = lim l:J(Y) when the limit exists for all fin qy n-oo n af /aT when the limit does not exist for all f in OJJ. Then P* is distributed as a Dirichlet process Da on !!£, and the conditional distribution of Y1 , Y2 ' •.. , y;, ... given P* is such that the Yi are independent each with distribution P*. PROOF. If YI' Y Z ' Y3 ' ... is a P61ya sequence, the Ionescu Tulcea theorem (Neveu, 1965, p. 162) states that a probability P exists on the product space TxT ... ,qy x OJJ ... such that P is consistent with each of the conditional probabilities Pi' The separability of q7j guarantees that functions of the form {Y1 = Y 2 } lie in OJJ x OJJ provided OJJ includes all singleton functions {t},tET. [{Y 1 =f. YJ =Sl1P[{Y 1EA i} - {Y2 EAJIEqy x OJJ; otherwise there exist y 1 =f. y 2 such that y 1: y 2 lie both inside or both outside of every Ai' and OJJ generated by Ai consists of functions f with f(y I) = f(yz); thus the singleton function {y J is excluded from OJJ.] If OJJ does not include all singletons, consider the space OJJ*, T* where T* consists of the equivalence classes Bt' tET'' t'EB if and onl".J if {tEA.} = {t'EA.} all A .. And qy* consists of the t functions f*(B) = f(t), f EOJJ. Note that f* is well defined since f(t) = f(t ' ) 1 1 1 129 \3.1. The Dirichlet Process whenever t' E B t • Now all* is separable and includes all singleton functions and the theorem may be proved for all*. The Dirichlet process P* defined on all* x all* x ... has the desired properties. It will therefore be assumed that all includes singletons. It will be shown first that P*f = lim Lf( YJjn except for sequences {Yi } n~<Xl in a set of pro bability zero. Let f 1(t) = {Y1 = t}. Then f 1(Yn) is an exchangeable sequence given Yl' so from de Finetti's theorem, 4.5, L~=2nY)/n converges to say P 1 with probability I. Next let f 2 (t) = {Y1 = t or Yz = t}. Then {f2(Yn)' n > 2} is an exchangeable sequence given Y1 , Y 2, and so L~=3f/YJjn converges to say P2 with probability I. Similarly, if fk(t) = Ul';;';k{Yi=t}, then Dk(YJjn converges to say Pk with probability 1, for alCk, 1 ~ k ~ 00. Now ... , Yk ] = [k + o:fk]/[k + o:T] -+ 1 as k -+ 00 p[Lfk(YJj(n - k) 1 Y1, Yz ' '" , Yk ] = P[j/Yn ) 1 Y1, ... , Yk ] -+ 1 as k -+ 00 P[Ifk(Y)/n> 1 - 8] Yl' ... , YkJ -+ 1 as k -+ 00, each 8> P[Pk> 1 - 81 Y 1 , ... , Yk] -+ 1 as k -+ 00. P[jk(Yn ) 1 Y1 , Y 2 , ° A probability P exists on all x all x all ... consistent with these conditional probabilities, and from separability, the functions fk(Y) lie in all x all x all x .... Thus with probability 1, all the limits Lfk ( Y)/n exist and lim lim "2..Jk(Y;)/n = 1. k~<Xl This guarantees that P*f = lim I f( Y;)/n exists for all f in all; P* is a discrete distribution carried by {YJ To show this, n L f(YJjn= f(Y1)[Lf1(YJjn] i= 1 + f(Y z) [Lf 2 (Y) - Lf1(Y)]/n + ... + f(Yk ) [Ifk(Y) - Ifk-1(Y)]/n + A[n - Lfk(Y)]/n where IAI ~ suplfl lim 1If(Y)/n - (f(Y1)P 1 + .,. + f(Yk)(P k - Pk - 1)) 1~ sup 1f 10 - Pk )· n-oo Thus all the limits Lf(Y)/n exist if the Pi exist and lim Pk = I. It is straightk forward to show that P*f = Lf(Y)(P i - Pi - 1), Po = 0, defines a probability on all, for each sequence Yl' Yz ' ... , Yi where the limits exist. Since P*f = o:f /o:T defines a probability all when the limits don't exist, P* always takes val ues in f!J>. To show that P* is distributed as D a' it is necessary to show that p* B l ' ... , P* Bk is distributed as Dirichlet D a(B)) •...• a(Bk) for each partition B l ' ... , Bk of T. Define 130 13. Nonparametric Bayes Procedures Then Zi is a random variable taking k discrete values, and it may be shown that Zi is a P6lya sequence with parameter a*, a*{i} = a(BJ Since P*Bj = lim L{Zi = j}/n, the problem is reduced to showing that P* is distributed as D(J. when T is finite. If T = {I, 2, ... , k}, let p* = {pp P2 , ••• , pJ, LPi = I, and note from de Finetti's theorem that the Yi are independent multinomial given P*, with P[Yi = jlp*] = Pi" Then p[p;'pi 2 ••• Pk k] = P[first Y\ Y's = I, next Y2 Y's = 2, last ykY's = k] a(I) a(I) + I +I = a(T)'a(T) a(l) + Y\ - I a(T)+Y 1 -1 x a(2) + 1 a(2) a(2) + Y2 - I The expression on the right is the y\, Y2 , ••• , yk th moment of a Dirichlet distribution with parameters a(l), a(2), ... ,a(k), La(i) = a(T). Since the Dirichlet distributions is characterized by its moments, the result follows. It remains to be shown that the Y\ are independent given P* with distriP*[Ji( Yi)] bution P*. It is necessary to check that PEnn Y) IP*] = obeys the product law: PEn P*Ii( Yi)] = PEnn Y)]. If the Ii are each members of the partition Bp B 2 , ... , Bk this follows from de Finetti's theorem for the case o/J finite. More general Ii may be approximated by linear combinations of these simple Ii' D n 13.2. The Dirichlet Process on (0,1) (I) Let a be uniform. F(x) has expectation x; thus F(x) is beta with density PX-\(l - F)-x. \J V ::!: N -... :::: ~ ~ ....0 .. 0 >- >- :: III C 0 0 -.... It) ~ 0 ...>- III C III 0 0 4) 4) ~ ..... 0 c: 4) 0 131 13.3. Bayes Theorem for a Dirichlet Process We could generate a single random F from Da as follows: (a) select F(t) from Be(t, t) (b) select F(±j/F(t) from Be(t, t), [F(i-) - F@]/F@ from Be(t, t)· (c) select [F( (2k + 1)/2/1) - F(k/2/1- I} J/[F( (k + 1)/2/1 - 1) - F(k/2 n - 1)) from Be(l/2/1, 1/2/1) (d) continue forever ... after you're finished: J ......... ~.i····· F(x) I j ..' !~// ............ ° Note that F will be quite bumpy, because the relative changes in F will be near or 1. (2) rx gives weights 1 to 1/4, 1/2,3/4, 1 and is zero elsewhere. F(x) is Be [rx(O, x rx(x, 1 J. Thus F(x) = for x < 1/4, and F(x) = 1 for x = 1; F(x) changes value only at x = 1/4, 1/2, 3/4, 1 and has atoms LlF(t), LlF(i), LlF(~), LlF(l) which are Dirichlet D u .!.!' J, o I ° J I 3 "41"4 Another Real izatlon of F 13.3. Bayes Theorem for a Dirichlet Process Theorem. Let rx be ajlnite measure on!lJJ, and let Da be a prior distribution on fJ}J the family of probabilities P on!lJJ. Let t be an observationfrom T according to P. The posterior distribution of P given t is D aH,' where (\ Y = yet). PROOF. The joint probability Q on f!Jl x !lJJ is defined by Q[Z(P) x Y] = Q&'(Z(P)Q~Y) = Da(Z(PJPY). 132 13. Nonparametric Bayes Procedures The marginal distribution on f!} is D(/., the marginal distribution on qy is oc Y/ocl. If B I , •.• , Bk is a partition of T, then P(YB)/PB i is independent of PBI' ... , P Bk when P is a Dirichlet process. (Since P( YBi) is the limit of a linear combisets BIJ.. ," B .. = B.I and {PB I)../PB.}I are Dirichlet nation of disioint :..r ~J I) {oc(Bi)/ocBJ independent of Bi .) Thus if Z(P) is of form Z(PB I , ••• , PB k ) = Zk say DJZ(PB " ... , PBk)PY] = k I DJZ(PB j=1 " ... , PBk)P(B)]DJP(YBi)/PBi] = IDJZ(PB " ... , PBk)PBj]oc( Y B;VocBi tEB;, Da.+~,Zk If Thus = D"B, ....."Bi+ 1 •...•"Bk Zk = D,,(ZkPB)/D,.{PB) D,,+/J,Zk = I BjD"(ZkPB;VD,,PB j QD,,+/J,(Zk x Y) = Q(YIBP"(ZkPBi)/D,,PBj) = Ioc(YBj)D,,(ZkPBj)/ocB i · Thus QD,,+/J,(Zk x Y) = Q(Zk X Y) whenever Z = Zk depends only on {PBJ Taking limits, the result holds for all Z, and so D,,+/J, is the posterior 0 distribution of P given t. 13.4. The Empirical Process The limiting case of the Dirichlet D occurs when oc == 0; this corresponds to the prior density I/p(l - p) for a binomial parameter p, which is not unitary. It is very difficult to imagine generating P from Do' The conditional distributions of P and tn+ I given t I ' ... , tn are nice and simple: (J. (i) PnY = P[Y(Tn+ l>ltl' ... , tn] = (l/n)I Y(t), the empirical distribution over {tJ (ii) plt ... , tn is Dirichlet D rA ; thus [P(t,), ... , P(t n}] is Dirichlet " D u ..... , and P is a discrete distribution carried by the observed sample points t" ... , tn' See Hartigan (1971). We are in the embarrassing position of declaring tn+ I to be surely equal to one of the previous sample points; carried back, this would imply tn = t I with probability I, which is dull. We can pretend to the surprised at each new observation tn which is not equal to t I ' ... , tn-after all events of probability zero do occur; but our credibility may be weakened by our always insisting that the next observation is just one of the previous ones, and our always being surprised! To do error analysis of a parameter Y(P) estimated by Y(P n), we compute the distribution of YP where p.- Dr,O(rd' For example, if Y(P n) = (l/n)It p +, 133 13.5. Subsample Methods J Y(P) = tdF is distributed as INi where p '" D I ; P(IN) = f, var(INi) = I(ti - f)2/n(n + 1). This procedure gives approximate error behavior for any statistic based on the empirical distribution; it works best, when as here for the mean, the excessive discreteness is smoothed out by the statistic. 13.5. Subsample Methods Consider the following competitors of the empirical process for generating posterior distributions of a functional Y(P) estimated by Y(P J (i) Subsamples: Select a random subsample til' ... , tir of t I ' ... , tn where the ilh observation lies in the subsample with probability 1/2; regard k random subsample values Y(P n ) as a random sample from the posterior distribution of Y(P). See Hartigan (1969). (ii) Jackknife: Divide the sample into disjoint groups of size k (randomly say). Define the ilh pseudo-value by ~ Y i = Y(t l , ••. , tn) - (~ - 1) Y(t l , ••• , tn less jlh group) Act as if Y(P) is a location parameter and {YJ is a sample from N(Y(P), 0"2). See Tukey (1958). For example, let n = 50 and suppose Y(P) denotes the correlation of a bivariate distribution. For the empirical process, sample PI' ... 'Ps o from DI , and recompute the correlation on the data values weighted by Pi; obtain 3 such values. For subsamples, select 3 random subsamples each of size roughly 25. Do jackknifing with group size 25; if r I and r2 are the correlations on the groups, the pseudo values are 2r - r I' 2r - r2; the values 2r - r I ' 2r - r2' 2r - (r I + r2)/2 are regarded as a random sample from the posterior distribution. Each of the techniques gives 3 values, which divide the line into four intervals; in 100 repetitions, the true correlations lay in the four intervals as follows: Bivariate normal p=.95 Expected Empirical process Subsamples Jackknife 25 31 31 28 25 25 23 29 25 22 28 21 Mixture of normals 25 22 20 21 25 37 33 28 25 28 23 27 25 19 27 25 25 16 17 21 That's a bit nasty! By what accident could the humble ad hoc Jackknife beat such delightful Bayesian trickery? Hartigan (1975) shows that the asymptotic inclusion probabilities are correct for the various techniques. 134 13. Nonparametric Bayes Procedures 13.6. The Tolerance Process If t I ' ... , tn form a sample from a continuous distribution function F, and if t(1)' ... ,t(n) denote the order statistics, then {t(k _ I) < tIl + I < tIkI} is a tolerance interval for tn+ I of size l/(n + 1); that is, P[t(k-l) < tn+ I < t(k)] = 1/(11 + 1), averaging over all t I ' ... , tn' tn+ I ' After all, why should tn+ I be any particular place in the ordered sample of t I ' t 2, ... , tn' tn+ 1 ? The tolerance process defines tn + 1 given t l ' ... , tn to be such that P[t(k_1) < tn+1 < t(k)lt l , .. ·, tn] 1 = n + l' More detailed probability statements are made as evidence accumulates. The joint distribution of tn + 1 ' tn + 2 given t I' ... ,tn may be computed by combining the tn+ Ilt I' ... ,tn with tn+ 21 t I' ... , tIl' tn+ 1; more generally tn +1 ' tn +.;: ... , It I ' ... ,tn has a certain joint distribution. Obviously, P A = lim(1/n)2...{t i EA}, so the distribution of P may be obtained from the distribution of tn + 1 ' tn + 2' ... ; the distribution of P is just that F(t(I)' F(t(2) - F(t(1)' ... , F(t n) - F(t(n_I)' 1 - F(t(n) is D 1 or F(t I)' ... , F(t n) is a random sample from Uta, 1). P[t(k) < median < t(k+ I) It I ' ... , tn] = (n)2-n reproduces non-parametric k confidence intervals for the median. Here the probability space on which the distribution of P is defined changes as evidence accumulates; Hill (1968) shows that no unitary probability on P and Y1 , ••• , Yn will reproduce these conditional probabilities, but Lane and Sudderth (1978) show that a finitely additive probability P exists which produces these conditional probabilities. 13.7. Problems E1. For observations 5, 7, 10, 11, 15 compute 50% confidence intervals for the median using the empirical process, subsamples, jackknife with group size I, and the tolerance process. PI. For p'" D), show that LPiXi has skewness of opposite sign to that of J.I - X. Thus if the X's are positively skew, LPiXi tends to be less than X, but J.I tends to be greater than X. P2. Let t = LaiX i where ai = [(Zi - (l/n)LZ;lIY] + (lIn), the Zi are independent N(O, 1), and Y is independent [nX;_I]'t2. If {XJ is a sample from N(J.I, (j2), and the prior density for (J4 (j2) is 1/(j2, show that t IX I ' ... ,X,I and J.lIX I ' ... ,X,I have the same distribution. P3. Let XI' ... , X II be independent and symmetrically distributed about O. Let Y1 , ••• , Y 2" _ 1 denote the ordered means of the 2'1 - I subsets of XI' ... , X II' Show that P8(Yk <0< Yk+,)=2- n, 1 ~k~2'1-1. 13.8. References 135 E2. If Xn+IIX 1 , ••• , Xn is such that P(X(k) ~ X.+ 1 ~ X(k+ I)IX I' ... , Xn) = 1/(n + 1), find X n + l , Xn+2IXI' ... , X n • 13.8 References Blackwell, David and MacQueen, James B. (1973), Ferguson distributions via Polya urn schemes, Annals of Statistics 1, 353-355. Ferguson, T. S., (1973). A Bayesian analysis of some non-parametric problems, Annals of Statistics 1, 209-230. Hartjgan, J. A. (1969), Use of subsample values as typical values, J. Am. Stat. Ass. 104, 1303-1317. --(1971), Error analysis by replaced samples, J. Roy. Statist. Soc. B 33, 98-110. --(1975), Necessary and sufficient conditions for asymptotic joint normality of a statistic and its subsamplevalues. Annals of Statistics, 3, 573-580. Hill, Bruce M. (1968), Posterior distributions of percentiles : Bayes theorem for sampling from a population, J. Am. Stat. Ass. 63, 677-691. Lane, David A. and Sudderth, William D. (1978), Diffuse models for sampling and predictive inference, Annals of Statistics 6, 1318-1336. Neveu, J. (1965), Mathematical Foundations of the Calculus of Probability, San Franciso: Holden-Day. Tukey, J. W. (1958), Bias and confidence in not-quite large samples, Ann. Math. Statist. 29. 614. Author Index A Abramowitz, M. Clevenson, M.L. 99, 105 Cox, D.R. 7, 13 100, 105 B Baranchik, A.l. 85,93,95 Barlow, R.E. 103,105 Bartholomew, D.l. 105 Berger, 1.0. 57,62 Berk, R.H. 38,43 Bernardo,l.M. 46,50,55 Bernoulli,l. 2,5, 13 Blackwell, D. 128, 135 Borel, E. 7, 13 Box, G.E.P. 68,71, 119, 126 Brandwein, A.R. 92, 95 Bremner, 1.M. 105 Brown, B.M. 114, 118 Brown, L.D. 65,71,81,92,95 Brunk, H.D. 105 Buehler, R.I. 70,71 c Christensen, R. 45, 55 Church, A. 4, 13 D Dawid, A.P. 23,28,33,69,71 DeFinetti, B. ix, 6, 7, 9,10,13,14,15, 17,22,40 DeRobertis, L. 119, 126 Doob,l.L. 34,38,43, 124,126 Dunford, N. 15,22 E Efron, B. 91,95 F Farrell, R. 61, 62 Fedderson, A.P. 70,71 Ferguson, T.S. 57,62, 127, 128, 135 Fine, T. 20,22 Fisher, R.A. 56,62, 107 Fox, M. 65, 71 Fraser, D.A.S. 47,55 Freedman, D. 40,43,69,71 137 138 Author Index G Kraft, C. 20,21,22 Kullback, S. 45,55 Gaskins, R. 102, 103, 105 Geiringer, H. 13 Good, I.J. ix, 7, 8, 9,13,45,46,55, 101,102,103,105 L H Hall, P. 114, 118 Hartigan, J.A. 47,50,51,55,68,70, 71,115,118,119,126,132,133, 135 Heath, D.C. 58,61,62 Heyde, C.C. 114, 118 Hill, B.M. 134 Hinkley, D. 7,13 Hoed, A.E. 92, 95 M MacQueen, J.B. 128, 135 Martin-Lof, M. 4, 13 Morris, C. 91, 95 N I Ibragimov,I.A. Lane, D.A. 127, 134, 135 Laplace, P.S. ix, 1,2,8, 13 Leibler, R.A. 45,55 Leonard, T. 102, 103, 105 Lindley, D. V. 88,92,95 Loeve, M. 41, 43 Loomis,L.H. 17,22 112, 118 J James, W. 87,95 Jaynes, E.T. 45,55 Jeffreys, H. ix, x, 3, 8, 13, 15,22,48, 49,50,55,68,73,74,75,76,77, 78,79,84,87,95,100 Johnson, B.M. 97, 105 Johnson, R.A. 115, 118 Johnson, W.E. 97,105 K Kennard, R.W. 92,95 Keynes, J.M. ix, 2, 3, 8, 13, 15 Khasminskii, R.Z. 112, 118 Kolmogorov, A.N. ix, 4,5, 10, 13, 15, 22,23,24,33 Koopman, B.O. 20,22 Neveu, J. 128, 135 Neyman, J. 56,62 o Olshen, R.A. 69,71 p Pearson, E.S. 56,62 Peers, H. W. 49,55,76,83 Perks, W. 49,55, 100, 106 Pollard, D. vii Pratt, J. 22 Purves,R.A. 69,71 R Ramsey, F. 7,13,19,22 Renyi, A. 15,22,31,33 Robbins, H.E. 101,102,103,104,105 139 Author Index s Savage, L.J. 6,7,9,13,19,20,22 Schwartz, J.T. 15,22 Schwartz, L. 38,43 Scott, D. 20, 22 Seidenberg, A. 22 Shannon, C.E. 45,55 Simonoff, J.S. 103, 105 Smith, A.F.M. 88, 92, 95 Smith, C.A.B. 7,13 Stegun, LA. 100, 105 Stein, C. 65,71,84,87,95 Stone, C.J. 103, 106 Stone, M. 23,28,33,47,55,69,71 Strawderman, W. 87,92,95 Sudderth, W. 28,33,58,61,62,127, 134, 135 Tjur, T. 25,33 Tukey, J.W. 133,135 Tulcea, L 128 v Von Mises, R.V. ix, 3, 4,13 w Wahba, G. 103 Wald, A. 56,57,62,107 Walker, A.M. 108, 118 Wallsten 7, 13 Welch, B.L. 49, 55,76,83 Winkler, R.L. 44,55 T z Thatcher, A.R. 77,83 Tiao, G.C. 68,71,119,126 Zellner, A. 46,55 Zidek, J. V. 28,33,99, 105 Subject Index A Absolute distance 48 Admissibility 56-62,63,75,76,87, 97,98,104,105 of Bayes decisions x, 56-62 various definitions of x,61-62 Analogy, Keynes uses 2 Approximable, mean- 35 square- 35 Approximating sequence 16,38 Asymptotic normality, crude demonstration xii, 108 examples 74-79 martingale sequences xii, 113 - 115 of posterior distributions xii, 107 -118 pointwise xii, 111, 112 regularity conditions xii, 108, 109 Autoregressive process 102, 117 Axioms ix, 14-22 Kolmogorov's 5, 10, 15 of conditional probability 23, 24 B Baire functions 40,41 Baranchik's Theorem xi, 84-86 Bayes 13 decisions 57-62,75 definition of probability 6 estimates xi, 63, 86-87, 90, 92 postulate 2 robustness of methods xii, 119 - 126 theorem x, 30 theory iii unbiased tests xi, 65-66, 75 Bayesian law of large numbers 36 Behrens-Fisher 81,82 beta priors 76, 104 Bets 6,9 Binomial, admissibility 61, 104 asymptotics 116, 117, 126 conditional probability for x, 31-32 convergence x, 38-39 exponential family 73 methods 76-78 priors xi, 76-79 c Chisquares 93, 94 Clusters, multinomials with Coherence 6 Collectives 3 , 4 Complete Bayesian 101 Complete class 57 Complexity 4 101-102 141 142 Subject Index Conditional Bayes decisions 58 - 59 Conditional bets 68 - 69 Conditional probability ix, 9, 23-33 axioms x,24 binomial x, 31,32 Conditionally probable 68 -69 Confidence regions xi, 67 -69 beaten for normal location 70 forbinornial 76-78 for poisson 79 not conditional bets 68-69 not unitary Bayes xi,68 prior for location and scale 80 Confidence interval 127 Conjugate priors, a chimera 72 Consistency, of posterior distributions x, 38 Constructing probabilities 3 Contingency tables xii, 103, 105, 106 Contradiction 76 Convergence x, 34-43 definitions x, 35 in distribution 35 of conditional probabilities almost sure x, 36-38 of conditional probabilities in mean x, 35-36 Countable additivity 10 D Elephant 44 Elicitation of probabilities 44 Empirical Bayes 89 Empirical process xii, 132, 133 Empirical theories ix, 3-6 Kolmogorov 5 falsifiable models 5 Von Mises 3-4 Entropy 45 Exact Bayes estimates 63, 64 Exchangeable sequences x, 9,10,40, 41,52 Exponential families xi,72-83 prior distributions for xi, 73 Extension, from a prespace 16 from a ring 17 F Falsifiable models ix, 5 Fineness 20, 21 Finite additivity 10, 15,57-61 Fisher's test 105 Frequency theory 5 neither necessary nor sufficient Fubini's theorem 26 Future, can't be sure about it 2 but like the past, probably 12 Decision theory x,56-62 Degree of belief 7 Density estimation 103 Dirichlet, priors xi, 96-97, 104 process xii,127-131 selection of xi, 100, 102 Dirichlet process 127-135 Bayes theorem for 131, 132 existence 127 -130 on (0,1) 130,131 Docile priors 74, 75, 76 Dutch book 6 G E I Edgeworth expansions Elementary events 1 78 9 Gambling system, impossible 3, 6 Gamma priors 79,97,99, 100, 104 H Haldane's prior 31 Hellinger distance 48, 49, 50 High density region 68,75,80 Higher order approximations 115, 116 Imaginary results 8 Improper 15, 16,28 143 Subject Index Improper distributions, embarrassingly frequent vii Inadmissibility of maximum likelihood, Poisson 99 multinomial 105 Inadmissible means 84,92 Independence, like insufficient reason 6 conditional 28 of random variables 28 Indifference, principle of 2 Indifference priors 63 Infinite axioms ix, 9-10 Information 44-45, 72, 107 Information distance 48 Insufficient reason, principle of 2 Internal point 19 Intervals, asymptotic behavior of xii, 123-125 of means xii, 120, 121 of posterior probabilities xii, 122, 123 of probabilities xii, 120 of risk xii, 121, 122 of variances xii, 122 Invariance x, 3, 47-48 inconsistency of 3, 54 priors 75, 80 J Jackknife, beats all 133 Jeffreys density x, 48-50,68,73-79, 84, 100 L Least squares 84 Lebesgue measure 18 Leibniz, probability 1, 15 Likelihood 31 Limit space 16-18 Location parameter, for th~ normal Logical theories ix, 1-2 and randomness 4 Jeffreys 3, 8 Keynes 2,8 Laplace 1-2,8 56 M Many normal means 84-95 Marginal distribution 23, 80 Marginalization paradoxes x, 23, 28-29 Markov chain 43 Martingale sequences xii, 37, 113 -ll5, 117 Maximal learning probability 45-47, 68 Maximum likelihood 91,92,97,101, 105, 107, 116 Means, many normal xi, 84-95 mostly small xi, 89 multivariate xi, 89 random sample of xi, 89 shrinking towards xi, 88 unknown variance xi, 92 Measurable 34 Median 127 Minimum variance prior 73 Minimum variance unbiased 84 Multinomial distribution xi - xii, 96-105 maximum likelihood xi, 97 with clusters xi, 10 1 with similarities xii, 102 Multivariate means 91,92 Mystical methods 53 N Neutral 19 Non-central chisquare 84 Non-parametric Bayes xii, 127 -136 Non-unitary 16,25,26,30,41,63 Normallocation xi, 74-76 location and scale xi, 79-82 many means xi, 84-95 priors 75 scale 73 Nuclear war, probability of 11-12 o Order statistics 127 144 Subject Index p s P-Bayes 57 -62 Penalized likelihood 103 Personal probability 6, 10 Pitman estimator 64,65 not the Bayes estimator 65 Poisson xi, 73, 79, 84, 96, 101-105 inadmissibility of maximum likelihood xi,99 two stage models xi, 101 P61ya sequence 128, 135 Posterior distributions, consistency x, 38 asymptotic normality xii, 107 -118 higher order approximations xii, 115 Posterior mean 86,87,89,93,94, 101 Practically certain, in interpreting frequencies 5 Prespaces ix, 16-18 Prior density 31 Probable bets ix, 18-20 Probability, axioms ix, 14-22 betting definition 6-7 comparative ix,20 complete 15 conditional ix,23-33 finitely additive 15 making x,44-55 maximum learning x,45-47 product x,26-27 quotient x, 27 -28 space 14 theories ix, 1-13 Sample mean 56 admissibility of 60 Sample median 56 Shrinking 88 Significance tests 5,6,8,90 Similarities, probabilities from vii and probability ix, 11- 12 multinomials with 102 Similarity probability 50- 53 Spline methods 103 Subjective theories ix, 6-8 bad probabilities 7 betting definition 6-7 de Finetti 6-7 Good 7-9 Subsample methods xii, 133 Support 39,59,60 R Random variables ix, 18 Randonmess 4 Range of probabilities xii, 125 Rational belief 2 Recursive functions 3 Regression xi, 89, 92, 103 Relatively invariant 47 Rings ix, 16-18 Robustness 119 - 126 T Tail probability 75 Tolerance process xii, 134 Tortoise 44 Two stage normal priors 76 u Unbetworthiness 81 Unbiased, Bayes tests xi,65-66 location estimates xi, 63-65, 91-94 Uniform distribution 10, 15,75,84 on the integers 25 on the plane 26 on the square 25 Uniform integrability, generalized 35 Uniformity criteria x,63-71 Unitary probability 10, 15,25,26,30, TI,~,~,~,~-@,n,n,~, 108, 113, 123 v Variance 73,74,75 Variance, components xi,93-94 145 Subject Index x Xn-limit Bayes y 59,60 Yale 94