Download Springer Series in Statistics

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Springer Series in Statistics
Advisors:
D. Brillinger, S. Fienberg, J. Gani,
J. Hartigan, K. Krickeberg
Springer Series in Statistics
L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications.
x, 146 pages, 1979.
J. O. Berger, Statistical Decision Theory: Foundations, Concepts, and Methods. xiv,
425 pages, 1980.
R. G. Miller, Jr., Simultaneous Statistical Inference, 2nd edition. xvi, 299 pages, 1981.
P. Bremaud, Point Processes and Queues: Martingale Dynamics. xviii, 354 pages,
1981.
E. Seneta, Non-Negative Matrices and Markov Chains. xv, 279 pages, 1981.
F. J. Anscombe, Computing in Statistical Science through APL. xvi, 426 pages, 1981.
J. W. Pratt and J. D. Gibbons, Concepts of Nonparametric Theory. xvi, 462 pages,
1981.
V. Vapnik. Estimation of Dependences based on Empirical Data. xvi, 399 pages, 1982.
H. Heyer, Theory of Statistical Experiments. x, 289 pages, 1982.
L. Sachs, Applied Statistics: A Handbook of Techniques. xxviii, 706 pages, 1982.
M. R. Leadbetter, G. Lindgren and H. Rootzen, Extremes and Related Properties of
Random Sequences and Processes. xii, 336 pages, 1983.
H. Kres, Statistical Tables for Multivariate Analysis. xxii, 504 pages, 1983.
J. A. Hartigan, Bayes Theory. xii, 145 pages, 1983.
J. A. Hartigan
Bayes Theory
Springer-Verlag
New York Berlin Heidelberg Tokyo
J. A. Hartigan
Department of Statistics
Yale University
Box 2179 Yale Station
New Haven, CT 06520
U.S.A.
AMS Classification: 62 AI5
Library of Congress Cataloging in Publication Data
Hartigan, J. A.
Bayes theory.
(Springer series in statistics)
Includes bibliographies and index.
1. Mathematical statistics. I. Title. II. Series.
QA276.H392 1983
519.5
83-10591
With 4 figures.
© 1983 by Springer-Verlag New York Inc.
Softcover reprint of the hardcover 1st edition 1983
All rights reserved. No part of this book may be translated or reproduced in any
form without written permission from Springer-Verlag, 175 Fifth Avenue,
New York, New York 10010, U.S.A.
Typeset by Thomson Press (India) Limited, New Delhi, India.
9 8 7 6 543 2 I
ISBN-13 :978-1-4613-8244-7
001: 10.1007/978-1-4613-8242-3
e-ISBN-13 :978-1-4613-8242-3
To Jenny
Preface
This book is based on lectures given at Yale in 1971-1981 to students
prepared with a course in measure-theoretic probability.
It contains one technical innovation-probability distributions in which
the total probability is infinite. Such improper distributions arise embarrassingly frequently in Bayes theory, especially in establishing correspondences
between Bayesian and Fisherian techniques. Infinite probabilities create
interesting complications in defining conditional probability and limit
concepts.
The main results are theoretical, probabilistic conclusions derived from
probabilistic assumptions. A useful theory requires rules for constructing
and interpreting probabilities. Probabilities are computed from similarities,
using a formalization of the idea that the future will probably be like the past.
Probabilities are objectively derived from similarities, but similarities are
sUbjective judgments of individuals.
Of course the theorems remain true in any interpretation of probability
that satisfies the formal axioms.
My colleague David Potlard helped a lot, especially with Chapter 13.
Dan Barry read proof.
vii
Contents
CHAPTER 1
Theories of Probability
1.0. Introduction
1.1. Logical Theories: Laplace
1.2. Logical Theories: Keynes and Jeffreys
1.3. Empirical Theories: Von Mises
1.4. Empirical Theories: Kolmogorov
1.5. Empirical Theories: Falsifiable Models
1.6. Subjective Theories: De Finetti
1.7. Subjective Theories: Good
1.8. All the Probabilities
1.9. Infinite Axioms
1.10. Probability and Similarity
1.11. References
1
1
2
3
5
5
6
7
8
10
11
13
CHAPTER 2
Axioms
2.0. Notation
2.1. Probability Axioms
2.2. Pres paces and Rings
2.3. Random Variables
2.4. Probable Bets
2.5. Comparative Probability
2.6. Problems
2.7. References
14
14
14
16
18
18
20
20
22
CHAPTER 3
Conditional Probability
3.0. Introduction
23
23
ix
x
Contents
3.1.
3.2.
3.3.
3.4.
3.5.
3.6.
3.7.
3.8.
Axioms of Conditional Probability
Product Probabilities
Quotient Probabilities
Marginalization Paradoxes
Bayes Theorem
Binomial Conditional Probability
Problems
References
24
26
27
28
29
31
32
33
CHAPTER 4
Convergence
4.0. Introduction
4.1. Convergence Definitions
4.2. Mean Convergence of Conditional Probabilities
4.3. Almost Sure Convergence of Conditional Probabilities
4.4. Consistency of Posterior Distributions
4.5. Binomial Case
4.6. Exchangeable Sequences
4.7. Problems
4.8. References
34
34
34
35
36
38
38
40
42
43
CHAPTER 5
Making Probabilities
5.0. Introduction
5.1. Information
5.2. Maximal Learning Probabilities
5.3. Invariance
5.4. The Jeffreys Density
5.5. Similarity Probability
5.6. Problems
5.7. References
44
44
44
45
47
48
50
53
55
CHAPTER 6
Decision Theory
6.0. Introduction
6.1. Admissible Decisions
6.2. Conditional Bayes Decisions
6.3. Admissibility of Bayes Decisions
6.4. Variations on the Definition of Admissibility
6.5. Problems
6.6. References
56
56
56
58
59
61
62
62
CHAPTER 7
Uniformity Criteria for Selecting Decisions
7.0. Introduction
63
63
xi
Contents
7.1.
7.2.
7.3.
7.4.
7.5.
7.6.
7.7.
7.8.
Bayes Estimates Are Biased or Exact
Unbiased Location Estimates
Unbiased Bayes Tests
Confidence Regions
One-Sided Confidence Intervals Are Not Unitary Bayes
Conditional Bets
Problems
References
63
64
65
67
68
68
69
71
CHAPTER 8
Exponential Families
8.0. Introduction
8.1. Examples of Exponential Families
8.2. Prior Distributions for the Exponential Family
8.3. Normal Location
8.4. Binomial
8.5. Poisson
8.6. Normal Location and Scale
8.7. Problems
8.8. References
72
72
73
73
74
76
79
79
82
83
CHAPTER 9
Many
9.0.
9.1.
9.2.
9.3.
9.4.
9.5.
9.6.
9.7.
9.8.
9.9.
9.10.
9.11.
Normal Means
Introduction
Baranchik's Theorem
Bayes Estimates Beating the Straight Estimate
Shrinking towards the Mean
A Random Sample of Means
When Most of the Means Are Small
Multivariate Means
Regression
Many Means, Unknown Variance
Variance Components, One Way Analysis of Variance
Problems
References
84
84
84
86
88
89
89
91
92
92
93
94
95
CHAPTER 10
The Multinomial Distribution
10.0. Introduction
10.1. Dirichlet Priors
10.2. Admissibility of Maximum Likelihood, Multinomial Case
10.3. Inadmissibility of Maximum Likelihood, Poisson Case
10.4. Selection of Dirichlet Priors
10.5. Two Stage Poisson Models
10.6. Multinomials with Clusters
96
96
96
97
99
100
101
101
Contents
xii
10.7.
10.8.
10.9.
10.10.
Multinomials with Similarities
Contingency Tables
Problems
References
102
103
104
105
CHAPTER II
Asymptotic Normality of Posterior Distributions
11.0. Introduction
11.1. A Crude Demonstration of Asymptotic Normality
11.2. Regularity Conditions for Asymptotic Normality
11.3. Pointwise Asymptotic Normality
11.4. Asymptotic Normality of Martingale Sequences
11.5. Higher Order Approximations to Posterior Densities
11.6. Problems
11.7. References
107
107
108
108
III
113
115
116
118
CHAPTER 12
Robustness of Bayes Methods
12.0. Introduction
12.1. Intervals of Probabilities
12.2. Intervals of Means
12.3. Intervals of Risk
12.4. Posterior Variances
12.5. Intervals of Posterior Probabilities
12.6. Asymptotic Behavior of Posterior Intervals
12.7. Asymptotic Intervals under Asymptotic Normality
12.8. A More General Range of Probabilities
12.9. Problems
12.10. References
119
119
120
120
121
122
122
123
124
125
126
126
CHAPTER 13
Nonparametric Bayes Procedures
13.0. Introduction
13.1. The Dirichlet Process
13.2 The Dirichlet Process on (0, 1)
13.3. Bayes Theorem for a Dirichlet Process
13.4. The Empirical Process
13.5. Subsample Methods
13.6. The Tolerance Process
13.7. Problems
13.8. References
127
127
127
130
131
132
133
134
134
135
Author Index
137
Subject Index
141
CHAPTER 1
Theories of Probability
1.0. Introduction
A theory of probability will be taken to be an axiom system that probabilities
must satisfy, together with rules for constructing and interpreting probabilities. A person using the theory will construct some probabilities
according to the rules, compute other probabilities according to the axioms,
and then interpret these probabilities according to the rules; if the interpretation is unreasonable perhaps the original construction will be adjusted.
To begin with, consider the simple finite axioms in which there are a
number of elementary events just one of which must occur, events are unions
of elementary events, and the probability of an event is the sum of the nonnegative probabilities of the elementary events contained in it.
There are three types of theory-logical, empirical and subjective. In logical
theories, the probability of an event is the rational degree of belief in the
event relative to some given evidence. In empirical theories, a probability is a
factual statement about the world. In subjective theories, a probability is an
individual degree of belief; these theories differ from logical theories in that
different individuals are expected to have different probabilities for an event,
even when their knowledge is the same.
1.1. Logical Theories: Laplace
The first logical theory is that of Laplace (1814), who defined the probability
of an event to be the number of favorable cases divided by the total number
of cases possible. Here cases are elementary events; it is necessary to identify
equiprobable elementary events in order to apply Laplace's theory. In many
2
l. Theories of Probability
gambling problems, such as tossing a die or drawing from a shuffled deck
of cards, we are willing to accept such equiprobability judgments because
of the apparent physical indistinguishability of the elementary events-the
particular face of the die to fall, or the particular card to be drawn. In other
problems, such as the probability of it raining tomorrow, the equiprobable
alternatives are not easily seen. Laplace, following Bernoulli (1713) used
the principle of insufficient reason which specifies that probabilities of two
events will be equal if we have no reason to believe them different. An early
user ofthis principle was Thomas Bayes (1763), who apologetically postulated
that a binomial parameter p was uniformly distributed if nothing were known
about it.
The principle of insufficient reason is now rejected because it sets rather
too many probabilities equal. Having an unknown p uniformly distributed
is different from having an unknown
uniformly distributed, yet we are
equally ignorant of both. Even in the gambling case, we might set all combination of throws of n dice to have equal probability so that the next throw
has probability 1/6 of giving an ace no matter what the results of previous
throws. Yet the dice wiII always be a little biased and we want the next throw
to have higher probability of giving an ace if aces appeared with frequency
greater than 1/6 in previous throws.
Here, it is a consequence of the principle of insufficient reason that the
long run frequency of aces will be 1/6, and this prediction may well be violated
by the observed frequency. Of course any finite sequence will not offer a
strict contradiction, but as a practical matter, if a thousand tosses yielded
1/3 aces, no gambler would be willing to continue paying off aces at 5 to
1. The principle of insufficient reason thus violates the skeptical principle
that you can't be sure about the future.
.JP
1.2. Logical Theories: Keynes and Jeffreys
Keynes (1921) believed that probability was the rational belief in a proposition justified by knowledge of another proposition. It is not possible to give
a numerical value to every such belief, but it is possible to compare some
pairs of beliefs. He modified the principle of insufficient reason to a principle
of indifference-two alternatives are equally probable if there is no relevant
evidence relating to one alternative, unless there is corresponding evidence
relating to the other. This still lea ves a lot of room for judgment; for example,
Keynes asserts that an urn containing n black and white balls in unknown
proportion will produce each sequence of white and black balls with equal
probability, so that for large n the proportion of white balls is very probably
near 1/2. He discusses probabilities arising from analogy, but does not
present methods for practical calculation of such probabilities. Keynes's
1.3. Empirical Theories: Von Mises
3
theory does not succeed because it does not provide reasonable rules for
computing probabilities, or even for making comparisons between probabilities.
Jeffreys (1939) has the same view of probability as Keynes, but is more
constructive in presenting many types of prior distributions appropriate
for different statistical problems. He presents an "invariant" prior distribution for a continuous parameter indexing a family of probability distributions, thus escaping one of the objections to the principle of insufficient
reason. The invariant distribution is however inconsistent in another sense,
in that it may generate conditional distributions that are not consistent
with the global distribution. Jeffreys rejects it in certain standard cases.
Many of the standard prior probabilities used today are due to Jeffreys,
and he has given some general rules for constructing probabilities. He
concedes (1939, p. 37) that there may not be an agreed upon probability
in some cases, but argues (p. 406) that two people following the same rules
should arrive at the same probabilities. However, the many rules stated
frequently give contradictory results.
The difficulty with Jeffreys's approach is that it is not possible to construct
unique probabilities according to the stated rules; it is not possible to infer
what Jeffreys means by probability by examining his constructive rules;
it is not possible to interpret the results of a Jeffreys calculation.
1.3. Empirical Theories: Von Mises
Let Xl' X 2 , ••. , X n , ••• denote an infinite sequence of points in a set. Let f(A)
be the limiting proportion of points lying in a set A, if that limit exists. Then
fsatisfies the axioms of finite probability. In frequency theories, probabilities
correspond to frequencies in some (perhaps hypothetical) sequence of experiments. For example "the probability of an ace is 1/6" means that if the same
die were tossed repeatedly under similar conditions the limiting frequency
would be 1/6.
Von Mises (1928/1964) declares that the objects under study are not single
events but sequences of events. Empirically observed sequences are of course
always finite. Some empirically observed sequences show approximate
convergence of relative frequencies as the sample size increases, and approximate random order. Von Mises idealizes these properties in an infinite
sequence or collective in which each elementary event has limiting frequency
that does not change when it is computed on any subsequence in a certain
family. The requirement of invariance is supposed to represent the impossibility (empirically observed) of constructing a winning betting system.
Non trivial collectives do not exist satisfying invariance over all subsequences but it is a consequence of the strong law of large numbers that
4
1. Theories of Probability
collectives exist that are invariant over any specified countable set of subsequences. Church (1940) suggests selecting subsequences using recursive
functions, functions of integer variables for which an algorithm exists that
will compute the value of the function for any values of the arguments in
finite time on a finite computing machine. There are countably many recursive functions so the collective exists, although of course, it cannot be
constructed. Further interesting mathematical developments are due to
Kolmogorov (1965) who defines a finite sequence to be random if an algorithm
required to compute it is sufficiently complex, in a certain sense; and to
Martin-Lof(1966) who establishes the existence of finite and infinite random
sequences that satisfy all statistical tests.
How is the von Mises theory to be applied? Presumably to those finite
sequences whose empirical properties of convergent relative frequency and
approximate randomness suggested the infinite sequence idealization. No
rules are given by von Mises for recognizing such sequences and indeed
he criticizes the "erroneous practice of drawing statistical conclusions from
short sequences of observations" (p. ix). However the Kolmogorov or
Martin-Lof procedures could certainly be used to recognize such sequences.
How does frequency probability help us learn? Take a long finite "random"
sequence of O's and I's. The frequency of O's in the first half of the sequence
will be close to the frequency of O's in the second half of the sequence, so that
if we know only the first half of the sequence we can predict approximately
the frequency of O's in the second half, provided that we assume the whole
sequence is random. The prediction of future frequency is just a tautology
based on the assumption of randomness for the whole sequence.
It seems necessary to have a definition, or at least some rules, for deciding
when a finite sequence is random to apply the von Mises theory. Given such
a definition, it is possible to construct a logical probability distribution that
will include the von Mises limiting frequencies: define the probability of
the sequence xl' x 2 ' ••. , xn as lim Nk(x)/N k where Nk(x) is the number of
k
random sequences of length k beginning with Xl' X 2 ' ... , xn and Nk is the
number of random sequences of length k. In this way a probability is defined
on events which are unions of finite sequences. A definition of randomness
would not be acceptable unless P[ xn + 1 = 11 proportion of I's in Xl' '" ,
Xn = pJ - Pn -7 0 as n -7 00, that is, unless the conditional probability of
a 1 at the next trial converged to the limiting frequency of 1's.
True, definitions of randomness may vary, so that this is no unique solution-but the arbitrariness necessary to define finite randomness for applying
frequency theory is the same arbitrariness which occurs in defining prior
probabilities in the logical and subjective theories.
Asymptotically all theories agree; von Mises discusses only the asymptotic
case; to apply a frequency theory to finite sequences, it is necessary to make
the same kind of assumptions as Jeffreys makes on prior probabilities.
1.5. Empirical Theories: Falsifiable Models
5
1.4. Empirical Theories: Kolmogorov
Kolmogorov (1933) formalized probability as measure: he interpreted
probability as follows.
(1) There is assumed a complex of conditions C which allows any number
of repetitions.
(2) A set of elementary events can occur on establishment of conditions C.
(3) The event A occurs if the elementary event which occurs lies in A.
(4) Under certain conditions, we may assume that the event A is assigned
a probability P(A) such that
(a) one can be practically certain that if the complex of conditions C
is repeated a large number of times n, then if m be the number of
occurrences of event A, the ratio min will differ very slightly from
P(A).
(b) if P(A) is very small one can be practically certain that when conditions
C are realized only once, the event A would not occur at all.
The axioms of finite probability will follow for P(A), although the axiom
of continuity will not.
As frequentists must, Kolmogorov is struggling to use Bernoulli's limit
theorem for a sequence of independent identically distributed random
variables without mentioning the word probability. Thus "the complex of
conditions C which allows any number of repetitions" -how different must
the conditions be between repetitions? Thus "practically certain" instead
of "with high probability." Logical and subjective probabilists argue that
a larger theory of probability is needed to make precise the rules of application of a frequency theory.
1.5. Empirical Theories: Falsifiable Models
Statisticians in general have followed Kolmogorov's prescription. They
freely invent probability models, families of probability distributions that
describe the results of an experiment. The models may be falsified by repeating the experiment often and noting that the observed results do not concur
with the model; the falsification, using significance tests, is ·itself subject to
uncertainty, which is described in terms of the original probability model.
A direct interpretation of probability as frequency appears to need an
informal extra theory of probability (matching the circularity in Laplace's
equally possible cases), but the "falsifiable model" interpretation appears
to avoid the circularity. We propose a probability model, and then reject
it, or modify it, if the observed results seem improbable. We are using
Kolmogorov's rule (4) (b) that "formally" improbable results are "practically"
6
I. Theories of Probability
certain not to happen. If they do happen we doubt the formal probability.
The weaknesses in the model approach:
(1) The repetitions of the experiment are assumed to give independent,
identically distributed results. Otherwise laws of large numbers will not
apply. But you can't test that independence without taking some other
series of experiments, requiring other assumptions of independence, and
requiring other tests. In practice the assumption of independence is usually
untested (often producing very poor estimates of empirical frequencies;
for example, in predicting how often a complex piece of equipment will
break, it is dangerous to assume the various components will break independently). The assumption of independence in the model theory is the
analogue of the principle of insufficient reason in logical theories. We assume
it unless there is evidence to the contrary, and we rarely collect evidence.
(2) Some parts of the model, such as countable additivity or continuity
of a probability density, are not falsifiable by any finite number of observations.
(3) Arbitrary decisions about significance tests must be made; you must
decide on an ordering of the possible observations on their degree of denial
of the model-perhaps this ordering requires subjective judgment depending
on past knowledge.
1.6. Subjective Theories: De Finetti
De Finetti (1930/1937) declares that the degree of probability attributed
by an individual to a given event is revealed by the conditions under which
he would be disposed to bet on that event. If an individual must bet on all
events A which are unions of elementary events, he must bet according to
some probability P(A) defined by assigning non-negative probabilities to
the elementary events, or else a dutch book can be made against him-a
combination of bets is possible in which he will lose no matter which elementary event occurs. (This is only a little bit like von Mises's principle of the
impossibility of a gambling system.) De Finetti calls such a system of bets
coherent.
In the subjectivist view, probabilities are associated with an individual.
Savage calls them "personal" probabilities; a person should be coherent,
but any particular event may be assigned any probability without questioning
from others. You cannot say that "my probability that it will rain this
afternoon is .97" is wrong-it reports my willingness to bet at a certain rate.
Bayes (1763) defines probability as "the ratio between the value at which an
expectation depending on the happening of the event ought to be computed,
and the value of the thing expected upon its happening." His probability
describes how a person ought to bet, not how he does bet. It should be noted
that the subjectivist theories insist that a person be coherent in his betting,
I. 7. Subjective Theories: Good
7
so that they are not content to let a person bet how he pleases; psychological
probability comes from the study of actual betting behavior, and indeed
people are consistently incoherent (Wallsten (1974)).
There are numerous objections to the betting approach some technical
(is it feasible ?), others philosophical (is it useful ?).
(i) People don't wish to offer precise odds-Smith (1961) and others have
suggested ranges of probabilities for each event; this is not a very serious
objection.
(ii) A bet is a price, subject to market forces-depending on the other
actors; Borel (1924) considers the case ofa poker player, who by betting high,
increases his probability of winning the pot. Can you say to him "your
probability of winning the pot is the amount you are willing to bet to win
the pot divided by the amount in the pot."
Suppose you are in a room full of knowledgeable meteorologists, and
you declare the probability it will rain tomorrow is .95. They all rush at
you waving money. Don't you modify the probability? We may not be willing
to bet at all if we feel others know more. Why should the presence of others
be allowed to affect our probability?
(iii) The utility of money is not linear-You may bet $1 to win $500 when
the chance of winning is only 1/1000; the gain of $500 seems more than 500
times the loss of $1. Ramsey (1926) and Savage (1954) advance theories of
rational decision making, choosing among a range of available actions,
that produce both utilities and probabilities for which the optimal decision
is always that decision which maximizes expected utility.
The philosophical objection is that I don't particularly care how you
(opinionated and uninformed as you are) wish to bet. To which the subjectivists will answer that subjective judgments are necessary in forming
conclusions from observations; let us be explicit about them (Good (1976,
p. 143». To which the empiricists will reply, let us separate the "good"
empirically verifiable probabilities, the likelihoods, from the "bad" subjective
probabilities which vary from person to person. (Cox and Hinkley (1974,
p. 389) "For the initial stages ... the approach is ... inapplicable because it
treats information derived from data as on exactly equal footing with probabilities derived from vague and unspecified sources.")
1.7. Subjective Theories: Good
Good (1950) takes a degree of belief in a proposition E given a proposition
H and a state of mind of a person M, to be a primitive notion allowing no
precise definition. Comparisons are made between degrees of belief; a set
of comparisons is called a body ofbeliefs. A reasonable body of beliefs contains
no contradictory comparisons.
The usual axioms of probability are assumed to hold for a numerical
8
1. Theories of Probability
probability which has the same orderings as a body of beliefs. Good recommends a number of rules for computing probabilities, including for example
the device of imaginary results: consider a number of probability assignments
to a certain event; in combination with other fixed probability judgments,
each will lead through the axioms to further probability judgments; base
your original choice for probabilities on the palatability of the overall
probabilities which ensue. If an event of very small probability occurs, he
suggests that the body of beliefs be modified.
Probability judgments can be sharpened by laying bets at suitable odds,
but there is no attempt to define probability in terms of bets. Good (1976,
p. 132) states that "since the degrees of belief, concerning events over which
he has no control, of a person with ideally good judgment, should surely not
depend on whether he uses his beliefs in any specific manner, it seems desirable to have justifications that do not mention preferences or utilities. But
utilities necessarily come in whenever the beliefs are to be used in a practical
problem involving action."
Good takes an attitude, similar to the empirical model theorists, that
a probability system proposed is subject to change if errors are discovered
through significance testing. In standard probability theory, changes in
probability due to data take place according to the rules of conditional
probability; in the model theory, some data may invalidate the whole probability system and so force changes not according to the laws of probability.
There is no contradiction in following this practice because we separate
the formal theory from the rules for its application.
1.8. All the Probabilities
An overview of the theories of probability may be taken from the stance of
a subjective probabilist, since subjective probability includes all other
theories. Let us begin with the assumption that an individual attaches to
events numerical probabilities which satisfy the axioms of probability
theory.
If no rules for constructing and interpreting probabilities are given,
the probabilities are inapplicable~for all we know the person might be
using length or mass or dollars or some other measure instead of probability. Thus the theories of Laplace and Keynes are not practicable for lack of
rules to construct probability. Jeffreys provides rules for many situations
(although the rules are inconsistent and somewhat arbitrary). Good takes
a belief to be a primitive notion; although he gives numerous rules for
refining and correcting sets of probabilities, I believe that different persons
might give different probabilities under Good's system, on the same knowledge, simply because they make different formalizations of the primitive
notion of degree of belief. Such disagreements are accepted in a subjective
1.8. All the Probabilities
9
theory, but it seems undesirable that they are caused by confusion about
meanings of probability. For example if you ask for the probability that it
will rain tomorrow afternoon, one person might compute the relative
frequency of rain on afternoons in the last month, another might compute
the relative amount of today's rain that fell this afternoon; the axioms are
satisfied. Are the differences in computation due to differences in beliefs
about the world, or due to different interpretations of the word probability?
The obvious interpretation of a probability is as a betting ratio, the amount
you bet over the amount you get. There are certainly some complications
in this interpretation-if a probability is a price, it will be affected by the
market in which the bet is made. But these difficulties are overcome by
Savage's treatment of probability and utility in which an individual is asked
to choose coherently between actions, and then must do so to maximize
expected utility as measured by an implied personal probability and utility.
The betting interpretation arises naturally out of the foundations of probability theory as a guide to gamblers, and is not particularly attached to
any theory of probability. A logical probabilist, like Bayes, will say that a
probability is what you ought to bet. A frequentist will say that a bet is
justified only if it would be profitable in the long run-Fisher's evaluation
of estimation procedures rests on which would be more profitable in the long
run. A subjectivist will say that the probability is the amount you are willing
to bet, although he will require coherence among your bets. It is therefore
possible to adopt the betting interpretation without being committed to
a particular theory of probability.
As Good has said, the frequency theory is neither necessary nor sufficient.
Not sufficient because it is applicable to a single type of data. Not necessary
because it is neatly contained in logical or subjectivist theories, either through
Bernoulli's celebrated law of large numbers which originally generated the
frequency theory, or through de Finetti's celebrated convergence of conditional probabilities on exchangeable sequences, which makes it clear what
probability judgments are necessary to justify a frequency theory. (A sequence
Xl' X 2 ' •.. ,Xn , ••• is exchangeable if its distribution is invariant under finite
permutations of the indices, and then if the Xi have finite second moment,
the expected value of Xn+ 1 given xl' ... ,xn and (l/n)Lx i converge to the
same limiting random variable.) Thus the frequency theory gives an approximate value to conditional expectation for data of this type: the sequence of
repeated experiments must be judged exchangeable.
The frequency theory does not assist with the practical problem of prediction from short sequences. Nor does it apply to other types of data. For
example we might judge that the series is stationary rather than exchangeable:
the assumption is weaker but limit results still apply under certain conditions.
The frequency theory would be practicable if data consisted oflong sequences
of exchangeable random variables (the judgment of exchangeability being
made informally, outside the theory); but too many important problems
are not of this type.
10
I. Theories of Probability
The model theory of probability uses probability models that are "falsified"
if they give very small probability to certain events. The only interpretation
of probability required is that events of small probability are assumed "practically certain" not to occur. The advance over the frequency theory is that it
is not necessary to explain what repeatable experiments are. The loss is that
many probabilities must be assumed in order to compute the probabilities
of the falsifying events, and so it is not clear which probabilities are false if
one of the events occur. The interpretation of small probabilities as practically
zero is not adequate to give meaning to probability. Consider for example
the model that a sample of n observations is independently sampled from
the normal: one of the observations is 20 standard deviations from the rest;
we might conclude that the real distribution is not normal or that the sampled
observations are not independent (for example the first (n - 1) observations
may be very highly correlated). Thus we cannot empirically test the normality
unless we are sure of the independence; and assuming the independence is
analogous to assuming exchangeability in de Finetti's theories.
Finally the subjective theory of probability is objectionable because
probabilities are mere personal opinions: one can give a little advice; the
probabilities should cohere, the set of probabilities should not combine to
give unacceptable probabilities; but in the main the theory describes how
ideally rational people act rather than recommends how they should act.
1.9. Infinite Axioms
Two questions arise when probabilities are defined on infinite numbers
of events. These questions cannot be settled by reference to empirical facts,
or by considering interpretations of probability, since in practice we do not
deal with infinite numbers of events. Nevertheless it makes a considerable
difference in the mathematics which choices are made.
In Kolmogorov's axioms, the axiom of countable additivity is assumed.
This makes it possible to determine many useful limiting probabilities that
would be unavailable if only finite additivity is assumed, but at the cost
oflimiting the application of probability to a subset of the family of all subsets
of the line. Philosophers are reluctant to accept the axiom, but mathematicians are keen to accept it; de Finetti and others have developed a theory
of finitely additive probability which differs in exotic ways from the regular
theories-he will say "consider the uniform distribution on the line, carried
by the rationals"; distribution functions do not determine probability distributions on the line. Here, the axiom of countable additivity is accepted
as a mathematical convenience.
The second infinite axiom usually accepted is that the total probability
should be one. This is inconvenient in Bayes theory because we frequently
need uniform distributions on the line; countable additivity requires that
total probability be infinite.
1.10. Probability and Similarity
11
Allowing total probability to be infinite does not prevent interpretation
in any of the standard theories. Suppose probability is defined on a suitable
class of functions X. Probability judgments may all be expressed in the form
P X ~ 0 for various X. In the frequency theory, given a sequence Xl' Xl' ". ,
X n , ••• PX ~ 0 means that L~= 1 Xi ~ 0 for all large n. In the betting theory,
PX ~ 0 means that you are willing (subjective) or ought (logical) to accept
the bet X.
1.10. Probability and Similarity
I think there is probability about 0.05 that there will be a large scale nuclear
war between the U.S. and the U.S.S.R. before 2000. By that I certainly don't
mean that such nuclear exchanges will occur in about one in twenty of some
hypothetical infinite sequence of universes. Nor do I mean that I am willing
to bet on nuclear war at nineteen to one odds-lam willing to accept any
wager that I don't have to payoff until after the bomb. (I trust the U.S.S.R.
targeting committee to have put aside a little something for New Haven,
and even if they haven't, bits of New York will soon arrive by air.)
What then does the probability 0.05 mean? Put into an urn 100 balls
differing only in that 5 are black and 95 are white. Shake well and draw a
ball without looking. I mean that the probability of nuclear war is about the
same as the probability of getting a black ball (or more precisely, say, war
is more probable than drawing a black ball when the urn has 1 black and
99 white balls, and less probable than drawing a black ball when the urn
has 10 black and 90 white balls.) You might repeat this experiment many
times and expect 5% black balls, and you might be willing to bet at 19 to
1 that a black ball will appear, although of course the decision to bet will
depend on other things such as your fortune and ethics. To me, the probability .05 is meaningful for the 5 out of 100 balls indistinguishable except for
color, without reference to repeating the experiment or willingness to bet.
Why should you believe the assessment of .05? I need to offer you the data
on which the probability calculation is based. The superpowers could become
engaged in a nuclear war in the following ways.
1. Surprise attack. Such an attack would seem irrational and suicidal;
but war between nations has often seemed irrational and suicidal. For example the Japanese attack on the United States in 1941 had a good chance
of resulting in the destruction of the Japanese Empire, as the Japanese
planners knew, but they preferred chancy attack to what they saw as sure
slow economic strangulation. Might not the U.S. or the U.S.S.R. attack
for similar reasons? If such an attack occurs once in 1000 years, the chance
of it occurring in the next twenty is .02 (I will concede that the figure might
be off by a factor of 10.)
2. Accidental initiation. Many commanders have the physical power to
12
I. Theories of Probability
launch an attack, because the strategic systems emphasize fast response
times to a surprise attack. Let us say 100 commanders, and each commander
so stable that he has only one chance in 100,000 of ordering a launch, in a
given year, and that the isolated launch escalates to a full scale exchange with
probability .2. In twenty years, a nuclear war occurs with probability one
.004.
3. Computer malfunction. Malfunctions causing false alerts have been
reported at least twice in the U.S. press in the last twenty years. Let us assume
that a really serious malfunction causing a launch is 100 times as rare. In
the next twenty years we expect only .02 malfunctions.
4. Third party initiation. Several embattled nations have nuclear capability-Israel (.99 probability), South Africa (040), India (.60), Pakistan
(040), Libya (.20). Of these Israel is the most threatened and the most dangerous. Who would be surprised by a preemptive nuclear attack by Israel on
Libyan nuclear missile sites? Let us say the probability is .01, and the chance
of escalation to the superpowers is .2. The overall probability is .002.
Summing the probabilities we get .046, say .05, which I admit may be
off by a factor of 10. There is plenty of room for disagreement about the
probabilities used in the calculations; and indeed I have committed apparently a circular argument characteristic of probability calculations; I am
supposed to be showing how a probability is to be calculated, but I am basing
the calculation on other probabilities. How are they to be justified?
The component probabilities are empirical, based on occurrences of
similar events to the one being assessed. An attack by the U.S. on the U.S.S.R.
is analogous to the attack by Japan on the U.S. Dangerously deceptive
computer malfunctions have already occurred. Of course the analogies
are not very close, because the circumstances of the event considered are not
very similar to the circumstances of the analogous events.
The event of interest has been expressed as the disjoint union of the intersections of "basic" events (1 would like to call them atomic events but the
example inhibits me !). Denote a particular intersection as BIB 2 ••• Bn.
The probability P(BIB2 ... Bn)=P(Bl)P(B2IBl) ... P(BnIBlB2 ... Bn_1) is
computed as a product of conditional probabilities. The conditional probability p(B11 BIB2 ... B i - 1) is computed from the occurrence of events similar
to Bi under conditions similar to Bl B2 ... B i - 1. We will feel more or less
secure in the probability assessment according to the degree of similarity
of the past events and conditions to Bi and Bl B2 ... B i - 1. The probability
calculations have an objective, empirical part, in the explicit record of past
events, but also a subjective judgmental part, in the selection of "similar"
past events. Separate judgments are necessary in expressing the event of
interest in terms of basic events-we will attempt to use basic events for
which a reasonable empirical record exists.
The future is likely to be like the past. Probability must therefore be a
function of the similarities between future and past events. The similarities
will be subjective, but given the similarities a formal objective method should
be possible for computing probabilities.
1.11. References
13
1.11. References
Bayes, T. (1763), An essay towards solving a problem in the doctrine of chances, Phil.
Trans. Roy. Soc. 53, 370-418, 54, 296-325, reprinted in Biometrika 45 (1958),
293-315.
Bernoulli, James (1713), Ars Conjectandi.
Borel, E. (1924), Apropos of a treatise on probability, Revue philosophique, reprinted
in H. E. Kyburg and H. E. Smokier (eds.), Studies in Subjective Probability.
London: John Wiley, 1964, pp. 47-60.
Church, A. (1940), On the concept of a random sequence, Bull. Am. Math. Soc. 46,
130-135.
Cox, D. R. and Hinkley, D. V. (1974), Theoretical Statistics. London: Chapman and
Hall.
De Finetti, B. (1937), Foresight: Its logical laws, in subjective sources, reprinted in
H. E. Kyburg and H. E. Smokier (eds.), Studies in Subjective Probability. London:
John Wiley, 1964, pp. 93-158.
Good, I. J. (1950), Probability and the Weighing of Evidence. London: Griffin.
Good, I. J. (1976), The Bayesian influence, or how to sweep subjectivism under the
carpet, in Harper and Hooker (eds.), Foundations of Probability Theory, Statistical
Inference, and Statistical Theory of Science. Dordrecht: Reidel.
Jeffreys, H. (1939), Theory of Probability. London: Oxford University Press.
Keynes, J. M. (1921), A Treatise on Probability. London: MacMillan.
Kolmogorov, A. N. (1950). Foundations of the Theory of Probability. New York:
Chelsea. (The German original appeared in 1933.)
Kolmogorov, A. N. (1965), Three approaches to the quantitative definition of information, Problemy Peredaci Informacii 1, 4-7.
Laplace, P. S. (1814), Essai philosophique sur les probabilites, English translation.
New York: Dover.
Martin-LOf, M. (1966), The definition of random sequences, Information and Control
9,602-619.
Ramsey, F. (1926), Truth and probability, reprinted in H. E. Kyburgand H. E. Smokier
(eds.), Studies in Subjective Probability. New York: John Wiley, 1964, pp. 61-92.
Savage, L. J. (1954), The Foundations of Statistics. New York: John Wiley.
Smith, C. A. B. (1961). Consistency in statistical inference and decision, J. Roy. Statist.
Soc. B 23,1-25.
von Mises, R. and Geiringer, H. (1964), The Mathematical Theory of Probability and
Statistics. New York: Academic Press.
Wallsten, Thomas S. (1974), The psychological concept of subjective probability:
a measurement theoretic view: in C. S. Stael von Holstein (ed.), The Concept of
Probability in Psychological Experiments. Boston: Reidel, p. 49-72.
CHAPTER 2
Axioms
2.0. Notation
The objects of probability will be bets X, Y, ... that have real-valued payoffs
X(s), Y(s), ... according to the true state of nature s, where s may be any of the
states in a set S.
Following de Finetti, events will be identified with bets taking only the
values 0 and 1. In particular, the notation {s satisfies certain conditions} will
denote the event equal to 1 when s satisfies the conditions, and equal to 0
otherwise. For example {X ~ 5} denotes the event equal to I when s is such
that X(s) ~ 5, and equal to 0 otherwise.
In general algebraic symbols +, -, ~, v, 1\ will be used rather than set
theoretic symbols u, II, c.
2.1. Probability Axioms
Let S denote a set of outcomes, let X, Y, ... denote bets on S, real valued
functions such that X(s), Y(s), ... are the payoffs on the bets when s occurs.
A probability space f!{ is a set of bets such that
(1) X, YEf!{=> aX
+ bYEf!{
for a, b real
(2) XEf!{=>IXIEf!{
(3) XEf!{=> X 1\ IEf!{
(4) IXnl~XoEf!{,
Xn~X=>XEf!{.
A probability P on f!{ is a real valued function on f!{ that is
LINEAR: P(aX + bY) = aPX + bPY for X, YEf!{ and a, b real
NON-NEGATIVE: plXI ~O for XEf!{
CONTINUOUS: IXnl ~XoEf!{,
Xn~X=>PXn~PX,
14
15
2.1. Probability Axioms
A unitary probability P is defined on a probability space f!£ such that 1EX,
and satisfies PI = 1. A finitely additive probability P is defined on a linear
space such that X EX=> IX IE f!£, 1E X, and P is linear and non-negative, but
not necessarily continuous.
A probability space X is complete with respect to P if
(i) XnE!![,
(ii) Y Ef!£,
I
I
X n ~ X => X E PI"
PY=O =>XE'?£.
P X n - X m ~ 0,
o~X~
Y,
The standard definition of probability, set down in Kolmogorov (1933),
requires that it be unitary. According to Keynes (1921, p. 155), it was Leibniz
who first suggested representing certainty by 1. However, in Bayes theory
it is convenient to have distributions which have PI = 00, such as the uniform
distributions over the line and integers. Jeffreys allows PI = 00, because his
methods of generating prior distributions frequently produce such P, but
in theoretical work with probabilities, he usually assumes PI = 1. Renyi
(1970) handles infinite probabilities using families of conditional probabilities.
But there is no formal theory to handle probabilities with PI = 00, which are
therefore called improper. The measure theory for this case is well developed;
see for example, Dunford and Schwartz (1964).
The betting theory interpretation of probability is straightforward;
P X 1/P X 2 is the relative value of bet X 1 to bet X 2; you accept only bets X
such that P X ~ O. It is true that you may effectively give value 00 to the constant bet 1; those bets which you wish to compare are infinitely less valuable
than I.
It is also possible to make a frequency interpretation of non-unitary
probability. Consider for example the uniform distribution over the integers.
This would be the limiting frequency probability of a sequence of integers,
such as
12
123
1234
12345
123456 ...
in which each pair of integers occurred with the same limiting relative
frequency. If it is insisted that continuity hold, then total probability is
infinite. If it is insisted that total probability is 1, then continuity breaks down
and the limiting frequency probability is finitely additive.
It is not possible to justify either the continuity axiom or probabilities
with PI = 00 by reference to actual experience, which is necessarily finite.
Indeed de Finetti rejects the continuity axiom on this and other grounds.
But the continuity axiom equally cannot be denied by reference to experience,
and it is mathematically convenient in permitting unique extension of P
defined on some small set of functions to P defined on a larger set of interesting limit functions: we begin by assuming that intervals have probability
proportional to their length, and end by stating that the rationals have
probability zero. [In contrast, de Finetti (1972) can say: consider the uniform
distribution on the real line carried by the rationals, or carried by the irrationals.] We need to invent methods to handle invented concepts such as the
16
2. Axioms
set of rationals; the main justification must be mathematical convenience;
and the same reasoning applies to non-unitary probabilities - they must be
mathematically convenient or they would not be so improperly ubiquitous
(see them used by de Finetti, 1970, p. 237).
2.2. Prespaces and Rings
A prespace d is a linear space such that XEd => IX lEd, X AlE d. A limit
space L is such that XnEL, XoeL, IXnl ~ X o' Xn -> X implies XeL. A
probability space is both a prespace and a limit space.
Lemma. The smallest probability space including a prespace d is the smallest
limit space including d.
PROOF. Let L be the intersection of limit spaces, containing d. For each X,
let L(X) be the set offunctions Y such that d(X, Y):IXI, IYI, X A 1, Y A 1,
aX + bY lie in L; L(X) is a limit space. If Xed, then d(X, Y) cdc L for
Y in d. Thus L(X)::::) d=>L(X)::::) L. If XeL, then XeL(Y)::::) L for Y in d.
Thus L(X)::::) d => L(X) ::::) L. If X e L, Y e L then d(X, Y) c L, so L is a
prespace and therefore a probability space.
0
A probability P on a pres pace d is linear, non-negative and continuous:
Xn -> 0, IXnl ~ Xed => PX n -> 0.
Theorem. A probability P on a prespace d may be uniquely extended to a
probability P on a completed probability space including d.
Let f!{ consist of functions X for which IX - an I ~ I~ 1 a~, a~ ~ 0,
where an' a~ed. Say that the sequence an approximates X and
define P X = lim Pan' It follows from continuity that the definition is unique,
PROOF.
L~ 1 Pa~ ->
°
n
which implies that P is unchanged for X in d. If an' bn approximate X, Y
then aa n + bbn approximates aX + bY with P(aX + bY) = aPX + bPY.la,,1
approximates IX I with P IX I ~ 0, and an A 1 approximates X A 1.
Now suppose Xnef!{, Xef!{, IXnl ~ X and Xn -> Y. We will show that
Yef!{ and PY = lim PX n' First assume Xn i Y. Then plXn+1 - Xnl < 2- n
n
on a suitably chosen subsequence. Also IXn+ 1 - Xnl ~ I~ 1 a~ where a~ed,
a~~O and IPa~<2-n+l, since IX +l-Xnlef!{. Thus IY-Xnl~
I~nIXi+l -Xil~I~nIi=laf where r.IPaf~2-n+2. Approximate Xn
by an where plXn - ani < 2- n+2. Then Y is approximated by an' and PY =
lim Pan = lim P X n' The general result follows using
sup Xn jsup Xn
N:?n:?M
N:?n
and
sup Xn! Y.
N:?n
17
2.2. Prespaces and Rings
pi
If XnE£!(, Xn ~ X and Xn - Xml ~O, a similar argument, first considering
monotone convergence, shows that X E£!(. If an approximates Y, it approximates X, 0 ~ X ~ Y, so £!( is complete with respect to P.
Suppose P' is a probability on £!( which agrees with P on d. Then
P'IX-anl~O if an approximates X, so P'X=limP'an=limPan=PX.
Thus P is uniquely defined on £!(.
0
A subset of S is a function on S taking the values 0 and 1.
A family fF of subsets of S is a ring if A, BEfF => Au B, A - ABEfF.
A function P on fF is a probability if
(i) P(A + B) = PA + PB if A, BEfF, AB = 0
(ii) P A ;;;; 0 for A in fF
(iii) An ~ 0, An ~ AEfF => PAn ~ O.
(iii)' An!O=>PAn!O.
[Note that (iii) and (iii)' are equivalent. Obviously (iii) => (iii)'. Suppose that
(iii)' holds, and An ~ 0, An < A. Define Bn = An U An+ 1 ... U Am where
2- n + PBn > sup P(An+ 1 U ... U Am) ~ PA. Then
m
P[Bm - BmBn] = P[B mu Bn - Bn] < 2- n for m > n
P[BmBn - BmBnBn+ I] ~ P[B m- BmBn+ I] ~ 2-(n+ 1) for m> n + I
P[B m - fl BJ ~ 2- n + 1
n~i~m
Since fln$i$mBi!O as m ~ 00, p(fl B)!O, so lim PBm ~ 2- n+1 for every n.
Since Am ~Bm' lim PAm < 2- n+1 for every n, PAm ~ 0.]
If P is a probability on fF it may be uniquely extended to the prespace d
consisting of elements L~= 1 (XiAi' where (Xi is real, by P(L~= 1 (XiA ) =
L~ = 1 (XiP Ai' It is easily checked that P is well defined, linear and non-negative
on d. The continuity condition is a little more difficult; suppose an ~ 0,
lanl ~ a. Then lanl ~ AA for some positive A, AEfF.
lanl ~ eA + {Ianl ~ e}AA.
Since {lanl;;;;e}A~O and {lanl;;;;e}A~A, P{lanl;;;;e}AA~O. Thus
lim Planl ~ ePA for every e > 0 and Planl ~ 0 as n ~ 00.
If P is a probability on fF it may, by Theorem 2.2, be extended uniquely
to the smallest complete probability space £!( including fF. It is customary
to call P on £!( an expectation or an integral, but we follow de Finetti in
identifying sets and functions, probabilities and expectations.
If P is a probability on £!(, P defined on the ring fF of 0 - I functions in
£!( extends uniquely to a complete probability space £!((fF) that includes £!(.
See Loomis (1953). Thus specifying P on fF determines it on £!(; the function
X is approximated by the step functions
k+l}
L -k{k-::;X<--
1<lkl<K2K
K -
K
18
2. Axioms
and
.
PX= hm
K-oo
k {k-::;X<-k+l} .
I-p
K
K K
EXAMPLE. Let:F be the set offinite unions of half-open intervals A = U(a p bJ.
Define P A = I Ibj - a j I if the intervals (a p bJ are disjoint. To check that
P is a probability, it is difficult only to prove (iii)'. Assume An 1o.
Let An = U7= 1 (a jn , bjJ, and if An < A, let A be the interval [a, b]. The
function A - An is a union of half open intervals which converges to A.
Define E = U I:!(a jn - (e/2 n), a jn + (e/2 n)). Then (A - An) U E is an open set,
n
,
and U(A - An) U E includes [a, b]. From compactness, a finite number of
(A - An)u E cover [a, b], and since A - An i, for some n, (A - An)u E => A.
Since E has total length less than e, An must have total length less than e.
Thus PAn ~ o.
From length of intervals on :F, we define a probability on a prespace of step
functions on intervals; from the prespace we define probabilities on a probability space f![ which includes, for example, all continuous functions zero
outside a finite interval. This is lebesgue measure.
2.3. Random Variables
Let P be a probability on Ii!f a probability space on T, and let PI be a probability
space on S. A random variable X is a function from Tto S such that !(X)EIi!f
for each! in PI. A probability pf'I is induced on PI by
pf'I!= P[j(X)]
each! in PI.
The distribution of X is defined to be pf'I, also denoted by pX.
If S is the real line, and PI is the smallest probability space including finite
intervals, from 2.2, pX is determined by the values it gives finite intervals
P{a < X ~ b} = G(b) - G(a). The distribution function G is right continuous
and uniquely determined upto an additive constant. If sup P(a < X ~ o} < 00,
set G(b) = sup P{ a < X
a
~
a
b}.1f pX is unitary, it will follow that lim G(a) = 0,
lim G(a) = 1.
a-+-oo
a-oo
2.4. Probable Bets
Let PI be a linear space of bets: X, YEPI => aX + bYE PI for a, b real.
Let f!l', the probable set, be a cone of bets: X, YEf!l' => aX + b YE[~ for
a, b "?; O.
2.4. Probable Bets
19
A generalized probability P for :!J> is a linear functional on :!J> (a real valued
function on .0)1 with P(aX + bY) = aPX + bPY) such that PX ~ 0 for X in
:!J>, PX > some X in :!J>. For sections 2.4 and 2.5, P will be referred to as a
probability.
°
Theorem. If :!J> =1= .0£ and & contains an internal point (a point X 0 such that
for every X in .0£, X + kX 0 E:!J> for some k), then a probability P exists for :!J>.
[Following Dunford and Schwartz (1964), p. 412.]
Let N = :!J> n ( -:!J» be the neutral set of bets, bets X such that both X and
- X are probable. If {:!J> al is a chain of probable sets with neutral sets N,
then u:!J>, is a probable set with neutral set N. From Zorn's lemma, there is a
maximal probable set :!J> 0 containing :!J> and having neutral set N. Then
.or=:!J>ou(-:!J>o)' for if X~:!J>o'X~-.O)1o the set :!J>o(X) = {exX+ Y,ex~o,
Y E:!J> o} is a probable set with neutral set Nand:!J> o(X) strictly includes:!J> o'
The internal point X 0 does not lie in -:!J> for X = X + kX 0 + k( - X 0)
would lie in :!J> for each X contradicting :!J> =1= .0£. Also X 0 is an internal point
for :!J>o' Define PX=sup{exIX-exXoE:!J>o}; then -oo<PX<oo since
X + kX 0 E:!J> 0 and - X + k' X 0 E:!J> 0 some k, k' ; P is a linear functional because
X - exXoE:!J>o or -:!J>o for every X, ex; for XE:!J> c &0' PX ~ 0; and PX o = 1.
Thus P is a probability for :!J>.
0
PROOF.
It is necessary that :!J> =1= .0£, for otherwise we cannot separate probable
bets from others, and it is necessary to assume an internal point so that one
of the probable bets will be comparable to all possible bets.
It is usual to take the bets as real values received according to whichever
state of nature occurs, but it is not necessary to do so. See Ramsey (1926) and
Savage (1954). The space of bets and probable set may be constructed from a
preference ordering among a set of mixed actions as follows. Let d be an
arbitrary set (not necessarily countable) of actions aI' az ' ... ; let d* be the
mixed actions L7= 1 Piai (perhaps constructed by generating new actions by
taking action ai with chance Pi) where Pi ~ 0, LPi = 1; and let ~ be a preference between mixed actions such that for 0 ~ ex ~ 1, a ~ exa + (1 - ex)a ~ a
and a ~ b, c ~ d = exa + (1 - ex)c ~ exb + (1 - ex)d. Construct the space .0£ of
bets L7 = 1 xia i where LX; = 0, and define the probable set :!J> to consist of bets
Jc(a - b) where Jc ~ 0 and a ~ b, a, bed*. A probability P for :!J> wiII now
satisfy Pa ~ Pb whenever a ~ band Pa> Pb for at least one pair a ~ b.
The condition that an internal point exists is equivalent to assuming a pair
ao ~ bo such that for each a, b, exa + (1 - ex)a o ~ exb + (1 - ex)b o some ex,
O~ex~1.
On the other hand there is no harm in assuming that bets are real valued
functions. Assume that & =F .0£, and that an internal point exists. Then there
exists a basis of .or, {X a}' X aE:!J> such that each X in .or is represented uniquely
by I caX a where only a finite number of caare non-zero, and so X corresponds
to the real valued functionf,f(ex) = ca' Note that X E:!J> whenever f ~ o.
20
2. Axioms
2.5. Comparative Probability
In comparative probability, all pairs of events in fi' are compared by the
relation ~, "is no more probable than":
(i) ¢ ~ A for AEfi',
(ii) S ~ ¢ is not true,
(iii) A ~ B, C ~ D implies A
+ C ~ B + D if AC =
BD = O.
The statements A ~ B may be interpreted as offering the bet: pay 1 unit if A
occurs to receive 1 unit if B occurs. The family of bets L7~ 1 rJ./Bi - A) for
Bi' Ai in fi' forms a betting space; suppose the statements Ai ~ Bi are construed as accepting all bets L7~ 1 rJ.i(B i - Ai) for which rJ. i ~ O. It may be
possible to make a book against the bets {Ai ~ BJ: find a linear combination
L;'~ 1 rJ.JB i - A) which is negative. Otherwise, for S finite, the set of combinations of ~ bets is probable, and there exists a probability P on fi' such that
A ~ B implies peA) ~ PCB). An example of such a "beatable" comparative
probability is given by Kraft, et al. (1959) for a set of 5 elements. See also Scott
(1964) who connects "unbeatability" with the existence of a conforming
numerical probability, and Fine (1973) for a general discussion, and for
continuity axioms.
The above axioms of comparative probability are unsatisfactory because
they may not generate a probable set of bets. One solution to the problem
is to prohibit negative combinations, which is just equivalent to requiring
that a certain subset of a betting space, generated by pairs of events, is
probable. An alternative approach followed by Koopman (1940) and Savage
(1954) supposes that S may be partitioned into sets of arbitrarily small
probability; Koopman requires that for each n there exist a partition into n
events of equal probability. Since all pairs of events are comparable, each
event has a precise numerical probability determined by comparison with
events in increasingly fine partitions, and this numerical probability satisfies
the usual finitely additive axioms with peS) = 1.
2.6. Problems
Exercises (E), are supposed to be easier than problems (P).
Probability is used in the sense of Section 2.4.
x
tx
EI. Let :i£ = [R2, define 9 = {x, yiy + ~ 0, y +
~ O}. Show that 9 is a probable
set, and find all probabilities P such that P(X) ~ 0 for X in 9.
PI. A bookie offers the following odds for various teams to win a basketball pennant.
Knicks: 6/1
Bullets: 2/1
Braves: 2/1
Celtics: 1/1
Odds of6/l means that he receives $1 if the Knicks lose and pays $6 if the Knicks
win. Consider the space of bets :i£{ (Xl' X 2 , X 3 , x 4 )} in which the bookie receives
2.6. Problems
21
if the ith team wins. Show that any probable set including the specified bets will
include all bets.
Xi
E2. Let!!f = IRk. Let P be a probability on !!f with PX ~ 0 for X e!!f, X ~ O. Show that
there exist PI' ... ,Pk' Pi ~ 0 such that P(X) = L:= 1 piX p where Xi denotes the
ith co-ordinate of X.
E3. Let!!f consist oflinear combinations of bets {sla < s ~ b}, a < b. Let &' consist of
non-negative combinations of bets (a, a + 215] - (a - 15, a]. Find a probability on
!!f which is positive for all nonzero bets in &'.
E4. Let!!f be the real sequences, and let &' consist of sequences X with lim L~= 1 Xi ~ o.
Show that if a probability P on (!!f, &') is such that X 0 = (1, 1, I, ... ) has P(X 0) = 1,
the positive sequence X = (1, t, t, ... , lin, ... ) has P(X) = o.
E5. Let !!f be the real sequences, X = (X l' X 2' ... ), with finitely many non-zero
elements, and let &' = {Xl for some i, Xi > 0, X i + 1 ~ 0, X i + 2 ~ 0, ... } v {O}. If P
is a probability on (!!f, &'), show that P{i} = 0 or 00 except for one {i}, where {i} is
the bet equal to I at i and zero elsewhere.
P2. Let S be the real line, :F be the ring of unions of half open intervals (a < s ~ b),
where - 00 ~ a < b ~ 00. Define P((a, b]) = F(b) - F(a) where F is a non-decreasing right continuous function. Show that P is a probability on :F, in the sense of
Section 2.1.
E6. Let!!f be k-dimensional euclidean space, and let the probable set &' include all
bets X = (X l ' ... , X k ) such that Xi ~ 0, I ~ i ~ k. Show that if &' is not neutral,
&' includes no bet which is uniformly negative.
E7. A bookmaker offers a number of bets X, Y, ... in k-dimensional euclidean space;
the bet X = (X l ' ••. , X k) means he receives Xi if i occurs. Show that there is some
mixture of the bets on which he always receives a negative payoff, or else there is a
probability P which is non-negative for all bets.
E8. Let!!f be the set of real-valued sequences X = (xl' x 2' ... , x n ' ••• ), let (PI' P2' ... ,
Pn' ... ) be a fixed sequence, Pi ~ 0, and let &' be the sequences X with lim L~= 1
PiXi ~ O. Show that &' is a probable set and specify the probability which gives
value 1 to (1, 1, ... , 1, ... ), and the probability which gives value I to (1,0, ... ,0).
Show that the first probability is continuous if and only if LPi < 00, and the
second probability is bounded if and only ifLPi < 00.
E9. Replace the third axiom of comparative probability by
(iii)' ifL~= l(A i - B) = L~= 1 (A; - B;), and all Ai ~ Bp then at least one A; ~ B;.
Then the set of bets L~= 1 (Xi(B i - Ai) where Ai ~ Bi' (Xi ~ 0 is a probable set in the
space of real-valued function on S.
P3. The axioms of comparative probability are satisfied by subsets of S = (1, 2, 3, 4, 5)
with 0 <2<3<4<23<24< 1 < 12<34<5<234< 13< 14<25< 123<35< 124,
the remaining sets being ordered by complements. Show that no numerical
probability conforms to the order. [Kraft, et aI., 1959.]
E I o. Add a fineness axiom to the axioms of comparative probability:
(iv) for each n, there exists {Ai} with L~= 1 Ai = S, AiA j = 0, Ai ~ A j each i,j.
22
2. Axioms
Then there is a unique probability with A
1940.]
~
B= peA) ~ P(B), A, BEY'. [Koopman,
P4. Let a finitely additive probability P be defined on the plane so that
p(lxl + Iyl > a) = 0 for a> 0, P[x < 0] = P[x = 0] = P[x > 0] = t, P[y < 0] =
P[y = 0] = P[y > 0] = t, events determined by x are independent of y, and
(A) P[x + y = 0, x > 0, y < 0] = P[x + y = 0, x < 0, y > 0] = These conditions
determine P uniquely. Show that a different P is determined if (A) is replaced by
P[x + y < 0, x > 0, y < 0] = P[x + Y < 0, x < 0, y > 0] = i, demonstrating that
the distribution of x + y is not determined from the distributions of x and y when
x and y events are independent.
i.
P5. Let X be a random variable from U, Z to S,.O£ where S is the real line and q;
includes all finite intervals. Show that the P-completion of Z includes X if
L~oPX{ lsi ~ k} < 00.
2.7. References
De Finetti, B. (1970), Theory of Probability, Vol. 1. John Wiley: London.
De Finetti, B. (1972), Theory of Probability, Vol. 2. John Wiley: London.
Dunford, N. and Schwartz, J. T. (1964), Linear Operators, Part 1. John Wiley:
New York.
Fine, T. (19731, Theories of Probability, an Examination of Foundations. New York:
Academic Press.
Jeffreys, H. (1939), The Theory of Probability. London: Oxford University Press.
Keynes, J. M. (1921), A Treatise on Probability. New York: Harper.
Kolmogorov, A. N. (1950), Foundations of the Theory of Probability. New York:
Chelsea.
Koopman, B. O. (1940), The bases of probability, Bull. Am. Math. Soc. 46, 763-774.
Kraft, c., Pratt, J. and Seidenberg, A. (1959), Intuitive probability on finite sets,
Ann. Math. Statist. 30, 408-419.
Loomis, L. H. (1953), An Introduction to Abstract Harmonic Analysis. Princeton:
Van Nostrand.
Renyi, A. (1970), Probability Theory. New York: American Elsevier.
Ramsey, F. P. (1926), Truth and probability, reprinted in H. E. Kyburg and H. E.
Smokier (eds.), Studies in Subjective Probability. New York: John Wiley, 1964,
pp.61-92.
Savage, L. J. (1954), The Foundations of Statistics. New York: John Wiley.
Scott, D. (1964), Measurement structures and linear inequalities, J. Math. Psych.
1, 233-247.
CHAPTER 3
Conditional Probability
3.0. Introduction
Kolmogorov's exquisite formalization of conditional probability in the
unitary case (1933) does not readily generalize to non-unitary probabilities.
Stone and Dawid (1972) show one type of difficulty with their marginalization
paradoxes for improper priors.
Consider the case of the uniform distribution over pairs of positive integers
{i,j}. The desired conditional distribution of {i, j} givenj = jo is uniform over
i. Following Kolmogorov, the conditional distribution given j should
combine with the marginal distribution to return the joint distribution:
But the event [{i,j},j = joJ is not given a probability so the marginal probabilities p(jo) are not determined by p(i,j), 1 ~ i, j ~ 00. Correspondingly
the uniform distribution over {i,j} givenj = jo is equally well represented by
p(iljo) = k(jo) for any k(jo)' Thus although these conditional distributions are
determined by the joint distribution, the marginal distribution is not. (This
is the explanation of the marginalization paradoxes of Stone and Dawid.)
It is assumed therefore that the joint distribution, the conditional distribution, and the marginal distribution are specified separately to follow the
axioms of conditional probability. In particular the probabilities of the {i, j}
and of [{i,j},j=joJ are separately specified. We are declaring that {i,j}
has the same probability as {i',j'}, and in addition that [{i,j},j = joJ has the
same probability as [{i, j}, j = ja
23
24
3. Conditional Probability
3.1. Axioms of Conditional Probability
Let !![, iljj, !Z, ... be probability spaces of functions on S. The conditional
probability on !![ given iljj is a function P from !![ to iljj that is
LINEAR: P(YtX 1 + Y2 X 2 ) = YtPX 1 + Y2 PX 2 forXiE!![, YiEiljj, YiXiE!![
NON-NEGATIVE: plXI ~O
CONTINUOUS: PXn-+PX for IXnl ~ XoE!![, Xn-+X
INVARIANT: If YE!![niljj,PY= Y
A family of conditional probabilities is assumed to satisfy the
P;,
P;
PRODUCT RULE: If
p~,
denote conditional probabilities from
!![ to iljj, iljj to !Z and !![ to !Z respectively,
P;=P~P;.
The conditional probability P is determined as a probability on !![ given
the results of an experiment which determines the values of all functions in
iljj. Each result of the experiment will give rise to possibly different values of
functions in iljj, and possibly different probabilities. The conditional probability P determines these different probabilities for all possible results of the
experiment. If PXE!![, then PX may be interpreted as a bet equivalent to X
that has known value after the experiment is performed.
The above axioms generalize the axioms of probability. Let !![ 1 denote
the probability space of constant functions on S, and let 1 denote the constant
function. Then P is a probability on !![ if and only if P~, : X -+ (P X)l is a
conditional probability on !![ given !![ l ' Indeed P~, = P:,P; implies that
P; is determined almost uniquely by P~, and
[Suppose P;X could
have values Y1 or Y2 ; then P:, [Y(YI - Y2 )J = 0 all YEiljj, so
p:,1 Y1 - Y 2 1= O.J Kolmogorov (1933) defines conditional probability in
terms of probability; under certain regularity conditions, there exists a
"conditional probability" that satisfies the above axioms except on a subset
of S of probability zero. Here we are following the more traditional scheme of
axiomatizing conditional probability rather than defining it in terms of
probability.
pt.
EXAMPLE 1: Toss a penny twice. Let!![ be the bets {XHH,XHT'XTH,XTT}
where X HH means the amount received if two heads occur, and similarly for
the other three results. The result of the first toss of the experiment, heads or
tails, determines the values of all bets in iljj = {X IX HH = X HT' X TH = X TT },
bets ofform (X H' X H' X P X T)' Assuming that tails and heads have probability 1/2 given the results of the first toss, the conditional probability of
X=(X HH , X HT ' X TH ' X TT ) is P<?IX=(XH, X H' Xp X T) where X H =
t(X HH + X HT) and X T = t(X TH + X TT)' Here P <?IX is a bet equivalent to X
that has known value, either X H or X T' once the first toss is known.
25
3.1. Axioms of Conditional Probability
Suppose that head on the first toss has probability p, and tail has probability (1 - p). Then
p?l" X = pqlj p?l"X
?l",
?l", qy
= pi'[x H, X H' X T' X T]
=pX H +(1-P)X T
= ~PXHH
+ ~PXHT + 1-(1 -
P)XTH
+ ~(l -
P)X TT ·
The probability on f1£ corresponds to giving probability ~p, ~p, 1-(1 - p),
1-(1 - p) to the four outcomes HH, HT, TH, TT. These probabilities have
been developed from conditional probability using the product rule, but in
the finite case we could just as well define conditional probability in terms
of probability; a separate axiomatization of conditional probability is
necessary only in the infinite case.
EXAMPLE 2: Uniform distribution on the square. Let f1£ denote the smallest
probability space of functions including the continuous functions on the
square; let X(u, v) denote the value of X at the point (u, v), 0 ~ u, v ~ 1. Let
qy denote the set of functions in f1£ depending only on u- y(u, v) = Y(u, I)
for all v.
Define P;X = I X(u, v)dv.
Define pi, Y = I Y(u, v)du.
Here P;X is a bet equivalent to X that is a function of u alone. From the
product axiom, P;,X = In X(u, v)dv)du = I JX(u, v)dudv. Thus the probability on f1£ is just Lebesgue measure, in which the probability of a set is its area.
Beginning with P~:" it is possible to construct the conditional probability
to satisfy the product axiom almost uniquely-any other solution
satisfies
(P; - Q;)X I = O. Tjur (1974) assures uniqueness by requiring
continuity of the conditional probability, but then, establishing existence is
sometimes formidable.
P;
pi
Q;
EXAMPLE 3; Uniform distribution on the integers. Let f1£ be the space of
sequences {Xl' X 2 , ... ,Xn' ... } with '2] Xi I < 00, let qy be the space of sequences
{a, {3, a, {3, ... }.
Define P;X = {a(X), {3(X), a(X), {3(X), ... } where a(X) = 2 I~ 1 X 2i - 1 '
{3(X) = 2I~ 1 X 2i ·
Define Pi, Y = ~a + ~f3.
Then P~,X = Pi,P~X = I~lXi'
Here
is a uniform distribution on the evens and the odds, according as
Y = (0, I, 0, 1,0, 1, ... ) is one or zero. The conditional distribution is not
unitary. The distribution on qy gives probability 1/2 to the evens, probability
1/2 to the odds; this distribution is unitary. The distribution on f1£ is uniform
over the integers; it is non-unitary.
Note that P~ is not determined by saying that it is uniform given evens and
uniform given odds; probabilities given even must be compared explicitly
p!
26
3. Conditional Probability
with probabilities given odds (such comparisons are implicit when probabilities are unitary, since 1 has the same probability under all conditions).
Let the conditioning family :!Z be the sets of sequences X with X 2i = X 2i - 1 '
i = 1,2, ... ,. Define
+ lx
lx
p irj!£ X -- {lx
2 1
2 2' 2 1
+ lx
lx + lx
lx + lx
l
2 2' 2- 3
2 4' 2 3
2 4'···
j
P;,Z=2L Z i·
LXi.
Then again P~, =
Here the conditional probability is uniform given 1 or 2, or given 3 or 4,
or given 5 or 6, ... The conditional probability is unitary; the probability on
:!Z is non-unitary. In this case, some elements of :!Z lie in !!£ and for these
p;"Z = Z, satisfying invariance.
4: Uniform distribution in the plane. Let !!£ be the smallest probability space which includes continuous functions zero outside some
square in the plane, - 00 ~ u, v ~ 00. Let qy be the probability space of such
functions depending only on u.
EXAMPLE
Define
P;X = S X(u, v)dv
Pi-, Y
Then
=
S Y(u, v)du.
P~,X = S(JX(u, v)dv)du = SX(u, v)dudv,
corresponding to the uniform distribution on the plane.
Note that the conditional distribution is not determined by requiring that
the distribution be uniform given each u; since the uniform distribution is
non-unitary, it is possible to have a conditional distribution which is uniform
given each u, a distribution on u which is not uniform, a distribution over the
whole plane which is uniform. The marginal distribution on qy is not determined by the distribution on !!£; the conditional distribution is determined only
up to an arbitrary weighting factor depending on u. Given the distribution on
!!£ and the distribution on qy, the product axiom determines
almost
uniquely.
P;
3.2. Product Probabilities
For arbitrary Sand T define the function subscript S, denoted by s' by
s(s)(t) = (s, t). Thus s is a function from S to the space of functions from T to
S x T. Define T similarly.
Fubini's Theorem. Let P be a probability on !!£ on S, let Q be a probability on
qy on T, let X x Y be the function on S x T: (s, t) ~ Xes) y(t), and let !!£ x qy
be the smallest probability space including all X x Y.
Then P x QW = PQWs = QPWT is the unique probability on !!£ x qy such
that P x QX x Y = PX QY. (Note that QWs is thefunctions~ QWs(s).)
27
3.3. Quotient Probabilities
PROOF. Let "/If 0 be the set of functions W in !![ x q!J such that WsCS)Eq!J for
each SE!![. Then "/If 0 includes all functions X x Y, and is a probability space,
so "/If 0 = !![ x q!J.
Again let "/If 0 be the set of functions Win!![ x q!J such that QWsE!![. Then
"/If 0 is linear, includes all functions X x Y and is continuous, but it is not
straightforward to show that WE "/If 0 :;. I Wi, W /\ I E"/If o' Let d(!![) be the
set of functions L~= 1 aiX p let d(q!J) be the set of functions L~= 1ai Yi and let
d(!![, q!J) be the set of functions LaiX i x Yi where X i and Yi are 0 - 1 functions
on Sand T. For each X E!![ there is a sequence X ~ X, IXnI ~ X, X nEd(!![).
Thus if XiE!![, yiE<W , then IXl x yl + X 2 X y2
lim Xln x yln + X n2 X y2n'
where X~ X Ynl + X; x Y;Ed,{!![, <W) and is bounded by IXll v IX21 x
lyllvly21.
Since d(!![,<W)c"/lfo'
and
"/If o
is continuous,
IX 1 X yl + X 2 x y21lies in "/If o' By a similar argument, any finite sequence
of operations involving linear combinations or absolute values or /\ I on
the functions X x Y will yield a function in "/If 0' so that the prespace including
all X x Y is included in "/If o' Since "/If 0 is continuous, by Lemma 2.2, "/If 0 =
!![ x <W.
Define P x QW = PQWs ; note that
i=
PQX x Ys = P(XQY) = PX QY.
It is easy to show that P x Q is a probability. For example, continuity requires
Wn~W, IWnl~Wo:;.PxQnW~P xQW. For each s, Ws(s)~WsCs),
IWnsCs)I~Wos(s) so QWns(s)~QWsCs) and IQWnsl~IQwsl Therefore
PQWns ~ PQWs as required.
Also P x Q is the only probability on !![ x <W such that P x QX x Y =
PX Q Y; for any Win!![ x q!J may be approximated by a sequence offunctions
of form L~= 1 anX n X Yn ' By symmetry P x QW = QPWT' and the theorem
0
~~~.
3.3. Quotient Probabilities
Let !![, ~ be probability spaces on S and let Y be a random variable from
S, ~ to T, <W. The conditional probability of!![ given Y is defined by (P; X)( Y) =
P;-I('?I)X, Thus for each X, P;X is a function in <W, such that PyX(t) is a
probability on!![ for each t. The notation P(X I Y) means (p; X)( Y), a function
in y-l(<w).
Suppose that X, Y and X x Yare random variables on U, ~ to S, !![,
T, <Wand S x T, !![ x <W, and that there exists a conditional probability on
(X x Y)-l!![ x <W given y-l(<w). The conditional probability of X, Y given Y
is (p;'YW)(Y) = p~_~r~;I~X'?lW(X,
WE!![ x <W. Thus for each W in
!![ x <W, P;,yw is a function in <W, such that (P;'YW)(t) is a probability on
!![ x <W for each t. The notation P[W(X, Y)I Y] means (p;'YW)(Y) and is
useful for indicating that X is summed over while Y is held fixed. The product
rule becomes pX, Y = pYp;,Y.
A quotient probability P; is a probability on!![ for each t such that gPUE<W
n
28
3. Conditional Probability
for each gEOJ/,fE:T. A conditional probability p;.,Y is defined from a quotient
probability by p;"Yfg = gP;'f each gEOJ/, fE:T; this equation determines
P;' Y on fll' X{!!. A quotient probability is not a conditional probability
because P;'fmay not lie in OJ/ for eachfin fll'; it is convenient to use quotient
probabilities to generate conditional probabilities because it is necessary to
specify probabilities on :T for each t, rather than on fll' x OJ/ for each t. As
before P[j(X) I YJ means (p;' f)( Y).
The random variables X and Yare independent if pX. Yfg = pXf pY g for
fEfll', gEOJ/ or equivalently, given that p;' is defined, if P;' = pX. Similarly, the
random variables {XJ are independent if for any finite subset Xl' ""X n
Px;'""X»TIJ; = TIpX'!;,J;Efll';.
The random variables X and Yare conditionally independent given Z if
p~'Yfg = P~fPlg,fEfll', gEOJ/, that is, if P;',z = P~.
EXAMPLE. Let S be the set of positive integer pairs (i, j), i ~ j. Let :T be the
probability space offunctions X, L;;u IXCi, j) I < CIJ, Let !!Z be the probability
space of real valued functions on S. Let Y be the function Y(i, j) = j from
S, :?Z to T, OJ/ where T denotes the positive integers and Y consists offunctions
g where L Ig(i) I < CIJ.
A conditional probability P; is
(P~'X)(j)
=
I
X(i, j).
For eachj, (P;X) (j) defines a probability on:T. For each X, P;X is a function
in W. The function (P~X)(Y) has value at (i,j): P~XY(i,j) = (P;X)(j) =
L;~jX(i,j) defining a conditional probability of fll' given y- l (OJ/).
Now let X(i,j) = i, and define fll' and OJ/ to be the space offunctionsfthat
take finitely many non-zero values.
Then P;,'YW(j) = Li~j W(i,j) where W is any real valued function finitely
non-zero.
P;,'Y(WY) = P;,'YW(Y) = L Wei, Y),afunctionin y-I(OJ/).
'>Y
The quotient probability p;'f(j)';, L;~jf(i) defines a probability on :T for
each j, such that {g(j)P;'f(j)} EOJ/ for each g in OJ/.
P;'- Yfg = L f (i)g(j) = g P;.J:
Here X and Yare not independent because p;'fU) varies withj.
3.4. Marginalization Paradoxes
In Stone and Dawid (1972), and in Dawid, Stone and Zidek (1973) a number
of "marginalization paradoxes" are produced using improper priors. See
also Sudderth (1980) where it is shown that no marginalization paradox
arises with unitary finitely additive priors.
3.5.
29
Bayes Theorem
Consider example 1, Stone and Dawid (1972). Random variables X and Y
are independent exponential given parameters ()rjJ and rjJ, and (), rjJ have
density e- 8 with respect to lebesgue measure on the positive quadrant. The
joint density of X, Y, (), rjJ is e- 8()rjJ2 exp [ - rjJ(()X + Y)], and the product rule
pX,Y,8,tf> = p8,t/>p:.~,8,tf> is satisfied.
The conditional density of (), rjJ given X, Y is e- 8 ()rjJ2 exp [ - rjJ(()X + Y)]/
f(X, Y) where f(X, Y) is the density of X, Y:
If e- 8()rjJ2 exp[ -
rjJ(()X + Y)]d()drjJ.
Again the product rule is satisfied.
However the conditional density of () given X,2 where 2 = Y/X is
e- 8()/(() + 2)3f(2) which does not depend on X; Stone and Dawid take this
to imply that the conditional density of () given 2 is e- 8 ()/(() + 2)Y(2).
Similarly the conditional density of 2 given () and rjJ is ()/(() + 2)2 independent
of rjJ; Stone and Dawid take this to imply the density of 2 given () is ()/(() + 2)2,
which is inconsistent with the conditional density of () given 2 being
e- 8()/(() + 2)3 f(2).
The paradox 'is caused by the implicit assumption that p;,8 = p;:!'tf>
if p;:!, tf> is independent of rjJ; this assumption is valid if (), rjJ given () is unitary
for then, letting f be a continuous function of two real variables, vanishing
outside some square,
p;,8,tf>[J] = p:,tf> p;:!,tf>[J]
= p;:!,tf>f
(which is independent of rjJ).
Thus P;'~ = p;:!,tf>f.
However, if (), rjJ given () is not unitary, p:,tf>P;:!'tf>f is not defined, since p:,tf>
is not defined for non-zero functions constant over rjJ. Instead,
P;,M[h(rjJ)fJ = p:,tf>h(rjJ)P;:!,tf>f.
Thus the joint distribution of rjJ and 2 given () is a product distribution, but
because rjJ given () is not unitary it is not possible to determine the marginal
distribution of 2 given (). (In the same way, if X, Y is uniformly distributed
over the plane we cannot determine that X is uniformly distributed on the
line.)
In the example, take () to have density e- 8 and rjJ given () to be uniform.
Then Z, rjJ given () has density ()/(() + 2)2; but 2 given () does not have density
()/(() + 2)2 because it is not valid to integrate over rjJ.
3.5. Bayes Theorem
A real valued functionfon S is a density on f!{ iffX Ef!{ for X Ef!{. A probability
space f!{ is a-finite if there exists X 0 E f!{, X 0 > 0; or equivalently, if there exists
XnEf!{, Xn i 1 as n --+ OC!.
30
3, Conditional Probability
Bayes Theorem.
Let P be a probability on U, :!L,
Let X, Y and X x Y be random variables from U,:!L to S, f![, T, qy and
S x T, f!l' x qy,
Let f be a density on f![ x qy,
Let f![ x qy be (J-finite,
Let f T(t): s -+ f(s, t)Ef![ each t.
Let .fI pXfT : (s, t) -+ f(s, t)/P[j(X, t)] be a density on f![ xqy.
Let p;'g = QY(gfs ) for some Q on qy, each gEqy.
[that is (p;'g)(s) = QY- !('iY>[g(Y)f(s, Y)]]
Then
pYg = QY(gPXfT ) gEqy.
P;h = pX(hfT)/px[JTJ as pY each hEf![.
[that is (P;h)(t) = p[h(X)f(X, t)]/Pf(X, t), except for a set To oft values
with pYTo = 0]'
PROOF, Let hnEf![, hni1, since f![ is (J-finite. For gEqy, g;;;O, hnglg; since
hngEf![ x qy and gEqy,
pX,Y h 9 = p(X,¥)-! f!l' x qy (X)g(Y)I pY- !('iY)g(Y) = pY 9
hn
n
pX, Yhng
=
pX P;'hng
=
PX[hnQY(gfs)]
pY 9 = pXQY(gfs) = QY PX(gY/T) = QY(gpXfT)'
This shows pY 9 = QY (gpX f T) for 9 ;;; 0, and general 9 follow easily.
is a quotient probability; from
Secondly, it is necessary to show that
the given definition,
p;
P;,Y[ghJ = PX[gThfl'J/PX[JTJ.
Then
pYp;'Ygh
=
pY(gPXhfT/pxfT)
= QY (gpX hfT)
from the first part of the proof,
= pXQY(ghs!s)
=pX[hP;'gJ
= pX,y gh.
from (v)
Thus p;,y as defined satisfies the product rule, and for any other conditional probability Q;,Y satisfying the product rule,
pYlp;,yW - Q;'YWI
=
O.
Thus pY 9 I p; h - Q; hi = 0 for any quotient probability Q satisfying the
product rule. Since qy is (J-finite, 9 may be chosen positive, and so
pYlp;h - Q;hl = o.
0
In terms of densities, we have that Y given X has density fs with respect
to some probability QY ; under specified conditions on f and f!r, X given Y is
31
3.6. Binomial Conditional Probability
a unitary probability with density f T/ pXfT with respect to pX. In the usual
terminology, fs would be the likelihood of Y given X, pX is the prior distribution of X, and f T/ pXf T is the posterior density of X given Y. Frequently
pX has some prior density P with respect to a probability R X, and then the
conditional probability of X given Y has posterior density fTP/Rx(fTP)
with respect to RX.
Note that pX and
may not be unitary, but under the conditions of
the theorem
is unitary. Renyi (1970) takes unitary conditional probabilities
as the basic concept, expressing non-unitary probabilities such as lebesgue
measure by families of such conditional probabilities. It seems simpler to
go the other way, to define unitary conditional probabilities from nonunitary probabilities; indeed we allow non-unitary conditional probabilities
in general, though our Bayes theorem produces only unitary conditional
probabilities.
pi
P;
3.6. Binomial Conditional Probability
The binomial distribution is defined for n 0-1 random variables
Xl' X 2' ... , X n given a parameter p, 0 ;:::; p ;:::; 1, by
P[Xi = Xi' i = 1, ... , nip] = pIX;(l - py-Ix;.
The random variables Xl' X 2' ... , X" are independent and identically
distributed given p, with P(X i Ip) = p.
In Bayesian analysis p (as well as the XJ is taken to be a random variable
on some underlying probability space :r and a probability P is assumed on
[![ such that the conditional distribution of Xl' ... , X" given p is binomial. If p
is not unitary, the marginal probability of Xl' ... 'X n = p(pIx;(l - p)"-IX i )
is not defined for all X; thus conditional probability given the observations
must be carefully handled. For example, if Pf = Hf(p)/p(1 - p)]dp using
Haldane's prior, then p[ Xl =
is not defined and the conditional distribution of P given Xl = 0 is not uniquely defined.
Assume that p'"(1 - p)m' EPI whenever m ~ a, m' ~ b. Then define PI:,b
to be the probability subspace of fil' including all functions
OJ
X i -- X i ' I'-1
,
... ,
n," n > n,
" -X"' - 1'L...,Xj
a, n' - "
L-Xj >b
=
X i-Xi'
I'-1
,
... ,
n," n > n,
X n ' -0"
'L-Xi >
=
a, n' - "
L-Xi -b
Xi = Xi'
i = I, ... , n,
LXi
~ a, n -
LXi
~ b.
Thus :r:,b corresponds to the shortest sequences of observations containing
at least n observations, at least a successes, and at least b failures. The conditional probability of p given PI:,b is
n
_
P[f(p) IPIa,b] -
p[f(p)pIX;(l - pr'-IXi]
P[pIXi(l _ p)n' IXi]
It may be verified that P[P(f(P)I:r:,b)] = Pf(p).
32
3. Conditional Probability
Consider, for example, Haldane's prior; here a = 1, b = 1 since
S[j(p)/P(1 - p)]dp exists whenever f{p) = pm(l - pt', m;:;;; 1, m';:;;; 1. (Of
course a and b could be positive fractions but this does not change X~.b·)
Then xi,l is generated from the sequences
001, 0001, 00001,
110, 1110, 11110,
01, 10.
The conditional probability of p given X~,I is
p[j(p)IX] =
S!(p)pLX,-I(1- pt'-LX,-Idp
SpLX,-I(1- p)n'-LXi-1dp
which is defined for each of the specified sequences since LXi;:;;; 1, n' - LXi;:;;; l.
If a = 0, b = 0 we would have X~,o generated from 00, 01, 10, 11; and the conditional probability of f(p) averages to the probability of f(p) when weighted
by the probabilities of 00,01, 10, 1l. In the case a = 1, b = 1 the sequences
00 and 11 do not have defined probabilities, so the average that validates
conditional probabilities is not available-OO is replaced by 001,0001, , .. ,
and 11 is replaced by 110, 1110, ... ,and an average with valid marginal
weights becomes available.
In application, we will be able to give conditional probabilities whenever
the data sequence has at least one 1 and at least one O. If the data are of form:
all O's or all l's, no conditional probability consistent with the axioms is
available.
3.7. Problems
Problems (Q) are ones that I find very hard.
EI. Let N denote the positive integers, and let :!Z be the space of sequences {z" Z2' ... ,
zn ... } with I IZi I < CIJ. Let X and Y be random variables into (N, :!Z), and let X x Y
be a random variable into N x N, with P{X = i, Y = j} = pl,)
. .. Define p.J = "'.p
. ..
L.r l.j
Let b/k) = {i = k}, kE:!Z.
Then
E2. Let IR: be the real line, B be the space of bounded lebesgue integrable functions
(obtained by completing the probability that values an interval by its length, and
accepting bounded functions in that completion). Let X and Y be random variables
into (IR:, B) and letf(x, y) be such thatg(y)f(x, y)EB x B for each gEB.
Suppose
pX,YW = Sf W(x, y)f(x, y)dxdy.
Then
pYg = Sf g(y)f(x, y)dxdy,
and
P;(h)(y) = Sh(x)f(x, y)dxlSi(x, y)dx.
PI. Let X be a real valued random variable uniformly distributed, and let the condition-
33
3.8. References
al distribution of Ygiven X give probability t to X - t and probability t to X + t.
Find the joint distribution of X x Y and the conditional distribution of X given Y.
(Bayes theorem fails.)
P2. Say X - N(}J,o' (j~) if X is a real random variable having density
exp[ -t(llo - X)2/(j~]/«(jo~) with respect to lebesgue measure. Suppose 0N(}J,o' (j~), X I0 - N(O, (j2). Find the distribution of 0 given X.
If, in addition Y IX, 0 - N(X - 0, (j2), find 0 IX, Y.
P3. The three observations 1,3, 7, given 0, are from a normal family
(l/j2i)exp[ -t(x - Of] with probability or from the family texp[ -Ix - 01]
with probability!. The prior distribution of 0 is uniform. Find the posterior distribution of 0 given the observations, and the posterior probability that the normal
is the true distribution.
±
P4. Assume that XIO - N(O, I), and that 0 - N(Oo' (j~). Usually 00 and (j~ are assumed
known, but suppose, as an afterthought, you decide that 00 - N(O, (ji). Find the
posterior distribution of 0 given X.
P5. Let Xl' ... , X n be a sample from the uniform (0 - t, 0 + t) given O. Let the prior
probability of 0 be uniform. Find the posterior probability of 0 IX l ' ... , X nand
compute the posterior mean and variance.
Ql. Suppose.cr and d,Ij are probability spaces on S, and that P: .cr to d,Ij satisfies the axioms
of conditional probability, but that .cr => d,Ij is not assumed. When is it possible
to extend P to !!l including.cr and d,Ij, so that P: !!l to d,Ij is a conditio~l probability?
Q2. If P is a conditional probability on.cr to d,Ij, does there exist a complete conditional
probability P' on .cr' to d,Ij such that P' coincides with P on .cr c .cr'?
3.8. References
Dawid, A. P., Stone, M. and Zidek, J. V. (1973), Marginalization paradoxes in Bayesian
and statistical inference, J. Roy. Stat. Soc. B 35, 189-223.
Kolmogorov, A. N. (1933), Foundations of the Theory of Probability. New York:
Chelsea.
Renyi, A. (1970), Probability Theory. New York: American Elsevier.
Stone, M. and Dawid, A. P. (1972), Un-Bayesian implications of improper Bayes
inference in routine statistical problems, Biometrika 59,369-373.
Sudderth, W. D. (1980), Finitely additive priors, coherence, and the marginalization
paradox, J. Roy. Stat. Soc. B 42,339-341.
Tjur, T. (1972), On the mathematical foundations of probability. Inst. of Math. Statist.,
University of Copenhagen.
--(1974), Conditional probability distribution. Inst. of Math. Statist., University
of Copenhagen.
CHAPTER 4
Convergence
4.0. Introduction
Notions of convergence, as the amount of information increases, are
necessary to check the consequences of probability assignments in empirical
fact. For example, if we assume that a penny has probability 1/2 of coming
down heads on each toss, and that the different tosses are independent, it
follows that the limiting proportion of heads will be 1/2 almost surely.
A standard method of evaluating statistical procedures is through their
asymptotic properties; partly this is a matter of necessity because the
asymptotic behavior is simple. For example, one of the desirable properties
of a niaximum likelihood estimate is that it converges, in a certain sense under
certain regularity conditions, to the unknown parameter value. A famous
theorem due to Doob (1949) handles consistency of Bayes procedures-if
any estimate converges to the unknown parameter value in probability
then the posterior distribution concentrates on the unknown value as the
data increases.
4.1. Convergence Definitions
Let f!E be a probability space on S. A real valued function X on S is measurable
if {X > a}, {X < -a}Ef!E for each a> O. Thus X is a random variable into
R, BI 0 where Blo is the smallest probability space containing intervals (a, 00),
( - 00, -a) for each a> O.
The space of measurable functions is a probability space which includes
f!E. In the following X, Xl' ... , X n , are assumed to be measurable functions
34
35
4.2. Mean Convergence of Conditional Probabilities
with respect to f!( on S, and it is assumed that there is a probability P on f!(
on S.
Xl = X 2 as P means P(X 1 =1= X 2) = O.
X.-+X as P means p{sIX.(s)+X(s)} =0; equivalently, for each B>O,
p{lX. - Xl > B some n > N} -+0 as N -+ 00.
X. -+ X in P means P { IX. - X I > B} -+ 0 each B> o.
X. -+ X by P means P IX. - X 1-+ 0 as n -+ 00.
If X. -+ X by P or X. -+ X as P then X n -+ X in P.
If XEf!(,XnEf!l,Xn-+X in P, and supP(IXnl({X.I>A}+{lX.I<
1/A} )) -+ 0 as
•
A -+ 00, then X n -+ X by P. The sup condition is a general-
ization of the notion of uniform integrability, necessary to handle non-unitary
P. With some mess, it may be shown that it is sufficient to prove the result for
X=O;
lim pIX.1
~ s~p p(lx.i( {IXnl > A} + {IX.I > ~}))
+ lim P[IXni( ~ ~ IX.I < A ) ]
Since lim P( IX. I{1/A ~ IX. I < A} ) ~ A lim P{ IX. I ~ 1/A}
lim pIX.1 = O.
=
0 all
A,
X n -+ X in D (in distribution) if P[j(X n)] -+ P[j(X)] for each bounded
continuous f that vanishes near zero. It is not necessary that the X n and X
be defined on the same probability space; the definition involves only the
distribution of the X. and X.
If Xn -+ X in P then Xn -+ X in D. To show this, for each f there is a fixed
Bo such that f(x) = 0 for Ixl ~ BO' and for each k, (j there is an B< Bo/2 depending on (k, (j) such that If(x) - fey) I < (j for Ixl < k, Ix - YI < B. Thus
If(x) - fey) I ~ 2 sup f( {Ixl > k} + {Ix - YI > B} ) + (j{ Ixl ~ Bo/2}
limlf(X) - f(Xn) I ~ 2 sup fP{ Ixi > k} + (jp(IXI ~ Bo/2)
Choosing k large and
(j
small gives plf(X)-f(X.)I-+O=Xn-+X in D.
4.2. Mean Convergence of Conditional Probabilities
Theorem. Let f!(n be a sequence of probability spaces contained in the probability space f!(, let P be a probability on f!(, and let Pn = P~n be the conditional
probability of f!( given f!( •. Let X E f!(.
(i) If X is mean-approximable by f!l n (that is P IX n - X 1-+ 0 for X nE f!l n)'
then X is mean-approximable by the sequence PnX.
(ii) If X is square-approximable by f!ln (that is P(Xn - X)2 -+0 for XnEf!(n)'
then X is square-approximable by the sequence PnX.
36
4. Convergence
PROOF.
Since f![n c f![, PnX n = Xn for each Xn in f![n' so PnP n = Pn'
+ PnlX n - PnXI = PnlX - X"I + Pnlp)X
X"I + p"P lX n - Xl = 2P IX - XJ
PnlX - PnXI ~ PnlX - Xnl
~ PnlX -
II -
X)I
II
lI
Thus if X is mean-approximable by f![", it is mean-approximable by PnX,
xy = Pn(X -
P)X -
P X)2 +2PJ(X - PnX) (P X - Xn}J + P,,(X n - p"X)2
II
lI
= Pn(X - PnXf + 0 + PII(X II - Pn X)2.
Thus if X is square-approximable by f![ nit is square-approximable by PnX. 0
EXAMPLE.
Let P be a complete probability on 'lI on U.
Let X, Xl' ... , X n be random variables on U, 'lI.
Suppose Xl' ... , X n' ... are independent and identically distributed given
e; let
=
for i = 1,2, ....
Assume that X is a real valued random variable such that X, X 2 e'll.
Set 11 = p(Xle).
e,
P;' P;
Since Xl' ... , X n are independent given
p[
(~LXi -
e,
11 Yle ] = P[(X - !1)2I eJ/11
p[~LXi -
11
J
= P[X - I1J2/ 11 •
Note that 11 2eZ because 112{ 1111 ~ k}l tl2 as ki 00, 112{ 1111 ~ k} =
1111(11111'1 k)e'll, and P11 2{1111 ~ k} ~ PX 2 aIIk.
Thus 11 is square-approximable by l/n LXi and hence by p[1l1 Xl'
X 2' .. , , XIII This is a Bayesian adaptation of the law of large numbers,
in which the unknown population mean Il is increasingly weIl approximated
by its best estimate from the sample Xl' ... , X n'
4.3. Almost Sure Convergence of Conditional
Pro ba bili ties
Theorem. Let P be a probability on .Uf, and let f![ n be an increasing sequence of
subspaces of f![. Let P n = pin' Let X e f![. If X is mean or square-approximable
by {f![n}, then
PnX
PROOF.
-+
X as P.
Let Xn = PnX, Then Pn- 1 Xn = X n-
1
by the product law, so Xn is
37
4.3. Almost Sure Convergence of Conditional Probabilities
a martingale when P is unitary; in this case the theorem is well known,
Doob (1953).
Lemma.
P{ sup IXil ~ s} ;£ plxnl/s.
1 ~i~n
PROOF.
Then
Let Ai = {IX 11 < e, Ix 21 < e, ... , IXil ~ e}, i = 1,2, '" , n.
n
IAi={ sup IXil~e}.
AiX; = Ai(X; + 2(Xn - X)Xi
+ (Xn -
P[Ai(X n - X)XJ = PPJAiX/X Ii
-
X/)
XJ] = P[AiXiPi(X Ii
-
X,)] = 0
P(AiX;) ~ e2P Ai since IXil ~ s when Ai =1= 0
PX; ~ e2PIA i = e2p{ sup IXil ~ e} as required.
For the second result, define Bi = {X 1 < e, X 2 < s, ... , Xi ~ s}
XnBi = BlX i + Xn - X,)
PXnBi = PB;Xi + PPiBi(X n - X,) = PB;Xi
~ePBi
P(X,7) ~ P(X"IB) ~ eP(IBJ = eP{ sup Xi ~ e}
- P(X';-) ~ eP{ inf X i ;£ - e}
plx,,1 ~ eP{ suplXil ~ e}.
Turning to the theorem, if X is square-approximable, from 4.2,
P(Xn - X)2 ~ 0, and for a suitable subsequence IP(X nr - X)2 < 00.
{ sup IX i - Xjl ~ 4e} ;£ sup( {
i,j?;;N
nr?:N
P{ supIXi-XJ~4e};£
i,j?:N
I
sup
",.-1 ~i~nr
IX i - X"J ~ e}
+ {IX -
XnJ ~ c})
[P(Xnr-xnr_y/e 2 +P(X-X"l/e 2 ]
nr?,N
->
0 as N
-> 00,
using the lemma and Ip(X nr - xf < 00.
lf X is mean approximable, the same argument holds with plX nr - Xl
replacing P(Xnr - X)2 and PIX nr - Xnr_tl replacing P(Xnr - Xnr_Y.
0
38
4. Convergence
4.4. Consistency of Posterior Distributions
Doob's Theorem. Let f!{n be an increasing sequence of subprobability spaces
off!{·
Let P be a probability on f!{, let P nbe conditional probabilities off!{ given f!{ n·
Let () be a real valued random variable such that {a ;i: () ;i: b} E f!{ for a, b
finite, and suppose that () is approximable by f!{ n : X n ~ () in P some X n in f!{ n·
Let Cn(a, b) = Pn{a ;i: () ;i: b}.
Then Cn(() - e, () + e) ~ 1 as P.
PROOF. Doob (1949) proved a version of this theorem for P unitary. See also
Schwartz (1965) and Berk(1970).
Let a and b not be atoms of () :P{a = ()} = P{b = ()} = o. I{a;i: () ;i:b} {a ;i: X n ;i: b} I ;i: {I a - () I ;i: e} + {I b - () I ;i: e} + {I X n - () I ~ e}.
Since
P{ IXn - ()I ~ e} ~ 0 as n ~ co, and P{Ja - ()I ;i: e} + P{lb - ()I ;i: e} ~ 0
as e ~ 0, P{ a ;i: () ;i: b} - {a ;i: X n ;i: b} I~ O.
From 4.3, Pn{a;i: ();i: b} ~ {a;i: ();i: b} as P.
Cn(a, b)(s) ~ {a ;i: ()(s) ;i: b} except for
SE A,
PA = O.
Cn(()(s) - e, ()(s) + e) ~ 1 except for SEA.
Cn(()
-
o
e, 0 + e) ~ 1 as P.
The condition of approximability and the conclusion of consistency may
both be expressed conditionally on (); () is approximable by f!{n
if PII[ IXn - ()I > e] ~ 0 except for a set of ()-values of probability zero; and
Cn(() - e, () + e) ~ 1 as P implies that Cn(O - e, () + e) ~ 1 as P II ' except for a
set of 0 values of probability zero.
Let X I ' X 2' •.. , X n be a random sample from N( (), 1); let () be
uniformly distributed on the line. Let f!{ ndenote the probability space generated by X I ' ... , X n and note that IXn - 01 ~ 0 in probability given (). The
conditional distribution of () given f!{ n is unitary. Therefore it concentrates on
the true value 0 as n ~ co, except for a set of () values of probability zero.
EXAMPLE.
4.5. Binomial Case
Let P be a probability on f!{ on S.
Let P, the binomial parameter, be a real valued random variable, 0 ;i: p ;i: 1,
defined on f!{.
Let X I' X 2' ... , X n be n Bernoulli random variables, each taking the
values 0 or 1; thus X I' X 2' ... , X n maps S to the space of n-tuples
(XI' X 2 ' ••• , xn) where Xi = 0 or 1.
A binomial distribution has X I' X 2'
... ,
X n independent and identically
39
4.5. Binomial Case
distributed given p:
P[X; =
Xi'
i = 1, ... , nip] = p~:Xi(l - p)"-IXi.
The posterior distribution of P given Xl' ... , X is
/I
PJ = P[j(p) IX l '
... ,
XJ = P[j(p)pLXi(l -
p)/I-IXi]/P(pLXi(l - p)n-IX,)
defined whenever pIX'(l - p)"- I X'Efif.
(Note that the conditioning space fifn must contain observations of varying
length if P is not unitary.)
Theorem. Let p, Xl' ... , X n be random variables on fif, and let X I ' .. , , X
be Bernoulli given p. Assume that pm(l - p)m' Efif for m ~ a, m' ~ b. Say that
Po is in the support ofP ifP[lp - Pol < e] > Ofor each e > O. Then pJlp - Po 1<
e] ~ 1 as P po if and only ifpo is in the support of P.
/I'
'"
Note that PJ/p - Pol < e] is a function in fif/l' Let R = IX i . If
Po is not in the support of P, p[lp - Po 1< e] = 0 some e > 0, and so PJlpPo I < e] = 0 for all n. This establishes the "only if."
Now suppose Po lies in the carrier of P. Assume 0 < Po < 1.
Let f(po, p) = Po log p + (1 - Po) log(1 - p).
Thenf(po, p), as a function of p, is continuous and has a unique maximum
at p = Po' Thus for each small b > 0, there exists Ll > 0 with
PROOF.
f(po' PI) > f(po' P2) + Ll
As n ~
00,
whenever
IPo - PII < b,
IPo - P21 > 2b.
R/n ~ Po as P po from the strong law of large numbers.
f( ~,
Pl) >
f( ~,
P2) + Ll
whenever
IPo - Pll < b,
IPo - P21 > 2b,
for all large n, with probability approaching 1 as n ~ 00. (Note that the conditioning space may include observation sequences of length greater than
n, but the inequality holds for all these sequences.)
Pn{IP-Pol~b}/P/I{lp-pol~2b}
=p[{lp-pol~b}exp[ nf(~,p)JJI
P
l{I
p - Pol
~ 2b} exp [ nf ( ~ , p ) JJ
~ e"Ap{ Ip - Pol < b}/p{lp - Pol> 2b} ~
Thus P n { Ip - Po I ~ 2b} ~ 0 as P po as required.
00
as
Ppo .
o
40
4. Convergence
Remark: The same result generalizes to multi nominal distributions, but not
to observations carried by a countable number of points, as shown by
Freedman (1963).
4.6. Exchangeable Sequences
Let P be a probability on a space f!(. A sequence of random variables XI'
X 2' .,. , X n, ... defined on f!( is said to be exchangeable if XI' X 2' ... , Xn has
the same distribution as X ... 1 ' X 0'2' ••• , X an for each n and each permutation 0'.
De Finetti's Theorem. Let XI' X 2' ••• , X n' .•• be an exchangeable sequence
of 0 - 1 random variables on a- probability space f!( with unitary probability P.
Then X l ' X 2' ... , X n' ... are conditionally independent and identically
distributed given the random variable p = lim L:~ 1 X/2 n if the limit exists,
p = 0 if the limit fails.
PROOF.
Let Xn = (l/n)L7= 1 Xi'
Then P(Xn - Xm)2 = [p(XD - P(X 1 X 2)](m - n)/mn
LPIX 2s
+1
PUX2r+l -
-
X 2 ,1 ~
00
x 2N I> e} ~ p{J)X 2
'+1 -
x 2,1 > e}-70 as N, M -700.
Thus p = lim L:~ 1 X/2n is defined except for a set A of probability 0; define
p = 0 on the set A. Define a conditional probability on XI' X 2' ... , X n' ...
given p by p[llp]=I,P[Xilp]=p, and let the XI'X 2 "",Xn ' ' ' ' be
independent and identically distributed given p. The specified probability
obviously satisfies the axioms of conditional probability except for possibly
axiom 1 and the product axiom.
For axiom I we need show that for Baire functions h, and for functions g
in the smallest probability space including X l ' ... ,Xn , P[h(P)g Ip] =
I
h(p)P[g p].
Take h to be continuous and bounded, and note that the class of functions
g satisfying the identity is a limit space, so that we need only verify that
P [h(p)n Xii p] = h(p )p[n Xii p]. (The probability space generated by
XI' X 2' ... , X n is the smallest limit space including Xi!' Xi,' ... ,Xi" for
each sequence i l ' i 2 , ... ,in' If the axiom is satisfied for X l ' X 2' ... , X n,
by symmetry it is satisfied for Xi ' Xi ' ... ,Xi .) Define X 2' i =
x
{2'(i - 1l < j ~ 2'i} /[2'i - 2'(i - 1 l]. The~ X 2: i -7 pas" r -7 00, whenever X 2' : . p
as r -700. If X2' -7 p,
'
LX.
p[h(p)nXilp]
If the limit
= P[limh(X 2 ,)nX ilp]
= P[limh(X 2 X 2"il p]
= h(p)pn by the law of large numbers
= h(p)p[nXilp].
,)n
X2' does not exist, then p = 0 and p[h(p)nXil p] = O. The axiom
41
4.6. Exchangeable Sequences
is true for bounded and continuous h, and it is true on a limit space of functions, so it is true for Baire functions.
For the product axiom, proceeding as before requires that
pp(nXilp) = p(n X ).
Now
p(nX ) = p(nX 2') = P(lim
r
nX 2')
= p(pn)
= p[p(nXilp)].
D
Thus the product axiom holds concluding the proof.
Notes: This theorem, proved first by de Finetti, has philosophical importance in relating Bayes and frequentist theory. In the frequentist approach,
probabilities are based on a sequence of observations Xl' X 2' ... , X n' ...
that have limiting frequency p. De Finetti argues that this form of evidence
requires the judgement that the X l ' X 2' ... , X n ' ••• are exchangeable; then
the limiting frequency p exists except for a set of probability zero. Thus the
frequentist probability statements may be derived from Bayes theory.
The generalization is straightforward for real valued random variables;
in this case the conditioning random variables are the functions
1 {Xi ~ x}/2r for rational x. The "nX/, for generating the probability
lim
space including X l ' X 2' X 3' ... are the functions n?= 1 {Xi ~ xJ where the
Xi are rational. See for example, Loeve (1955), p. 365.
How can the theorem be generalized to non-unitary probabilities? Consider a special case, corresponding to the prior density proportional to
I;:
11p(1 - p).
Let S be the set of all 0 - 1 sequences. Consider the smallest probability
space [!( on S that includes all functions X depending on a finite number of
elements of the sequence s, such that X(s) = 0 for s = 0, 0, 0, ... or s =
1, 1, 1, .... For example the finite sequences containing at least one 1 and
one 0 will lie in this space.
Let the sequence xl' x 2 '
•••
,xn have probability 1/
(I:-:-\)
for
0< IXi < n. This assignment of probability satisfies the axioms. For example
] = P(01) = P(Oll) + P(010).
The function xn = (1/n)Ix i does not lie in the probability space, but
the functions xn - xm do, since they give value 0 to s == 0 or s == 1. Since
P(xn - xm? = ~ P(x 1 - x 2)21 n - mlimn, the sequence X2 " converges except
on a set of probability zero to a function, p say. Let rJjj be the probability
space generated by the polynomials pkl(l - p)k2 where kl' k2 > O. Define
a conditional probability on [!( given rJjj by P[ Xl' X2 ' ••• , Xn rJjj] =
pLxi(1 _ p)n-r.xi.
I
This conditional probability satisfies the product axiom:
PP[xl' x 2 '
••• ,
xnlrJjj] = p[pr.xi(1 - p)"-r.XiJ
=
P[lim X~~i(l - X2 N )n-r.xi ]
= P(x 1 ' x 2 '
••• ,
xn)·
42
4. Convergence
And different sequences
dent given p.
XI' X 2 , ••• ,
xn and x n + I' x n + 2 '
••• ,
x N are indepen-
4.7. Problems
EI. In the binomial case the prior P is carried by the rationals (0, I], with P(r) =
log [I - 2 - N]/[ - N], where N is the first integer for which rN is an integer. If Po
is the true value of p, show that the posterior distribution of P concentrates on Po
as n -+ 00.
E2. Two parameters U, V ar~ independent uniforms. A coin is tossed giving heads with
probability 1U - V I. Find the posterior distribution of U, V given r heads in n
tosses, and specify its behavior as n -+ 00.
PI. In the binomial case, assume P{H = pa} = t. Find the asymptotic behavior of
p., as the true probability Po ranges over (0, I). Generalize results to P carried by
a finite number of points.
P2. Let p be a binomial parameter, and let 0 = tp{O ~ p ~ t} + (t + tp){p > t}·
Let P be the prior distribution on 0 which is uniform over (0, ±)v(~, I). Let P n
be the posterior distribution of 0 after R successes in n trials, and consider its
distribution as R varies given the true value 00 =
Show that Pn's distribution
converges to a distribution over the two point distributions carried by {O = H
and {O = ~}, with P{ 0 = H uniformly distributed between 0 and 1.
±.
P3. In the binomial case, show that for a particular e > 0, it may happen that
P(po - e, Po + e) > 0 but Pn(Po - e, Po + e)-+O as P po •
P4. In the binomial, let the probability P be defined by
1
P(f) =
Show that for each Po' 0 ~ Po
~
JfI[P(1 -
o
p) ]dp.
1, p.(Po - e, Po + e) -+ 1 as P po •
P5. The observation X is poisson with parameter A., where P{A. = 1 + I/i} = 2-;,
1 ~ i < 00. Given A. = 1, specify the asymptotic behavior of the posterior distribution of A..
P6. Let X be 0 - 1 with probability given the parameter p, 0
P[ Xl p] = p + Hp =
~p ~
1,
±} - t{p = ~}.
Let the prior distribution of p be uniform. Show that the posterior distribution
given n i.i.d. observations Xl' X 2' ... , X. is not consistent for p = or p =
±
i.
P7. Let P be a unitary probability on f!{. Let Y be a random variable, let qy be the probability space containing sets ofform {Y ~ a}, and let qy. be the probability space
generated by sets ofform {Y ~ k/2n}, - 00 < k < 00.
Let
Snk =
Pn(X) =
k+I} .
{ 2kn < Y ~ ---z;-
I
P(Snk»O
Show that
SnkP(SnkX)/P(Snk)
4.8. References
43
is a conditional probability on PI to all •. Suppose that P co is a conditional probability on PI to all such that P",p. = P ",' Then p.(X) -+ P oo(X) in P. (In this way, the
conditional probability of PI given the random variable Y is approximated by
directly computed conditional probabilities given discrete random variables Y.
which approximate Y.)
P8. A Markov chain, with a finite number of states, has probability Pi for the initial
state to be state i, and probability p/ j for a transition from state i to state j. The
initial state and an infinite sequence of transitions is observed. Assume that the
prior distribution for {Pi} and {p;) is uniform over the sets 0;;; Pi ;;; 1, LPi = 1;
0;;; Pi} ~ 1, LjPij = 1. Specify the limiting posterior distribution of {Pi} and {Pij}'
P9. Let P be a probability on PI such th~t XePI=--X 2 ePI. Let all be a probability
subspace of PI such that py2 = O=-- Y = 0, and suppose that all is complete with
respect to the metric p(Y1, Y2) = P(Y1 - y 2)2: if p(Y., Ym)-+O, there exists
Y e all such that p( Y, Y.) .... O. Define P(X Iall) = Y if Y e all minimizes P(X _ y)2.
Show that P(X Iall) is uniquely defined, and is a conditional probability on PI to all.
(Doob, 1953).
PIO. In the binomial case, if P has an atom at Po, then p. {Po} -+ 1 as P if Po is true.
4.8. References
Berk, Robert H. (1970), Consistency a posteriori, Ann. of Math. Statist. 41, 894-906.
Doob, J. L. (1949), Application of the theory ofmartingaies, Colloques Internationaux
du Centre National de la Recherche Scientlfzque. Paris, 23-27.
Doob, J. L. (1953), Stochastic Processes. New York: John Wiley.
Freedman, David (1963), On the asymptotic behaviour of Bayes estimates in the
discrete case, Ann. of Math. Statist. 34, 1386-1403.
Loeve, Michel (1955), Probability Theory. Princeton: Van Nostrand.
Schwartz, L. (1965), On Bayes procedures, Z. Wahr.4, 10-26.
CHAPTER 5
Making Probabilities
5.0. Introduction
The essence of Bayes theory is giving probability values to bets. Methods
of generating such probabilities are what separate the various theories.
If probabilities are personal opinions, then they are determined by asking
questions or observing which of a family of bets an individual accepts. There
is a small theory for extracting personal probabilities, the elicitation of
probabilities, for example Winkler (1967). To discover a person's probabilities
for the disjoint events Ai for example, you offer to pay + log Xi if Ai occurs
where Xi is the person's stated probability for the event Ai. If the person's
"true" probabilities are Pi' his expected gain is LPi log Xi which is maximized
when Xi = Pi. This elicitation function is not entirely satisfactory, since he
may over-estimate the probability of unlikely events to avoid large losses.
There are a number of objective methods of generating probabilities that
are sophisticated versions of the principle of insufficient reason-they
attempt to give probabilities that correspond to no information, in the hope
that any information may be incorporated using Bayes theorem. It is necessary in every case to assume some prespecified probabilities on which~ to
build the "indifferent" probabilities.
We have rested the tortoise four square upon the elephant; but what
does the elephant stand on? A method of basing probabilities on similarity
judgments is proposed.
5.1. Information
Iff is a density with respect to f!£, and P X = Q(fX) each X in f!£, write f =
dP/dQ and P = Q.f.
The iriformation of P with respect to Q is I(P, Q) = Q((dP/dQ) log (dP/dQ))
44
5.2. Maximal Learning Probabilities
45
defined whenever (dP/dQ) log (dP/dQ)E f!(. If P is a discrete distribution over
the integers, and Q gives value log 2 to each integer, - I(P, Q) coincides
with Shannon's (1948) definition of entropy, which may be interpreted as the
average number of bits per observation, required to send over a channel
a stream of observations from P encoded into binary digits. See Good
(1966) for a statistical interpretation. If P and Q are both unitary probabilities
I(P, Q) is defined by Kullback and Leibler (1951). In this case I(P, Q);:;:; 0,
and I(P, Q) = 0 when P = Q, so I(P, Q) may be interpreted as a measure of
distance between P and Q.
For a given Q, it is sometimes useful to find the minimal iriformation P
satisfying various constraints-this is put forward as the principle ofmaximum
entropy by Jaynes (1957) and Kullback (1959), but there is no reason to
think that such a probability is correct for betting-it is merely the probability closest to Q in a certain way. Of course the minimal P always depends on
the underlying probability Q. See also Christensen (1981).
Theorem. If there exists a density f = exp(L~= 1 AjX j) such that Q(fX) = a p
i = 1, ... , n, then P with dP/dQ = f is uniquely of minimal iriformation among
all P satisfying the constraints P(X j) = a p i = 1, ... ,n.
g log f ~ g log g + f - g with equality only if f = g.
If a density g satisfies the constraints, Q(gX j ) = a p
PROOF.
Q(g logf) = Q[gLAjX)] = Q[fLAjX j)]
= Q(flogf).
Thus Q(flogf)=Q(glogf)~Q(glogg) with equality only if f=g
as Q. Therefore P with f = dP/dQ is of minimal information among all P
satisfying the constraints, and no other minimal P exists.
Note: Since I(P, Q) is convex as a function of P, it may be shown that the
minimal information P is unique if it exists.
Note: It may happen that no minimal information P exists, but that there
exists Po = Q ·fo such that for each A with 0 < P o(A) < 00, P A = Q(foA)/
P o(A) is the minimal probability P under the additional constraints P A = 1,
P(X) = 0 if XA = O. The P A'S are conditional probabilities given A corresponding to Po, and Po may reasonably be taken to be the minimal P. For
example, if Q is uniform on the integers, the optimal P under the constraints
that all but a finite number of probabilities are zero and PI = 1, has all
non-zero probabilities equal. Thus the overall minimal P should be taken
to be uniform on the integers.
5.2. Maximal Learning Probabilities
Let P be a probability on fL and let X, Y and X x Y be random variables
defined on fL. Suppose that a quotient probability pi exists, satisfying
pX,Y = pXpi·
46
5. Making Probabilities
If pX,Y has density fwith respect to pX x pY (that is
pX,YW = pXpY[JW]
for
We,q' x
~),
define the information between X and Y to be pX,Y[IogJ] = pX'Y[log(dpX,y/
dPx x pY) J. This measure is zero if X and Yare independent
= pY),
and it is non negative if P is unitary. Define the conditional information of
Y given X by I(YIX) = I(Pi, pY); then pX[I(YIX)] is the information
between X and Y. See Lindley (1956).
If pX is O'-finite, and pi has density h with respect to a probability QY,
h = dPjjdQY, then pY has density pXhT with respect to Q, and
(pi
P[logdP/dpx x dpY] = pX[I(pi, Q)] - I[pY, Q],
which may be interpreted as the probable change of information about Y
with respect to Q, due to learning X.
If Pi is known, pX,Y is determined by pX; Good (1969), Zellner (1977)
and Bernardo (1979) have suggested determining pX to be a "maximal learning probability," maximizing pX,Y 10g(dpX,Y/px x pY); thus learning Y
will maximally increase the information about X.
Theorem. Let pX,Y be unitary, with marginal probabilities pX, pY and quotient
probability pi. For pi fixed, if there exists pX with
(1) pX{I(Pi, pY) = c} = 1
(2) I(Pi, pY) ~ c
then the maximal information between X and Y is c, and QX produces maximal
information if and only if it satisfies (1), (2) and pY = QY.
PROOF.
QX[I(pi, QY) - I(Pi, pY)]
= QXPi[log(dpY/dQY)]
= QY log dpY/dQY ~ 0 with equality only if pY = QY.
Thus QXI[pi, QY] ~ c = pX[I(Pi, pY)] with equality only if pY = QY
and if(I) and (2) are satisfied by Q.
0
Let Y = 0 or 1, let X be a binomial parameter 0 ~ X
PxY=X,PY=PX.
EXAMPLE.
I-X
~
1, so that
X
I[Pi, pY] = (1 - X) log 1 _ PX + X log PX'
If pX{O} = pX{1} = 1/2, then I(Pi, pY) = In 2 at X = 0,1 and I[Pi, pY] =
(1 - X) 10g(1 - X) + X log X -In 2 ~ In 2 for all X. Thus pX is a maximal
learning probability, and it is unique.
For two binomial observations, the maximal learning probability is
pX{O} = 15/34, PX{I/2} = 4/34, PX{I} = 15/34; empirical evidence suggests
that the maximal learning probability is carried by (n + 1) atoms for n
47
5.3. Invariance
observations and converges weakly to the Jeffreys distribution with sin - 1
uniform.
fi
5.3. Invariance
Let X be a random variable from U,:!l' to S, fl'. Let a be a 1 - 1
transformation of S onto itself such thatfaefl' eachfefl'. A probability pX
is relatively invariant under a, or a-invariant if
pX = kP"X
some k, that is
Pf(X) = kPf(aX),
each f in fl'.
(If pX is unitary, k = 1; note that prxf = PXfa for any a.)
For example, let X = Xl' X 2' ... , Xl 0 denote the results, in heads or tails,
of tossing a penny ten times; and suppose we have no reason to differentiate
between the tosses-each order is equally likely. Let a be a permutation of 10
numbers; aX = X"l' X,,2' ... , X"lO' Then X and aX have the same distribution.
A quotient probability P: is relatively invariant under a, r if a is a 1 - 1
transformation of S, fl', r is a 1 - 1 transformation of T, I!!! and
pXY = kP"x
tY
some k •
(Note that P; f = P;y/r for any 1 - 1 r.) If pY is relatively invariant under r,
and
is relatively invariant under a, r, it follows from the product rule that
pX. Y is relatively invariant under a, r-that is paX. tY = kp x , Y. Conversely,
if pY is relatively invariant under rand pX. Y is relatively invariant under a, r,
and if
is defined, then
is relatively invariant under a, r as pY. More
precisely, for each fefl', gel!!!, pY(lp;f - kP;; fig) = O.
Invariances are used to generate prior probabilities as follows. Suppose
that X is an observation, Y is an unknown parameter. A model specifies the
quotient probability
and transformations a and r are found such that
is a, r-invariant. It is now assumed that the same invariance applies to the
posterior distribution of Y given X; this will occur if the prior distribution
is r-invariant. Thus each invariance found on the model induces a constraint
on the prior; conceivably, we might find so many invariances on the model
that no prior satisfies them all! Here, we are arguing by analogy that
observing a X given r Y is similar to observing X given Y, since the probability
models are the same-therefore conclusions about rY given aX should
correspond to conclusions about Y given X. See Hartigan (1964), Stone
(1970), and also Fraser (1968) for a non-Bayesian theory of inference using
invariances.
P;
P;
P;
P;
P;,
EXAMPLE. Let X and Y be real valued random variables with (P;f)(t) =
Jf(s)h(s - t)ds some density h. Equivalently, X - Y has density h given Y
with respect to lebesgue measure. Let a X = X + c, r Y = Y + c.
48
5. Making Probabilities
Then
paXf
= pXrY fa = pX(fa)T
rY
Y
1
for any a,
T
.
P;;f(t) = P:Ua)(t - c) = JI(s + c)h(s - t + c)ds = P:fU).
P:,
Thus a, L is an invariant transformation for
and so pY is required to be
invariant under T. Thus pY has density P, pet) = e)", with respect to lebesgue
measure. The posterior density for P: is eA'f(s - t)/ Se)·'f(s - t)dt with respect
to lebesgue measure-and pi is T, a-invariant.
5.4. The Jeffreys Density
Let a distance p be a non-negative function on S x S, and assume {s Ipet, s) ~ c}
lies in X' on S. A local p-probability on X' is a probability P such that
limP{p(tl's)~c}/P{p(t2,s)~c}=1
dO
for each
tl'
t z inS.
Such a probability gives approximately equal value to small spheres.
Jeffreys (1946) considered a number of measures of distance between two
probabilities. Write f = dP/dQ if P has density f with respect to Q.
Information distance:
I(P, Q) = P ( log
Root or Hellinger distance:
reP, Q) = Q [
Absolute distance:
a(P, Q) = Q I
~~)
(~~) 1i2 -
~QP - 11 =
1
J
sup (P X - QX).
IXI~l
Since (U 1/2 - 1)2 ~ Iu - 11, rep, Q) ~ a(P, Q). Using Schwartz's inequality,
for P and Q unitary, a(P, Q) ~ 2r(P, Q)1/2. See Pitman (1979) for applications
of reP, Q) to non-Bayesian inference.
Lemma. Let {PJ and P be unitary. Then r(P n, P) ~ 0 as n ~
if dPn/dP ~ 1 in P.
00
if and only
PROOF. Letf" = dPjdP.
Then r(P", P) ~ O=> P[f,;i2 - 1]2 ~ 0 => fn1i2 ~ 1 in P=>fn ~ 1 in P.
Conversely
f" ~ 1 in P=> fn1/2 ~ 1 in P=> Ifn1/2 - 11 {lfn1/2 - 11 < e} ~O in P
=>plfn1/2 - 11{lf,;/2 -11 < e} ~ 0
=>Pfn1i2{lfnl/2 -I} < e} ~ 1 since P{lfnl/2 -11 < e} ~ 1
pUn1/2 - 1)2 = 2 - 2P f~1/2 ~ 2 - 2P{fn1/2If,,1/2 - 11 < e} ~ O.
D
5.4. The Jeffreys Density
49
Theorem. Let f!JJ be afamily of unitary probabilities P t on f!l, indexed by T, a
compact subset of R Psuch that the interior of T is dense in T. Assume
(1) Pt has density fe with respect to some probability J..i. on f!l.
(2) f t = fs as J..i.=S = t.
(3) for all t, there exists a vector derivative (8/8t)f/12 such that
h(s, t) = jt:12 - f/12 - (s - t)' !f/121/Is where
tl--+
Oas Is -
tl--+
0
Is - t I is euclidean distance.
(4) Forls - tl < Dt > 0, h(s, t) < Zt where J..i.(Z;) <
CD.
The probability J that has density with respect to lebesgue measure on T equal
to the determinant jet) = IJ..i.[(8/8t)f 1/2( (8/8t)F/2),] 11/2, is a local p-measure
where p is the Hellinger distance, providedj(t) is continuous and non-zero in T.
reP s' P)t = rlI(fl/2
_fI/2)2 Fix t .
s t '
As reps' Pt) --+ O,fJft --+ I in P t ; let So be a limit point of the sequence of s
values (by compactness, So exists). From (3),fs --+ fs o' Therefore fso/ft = I in
P e which implies So = t from (2).
Thus reP s' Pt) --+ 0 if and only if s --+ t, so that the set of s values
with reps' Pt) < e may be found in a neighbourhood of t.
From (3), reps' Pt) = (s - t)' J..i.[ (8/8t)f/12( (8/8t)f/12YJ (s - t) + o(ls - t 12).
The sphere reps, Pt) ~ e corresponds to an approximate ellipsoid in T,
(s - t)'It(s - t) ~ e of volume Kc: 1/2p lIt 1- 1/2 where jet) = IIt 11/2. The probability of reps' Pt) ~c:, with the specified density jet), is Ke 1/2P [1 + 0(1)]
since j is continuous and positive. The density jet) thus generates a local
0
p-probability.
PROOF .
The Jeffreys density was put forward simultaneously by Jeffreys (1946)
and by Perks (1947). Perks considered confidence regions for t which may be
constructed from a sequence of independent observations each distributed
as Pt' The confidence region in the neighbourhood of t has volume
asymptotically proportional to jt- I, under certain regularity conditions
similar to those in the theorem. Thus if to is the true value of t, we will have a
confidence region closely concentrated near to if jt is large; Perks places
density jt on t to represent this expectation. A more explicit confidence justification is given by Welch and Peers (1966), for the case where T is the real
line: after n observations from f!l, the conditional probability of t given n
observations is Pn ; choose t n,C( so that Pn(t < t n,GC ) = a; under regularity
conditions the confidence size of the interval estimate {t < tn ) is Pt(t < t n ) =
a + 0(n-1/2), but for Jeffreys' density Pt(t < tn,a) = a + 0(n- 1). Thus the
Jeffreys density gives Bayes one-sided intervals which are more nearly confidence intervals than the intervals for any other prior. It should be noted
50
5, Making Probabilities
that the same justification does not hold for two-sided intervals, Hartigan
(1966).
The Jeffreys density is also obtained from maximum learning probabilities
(Bernardo (1979): suppose that n independent observations with probability P t generate the space :![ I X .. , :!t n , and let qy be the Baire functions
on T. As n -> 00, the information between fill X ." filn and qy, denoted by
I n' satisfies, under regularity conditions
In -
t log n -> -
I(P 1J , J) + K,
Thus the maximal learning probability for the asymptotic information
between :I'I x ... Xn and qy, is the Jeffreys probability J.)
The Jeffreys probability is induced on the indexing set T by the family of
probabilities !!l' = {P t , tE T}; it will provide the same probability on !!l'
regardless of the particular set T used to index !!l'. The topology on T is
induced by the Hellinger distance on f!jJ. The probability on T is unchanged
if Jeffreys' probability is computed using a number of observations from
fil rather than a single observation. These properties are also possessed by
the family of densities PaCt)
~ log PaCt) = {p (! log! :t: log! ) + ap( :t log! Y} Ip( :t log!Y
t
for t one dimensional, Hartigan (1965). This family gives the Jeffreys probability when a = 1/2, and often generates commonly accepted prior densities
with suitable choice of a. Perhaps there is an interpretatIOn in differential
geometry.
If a subset of!!l' is considered, f!jJ' = {P t ' tE T'}, where T' is a compact set
in Rk whose interior is dense in T', then the Jeffreys probability on T'may
be obtained by conditioning the probability on T to T', provided J(T') > O.
If however the indexing set T is partitioned into a family {T~} of indexing
sets of lower dimensionality, the Jeffreys probabilities on each of the T~
might not be conditional probabilities from Jeffreys' probability on T; the
Jeffreys construction is not consistent with the combination of conditional
probabilities.
5.5. Similarity Probability
Let X, Y and X x Y be random variables from some probability space
into (S, fil), (T, qy) and (S x T, fil x qy). If pX. Y has density I with respect to
pX x pY (that is pX, YW = pX PY[IW] for W in :I' xlJJ), call I, a real valued
function on S x T, the likeness or similarity between Sand T.
The random variable Y describes a number of possible outcomes in the
past, the random variable X describes outcomes in the future, and I specifies
similarities between pairs of these outcomes. We propose that I be specified
51
5.5. Similarity Probability
subjectively to correspond to perceived similarities, and that the probabilities
pX. Y, pX and pY be determined from 1. (Hartigan, 1971). In the notation of
5.2, [= dP;/dp x , the density of the quotient probability P; relative to the
marginal probability pX. If X and Yare both discrete,
[(x, y)
= P(X = x, Y = y)/P(X = x)P(Y = y).
If pX and pY have densities with respect to some probabilities QX and QY
[(x, y)
= pX. Y(x, y)/pX(x)pY(y)
where pX. Y, pX and pY are densities of pX. Y, pX and pY.
EXAMPLE 1 : Selecting from a deck of cards. A deck of cards is composed of 52
cardboard rectangles of apparently identical dimensions, one side of the
rectangles being distinguished by different markings, the other side marked
the same for all cards. The deck is shuffied with the uniform side showing.
What is the probability that the top card is the ace of spades?
Let X denote the top card. Let Y denote past knowledge about this deck
of cards, observations of the shuffiing process, and any other information.
For x one of the expected 52 cards, and for y past knowledge that does not
refer to a particular card, take [(x, y) to be constant. Thus
P(X
= x, Y = y) = cP(X = x)P(Y = y)
P(X
= xl Y = y) = cP(X = x).
Now consider the event that the top card is either x, one of the 52, or a card
with a picture of a rabbit (the children have been at the cards again). Call
this event {X = x or R}.
Then
P[X=xorRIY=y]=c'P[X=xorR],
c'fc
= RIY = y] = c'P[X = R] + (c' - c)P[X = x].
Thus P(X = xl Y = y) and P(X = x) are the same for all x. I do not feel happy
P[X
about the rabbit, but some event of different similarity is necessary to show
that all probabilities are equal. Note that P[X = xl Y = y] is the same for
all x only if the knowledge y contains nothing to distinguish the cards.
This looks like the principle of insufficient reason, but it is not subject to
partitioning paradoxes. Consider for example the rotatable cards-21 cards
that look different when rotated through 180°. Distinguish between the
two versions of these cards when selecting the top card, so that there are 73
possible results. There are now a number of different similarities.
l(x, y) : rotatable x to a typical y
l(z, y):non-rotatable z to a typical y
l(x or R, y) : rotatable x or rabbit to a typical y
l(z or R, y): non-rotatable z or rabbit to a typical y
[(x' or x, y) : either version of a rotatable x to a typical y.
52
5. Making Probabilities
Assume that lex' or x, y) = l(z, y) but that the other three similarities are different from each other and from l(z, y). Then P(x Iy) = p(x'i y) and P(x or x'ly) =
P(zly) so that the probabilities are 1/52 for non-rotatable cards and 1/104
for rotatable cards.
EXAMPLE 2: Uniform on the integers. It is not possible to present realistic
examples of infinite sample spaces in a bounded universe, but such sample
spaces have proved to be mathematically convenient. Who would give up
Poisson and normal distributions?
Let X be a random variable taking integer values.
Let Y be past knowledge.
Suppose y is such that no integer for X is preferred, and take lex, y) to be
the same for all x. Again we need some outside event E such that lex or E, y)
is the same for all x, and different to lex, y). Then P[X = xlY] is the same for
all x, and the distribution on the integers is uniform. The uniform distribution
on the line may be handled similarly-it will require that, l( {XI ~ X ~ x 2 }, y)
depend only on the length ofthe interval {XI ~ X ~ x 2 }.
Nothing much is happening, just the transfer of equal similarity perceptions
to equal probability statements.
EXAMPLE 3: A sequence of coin tosses. Let X I' X 2' X 3' '" denote the
sequence of heads or tails in tossing a coin. Let Y be past knowledge about
this and other coins and other things. If x = xl' x 2 ' ... ,xn is a particular
sequences of n tosses, let x' = xul ' x u2 ' ... ,xun denote a permutation of x.
Suppose lex, y) = lex', y) for all permutations x', and suppose
l(xl' y) j. l(xi or x 2 ' y) for x 2 not a permutation of XI'
l(x i or x 2 ' y) = l(x'i or x~, y).
Then P(x Iy) = p(x'i y) and the sequence Xl' X 2' .•. is exchangeable.
The probability distribution for X I' X 2' ... is then independent Bernoulli
given P = lim(~::XJn), which exists almost surely. Frequency theory assumes
no more than this-tlrust6e small probability assumptions of frequency
theory may be derived from equal similarities of permuted sequences to
given knowledge. To have a full probability model for a sequence of coin
tosses, it is necessary to specify in addition the prior distribution of P, that
is to specify the similarities of various P values to given knowledge.
Assume that only a finite number of P values are possible, say PI' P2' ... ,PN'
Then
p(Pily)
p(pjly) =
More generally, P,(PEA)/I(PEA, y)
[(Pi or Pi' y)
1 - --'---"-[(Pj' y)
I(Pi or Pi' y) _ 1
l(pi' y)
= Py[{pEA }/l(p, y) J.
53
5.6. Problems
If the distribution of P given y has a density fy(P) with respect to lebesgue
measure, differentiation ofthis formula gives
fy(Po)
l(p~po'Y)
Forexample, if l(p
+
(1
0
f
(P)dP)~[
y
1
] = f,(Po) .
dp o 1(P~po'Y)
l(po'Y)
Po, y) = Po
for
0 ~ Po
l(po' y) = 2po
for
for
0 ~ Po ~ 1/2,
0 ~ Po ~ 1/2.
~
fy(Po) = cPo
~
1/2,
We might, by symmetrical similarity judgments, require that fy(P) be symmetrical about P = 1/2.
I might be charged with replacing mystical methods of determining
priors by mystical methods of specifying similarities. I am not proposing
formal methods of determining similarities. They are subjective judgments
relating expected events to past knowledge; they may come only as comparative judgments-this P value is more similar than that; even such comparative
judgments may usefully constrain the conditional distribution given y.
5.6. Problems
E1. Find the minimum information probability with respect to lebesgue measure
on the plane, with means, variances and covariance fixed.
E2. Let X > 0 have minimum information with respect to lebesgue measure subject
to P(X) = 1, and let Y > 0 have minimum information with respect to lebesgue
measure subject to p(y2) = 1. Show that X and y2 have different distributions,
illustrating lack of invariance of minimum information probabilities.
E3. Let X and Y be two integer variables with fixed marginal distributions. Find the
minimum information joint distribution of X and Y with respect to uniform
probability on pairs of integers (i,j), - 00 < i < 00, - 00 <j < 00.
PI. A person ranks N candy bars, which have delectability coefficients d" d2 , ... , dN
such that the ith bar is preferred to the jth bar with probability dJ(d i + d}. Find
the minimum information probability for the complete ranking of the candy
bars, with the probabilities that i is ranked above j as given, with respect to uniform
probability over permutations of the candy bars.
P2. Let X" ... , X n be n discrete variables with known pairwise distributions. Find
the minimum information probability for the joint distribution of X" ... , X n with
respect to counting measure on atoms.
Q1. Show that a minimal information P with respect to Q may exist, satisfying n
constraints PX i = a p but not satisfying
dPjdQ = ex p ( Ao + it, AiXi)
for any Ao' Ai'
54
5. Making Probabilities
P3. Let 0/1, !E I ' ..• , !En' '" be probability subspaces of!E such that !En i . Assume that
a conditional probability Pn:!E ..... !En exists for each n with PnPn = Pn' Assume
l(o/Il!En)E!En. Then 1(0/1 I!En) is a sub-martingale, that is
+,
Pn_J1(0/I1!En)] ~ l[o/Il!En_ I ].
(Roughly translated, we expect to learn something by knowing !En')
P4. Let P; be a normal distribution on !E with mean J.l, variance 1. Find the invariant
transformations (0", r) for the quotient probabilities
an? show that the only
probability on J.l which is r-invariant for every r is lebesgue measure.
P;,
P;
P5. Let
be a normal distribution on !E with mean 0, variance V. Find the invariant
transformations (0", r) for P V' and find measures on V which are r-invariant
(more than one !).
P;
P6. Let
be a bivariate normal distribution on !E, with means 0 and covariance
matrix V. Find the invariant transformations (0", r) for P y , and find measures on
V which are r-invariant.
E4. In the binomial case, find that function of P which is uniformly distributed according to the Jeffreys density.
E5. Compute the Jeffreys density in the bivariate normal case with unknown means
and covariances.
P7. A contingency table has 1000 cells with cell probabilities PI' ... , PlOoo' with
IJ2~O Pi = 1. Show that the Jeffreys density implies that the number of cells with
Pi> 1/1000 is approximately N(160,134). Suppose that examination of the
contingency table produced 12 empirical frequencies greater than 1/1000. Would
you use the Jeffreys density for constructing estimates of Pi?
P8. An observation comes from the normal mixture, N«()1' 1) with probability t and
N«()2' 1) with probability t. Find the Jeffreys probability for «()" ()J
P9. Observe the toss of a coin, with unknown success probability p, until r successes
appear. Find the Jeffreys density for p. Now observe n tosses of the coin, and find
the Jeffreys density for p. You observe a man toss a coin 50 times, getting 20
successes, and he asks you, as consulting Bayesian, to compute the posterior density
of p. In an effort to be impartial, you do so with the Jeffreys density for p for 50
tosses of a coin. He then confides in you that he stopped tossing when 20 successes
were reached. Do you change the posterior density?
PIO. Let P e have density (1 + ()'x)/4n with respect to uniform probability on the three
dimensional sphere, for each x, () in the sphere. Show that the Jeffreys probability
is uniform over the sphere. If () is constrained to lie in a great circle through the
poles, show that Jeffreys probability is uniform over the great circle. Show that the
constrained Jeffreys probability is not a conditional probability for the unconstrained Jeffreys probability. (Similarly it is not possible to have joint probabilities
and conditional probabilities which are rotation invariant.)
5.7. References
55
5.7. References
Bernardo, J. M. (1979), Reference posterior distributions for Bayesian inference
(with discussion), J. Roy. Statist. Soc. 41, 113-147.
Christensen, Ronald (1981), Entropy Minimax Sourcebook, Vol. I: General Description.
Lincoln, Massachusetts: Entropy Limited.
Fraser, D. A. S. (1968), The Structure of Inference. New York: John Wiley.
Good, I. J. (1966), A derivation of the probabilistic explanation of information,
J. Roy. Statist. Soc. B 28,578-581.
Good, I. J. (1969), What is the use of a distribution?, in Krishnaiah (ed.), Multivariate
Analysis Vol. II, 183-203. New York: Academic Press.
Hartigan, J. A. (1964), Invariant prior distributions, Ann. Math. Statist. 35, 836-845.
Hartigan, J. A. (1965), The asymptotically unbiased prior distribution, Ann. Math.
Statist. 36, 1137-1152.
Hartigan, J. A. (1966), Note on the confidence-prior of Welch and Peers, J. Roy.
Statist. Soc. B 28, 55-56.
Hartigan, J. A. (1971), Similarity and probability, in V. P. Godambe and D. A. Sprott,
(eds.), Foundations of Statistical InJerence. Toronto: Holt, Rinehart and Winston.
Jaynes, E. T. (1957), Information theory and statistical mechanics, Phys. Rev. 106,
620-630.
Jeffreys, H. (1946), An invariant form for the prior probability in estimation problems,
Proc. R. Soc. London A 186, 453-461.
Kullback, S. (1959), Information Theory and Statistics. New York: Wiley.
Kullback, S. and Leibler, R. A. (1951), On information and sufficiency, Ann. Math.
Statist. 22, 79-86.
Lindley, D. V. (1956), On a measure of the information provided by an experiment,
Ann. Math. Statist. 27, 986-1005.
Perks, W. (1947), Some observations on inverse probability; including a new indifference
rule, J. Inst. Actuaries 73, 285-334.
Pitman, E. J. G. (1979), Some Basic Theoryfor Statistical Inference. London: Chapman
and Hall.
Shannon, C. E. (1948), A mathematical theory of communication, Bell System Tech. J.
27, 379-423.
Stone, M. (1970), Necessary and sufficient conditions for convergence in probability
to invariant posterior distributions, Ann. Math. Statist. 41,1939-1953.
Zellner, A. (1977), Maximal data information prior distributions, in A. Aykac and
C. Brumat, (eds.), New Developments in the Applications of Bayesian Methods,
p. 211-232. Amsterdam: North Holland.
Welch, B. L. and Peers, H. W. (1963), On formulae for confidence points based on
integrals of weighted likelihoods, J. Roy. Statist. Soc. B 25, 318-329.
Winkler, R. L. (1967), The assessment of prior distributions in Bayesian analysis,
J. Am. Stat. Assoc. 62. 776-800.
CHAPTER 6
Decision Theory
6.0. Introduction
Fisher (1922) compared two estimators by considering their distributions
gi ven an unknown parameter of interest. For example, in estimating a normal
distribution mean the sample mean is unbiased with variance 21n times
the variance of the sample median, for all values of the distribution mean,
so it is to be preferred to the sample median. Of course, it may be difficult
in general to decide between the two families of distributions.
Neyman and Pearson (1933) proposed evaluation of a test statistic by
considering the probability of rejection of the null hypothesis under various
values of the parameter of interest.
Wald (1939) proposed a general theory to cover both of these cases, in
which a general decision function (of the data) is evaluated by its average
loss for each value of the parameter. Wald suggested minimax techniques
for selecting decisions, that arose out of von Neumann's theory of gameswe playa game against nature (the unknown parameter value) so that our
loss will not be too severe if nature chooses the worst parameter value.
Ramsey, de Finetti and Savage use similar ideas from the theory of games
in showing that a coherent betting strategy requires a probability distribution
on the set of bets. There is no technical or conceptual difference between
coherent betting and admissible decision making. If we decide to use one
decision function rather than another, we are accepting the bet corresponding
to the difference in losses for the two decision functions.
6.1. Admissible Decisions
It is necessary to choose one of a set of decisions D. The consequences of the
decisions are determined by the outcome s in a set of possible outcomes S.
For decision d, and outcome s, there is a loss L(d, s). Since there is no reason
56
57
6.1. Admissible Decisions
to differentiate between two decisions which have the same loss for each
value of s, one may regard the decision set D as a set of real valued functions
on S :d(s) is the loss incurred by decision d if outcome s occurs.
Let d [ ~ d z mean d[ (s) ~ d 2 (s) for SES. Say that d is admissible ifit is minimal
in D, that is d' ~ d, d' E D = d' = d. A complete class C is a subset of D such that
d' E D - C implies d ~ d' for some d in C. A complete class is minimal if it
contains no proper complete class. It is easy to show that if a minimal
complete class exists, it is the set of admissible decision functions. However,
no admissible decision functions may exist; consider S = {l}, D = {dl- 00
< des) < 00 }; every decision is inadmissible, no minimal complete class
exists. See Wald (1939, 1950) for the first theory, and Ferguson (1967) and
Berger (1980) for expository texts.
If P is a probability on a probability space fil on S, such that D c fil, a
decision do is a Bayes decision if P(d o) ~ P(d), dE D. If a Bayes decision is
unique it must be admissible (for otherwise there exists d' ~ do, d' =1= do, which
implies P(d') ~ P(do), so that d' is also a Bayes decision, contradicting
uniqueness). We will say that do is P-Bayes.
If P is a finitely additive probability on fil, a linear space of functions
including D, then do is P-Bayes if P(d - do) ~ 0, dED.
Theorem. Let P be a finitely additive probability on fil including D. If do
is unique P-Bayes, then do is admissible. If do is admissible and D is a convex
space of bounded functions then do is P-Bayes with respect to afinitely additive
probability P on a space fil including D [Heath and Sudderth (1978)].
PROOF. If do isP-Bayes, then P(d - do) ~ 0, dED. Ifd ~ do, thenP(d - do)~O,
so d must be P-Bayes. Since do is unique, d = do and so do is admissible.
If do is admissible, let fll be the bounded real valued functions, and define
fl} = {Xld o + aX ~ d, some a > 0, some dED}.
Then X [ ,X z E;JJ aX [ + bX 2 Ef!J for a ~ 0, b ~ by convexity of D. From
Theorem 2.1, it is possible to extend r!Jl to f!J* ::::J f!J, such that f!J* (\ - fY/1* =
,9 (\ - ;JJ and r!Jl* u - r!Jl* = fil. Note that .9 includes all X ~ 0, and excludes
all X < 0.
Set P X = sup {IX IX - IXd o E,9*}. Then - co < P X < 00 because X is bounded, and P is an additive functional on X with P X ~ for X EfY/1; in particular
P is non-negative, that is, P X ~ for X ~ 0.
Since do + (d - do) ~ d, d - doE;J} and P(d - do) ~ all dED. Thus do is
P-Bayes as required.
0
°
=
°
°
°
N ate. The idea of this theorem is that accepting do over all other decisions
d is accepting the bets d - do, all dED. You will surely lose money (that is,
do is inadmissible) unless there is a finitely additive probability P such that
P(d - do) ~ all dED.
Apparently we should be satisfied with finitely additive probability; however finitely additive probabilities do not discriminate well between deci-
°
58
6. Decision Theory
sions, so that a unique P-Bayes decision is unusual; many inadmissible
decisions may be also optimal for a given functional P.
Consider for example estimation of a normal mean. The data is x, the
unknown mean (), and the decision is a function of x, say b. In the loss framework, the decision will be represented as a real valued function of (), d«(}) =
J(b(x) - (})2 exp [ -!(x - (})2]dx/fo. Consider D to be composed of
decisions d: estimate () by b;(x) with probability (Xi' where bi(x) - x ~ 0 as
Ix I~ 00, i = 1, ... ,n. For all such decisions d«(}) ~ 1 as I() I~ 00. Uniform
finitely additive probability on () gives value lim X«(}) to X when this limit
181-00
exists. Thus all the decisions proposed are finitely additive Bayes with respect
to the uniform distribution; they are not all admissible, demonstrating that
optimality by a finitely additive probability is rather too easy. See Heath and
Sudderth (1972, 1978).
6.2. Conditional Bayes Decisions
Let X be a random variable into S, f!£ and let Y be a random variable into
T, IlJI and assume that X x Y is a random variable into S x T, f!£ x 1lJI. Assume
there is a quotient probability pi on Y given X, and denote the value of Pi
at s by Pi(s); this defines a probability on 1lJI. Similarly assume a quotient
probability P:. By the product rule PZ = pX piZ = pYp:Z each Z in f!£ x 1lJI.
Now suppose a decision d in D is to be taken using an observation t in T.
The loss, if d is taken when the parameter s is true, is L(d, s). A family ,1 of
decision functions b: T ~ D is constructed satisfying (s, t) ~ L(b(t), s)
Ef!£ x IlJI for each bEL1. The loss associated with b is the risk r(b, s) =
pi(s) [L(b, s)].
Theorem. Suppose that, for each t, bo(t) is P:(t)-Bayes. If bo EL1, then b o is
pX-Bayes. Conversely if b~ is pX-Bayes, and b o (t) is P: (t)-Bayes, then b~ (t)
is P:-Bayes as pY.
PROOF.
P:(t) [L(bo(t), .) - L(d, .)] ~ 0
Since
P:(t) [L(bo(t), .) - L(b(t),·)] ~ 0
pXpX = pYpY
Y
all d.
all bEL1.
x'
pXPi[L(b o") - L(b,·)] ~ 0
pX[r(b o , .) - r(b,·)] ~ 0
all bEL1
all bEL1.
Thus b o is pX-Bayes.
Conversely if b~ is also pX-Bayes,
PYP:(t) [L(bo(t), .) - L(b~(t), .)] ~ 0
P:(t) [L(bo(t), .) - L(b~(t), .)] = 0 as pY.
o
6.3. Admissibility of Bayes Decisions
59
Note. The theorem makes it practicable to find Bayes decision functions,
since it is easier to search over the smaller space of decisions D to obtain a
conditional Bayes decision, than to search over the larger space of decision
functions LI.
EXAMPLE. Let S, T be the real line, let pi(s) denote the normal distribution
with mean s and variance 1. Let f!( and CfY be the Baire functions on Sand T.
Take pX to be normal with mean 0 and variance (12. Then P:(t) is normal with
mean (12 t/{l + (12) and variance (12/(1 + (12).
Let D be the real line, L(d, s) = (d - S)2, and let LI be the set of Baire functions on D. Then r(b, s) = J(b(t)- S)2 exp [ - t(t - s)2]dt/J2n. For a particular t, the PX(t)-Bayes decision bo(t) minimizes
J(d - sf exp [ -
~(s
- ~)21 + (12 JdS, 15 (t) =~.
2
1 + (12
(12
0
1 + (12
Thus the pX-Bayes decision function is bo(t) = (12t/(1 + (12).
It may happen that 15 0 is conditionally Bayes (that is bo(t) is P:(t)-Bayes
for each t), but not Bayes because bo¢LI. In the present example, if pX is
uniform, bo(t) = t is the conditional Bayes decision but it has risk r(b o' s) = 1
which is not integrable, so it is not the Bayes decision; in certain cases,
conditional Bayes decisions are even inadmissible (see Chapter 9 on many
means). It should not be thought that the possible inadmissibility of
conditionally Bayes decisions is caused by pX not being unitary; in the
present example, if pX = N(O, (12), and L(d, s) = exp(ts2/(12)(d - sf, the
conditional Bayes estimate is bo(t) = t, but it is not the Bayes estimate. In
the same way, the inadmissible estimate for many normal means is conditionally Bayes with respect to a unitary prior distribution. See also Chapter 7
on conditional bets.
6.3. Admissibility of Bayes Decisions
If a Bayes decision do is unique it is admissible [for any decision beating it
would also have to be the Bayes decision].
If the decisions d in D are continuous in some topology on S, and the
finitely additive probability P is supported by S (that is, PJ> 0 ifJis continuous, non-negative, and not identically zero), then any decision which is
P-Bayes is admissible.
[If do is P-optimal and d' ~ do, then P(d o) ~ P(d') => P(d o - d') = o. Thus
d' = do since P is carried by S.]
Let P be a probability on f!( on S. Say that P is supported by S if for each
continuous X in f!£, X =1= 0, X ~ 0 implies PX > o. Say that P is X n-(1-jinite
if some sequence X n in f!( has X nil. Say that a decision do is X n-limit Bayes
ifsupP[Xn(d o - d)] ~ 0 as n ~ 00.
deD
60
6. Decision Theory
Theorem. If P is supported by S and is X n-a-jinite for some continuous X n' and
if D consists of continuous functions, and if do is Xn-limit Bayes, then do is
admissible.
PROOF. If d' ~ do and d' =1= do' Xn(d O - d') ~ 0, Xn(d o - d') =1= 0 for n large
enough, so P[ X n(do - d')] > 0 for n large. Since X n(dO - d') i (do - d'),
P[Xn(do - d')] -+ 0 is impossible. Thus do is admissible.
0
Note. If P is carried by S, if D consists of continuous functions in f!(, and if
do is P-Bayes, then do is admissible. The present theorem applies to decisions
do which may not lie in f!(.
EXAMPLE. Let x be an observation from N(fJ, 1), and suppose that fJ is to be
estimated with squared error loss function. The theorem will be used to show
that x is admissible, being X n -limit Bayes with respect to uniform probability
J1. on fJ.
The estimate b generates the decision
d(fJ) = J(b(X) - fJ)2 exp[ - -!{x - fJ)2]dx/J2n
The decisions dare fJ-continuous. The measure J1. is a-finite with respect to
f,,(fJ) = exp( - fJ2/n) i 1 as n -+ 00
inf J1.(f"d) = inf J J(b(x) - fJ)2 exp[ - -!{x - fJ)2 - fJ 2/n]dxdfJ/J2n
d
(j
=
J Jinf(b - fJ)2 exp[ - -!{x - fJ)2 - fJ 2/n]dfJdx/J2n
(j
= J exp[x 2/(n + 2)]dx
I( + ~)3/2
1
=}nn 3/2 /(2 + n)
=fo
S~PJ1.[f,,(I-d)]=fo[ 1 J1.(f,,)
2:
nJ
= 2fo/(n
+ 2) -+ O.
Since the estimate bo(X) = x generates the decision d == 1, bo is continuous
a-finite Bayes as required. (Here the decision d == 1 does not lie in f!(.)
EXAMPLE
2. Let r be a binomial observation, with Pp{r} =
(~)pr(1 _ p)n-r,
o ~ r ~ n. Suppose that p is to be estimated with squared error loss functionthe estimator b corresponds to
d(p) = rto (b(r) - p)2
(~) pr(1 -
p)"-r.
61
6.4. Variations on the Definition of Admissibility
For the measure Ii, fl(f)
j~(p) =
inf fl(df)
=
[p(1 inrJ
p)
2:
/I
r=O
Y i 1 as al O.
(b(r) - p)2
(n)r pr(l - p),,-r pa-l(l - pr 1dp
.to i~rJ G
Y±
±(n)r(r+a+l)r(n-I"+a+l)-->~asa-->o.
dab
=
S6f(p)[p(1 - p)]-ldp, take the functions
=
(b - p)2
)pr(1 - p)"-r[p(1 - p)
1
dp
(n)r(r+a)r(n-I"+a). (r+a)(n-r+a)
r=O r
r(n+2a)
(n+2a)2(n+2a+l)
=
r=O
Forb
o
r(11+2a+2)(n+2a)
,I"
11
=~,fl(d f)= SP(1-P)pa-l(1_pj"- l dp
n
n
0 a
=
S pCl(l - p)"dp/n = ['2(a
+ 1)/[r(2a + 2)n]
1
as a --> O.
n
Thus sup
fl[f (do - d)] --> 0 as a --> O. Also fl is carried by the interval [0, 1J.
d
a
--> -
Therefore do is admissible.
6.4. Variations on the Definition of Admissibility
A decision d is beaten by d' at e if dee) > d(e'). We say
d is somewhere beaten by d' if d'(e) ~ dee) all e, d'(e) < dee) some e,
d is everywhere beaten by d' if d'(e) < dee) all e
d is uniformly beaten by d' if inf[ dee) - d'(e)] > O.
Then d is admissible in a set of decisions D if d is not somewhere beaten by
any d' in D. Say that d is weakly admissible if it is not everywhere beaten by
any d' in D, and that d is very weakly admissible if it is not uniformly beaten
by any d' in D.
The sense of admissibility appropriate for finitely additive probabilities is
very weak. If do is a finitely additive Bayes decision with respect to P, then do
is very weakly admissible. [Heath and Sudderth (1978).] Conversely, the
argument of Theorem 6.1 shows that if do is very weakly admissible, and D
is convex with sup des) < 00 each d in D, then do is finitely additive Bayes
s
with respect to some P.
The sense of admissibility appropriate for probabilities is weak. If do is a
Bayes decision with respect to P, then do is weakly admissible. However,
converse results are more complicated than in the finitely additive case.
See for example Farrell (1968).
62
6. Decision Theory
If D consists of continuous functions and S is compact, then a finitely
additive P on the space of continuous functions on S is countably additive
(since if a decreasing sequence of functions converges to zero it converges
uniformly to zero). Thus if do is weakly admissible, it is P-Bayes with respect
to a unitary probability P. More generally if D consists of continuous
functions zero outside compact subsets of S, a weakly admissible do is
P-Bayes with respect to a unitary probability P. [If do is carried by Sf,
consider decisions and probabilities restricted to Sf.]
6.5. Problems
E I. Let the sample space S be finite. Let D be a set of decisions on S (real valued functions
on S). Let P give positive probability to each non-zero non-negative X on S. Show
that a P-Bayes decision is admissible.
E2. For decisions D on a finite S, show that no P-Bayes decision may exist, and that it
might not be admissible if it does exist.
PI. Let X be a Poisson observation with P(X = x) = AXe-.l/x!. Show that X is an
admissible estimate of A with squared error loss.
P2. Let X be binomial with P(X
= x) = (: )pX( I - p)'-x, 0 ~ x ~ n. Consider estimates
b of p using squared error loss. Show that the estimate b(x) = x/n is weakly admissible.
6.6. References
Berger, James O. (1980), Statistical Decision Theory. New York: Springer-Verlag.
Farrell, R. (1968), Towards a theory of generalized Bayes tests, Ann. Math. Statist.
39, 1-22.
Ferguson, T. S. (1967), Mathematical Statistics, a Decision Theoretic Approach.
New York: Academic Press.
Fisher, R. A. (1922), On the mathematical foundations of theoretical statistics, Phil.
Trans. Roy. Soc. A 222, 309-368.
Heath, D. C. and Sudderth, W. D. (1972), On a theorem of de Finetti, odds making,
and game theory, Ann. Math. Statist. 43, 2072-2077.
Heath, D. C. and Sudderth. W. D. (1978), On finitely additive priors, coherence, and
extended admissibility, Ann. Statist. 6, 333-345.
Neyman, J. and Pearson, E. S. (1933), On the problem of the most efficient tests of
statistical hypotheses. Phil. Trans. Roy. Soc. A 231, 289-337.
Neyman, J. and Pearson, E. S. (1933), The testing of statistical hypotheses in relation
to probabilities a priori, Proc. Camb. Phil. Soc. 24,492-510.
Wald, A. (1939), Contributions to the theory of statistical estimation and testing
hypotheses, Ann. Math. Statist. 10, 299-326.
Wald, A. (1950), Statistical Decision Functions. New York: John Wiley.
CHAPTER 7
Uniformity Criteria for Selecting
Decisions
7.0. Introduction
The set of admissible decision functions in a particular problem is usually so
large that further criteria must be introduced to guide selection of a decision
function. Many such criteria require that the unknown parameter values be
treated "uniformly" in some way; decision procedures are required to be
invariant or unbiased or minimax or to have confidence properties. Since
selection of a decision function, from a Bayesian point of view, is selection of
a probability distribution on the parameter values according to which the
decision function is optimal, these criteria may be viewed as methods of
selecting indifference probability distributions on the parameter values.
The general conclusion is that the various uniformity criteria are satisfied
by no unitary Bayes decision procedures, establishing the necessity for
considering non-unitary probabilities.
7.1. Bayes Estimates Are Biased or Exact
Let P be a probability on fl, let OEfl be such that a conditional probability
P[ X IOJ exists satisfying P X = P[P(X I0) J for all X. Let IfJ/ be a subspace of f£.
An estimate Y in IfJ/ of 0 is unbiased if P[ Y IOJ = 0 and exact if P[ Y =1= OJ = O.
A Bayes estimate of 0 in IfJ/ with respect to squared error loss is a Y such that
P[Y - OJ2 is a minimum over P[y* - OJ2 with Y*EIfJ/, (y* - O)2Efl.
Theorem. An unbiased Bayes estimate is exact.
PROOF. Let Y be an unbiased Bayes estimate. Define X A
for each A ~ O. Note IXIA ~ (XA)2Efl.
= (- A) v X
1\
A
63
64
7. Uniformity Criteria for Selecting Decisions
Setting y* = y
P[Y - BY
~
+ eyA,
P[y* - B]2 = P[Y - B]2
+ 2ep[yA(y -
B)] + e2p(yA)2.
Taking e small and of sign P[yA(B - Y)],
PyA[B - Y] = O.
If Y is unbiased,
P[YIB] = B
P[Y - BIB] = 0
P[(Y - B)BAIB] = 0
P[(Y - B)BA] = O.
Thus if Y is Bayes and unbiased,
P[Y - B] [BA - yA] = O.
Since x - y ;;;; x A - yA if x ;;;; y,
P[B A - yA]2
~
0,
which implies
P[B A =F yA] = 0 all A,
P[B =F Y] = 0, so Y is exact.
Note. It may happen that a posterior mean be unbiased. For example, let
Y given B be distributed as N(B, 1) and let B be uniform on the line; the
posterior mean of B given Y is Y, and
P[[B - y]21 Y] ~ P[ [B - D(y)]21 Y]
for all borel functions D.
Also Y is unbiased, P[YIB] = B. However Y is not the Bayes estimate of B
in the class of functions D( Y) because none of the functions [D( Y) - B] 2 are
integrable.
7.2. Unbiased Location Estimates
Let Xl' X 2' ... , X nand B be real valued random variables on f!£. Suppose
Xl' ... , X n are independent and identically distributed given B, and that
Xl - B given B has a distribution which does not depend on B; assume that
this distribution has density fwith respect to Lebesgue measure.
An invariant estimator 0 of B satisfies D(X + a) = D(X) + a.
Theorem. Suppose that Xl has finite second moment given B. The Pitman
estimator, the posterior mean of B given X corresponding to a uniform prior
probability on B, is unbiased and has minimum mean square error given B of
all invariant estimators.
65
7.3. Unbiased Bayes Tests
PROOF. Consider first the case of one observation. Any invariant estimator
is ofthe form (i(X) = X + a and has mean square error var X + [Pe(i(X) - 0]2,
so the Pitman estimate (io will be optimal if it is unbiased.
Now
JOf (x -
O)dO/ Jf(x - O)dO =
= x - Juf(u)du
(io =
Po(io = Jxf(x - O)dx -
J(x -
u)f(u)du
Juf(u)du = O.
Thus the Pitman estimator is unbiased and optimal.
For n observations, consider the behavior of an invariant estimator
(i(X) and the Pitman estimator (i 0 (X) conditional on X 2 - Xl' X 3 - Xl' ... ,
Xn - X I ' The conditional density of X I is flf(X i - O)lJflf(X i - O)dO;
the conditional Pitman estimate corresponding to this density is just (io(X).
Also (i(X) is an invariant estimator of 0, considered as a function of X I
with Xi - X I fixed, i = 2, '" , n. Thus the conditional risk of (io is no greater
than that of (i, and hence the unconditional risk of (io is optimal. Similarly,
since (io is conditionally unbiased for 0, it is unconditionally unbiased. D
Note: The Pitman estimator is not the Bayes estimator corresponding to
a uniform prior, because it has constant risk which is not integrable. Stein
(1959) shows that the Pitman estimator is admissible whenever X I has
finite third moment given 0, and Brown and Fox (1974) have shown admissibility under weak conditions.
7.3. Unbiased Bayes Tests
In testing, a decision is made whether a parameter 0 lies in a set H 0 or in a
set HI' HoHI = 0, Ho + HI = 1. Thus d takes the values Ho or HI and has
loss
L(d, 0) = !X{d = HI }Ho + f3{d = Ho}H I ·
The loss is !X if you mistakenly decide 0 E H I and 13 if you mistakenly decide
OEH o '
A decision function based on a random variable Y is a function (i( Y)
taking values H 0 or HI' The decision function is unbiased if
PeL«(i(Y), 0) ~ PoL«(i(Y), Of) for 0, Of, which is equivalent to
Pe [(i = HI] ~ f3/(!X + 13) ~ Po' [(i = HI]
for 0 E H 0' Of E HI'
Since !X and 13 are usually arbitrary, the more usual definition of unbiasedness
requires
66
7. Uniformity Criteria for Selecting Decisions
Suppose now that Y and () are random variables on PI, a probability P
is defined on PI, and a quotient probability P~ exists such that pOp: = pYp~.
The conditional Bayes decision J(Y) minimizes Py[a{J(Y) = H1}H o +
P{ J(Y) = H o} H J which requires
Py[OEH I];£
p/(a + P);£ P y.[8EH 1 ]
for J(Y) = Ho' J(Y') = HI'
Compare the form of the Bayes decision with the unbiased ness requirement.
The conditional Bayes decision J is the Bayes decision if L(J(Y), (J)EPI. The
Bayes decision is saying the obvious, that you decide 8EHo if the conditional
probability of Ho is large, and that you decide 8EHI if the conditional
probability of H 0 is small.
To test () = 80 against 8 =f. 80 , assume that
has density fee Y) with respect
to some probability Q. The posterior distribution of 8 given Y is given by
P:
(P~g)(t) = P[g(8)fit)]/P[fo(t)]
and the conditional Bayes decision is:
6(t)
= I
if P{8 = 80 }foo(t)/P[fit)] ;£ PI(a
+ P).
For a given prior P on 8, consider the mixture P a '
PaX = aX(8 0 ) + px.
The conditional Bayes decision is 6(t) = 1 if foo(t)/P[fo(t)] ;£ k where k
depends on a, a, p. The atom at {8 0 } affects only k. A test of this form will be
called a P-Bayes testfor 8 = 80 against 8 =f. (Jo'
Theorem. Let 8 be a real valued random variable. Let Y be a random variable
such that P: has density fo with respect to a probability Q. Assume that fo(t)
is 8-differentiable for each t, and sup I(dfo)/d8 I is Q-integrable for each finite
eel
interval I of 8 values. Let the prior probability for (J be pO.
The test 6(t) = 1 iffeo(t)/P[fo(t)] ;£ k is unbiased for every 80 , k if and only
if feo(t)IP[fi t )] has the same distribution for every 80 , letting t have the
distribution of Y.
PROOF.
Let h(t) = pO/it), g(8, Y) = fiY)/h(Y). Unbiased ness requires
P: {g(8,·) ;£ k} ;£ P:. {g(8,·) ;£ k}
Q{g(8, .) ;£ k} (fe - fo') ;£ 0
Q[
~ {g( 8, . ) ;£ k} J= 0
where the differentiation is justified because sup I(dfe)/d8 I is Q-integrable.
8el
67
7.4. Confidence Regions
'I'
Q[h dgd(fJfJ'),J..[g(fJ,.)]]--O
d
dfJ Q[h¢(g(fJ,·))]
=
so
b ounded contmuous
.
lor
,J..
'I'
°
for ¢ twice differentiable.
But P(¢[g(fJ, .)]) = Q[h¢(g(fJ, .))] for fJ fixed. Thus g(fJ, Y) must have the same
distribution for all fJ. The converse follows by running the steps of the proof
in reverse.
0
Note. Unbiased Bayes tests for unitary P rarely exist, but the above condition
is met for some other P. For example, if Y '" N(fJ, 1) and fJ is uniform, the
Bayes test is accept e = fJ 0 if Y - eo ~ k, and the test statistic Y - fJ 0 has
the same distribution for all eo since the distribution of Y is uniform.
I
I
I
I
7.4. Confidence Regions
Let Y and e be random variables, let P~ be the quotient probability on Y
given fJ. Suppose we wish to select a set oflikely fJ-values; a decision d will be
a set of fJ-values, and a decision function o(y) selects such a set for each Y
value.
A set selection function 0 is a confidence procedure if P~[fJEO] = ct o for all
fJ. This requirement is analogous to invariance or unbiasedness in that all
fJ-values are given the same treatment.
Consider a family of testing decisions dee), where d(fJ o) = 1 decides fJ = eo
and d(fJ o) =0 decides fJ=fofJ o' The set {d(fJ) = I} is selected by d, giving a
correspondence between families of tests and set selection decisions.
The loss (analogous to testing loss) for d is
+ /3{d(fJ o) = l}{e =fo fJo}
P~[ L(o, eo' fJ)] = ctp~[e 0 ¢ oJ{ e = fJ o} + /3P~[ eo E 0] {e =fo eo}
L(d, eo' e) = ct{d(fJ o) =fo l}{e = fJo}
Thus we want a large probability P~[fJEd] and a small probability P~[fJoEd]
with fJ =fo fJ o' The standard decision theory is not applicable because of the
appearance of both fJ and fJo in the loss function; it is necessary to consider
a prior distribution over e and fJo to discover admissible set selection procedures fJ. For a given prior pe.eo, the Bayes set d given Y minimizes
ctP~·Oo{d(fJo) =fo
I} {fJ = eo} + /3P~·eo{d(fJo) = I} {e =fo eo}
which requires d(fJ o) = 1 if p~.oo {fJ = eo} ~ /3/(ct + /3).
For a given prior pO on e, the conditional prior P: o suggested by testing is
P:oX = aX(e o) + pOX,
and then the set selection procedure is d(fJ o) = 1 if
foo(Y)/POUo(Y)] ~ K
68
7. Uniformity Criteria for Selecting Decisions
where Y has density f6 given O. Regions of this form are called Bayes high
density regions; see for example Box and Tiao (1973) and Hartigan (1966).
From theorem 5.2, it follows that unitary Bayes high density regions are
confidence regions for all K only if the probability P is a maximal learning
probability; in the many cases where maximal learning probabilities do not
exist, confidence regions cannot be unitary Bayes high density regions.
However, confidence regions are often Bayes high density regions
corresponding to non-unitary prior measures. Hartigan (1966) shows that
high density regions are asymptotically closest to confidence regions for the
Jeffreys density.
7.5. One-Sided Confidence Intervals Are Not
Unitary Bayes
Let Y be a random variable on fi£, and let 0 be a real valued random variable
on fi£. A one-sided confidence interval [ - 00, O( Y)] is such that P6 [O;;::; O(Y)] =
ex all O. A one-sided Bayes interval [ - 00, O( Y) ] is such that P y[O ;;::; O( Y)] = ex
all Y.
Theorem. A one-sided unitary Bayes interval of size ex, 0 < ex < I, is not a
confidence interval.
Proof. Let Py[O;;::; O(y)] = ex. Then P[O;;::; O(Y)] = ex. For each fixed 00 ,
P[O;;::; O(Y)IO;;::; 00 ] > ex if P[O(Y) ~ 0010;;::; 00 ] > O.
If P[O(Y) ~ 0010;;::; 00 ] = 0 for all 00 , P[O(Y);;::; 0] = 1 which contradicts
0< ex < 1. Thus P[O;;::; O(Y)IO;;::; 00 ] > ex for some 00 ,
If 0;;::; O(Y) is a confidence interval, P6 [O;;::; O(Y)] = ex all 0, so
P6 [O;;::; O(Y)IO;;::; 00 ] = ex,P[O;;::; O(Y)IO;;::; 00 ] = ex which is a contradiction. 0
7.6. Conditional Bets
Let Y and 0 be random variables on fi£. A bet Z( Y, 0) is conditionally probable
given Y if Z( Y, 0) E fi£ each Y and P yZ( Y, 0) ~ O. A bet Z( Y, 0) is conditional
given Y if P6 [Z(Y, O)f(Y)] < 0 all 0 for no f~ 0 such that Z(Y, O)f(Y)Efl'
all O.
If Y and 0 take finitely many values, the two conditions are equivalentfor any matrix X ij there exists no Pj ~ 0 such that LjX ijPj < 0 all i if and only
if there exists ex i ~ 0 (oc =/= 0) such that LjXiPi ~ O. (Equivalently, a convex
set disjoint from the negative quadrant is separated from the negative
quadrant by a hyperplane.)
In general, the two senses of conditionality are not equivalent. For example,
suppose that Y and 0 take values on the integers 1, 2, ....
69
7.7. Problems
Define P~{Y = i}Z(i, 0) = [ - {O ~ i} + {O > i} ]/i 2
Then LP~{Y = i}Z(i, O)g(i) = - Le5,ig(i)/i2 + Le>ig(i)/i 2 •
This quantity is defined only if Lg(i)/i 2 converges, and thus it cannot be
negative for every 0 when g ~ 0, and Z(i, 0) is a conditional bet. For any
probability p 8, Lep8P~ {Y = i}Z(i, 0) = [ - P(O ~ i) + P(O > i)]/i 2 is necessarily negative for some t, so Z cannot be conditionally probable.
Theorem. Let Y and 0 be real valued random variables, and suppose that
y) is a confidence interval of size IXfor 0, and that 0 < Po(Y ~ a) < 1
for all a. Then Z(Y, 0) = {O ~ Y} -IX is not a conditional bet given Y.
( - 00,
Proof. For 0 ~ a,
For 0 > a,
P8(Z(Y, O){Y ~ a}) = Pe{O ~ Y ~ a} -IXP{Y ~ a}
= (1 -1X)Pe{Y ~ a} - Pe{Y < O}
= (1 - IX) [P e{ Y ~ a} - 1] < O.
PiZ(Y, O){Y ~ a}) = -IXPe(Y ~ a) < O.
Thus Z( Y, 0) is not a conditional bet.
o
Note. See Olshen (1973) for references and an application to confidence
ellipsoids. From 7.5, we know that one sided confidence intervals are not
conditionally probable with respect to a unitary probability, but they may be
conditionally probable with respect to a non-unitary probability. If the
definition of a conditional bet is weakened so that Z( Y, 0) is weakly conditional given Y provided P~[Z(Y, O)f(Y)] < 0 for no f such that Z(Y, O)f(Y)E.%"
all 0, AND Z(Y, O)f(Y)E.%" taking Y and 0 random, then if Z(Y, 0) is conditionally probable it is weakly conditional given Y. Thus for example if Y ~ N(O, 1)
where 0 is uniform, 0 < Y + 1.64 is a 95% confidence interval, conditionally
probable, and weakly conditional given Y, but not conditional given Y.
The bets ( {O < Y + 1.64} - .95) {Y ~ O} have negative conditional probability given 0, but are not integrable overall.
Freedman and Purves (1969) and Dawid and Stone (1972) show, under
regularity conditions, that the notions of conditionally probable and
conditional bet coincide if and only if the distributions P~ and P~ are constructed according to Bayes theorem.
7.7. Problems
E 1. Let t, 0 ~ t ~ n be an observation from the binomial distribution with P p {t} =
(: )p'(l - p)n-'. Show that the posterior mean for p, corresponding to any prior
unitary probability P, is biased.
70
7. Uniformity Criteria for Selecting Decisions
PI. Show that the Bayes estimate iJ corresponding to the loss function L(d, 0) = Id - 0 I
is the median of the posterior probability of 0 given Y. Does there exist a nonatomic posterior probability for which iJ is median unbiased, that is
Po[O <
iJ] = Pa[O;;:; iJ] =
t
all O?
QI. Let Y\, Y2 ' ••• Y. denote independent observations from f(O, Y), and let P have
density g(O) with respect to lebesgue measure on the line. Under suitable regularity
conditions, when 00 is true, show the posterior mean
p[OI Y\, ... , Y.]
=
00
+ ~[ -
0: 0 logg + Poo(fd2
+ f 3 )/PaJ2 JI PaJ2
+O(n- 312 )
where fi = [Oi/OOi log f]o=oo' Then show that g is asymptotically unbiased if
g(O) = -Po[0:210gf(0,
Y)J
(the square! of the Jeffreys density). Hartigan (1965).
E2. Let Y,O;;:; Y < 2n, have density f(O, Y) = t[1 + cos(Y - 0)], where 0;;:; 0 ;;:;2n.
Let the prior probability be uniform over 0 ;;:; 0 ;;:; 2n. Show that the Bayes high
density region is a confidence region.
Q2. As the number of observations Y\ ' ... , Y. becomes infinite, under suitable regularity conditions, find an asymptotic expression for the confidence size of the Bayes
high density region with respect to P,
Pao{nf(Oo' Y)lPa[nf(O, Y)] > k}.
P2. Let the decision d be an ordering of the parameter values 0, with L(d, 0) =
l{d(') > d(O)} -l{d(') < d(O)} for some loss measure /. Show that the Bayes
decision given the observation Yorders 0 according to dP y/d/.
P3. Let the decision d be an interval { - 00, c} on the line. Let 0 be on the line, and set
L(d, 0) = - {O;;:; c} + K(c - 0)+. Show that the Bayes decision corresponding to a
probability P with density g, satisfies
c
K
S g(O)dO =
g(c).
P4. Consider the 95% confidence interval for 0 based on one sample from N(O, 1),
{O;;:; Y + 1.64}. Bet $95 to win $5 that 0;;:; Y + 1.64 whenever Y ~ 0, and bet
$5 to win $95 that 0 > Y + 1.64 whenever Y < O. Find your probable gain as a
function of o.
P5. In the normal location case, show the 95% confidence interval for 0, (Y - 1.96,
Y + 1.96), is not a conditional bet. (Bet an amount proportional to e Y that 0 lies
outside the interval.)
P6. Let x l ' x 2 , ... ,x. be a sample from N{Jl, 0'2), .Iet i be the mean of x l ' ... , x. and
let s be the standard deviation. Show that the confidence interval for J.J,
i - ks, i + ks) can be beaten by betting that the interval contains p. if s > 1, and that
it doesn't contain p. when s;;:; 1. (Buehler and Fedderson (1963».
71
7.8. References
P7. If y, Xl' ... , x. are sampled from N{Jl, 1), show that the 95% tolerance interval for
y, {y < x+ 1.64[1 + (l/n)]112} may be beaten by betting differently according to
the value of x.
E3. The decision d chooses one of two parameter values sl' S2 and L(d, s) = Is - dj.
Show that the Bayes decision, given an observation t with density f(s, t), for any
prior probability which has P{SI} > 0, P{S2} > 0, is
SI
if f(sl' t)Jf(S2' t) > c,
d = S2
if f(sl' t)If(S2' t) ~ c.
d=
P8.
XI' x 2 are observations from N{Jl, (12). A test for J.l = 0 against J.l +- 0 is similar ifthe
probability of deciding J.l +- 0 when J.l = 0 is independent of (12. Are any Bayes tests
similar?
P9. If X and Yare random variables with the same distribution, show that
P(X - Y > a) ~ P( IX I > tal. Let P8 be a family of probability distributions with
positive densities f(O, Y), 0 ~ 0 ~ 00, satisfying the conditions of theorem 7.3,
such that f(8, Y)J f(Oo' y) --+ 0 as 0 --+ 00 for each Y. Show that no unbiased
unitary Bayes test exists.
P1O. Let X be an observation with density f(x - 0) with respect to lebesgue measure,
where
f(u) =
(2/lul- l)f(2 -Iui){ lui
~ 2}.
Let g be a prior density with respect to lebesgue measure,
g(O) = {[20] = 2[0]}
where [0] is the largest integer ~ O. Show that the posterior mean with respect to g
is unbiased for O. [The uniform distribution is not the only unbiased distribution
in location problems].
7.8. References
Box, G. E. P. and Tiao, G. C. (1973), Bayesian Inference in Statistical Analysis. Reading:
Addison-Wesley.
Buehler, R. J. and Fedderson, A. P. (1963), Note on conditional property of student's t,
Ann. Math. Statist. 34,1098-1100.
Brown, L. D. and Fox, M. (1974), Admissibility in statistical problems involving a
location or scale parameter, Ann. of Statistics 2, 807-814.
Dawid, A. P. and Stone, M. (1972), Expectation consistency of inverse probability
distributions. Biometrika 59, 486-489.
Freedman, D. and Purves, R. A. (1969), Bayes method for bookies, Ann. Math. Statist.
40,1177-1186.
Hartigan, J. A. (1965), The asymptotically unbiased prior distribution, Ann. Math.
Statist. 36, 1137-1154.
Hartigan, J. A. (1966), Estimation by ranking parameters, J. Roy. Statist. Soc. B
28,32-44.
Oishen, R. A. (1973), The conditional level of the F-test, J. Am. Stat. Ass. 68, 692-698.
Pitman, E. J. C. (1939), Location and scale parameters, Biometrika 30,391-421.
Stein, C. (1959), The admissibility of Pitman's estimator ofa single location parameter,
Am. Math. Statist. 30, 970-979.
CHAPTER 8
Exponential Families
8.0. Introduction
Let /-1 be a probability on qy, and choose a unitary probability P on qy to
minimize the information P(log(dP/d/-1)) subject to PYi = ci' i = 1, ... , k.
The optimal probability P has density
-dP =
d/-1
exp [
I
k
i= 1
a. Y.(t)
!
+b
J.
!
Such a P is said to be exponential with respect to /-1, for the functions Y and the
parameters a, denoted E[/-1, Y, a]. The further parameter b is determined as a
function ofa by PI = 1. An exponential family {Ps' SES} consists of E[/-1, Y, s],
SES where S is a subset of k-dimensional Euclidean space. The set of all
values S with /-1 [ exp s'Y] < 00 is convex because exp is convex.
Exponential families are attractive for statistical analyses because they
remain exponential under repeated sampling and under formation of
posterior distributions. If X is distributed as E[/-1, Y, SJ, then the random
sample XI'X 2"",X n is distributed as E[/-1n,IY(X),s]. If S has prior
probability P, and X is distributed as E[/-1, Y, sJ, the posterior probability of
s given Xl' X 2' ... , X n is
P x = E[P, [sl' S2' ... , Sk' Je(s)],
[I Y (X), ... , I Yk(X), n]]
1
where Je(s) = -logPsexp[s'Y(t)]. Thus the posterior probabilities, for
all data X, belong to the same exponential family! (This occurs for all prior
probabilitiesP; there is NO special family of"conjugate" prior distributions!)
72
73
8.2. Prior Distributions for the Exponential Family
8.1. Examples of Exponential Families
(i) BERNOULLI
T
=
{a, I},
for t = 0, I,
Let s = logp/(l- p), P s = exp[t 10gp + (l - t) log(l - p)]
= exp [ts + A(S)], ),(s) = - log (1
For n observations, this becomes the binomial.
(ii) POISSON
(iii) EXPONENTIAL
°
!1{l}
P p = pr(l - p)1 -r
=
1.
~ p ~ 1.
+e
S
).
T = {a, 1, ... }, !1{t} = l/t!.
p;. {t} = ),t e - A/t! = exp [t log ), - ),]Il{ t}.
PJt}=exp[ts-e S ] withs=logA.
T = [0, (0),
!1 uniform on T.
fACt) = e-t/A/Je.
s=-I/Je.
fs(t) = exp[ts + loge - s)],
(iv) NORMAL LOCATION T = ( - 00, 00), !1 unit normal.
feet)
(v) NORMAL SCALE
=
exp[et - te2],
T = ( - 00, 00),
f)t)=
s = e.
!1 uniform
I
/1.:exp[-tt 2/0"2]
O"y 2n
fs(t) = exp [st 2 + t log [ - s/n]],
s=
-
1/20"2.
[N ote. The A(S) for the posterior density is the second expression in the
exponential argument.]
[Note. Here parameters have been transformed to demonstrate reduction
to exponential form; in computations, it is usually better to leave the parameters untransformed.]
8.2. Prior Distributions for the Exponential Family
If P s = E[I/, Y, sJ, the Jeffreys density is [var Y]1i2; only the binomial
among the standard families has the Jeffreys density unitary.
Since Y is of minimum variance among unbiased estimators of Ps Y,
it is of interest to discover the prior distribution such that the posterior
mean is Y. This prior distribution is necessarily not unitary.
Theorem. Suppose Ps = E[I/, Y, s], SI < S < S2 and that,foragiven y, exp(Ys)/
->
as s -> SI or s -> 8 2 , The posterior mean, given Y, of Ps Y is Y
if the prior probability ofs is uniform over (SI ' 8 2 ),
1/[exp( Y s) ]
°
74
8. Exponential Families
PROOF.
P5 Y = J.l(Yexp(Ys»IJ.l(exp(Ys»
a
= as log J.l[ exp (Ys)].
5,
J(PsY -
Y)[exp(Ys)IJ.l(exp Ys)]ds
a
51
J - as exp{Ys -log[J.l(exp(Ys»]}ds
S2
=
51
o
=0.
In the binomial case tin is the minimum variance unbiased estimate of p. If s = log(pll - p) is uniform over - 00, 00, then tin is the posterior
mean of p, provided 0 < t < n; in the case t = 0, n the posterior mean is not
defined (the limiting conditions of the theorem break down).
EXAMPLE.
Name
Binomial
(~)pt(l _
Poisson
Exponential
Normal Location
Normal Scale
Parameter
Family
p)n-t
p
s=log-I-p
),!e-).It!
s= log),
e-t/).j).
s = -Ill
_1_e9t- (1/2)9'
y'bt
I
--exp( -ft 2I( 2)
qy'bt
The Jeffreys
Density (for s)
lis
s=8
s = -1/2u 2
lis
8.3. Normal Location
Assume Xl"'" X n' () are random variables such that Xl"'" X n given
o are independent, P;' = N«(), 1).
(i) The posterior distribution. If () has prior probability P, the posterior
probability P X1 ••••• Xn has ~ensity exp [ - tn(X - ()2]1 P exp [ - tn(X - ()2]
with respect to P, where X denotes the mean of Xl' ... , X n' The posterior
probability is defined for n large enough provided p[exp( - A()2)] < CI)
for some A. Say that such a P is docile.
(ii) Asymptotics. If ()o is in the support of a docile P, then P x eventually
concentrates on () 0 as P80' (For each e > 0,
P80 {p x(i() - ()oi > e)
+O} = 0.)
If a docile P has a density wrt Lebesgue measure that is continuous and
positive at ()o' the posterior distribution is asymptotically normal N(X, lin)
75
8.3. Normal Location
given eo' (That is,
_
px(e~x+z/Jn)-+
z
J
1
,r;:cexp(-!u2)du
-roy 2n
[inPeJ)
If a docile P has a density p wrt Lebesgue measure that is continuously
differentiable and positive at eo' the posterior distribution is asymptotically
N(X + [(%eo) logp]/n, lin) given eo' (That is,
In[
px[ e
~ X + [o~o logp JI n + zJn J- Jro$exp ( -!u2)du J-+o
[in PeJ)
Thus the principal effect asymptotically of a smooth prior density is a shift
in location of the mean of the posterior distribution.
(iii) The uniform prior. 'The uniform prior is Lebesgue measure on the
line. It is docile, the Jeffreys density, the only density invariant under location
and sign changes, the only density for which the corresponding location
estimate is unbiased.
The posterior distribution is P~ = N(X, lin). The posterior mean X is
mean-square admissible, unbiased, of minimum variance among unbiased
estimators.
The high density regions {eln(e - X)2 < k} of posterior probability IX
are confidence regions of confidence size IX.
To test e < 0 against e ~ 0, the Bayes decision accepts e < 0 if Px( e < 0) > k,
which is equivalent to X < c, the uniformly most powerful test of e < 0
against ~ o. If X 0 is observed, it is customary to report the tail probability
Pe=o(X> Xo) in testing e < 0 against e ~ 0; this is the same as pxo(e < 0),
the posterior probability of the null hypothesis.
To test e = 0 against e =1= 0, the Bayes test is ofform accept e = 0 if IX I < c,
which is the most powerful unbiased test. If X 0 is observed, it is customary
to report the tail probability Pe=o(IXI > IX o !)' which is the same as
Pilxo - el < IXo !)' the posterior probability that the true mean is farther
away from the observed mean than O.
(iv) Normal priors. If P = N(e o' (J~), then P~ = N{[(eo/(J~) + nX]/
«1/(J~) + n), ((1/(J~) + n)- l}; the posterior mean is of the same form as the
prior. (This is true for any prior; see 8.0.)
The formulae for means and variances may be remembered by the followingscheme:
e
PRIOR:
e", N(e o' (J~)
OBSERVATION: X'" N(e, lin)
Act as if eo is an observation on e; combine with the observation X
inversely weighting by variances:
l/varie) = l/variX) + l/var e
Px( e)jvari e) = X /var eX + e o/var e.
76
8. Exponential Families
It may happen that the prior N(Oo' a;) and the observed X contradict
each other. Note that X - eo "" N(O, a~ + lin). Thus if (X - Oo)2/(a; + l/n)
is very large, we might decide to revise the observation X, its distribution
given 0, or the prior for O. The contradiction will not arise if is very large.
The posterior distribution P x approaches the posterior distribution
N{X, lin) corresponding to the uniform as a o ---> 00, and so this family of
priors is useful in showing admissibility of classical statistical procedures
corresponding to the uniform.
(v) Two stage normal priors. Consider a family of normal priors P A =
N[O(l), a 2 (),lJ. Given A, the posterior distribution P). x is normal with parameters given in (iv). Suppose that ), itself has a prior distribution Q; then
the prior distribution on 0 is Q(P ).), a mixture of normal priors. The posterior
distribution corresponding to Q(P;) is also a mixture of normals Qx(P A.x)
where Qx is the posterior distribution for), for the prior Qand for the observation X ~ N[ O(A), a 2(),) + l/n J. [Given ),' 0 is distributed as N[ O(),), a 2 (A) J
and X is distributed as N(O, l/n); ignoring 0, X"" N(O{A), a 2 (),) + l/n). J
Thus values of A for which [X - Or),) J2 /[ a 2 (A) + l/n J is large will be downweighted, and contradiction between the prior mean O(A) and the observed
X is prevented.
a;
8.4. Binomial
The number of successes t,
°~
t
~ n has
probability (: )pt(l - p)"-t. The
prior P is docile if pA(1 - p)B is integrable for some A, B.
(i) The posterior density with respect to the docile prior P is pt(l _ p)n-I/
p[pl(l - p)n-t]. The posterior density concentrates with probability 1 on
Po if Po is true and Po lies in the support of P. If P has a density that is
continuous and positive at Po' then PI is asymptotically normal N[t/n,
Po(1 - po)/n].
(ii) Beta priors. If P is Be(C(, /3), having density pa-l(l - p)/3-1/B(C(, /3)
wrt Lebesgue measure, then PI is Be{C( + t, /3 + n - t). Note that if P = Be(C(, /3),
then Pp = C(/(C( + f3), var p = C(f3/[(C( + f3)2(C( + f3 + 1)]. The Jeffreys density
is Be(1/2, 1/2). The "unbiased" prior, P = Be(O, 0) has posterior mean
PIS = tin for 0 < t < n; the posterior distribution is not defined for t = 0,
t = n. The admissibility of t/n may be demonstrated by considering it as a
limit of the Bayes posterior means (t + C()/(n + 2C() corresponding to priors
Be (C(, C() as C( ---> O.
(iii) Confidence properties of beta priors. The discreteness of t makes it
impossible to achieve unbiasedness of two sided tests p = Po against p =1= Po'
or to find set selection procedures that have the confidence property.
Welch and Peers (1963) show that the Jeffreys density generates Bayes one
sided intervals which are most nearly confidence intervals; but their proof
is invalid for discrete observations.
77
8.4. Binomial
For a particular prior, consider the one sided Bayes intervals [0, pJ,
such that P, {p ~ Pt} = a, ~ t ~ n. The confidence properties of such intervals
are determined by the function P /p ~ P,); this function is discontinuous
in general at (Po' PI' ... , pJ
For the prior Be(O, I), P/p ~ P,) ~ a all P > ane;! for the prior Be(l, 0),
P /p ~ p) ~ a all p < 1 (Thatcher, 1964). These might be viewed as "liberal"
and "conservative" confidence regions for p. The inequalities follow from
the identity:
°
°
°
°
which relates binomial and beta (confidence and Bayes) probabilities.
For Be(l, 0), Pp(P ~ ptlLa as pi Pk ' < k ~ 11.
For Be(O, 1), Pp(P ~ p,)ia as plp k , ~ k < n.
Let P,,). be such that Pt(p ~ p,) = a for the prior Be p, 1 - ).], ~ ~ 1.
Then Pt,). is increasing in A. [The prior is equivalent to observing). successes
and 1 -). failures; the larger )., the more the posterior for a particular t
is shifted to the right.] Thus for each prior Be [)., I - ).], ~ ~ 1,
lim P /p ~ p,) ~ a, lim P p(p ~ P,,;) ~ a. The confidence values cross the
° ),
° ).
piPr,A.
plpt}
correct probability a at each of the points of discontinuity Pt,A'
In Figure 1, n = 10, a = 0.9 and the upper and lower bounding confidence
curves for Be(O, 1) and Be(1, 0) are given, together with the intermediate curve
for Be(I/2, 1/2), the Jeffreys density. Note that P~(p ~ Pt) ---> 1 as P ---> 0,
P~(p ~ p,) --->
as P ---> 1 for the Jeffreys density, so that it can never give
confidence values uniformly near a. It does give confidence values which
are closer on the average to the correct a than the bounding priors Be(O, 1)
and Be(1, 0).
°
I.O~----~--.---~~--~~----~----------~--~--~~
0.9
0.8
0.6
0.4
........ Density CC 1/ P
Density CC II(I-p)
Density cc
Binomial Parameter
Figure 1.
I
[p( l-p)r'2
78
8. Exponential Families
An arbitrary interval selection procedure specifies an interval {p ~ p,}
for each of t = 0, 1, ... , n. Its confidence properties are given by the function
P {p ~ PI}, which is discontinuous at each of the points Po < Pi < ... < Pn.
The overall error of the procedure might be assessed by sup IP p(P ~ PI) -
°
°
e<p<l-e
oc I; it is necessary to bound P away from and 1, because lies in every
interval and 1 usually lies in no intervals, so that P o(P ~ PI) = 1, Pi (P ~ PI) = 0.
The maximum error is achieved at the points of discontinuity; it will be
minimized by ensuring that Wim Pp(P ~ PI) + lim P p(P ~ PI)) = oc at each
ptps
P~Ps
point of discontinuity Ps ' In this case, the asymptotic error at Ps is
(l/2fo) exp( - !Z;)·(l/JPs(l- Ps)).(1/";;;) where Za is such that
P(Z ~ Za) = oc for a normally distributed Z. (This result is obtained by
equating binomial and beta tail areas and then using Edgeworth expansions
for the beta distribution, involving the first three moments.) For a Bayes
procedure, the asymptotic error is (l/fo) exp (- tZ;Hl/JPs(l - Ps
(l/";;;)sup(ILlI,ILI-1j), where the prior density h satisfies LI=
[(%p)log(h(P))p(l- p)]p=ps' This error is minimized for all Ps ' precisely
when h = j, the Jeffreys density, and in this case the interval selection procedure is close as possible to being a confidence procedure. (The error
sup IP/p ~ PI) - oc I is 0 (n -1/2) for every prior, but is smallest for the
».
e<p< 1-e
Jeffreys prior. If Po' Pi' ... 'P n are the upper bounds of intervals taken to
ensure t[lim Pp(P ~ PI) + lim P/p ~ PI)] = oc for each Ps ' and if P: denote
pt Ps
P~Ps
the Bayes upper bounds, then P: = Ps +0 (l/n) for any Bayes procedure,
and P: = Ps + o(l/n) for the Jeffreys density.)
In the table below, the intervals {p ~ PI} are specified corresponding to the
three priors Be(O, 1), Be(1, 0), Be(!, t), and also for a confidence procedure
minimizing maximum error.
EXAMPLE. For n = 10, oc = .90, the intervals for various methods:
Po
PI
P2
P3
P4
Ps
P6
P7
Ps
P9
Plo
Be (0, 1)
Be(1,0)
Be(t,t)
Confidence
.0
.2057
.3368
.4496
.5517
.6458
.7327
.8124
.8842
.9955
.9895
.2057
.3368
.4496
.5517
.6458
.7327
.8124
.8842
.9455
.9895
1.0000
.1236
.2744
.3948
.5018
.5997
.6901
.7735
.8494
.9164
.9704
.9998
.1487
.2981
.4063
.5118
.6090
.6990
.7823
.8584
.9257
.9799
1.0000
79
8.6. Normal Location and Scale
8.5. Poisson
The number of occurrences t, 0 ~t < 00, has probability PA{t} = A.te-A/t!.
The prior P is docile if AKe- Ais integrable some K.
(i) The posterior density. with respect to the prior probability P is Ate- A/
P(),te-A). The posterior density concentrates on A.o with probability 1, if
1..0 is true and lies in the support of P. If P has a continuous positive density
at A. o' then Pt is asymptotically normal N(A o' Ao/n).
(ii) Gamma priors. The prior G(m, a) has density amAm-le-aA/r(m), the
gamma density. The posterior given t is G(m + t, a + 1). The Jeffreys density
is G(t, 0), not unitary. The "unbiased" prior is G(O, 0), which has posterior
mean t; the posterior distribution is not defined for t = O.
(iii) Confidence properties of gamma priors. Similar considerations to those
for the binomial apply. Ex~ct confidence intervals are not possible because
of the discreteness of the Poisson. For a particular prior P, let 0 ~ A ~ At
be the IX-probability interval, Pt(O ~ A ~ A.t) = IX. The confidence function
P(A ~ At) will be discontinuous at A. o' A!, ....
For the prior G(O, 0), P A(A. ~ At) ~ IX all t, and for the prior G(l,O),
P A(A ~ A.t ) ~ IX all t. These results follow from the equivalence between Poisson
and gamma tails:
co A.t
co xto-1
L -e- A= J
e-Xdx.
oCto-I)!
t=tot!
As in the binomial case, Jeffreys' density gives intervals which are closest
to confidence intervals in that
!(lim P A(A. ~ At) + lim P /A ~ At}}
ATA.
AlA.
is closest to IX at every As'
8.6. Normal Location and Scale
Suppose Xl' X 2 ,
.. ·,
Xn are from N(f.l,
then Xl' ... , X n is E [ vn ,
0'2),
(~~? ). (_~~;;2
on Rn. The prior P is docile if exp [ - (A
A,B>O.
where J1 and
)]
0'2
are unknown;
where vn is Lebesgue measure
+ BJ12)/O' 2 ]
is integrable for some
(i) General priors. For the prior P, the posterior Pt has density
O'- n exp[ - tLX?/O' 2 + LX i J1/O' 2 - tnJ1 2 /O' 2 ]k(X l ' ... , Xn) with respect to
P. If (J1 0 ' o'~) lies in the support of P, and J1 0 ' o'~ is the true value, then the
posterior distribution concentrates on (J1 0 ' o'~) with probability 1. If P
has a positive continuous density at (J1 0 ' o'~), then the posterior density is
asymptotically normal.
80
8. Exponential Families
(ii) I nvariance generated priors. A prior with density [with respect to
Lebesgue measure on {t, u 2J u- A exp( - tB/u 2 + C{t/u 2 - tD{t2/U 2 + K) is
called an in variance generated prior IG(A, B, C, D). After the observations
Xl' X 2' ... , X n' the posterior is IG(A + n, B + LX?, G + LXi' D + n).
Priors of the form IG(A, 0, 0, 0) are improper, invariant priors under the
transformations Xi --+ a + bX i , {t --+ a + b{t, u --+ Ib Iu; by considering posterior
distributions obtained from invariant priors, for various types of data,
we obtain the distributions IG(A, B, C, D) where B:?; 0, D is an integer,
C 2 ~BD.
The Jeffreys density is IG(2, 0, 0, 0).
The density IG(5, 0,.0, 0), for parameters (l/n)P ll.cr2(LX) = {t and
(l/n)P Il ,cr2(LX?) = u 2 + {t2, has posterior means (l/n)LX i and (l/n)IX?; this
corresponds to ({t/u 2, - 1/2( 2) being uniform in the plane, see 8.2.
(iii) Marginal distributions of {t and u 2. The marginal density of {t corresponding to IG(A, B, C, D) is K1(tB - C{t + tD{t2)-(A-l)/2, which is
a student distribution with A - 2 degrees offreedom. (The conditional density
of {t given u 2 is normal.)
The marginal density of 1/u 2 = u corresponding to IG(A, B, c, D) is
K2dA-2)/2-lexp[ -t(B-D 2/C)uJ, which is a gamma distribution with
A - 2 degrees of freedom.
(iv) The "confidence" prior IG(1, 0, 0, 0). For this prior, the posterior is
IG(n + 1, LX?, LXi' n), and the marginal densities of {t and u are:
In(X - {t)/s '" Tn-I
(n - l)s2/u 2 '" X;-l
where X = l/nIXi' S2 = L(X i - X)2/(n - 1), Tn-I denotes a student distribution on (n - 1) degrees offreedom,
1 denotes a chi-square distribution
on (n - 1) degrees of freedom.
Since the same distributions hold when {t and u 2 are fixed, and X and s
are random, Bayes intervals and regions have a confidence interpretation.
For example, the high density region of {t, {{tIJnIX - {tl ~ sTn- 1,a} has
posterior probability (L given Xl''''' X n' but also probability (L given {t, u 2.
(Here P(ITn_ll~Tn_l)=(L.) Or, the one-sided Bayes interval for u 2,
{u 2 (n - l)s2 /u 2 ~ X;- 1, a} has posterior probability (L of containing u 2, but
also probability (L given u 2. (Here P(X; _ 1 ~ X; _ 1, a) = (L.)
(It should be noted that the high posterior density region for {t and u 2
is not a confidence region for this prior; the Jeffreys prior IA(2, 0, 0, 0) gives
a high posterior density region
x;_
1
(s/u)n exp [ - t(n - l)s2/u 2 - tn(X - {t)2/U2J :?; c
which is also a confidence region.)
The Bayes test for {t = {to against {t =1= {to is: accept {t = {to if IX - {tol < cs.
The tail area p[1 Tn-II:?; Jnl X - {to l/sJ is the Bayes posterior probability
81
8.6. Normal Location and Scale
p x[ III - Xl ~ IlLo - Xl], the probability that IL is further from the observed
X than lLo'
(v) Unbetworthiness of confidence intervalfor IL. The intervaLjnl1L - Xl ~
Tn_ 1 as, where P(ITn_11 < T n_ 1 a) = OC is not betworthy. If s < I, bet 1 - oc
to receive 1 if.jn11L - Xl > Tn_l:as.1f s ~ I, bet oc to receive 1 ifJnl1L - Xl ~
Tn_l,as. The strategy is to bet that IL does not lie in the interval when s is
small, and to bet that IL does lie in the interval when s is large. Since
P[Jnlx -ILl ~ Tn_l,asl s, IL, oj increases strictly with s, and averages oc
over all s,
p[Jnlx -ILl ~ Tn_I,asl s < I] ~ oc < P[ JnIX -ILl ~ Tn_I,asls > 1].
The above bet will always have positive expectation, no matter what the
value of IL, u. However, as u 2 --+ 0 or 00, the net gain from the bet will be
arbitrarily close to zero.
More generally, bet ock(s) to receive k(s) if JnIIL - Xl ~ Tn_I,as, where
k(s) may be negative. Whenever the function k(s) is strictly increasing in s,
the net gain from the bet is P",Jk(s) [ {In[1L - X] ~ Tn-I,A - oc]] > O.
For k(s) = sA., the net gain is u)K(n, oc, l) where K(n, oc, l) has the same sign as l.
Thus the bet s2IK(n, oc, 2) + S-2 1K(n, oc, - 2) has gain u 2 + u- 2 ~ 2. It is
thus possible to devise bets whose net gain is greater than 2 for all u 2 • See
Brown (1967).
(vi) The Behrens-Fisher problem. Suppose that Xl' ... , Xn are a sample
from N( 1L 1' u:), and Y l , ... , Ym are a sample from N(1L 2, u~). Taking the
"confidence" prior density (with respect to Lebesgue measure v on ILl' 1L2' u l '
( 2 ) u~ IU;- I,
and letting
X = ~IXp si = I(X i - X)2/(n - I),
Y=
~ IYp
s:
=
L<Y
i -
y)2/(m - 1),
the posterior distributions of ILl and 1L2 are independently
ILl - X
+ sxTn-lIJn,
1L2 - Y +SyTm_llrm·
Then ILl - 1L2 is the convolution of two student distributions.
In order to test ILl = 1L2 against ILl =1= 1L2' Behrens and Fisher propose
the test which rejects ILl = 1L2 if
P xA IILI - 1L2 - (X - Y) I > IX - YI] is sufficiently small.
The Bayes test would reject ILl
= 1L2 if the posterior density of ILl -1L2 at 0 is
82
8. Exponential Families
sufficiently small-that is, if
J I
[
m
(Xi -
V)2
J-<m+1)/2[ n
J-<n+1)/2
I (¥i + V)2
dv
i= 1
i= 1
is small enough.
Both tests have probability of rejection, given III = 1l2' that depends on
(11/(12'
8.7. Problems
E1. Suppose XI' ... , X.is a random sample from N(O, 1), and thatthe prior distribution
P has P{ - I} = P{ I} = t. If 0 = 0, what is the asymptotic behavior of the posterior
distribution?
E2. If X is an observation from N(O, I), show that for every unitary prior P,
PPe[O - Px(O) Y < 1.
PI. If X is an observation from N(O, 1), show that aX is an admissible estimate of 0
using the loss function L(d, 0) = (d - 0)2, for 0 ~ a ~ 1.
P2. If X is Poisson with parameter A, and the prior on A is gamma, G(m, a), find the
Bayes estimate of A with loss L(d, A) = (d/A - Ajd)2.
Q1. In a test of 10 questions, a child gets t questions correct where t is binomial with
parameter p. Over many children, the parameter p has prior distribution P. The
observed number of successes over many children is:
NUMBER
CORRECT
NUMBER
OF
CHILDREN
0
66 240
2
3
4
540 960 2450
5
3016
6
7
8
2520 2520 2970
9
10
2640
1716
Estimate P.
P3. If an observation t has probability P s = E[Jl, Y, s], show that Bayes tests of s ~ So
against s > So are ofform: decide s ~ So if Y ~ Yo'
P4. If XI' ... , Xn are observations from N(O, I), and for the uniform prior on 0, find
the conditional distribution of X k + l' ... , Xn given XI' X 2' ... , X k •
E3. For a normal sample X l ' ... , Xn from N(Jl, (2), with the prior 10(1,0,0,0), find
the posterior mode of J1. and q2, and the posterior means of J1. and q2, based on the
posterior density of J1. and q2.
P5. For a normal sample X l ' ... , X n from N(J1., (2), with prior 10(1,0,0,0) find the
Bayes estimator of q2 using loss function L(d, (2) = (d - q2j2, and compare its
risk function with those of maximum likelihood and unbiased estimates of q2.
8.8. References
83
8.8. References
Brown, L. (1967), The conditional level of the t-test, Annals of Mathematical Statistics
38, 1068-1071.
Thatcher, A. R. (1964), Relationships between Bayesian and confidence limits for
prediction, J. R. Statist. Soc. B 26, 176-210.
WeIch, B. L. and Peers, H. W. (1963), On formulae for confidence points based on
intervals of weighted likelihoods, J. Roy. Statist. Soc. B 25,318-329.
CHAPTER 9
Many Normal Means
9.0. Introduction
Given X, suppose pi' = N(X i , 1), i = 1,2, ... , n, and the Y; are independent.
The straight estimate Y; of X; is least squares, maximum likelihood, of
minimum variance among unbiased estimators, posterior means with
respect to the Jeffreys density (the XI' ... , Xn are uniform) but for all these
virtues inadmissible with loss function 2:7= 1 (d; - Xl for n > 2, Stein
(1956).
9.1. Baranchik's Theorem
Lemma. If Y ~ N(X, 1),
X ~ 0, and iffis integrable
00
P[J(y2)]
=
2:
PkP[J(X~k+ I)J
k=O
where X~k+ 1 denotes a variable with the chi-square distribution, and
Pk = exp( - ±X 2 H±X 2 Nk!
are Poisson probabilities with expectation ±X2.
PROOF. The first result, that a non-central chi-square is a mixture of central
chi-squares with Poisson mixing probabilities, should have a nice
probabilistic proof, but I don't know one.
84
85
9.1. Baranchik's Theorem
(i) p[J(y2)] = J':"",f(y2) exp[ -1{X - y)2]dYlfo
= SI(y2)exp( -ty2)exp[XY]dYexp[ -tx2)lfo
00
Xkyk
= J':"rof(y2) k~O k! exp[ - ty2]dYexp[ - t X2 ]/fo·
{Note X~k+ 1 = y2 = u has density: Uk- 1/2 exp( -
tuHW+ 1/21 f'(k + t).}
X2ky2k
,exp[ -tY2]dYexp[ -tx2]/fo
k=O
2k.
co
P[J(y2)]=2J~f(y2) L
(tX2)k(ty2)k
- t)(k - t) ... t
_ 2Joof(y 2)"
-
L.. k !(k
0
·exp[ - tyZ]dYexp[ - tX2]/fo
(1. y2)k-I/2
dy2
=LJ~f(y2)Pk 2f'(k+t) exp[ - t yZ ]-2-[sincef'(t)=Jn]
= IPkP[J(X~k+l)J.
00
=2J~f(y2)~
X2k+ly2k+2
(2k+I)!
·exp[ -ty2]dYexp[ -tx2]/fo
ro
= X L PkP[J(X~k + 3)] after some algebra.
D
k=O
Theorem (Baranchik (1970)). Let Yi be independent N(X p 1), i = 1, ... , n.
Let S = L y i2, and let f be a non-decreasing non-negative function with
f < 2(n - 2). Then peL {YJI- f(S)/S] - XJ2] < p[I(Yi - X;l2] for every
X 1 'X z , .. ·,X n ,ijn>2.
Since S = L7= 1 y i2 is invariant under rotations of the Y i ,
L(Y;O - f(S)/S) - X;l2 has the same distribution if Y and X undergo the
same rotation. It is sufficient therefore to consider X 1 ~ 0, Xi = 0 for every
i ~ 2.
PROOF.
Let g(S) = 1 - f(S)/S
pI(Yig(S) - X)2
Then S = Yi
+Z
=
peL Y;g2(S) - 2X 1 Y1g(S) + Xi].
where Z ~ X; _ 1 independent of Yl
·
86
9. Many Normal Means
P[Sg2(S)] = pZp z(Yi + Z)g2(Yi + Z)
+ Z)g2(X~k + 1 + Z)] from the lemma
LPkP[(X~k+n)g2(X;k+n)], Pk = exp( - txi)(txiNk!
= pZL PkP Z[ (X~k + 1
=
P[Y1g(S)] = X lLP kP[g(X;k+n+2)] from the lemma
PL[Yig(S) - XJ2
= LPkP[X~k+ng(X;k+n) - 2Xig(x~k+n+2) + Xi]
00
= L PkP[X~k+ng2(X~k+n) - 4kg(X;k+n) + 2k]
P[L(Yi9(S) - X)]2 - PL(Yi - Xi)2
= L PkP[X;k+ng 2(X;k+n) - 4kg(X~k+n) + 2k -
nJ.
This expression is to be shown < U.
Set g(S) = I - f(S)/S and note that
]
2
I
P [ f(X 22k+n)/X 2
2k+n ~ P f(x 2k+n)P-2X2k + n
because f is non-decreasing, so f(x2) and I/x2 are negatively correlated.
P[X~k+ng2(X;k+n) - 4kg(X;k+n) + 2k - n]
= Pf(X;k+n) [ - 2 + f IX;k+n
+ 4kIX;k+n]
2
[
2(n - 2) + 4kJ _
< Pf(X2k+n) - 2 + n + 2k _ 2 - 0.
since f < 2(n - 2) and P[I/X;k+n] = I/(n
+ 2k -
2). Thus Yg(S) beats Y.
D
Note. Yg(S) shrinks the estimate towards 0, but the same result holds if it is
shrunk towards any other point Z by Z + (Y - Z)g(Sz) where
Sz = LeYi - Zi)2. Or shrunk towards Yl,f < 2(n - 3).
9.2. Bayes Estimates Beating the Straight Estimate
Theorem. Suppose Yi '" N(X i,(J2), i = I, 2, ... , n, (J2 known. Let the prior
distribution for Xi be Xi'" N(O, (J~) independently given (J~, where (J~ has a
density 9 such that log 9 is concave in log«(J2 + (J~) and «(J2 + (J~)1-(1/2)l%g is
increasing for some IX. The posterior mean of Xi given Yi has smaller mean
square error as an estimate of Xi than Yi for every choice of Xl' ... , X n '
whenever n ~ 4 - IX.
PROOF.
First fixing
(J~,
p[Xil Yp
Yi '" N(O, (J2
(J~] =
+ (J~) independently.
Y/ ( 1 + :;).
87
9.2. Bayes Estimates Beating the Straight Estimate
The posterior density of u~ given Y 1 '
g[u~IY]
OC
••• ,
Yn is
+ u~]-n/2 exp[ -t'LYN(u o + U~)]g(U2 + u~).
V = S/(u 2 + u~),
[172
Letting S = 'LY?,
p[VIY] = SV n/2- 1 exp( - tV)g(S/V)dV/ Svn/2-2 exp( - tV)g(
= P[x; _2g(S/X; -2)]/ P[g(S/X; -
Now
p[XjIY]
= Yjp[
t
)dV
2)].
(I + :;r1Iy ]
= Yj[1 - p(VIY)u 2/S].
From Theorem 9.1, the estimate p[XjIY] will beat Yj if p(VIS) = P(VIY) is
a non-negative, non-decreasing function of Su 2 such that P(VI S) < 2(n - 2).
It is obviously non-negative.
Let k(V) = V n/2- 2 exp( - tV). For S > s'
p[VIS]/p[VIS'] = SVk(V)g(S/V)k(U)g(S'/U)dVdU ;::: I
SUk(U)g(S'/U)k(V)g(S/V)dVdU -
if S(V - U)k(V)k(U) [g(S/V)g(S'/U) - g(S/U)g(S'/V)]dUdV ~ O. Since logg
is concave in log(u 2 + u~), g(S/ V)g(S' / U) - g(S/U)g(S' IV) ~ 0 for S ~ S',
V ~ U. Thus p[VIS] is increasing in S.
Since (17 2 + u~)I-(1/2)a 9 = h is increasing,
p[VIS]
= Sv(n-a)/2 exp( -
tV)h(
~ )dV /Sv(n-a)/2 exp( -
tV)h(
~ )dV
= P[X;_ah(S/X;_a)]/P[h(S/X;_a)]
~ P(X; -a) = n -
0:
~ 2(n - 2) if n ~ 4 -
0:.
Thus p[VIS] satisfies the conditions of Theorem 9.1 and the theorem
is proved.
D
Note. Priors of the above type will be unitary only if 0: < 0, so that for a
unitary Bayes estimate n ~ 5 is required to beat the straight estimate.
For 0: < 2, the loss D X j - P[ X j Iy])2 is integrable, so the posterior
mean is Bayes and hence admissible; a Bayes estimate may thus be obtained
for n ~ 3. Strawderman (1971) considers thedensitiesg(u~)oc (u~ + u 2 )(1/2)a-l ;
then p[VIS] = P[x;
-a < S/u 2]. The particular choice 0: = 0 is suggested by Jeffreys (1961).
James and Stein (1961) showed that estimates Y j (1 2)u 2 )/'L Y?) beat
Y j whenever n > 2; these estimates are not admissible. This estimate may be
justified by noting that Peen - 2)/'LYj2) = 1/(17 2 + u~) under the conditions
of the theorem, so that the shrinking factor is estimated unbiasedly. The
-alx;
«n -
88
9. Many Normal Means
Bayes estimates, in contrast, shrink rather less when S is small than when it is
large; for large S, the shrinking factor P[ V ISJ will be close to (n - ct); for
small S, it will be close to zero.
If (n - ct) is even, p[VISJ = (n - ct)P[Z > (n - ct)/2J/P[Z ~ (n - ct)/2J
where Z is Poisson with expectation is/(J2. For example, for n = 3, ct = 1,
P[V ISJ = 2[1 - e - (I/2)S/cr'(1 + is/(J2) J/[1 - e-(l/2)S/cr 2 J, and the estimate is
Yi [I/(l - exp( -is/(J2)) - 2(J2/S].
Let (J = 1, and consider the sample Y1 = 1.2, Y2 = - 0.6, Y3 = 0.8. Then
S = 2.44, the shrinking factor is .6, and the new estimates are .72, - .36, .48.
9.3. Shrinking towards the Mean
Lindley and Smith (1972) use the prior Xi'" N(e o' (J~), independently for
and
N(O, ,2). For the moment (J~ and ,2 will be
assumed known. Then
i = 1,2, ... , n given
eo'
eo '"
p[XiIY, eoJ = [Y/(J2
and
Yi
'"
N[e o' (J2
+ eo/(J~J/(1/(J2 + l/(J~)
+ (J~J independently for i =
= [
Y/(J 2
1,2, ...• n
1 1).
2J/( (J2 + (J~
+ Y/(Jo
If r =
Note that
p[XiIY,
(J~J = Y+ (Yi - Y{ 1 -
(J2
~ (J~
00.
J
(J~ has prior density ((J2 + (J~)1-«1/2)a), pO)Yi - y)2/((J2 + (J~)] =
P[X~-a-ll X~-a-l < L(Yi - y)2/(J2J and the estimate beats Y i if n ~ 5 - ct,
from Theorem 9.1. Shrinking towards the mean rather than towards an
arbitrary constant loses a degree of freedom, but doesn't change the basic
arguments.
If
89
9.5. When Most of the Means Are Small
9.4. A Random Sample of Means
Suppose Yi ", N(Xp 1) independently, and the Xi are a random sample from
some prior Po' The Yi are then a random sample from the density g,
g(y) = P o[ exp( - t(X - y)2)/.j21t].
Assuming first that 9 is known, the posterior mean of Xi given Yi
is P o[X exp( - t(X - y i)2)]/PO[ exp( - t(X - y)2)] = Yi + (d/dYi)logg.
If 9 is not known, it is necessary to place a prior distribution on it so that
the posterior expectation of the "correction" (d/ dY) log 9 may be computed.
An "empirical Bayes" approach permits estimation of 9 by any method,
not necessarily a Bayesian method. It is known that Y1 , ••• , Yn form a random
sample from g. A density estimation technique might be used to estimate
(d/ dY) log g. For example, iflog 9 has a continuous first derivative at y,
P[Y -
d
yll Y - YI < e]/P[(Y - y)211 Y - YI < e] -+ dy logg as e -+ O.
Thus (d/dy)logg may be estimated by LIY;_yl<e(Yi -y)/LIY;_yl«(Yi -y)2
for sensibly selected e. (For e large enough to include all data values, the
estimate will be similar to the James-Stein estimate (9.2). The estimate
Yi will be replaced by an estimate closer to the mean of those observations
near Yi .
9.5. When Most of the Means Are Small
In 9.2 and 9.3, 9 is normal with mean 0 and unknown variance, and a prior
distribution is placed on the variance. In many regression and analysis of
variance problems, most of the means Xi are very close to zero, but a few
are quite large. Such a situation is not well represented by a normal g, because it is not sufficiently long tailed. One alternative is to assume that Xi
comes from a distribution ptJ o + (1- p)N(O, a~) where tJo{O} = 1. Then Yi
is a random sample from pN(O, I) + (1 - p)N(O, a~ + 1),
1 {
(1 - p) exp( - ty2/(1
g(y) = ~
p exp( - ty2) + ~
v 2n
1 + a~
d
- y{ p exp( - ty2) + (1
-Iogg(y) =
dy
{p exp( - ty2) +
~ :~~/2 exp( -
ty2/(1
0
~ exp( 1 +a~
t y2/(1
+ a~)) }
+ a~)) }
+ a~))}
If y is small, the adjustment is close to - y; if y is large it is close to y/(1
+ a~);
in this way the small observed values Yi are moved very close to zero, but
90
9. Many Normal Means
the large observed values Yi are relatively unchanged. In practice p and O"~
must be estimated from the Yi . A Bayesian approach requires computation
of the posterior mean of (d/ dy) log g(y) but no prior on p and 0"0 is known
which permits explicit computation. It is straightforward computationally
to estimate p and O"~ to maximize the likelihood of the observations, but
explicit expressions are not available, and it is not known whether the resulting estimates of the Xi beat the straight estimates.
A standard approach to the problem of many small means is to carry out
a significance test on each mean separately, and to set all those means to
zero which do not exceed some significance level. Here, the estimate would be
Xi = Yi { IYil ~ c}, where cis the cutoff point in the significance test. Then
Ip(Yi - X)2 - Ip(Xi - X)2 = Ip( {IYil < C}(Yi2 - 2YiXJ)
=
Ip( {IZi + Xii < c}(Z; - X;»
°
where Zi ~ N(O, 1). If IXI> 1, P( {IZ + Xl < C}(Z2 - X2» <
for~ every
choice of c. Thus there is no way to choose c so that the estimates Xi have
uniformly smaller mean square error than Yi ; it does not help to allow c to
depend on the Yi .
Yet there is practical value in setting many small means to be exactly
zero if there is no evidence of significant departure from zero. Suppose the
loss function is
L(d, s) = {d
+s} + ked -
S)2.
Let Po be a unitary probability on S which has an atom Po {so} only at so'
Then the probable loss for d is Po {d s} + kP oed - S)2 = P oSo +
+ {d = so} (1 - 2P oSo) + ked - P osf + kP o(S - P OS)2. The Bayes decision is
d = So if 2P 0 {so} > k(so - P OS)2 + 1, and d = P oS otherwise.
If Yi ""' N(Xi' 1) independently, where the Xi are sampled from pb o +
(1 - p)N(O, O"~), then
+
Xii Yi '" Py,b O + (1 - py)N
[~' _1-1 ]
1+1 +0"2
0"2
o
where
Py,
= {
[12(1 + )]}-1
°
°
1+ [0 -py)~
Jk
-+
I-p
1 + -p-' (1
1
0
+ 0"~)1!2 exp -
"2Y / O"~
is the posterior probability that Yi came from the
estimate will be Xi = if
< 2py"
0"2
o
1
1
component. The Bayes
91
9.6. Multivariate Means
and
y.
X. = (l - Py ) ~l_
Iii
-+
1
v2
o
otherwise.
9.6. Multivariate Means
Let Y ~ N(X, L}, X ~ N(O, kLo} where Y and X are n dimensional vectors,
Land Lo are known covariance matrices, and k is unknown. By a linear
transformation applied to Y and X, this case may be reduced to
Y i ~ N(X i, v;}, Xi ~ N(O, v~} where v~ is unknown, and the distributions
are independent for different i. See Efron and Morris (1973).
Given v~, p(XiIY) = Yi(l - (v;/v~ + v;}).
A Bayes procedure for a prior density f(v~) on v~ would use
P [(
I
v~
+ v;
I
)1 Y J= __Sv2+v2'TI(v~+v;)-1/2exp(-tYi2/(v~+v;))fdv~
__ _________
~O==~l~____________~=-~~
STI(v~
~
+ v;)- 1/2 exp [ - tI y i2/(v~ + v;)] f dv~
but no magical f exists that permits a simple explicit computation. As
a practical matter, taking a uniform discrete prior on v~ from 0 to
2
i (l +~) in 100 steps, should give a reasonable Bayes estimate of
l/(v~ + v~). [For a continuous 1, the above integrals will have to be approximated as if the prior were discrete, anyway.]
A simple alternative to a Bayes procedure uses P[Yi2Iv~] = v~ + v~,
p[IYi2Iv~] = nv~ + Iv;, so thatv~ is estimated unbiasedly by L<yi2 - v;)jn.
This estimate is sometimes embarrassed by being negative, and may not
lead to a good estimate of I I( v~ + v;}.
A slightly better non-Bayesian method is maximum likelihood, which
finds v~ to maximize - I log (v; + v~} - I YN(v; + v~). The maximum
value occurs at 0, or at a solution of an equation I(Yj2 - v; - v~)/
(v; + v~}2 = 0; thus v~ is the weighted average of y i2 - v; with weights
inversely proportional to the variances of
given v~. However the solution
may not be unique, and checking a spectrum of v~ values is about as difficult
as doing a Bayes approximate integration.
The above procedures are not known to be uniformly better than the
straight estimates Yp which have sum of squared error loss Lv;. Given
v~, the loss of the Bayes estimate is I(v;(v6/(v~ + v;)2} + viX;/(v~ + v;j2}
which is less than Iv; if
Iy
Y:
92
9. Many Normal Means
This condition is analogous to one given in 9.3; it will always be satisfied
for (J"~ large enough. This suggest that an estimate beating Y i might ~e obtained by overestimating (J"~. Of course, if loss is measured by P(l)X i - xyj
(J";IX), the problem may be transformed to one in which all the (J"; are equal
and Stein's estimate and unitary Bayes estimates exist beating Y i . Brown
(1966) shows the estimate Y i to be inadmissible for a large class of loss functions; better estimators are given by Brandwein and Strawderman (1978).
9.7. Regression
Suppose YIX ~ N(AX, (J"2 I n)' X ~ N(O, (J"~I p) where Y is n x 1, A is n x p, X
is p x 1 and In denotes an n x n unit matrix. Then
Xl Y -
N[(A'Aj(J"2
+
Ij(J"~)-1 A'Yj(J"2, (A'Aj(J"2
+
Ij(J"~)-lJ.
The estimate P(X I Y) of X is often advocated for purely computational
reasons, to guard against singularity or near-singularity of A' A, Hoerl and
Kennard (1970). As in 9.6, it is difficult to estimate (J"~ by a simple Bayes
procedure. It is tempting to use the unbiased estimate
8~
= [Y'A(A'A)-lA'Y - p(J"2]jtrace(A'A),
but this is dangerous because it might be negative.
The maximum likelihood estimate for (J"~, assuming (J"2 known, minimizes
log I(J"~AA' + (J"2 II + y'((J"~AA' + (J"21) - 1 Y. This looks nasty, but when AA'
is diagonalized, it reduces to the likelihood expression in 9.6.
See Lindley and Smith (1972).
9.8. Many Means, Unknown Variance
Let YilX i ~ N(O, (J"2), Xi ~ N(O, (J"~), i = 1,2, ... ,n, independently for each i,
and suppose there is an independent estimate of variance S, with S ~ (J"2X; .
Such a situation arises in regression problems. Given (J"2 and (J"~,
P[XIY]
The density of S, Y given
(J"2
S(k/2)-1
((J"2)k/2 exp( - !Sj(J"2)
We may estimate
(J"2
=
Y{ 1 -
(J"2
~ (J"~J
and (J"~ is proportional to
(I y 2
)(n/2)-1
((J"2
~ (J"~)n/2 exp [ -! y i2j((J"2
and (J"~ unbiasedly by solving
S = k(J"2.
I
y i2
= n( (J"2 + (J"~),
+ (J"~)]
93
9.9. Variance Components, One Way Analysis of Variance
+ (J~) unbiasedly by (n - 2J/k.
2J/k). S/'i Yi2 ] beats Yi , from Baranchik
but it is better to estimate the coefficient (J2/«(J;
S/"i.. y i2 ; the estimator YJ! -
«n -
(1971). Even so the estimator can occasionally give foolish results with the
coefficient of Yi negative.
A maximum likelihood procedure gives the same results as the unbiased
method S = k(J2, 'i y i2 = n«(J2 + (J~) except when (J~ is estimated negative;
in that case (J~ is estimated to be zero, and (J2 is estimated by (S + 'i Y/)f
(n
+ k).
For the prior density «(J2 + (J~)((1/2)")-I((J2)((1/2)fJ)-I, from 9.2, (J-2 ~ X;_p/S,
«(J2 + (J~)-I ~ X;-)'i y i2 where the X;_ , X;-a are sampled from independent
chi-squares but accepted only if (J-/;?; «(J2 + (J~) - I. Thus (J2/«(J2 + (J~) '"
(S/'iYi2)X;_)X;_fJ constrained not to exceed 1, and p[(J2/«(J2 + (J~)!y] =
(S/'iYi2)P[X;_"Ix;_fJ!x;_,,/X;_fJ ~ 'iYNS]. The computation is an incomplete beta integral. From Baranchik (1971), the estimator
YD - (S/'i y i2 )r('i YNS)] beats Yi ifr is non-decreasing, r ~ 2(n - 2)/(k + 2).
Here r is obviously non-decreasing, r ~ P[X;-)X~-fJ] = (n - rx)/(k - f3 - 2).
Thus the posterior mean beats Yi if (n - rx)f(k - f3 - 2) ~ 2(/1 - 2)/(k + 2).
(These estimates are not Bayes because the loss is not integrable. For example,
when rx = 0, f3 = the condition is satisfied for no k, n; when rx = 2, f3 = - 4,
it is satisfied for n ;?; 3.
°
9.9. Variance Components, One Way Analysis of
Variance
Suppose that a number of normal samples estimate the means X I '
for the /" sample
Yij ~ N(X j , (J2),
i
... ,
Xn;
= 1, ... ,m
The X {s are assumed to be sampled from N(X, (J~J. Finally X '" N(O, (J~[).
Since Yj ~ N[Xj' (J2/m] this is essentially the same situation considered
in 9.3. Given (J~, (J2, (J~ there will be posterior mean estimates of the X ..
In practice, it is necessary to estimate the "variance components" (J~, (J2, (Jt
somehow, and they are of interest in themselves to indicate how important
between group effects (represented by (J~) and within group effects (representare:
ed by the (J2)
J
Here
"
( Y _ y)2
'" (J2X2n(m-I)
L..L.. ij
j
j
i
' ( Y. _ y)2 '" ((J2 + (J2 )X2
J
m
0
n- 1
L..
94
9. Many Normal Means
independently. (The distributions are not so simple if there are unequal
numbers in the different samples.) Unbiased estimates of a 2 , a~ and a~ may
be easily constructed from linear combinations of the sums of squares in
Y, but the estimates are inadmissible because they may be negative. Maximum
likelihood gives the same estimates if the solutions to the equations are
positive.
For a prior uniform in log a 2, log(a 2 /m + ( 2) and log(a 2/mn + aUn + a~),
t~e posterior dis!ribu~ion of a 2 ,-a 2 /m + a~, a 2 /mn + a~/n + a~ is LL(Yij Y//X;(m-l)' L(Yj - y)2/X;_1' y2/X~ where the chi-squares are taken independently, but accepted only if the appropriate inequalities hold between
the three variables. Computation of posterior means would require formidable numerical integrations in three dimensions. Similar considerations arise
in estimating variance components for more complicated analysis of variance
models.
9.10. Problems
P L A surveyor, poor but honest, measures the three angles (8 1 , 82 , 8 3 ) of a triangle with
independent errors N(O, I). The measured angles are 81 = 63°, 82 = 3P, 83 = 92°.
For a suitable prior on 8, find the posterior distributions of each of 8 1 ,8 2 ,83 given
the data. [The true values should add to 180°.]
P2. In football, the scoring difference between team i and team j is distributed as
N[ll i - Ili' 0- 2 ]. The prior distributions at the beginning of a season are, independently,
Yale
Harvard
Princeton
Dartmouth
III ~ N(O, 0- 2 )
~ N(O, 0- 2 )
113 ~ N(O, 0'2)
112
114
~
N(6,
0- 2 )
Game scores are: Harvard 13-Princeton 6
Princeton 27-Dartmouth 20
Princeton 21-Yale 3.
Compute the probability that Harvard will beat Yale, given the observed scores.
P3. For Yi ~ N(Xp 1)
independent
Xi ~ N(X o, o-~) independent
assume g(X o' o-~) = .I/(o-~ + I).
Find the posterior mean of Xi given Y1 ,
•.• ,
Yn'
95
9.11. References
P4. Votes for the Democratic candidate for President:
South
21
29
30
27
21
Central
43
47
42
New England
61
62
65
Construct a model II analysis of variance, and estimate variance components,
using unbiased estimates and Bayes estimates.
P5. Show that the following estimate in the Stein problem, Yi
beats Yi :
-
NCB i , 1), is Bayes and
Show that the multiplier is never negative.
9.11. References
Baranchik, A. J. (1970), A family of minimax estimators of the mean of a multivariate
normal distribution, Ann. Math. Statist. 41, 642-645.
Brandwein, A. R. and Strawderman, W. E. (1978), Minimax estimation of location
parameters for spherically symmetric unimodal distributions under quadratic
loss, Annals of Statistics 6, 377-416.
Brown, L. D. (1966), On the admissibility of invariant estimators of one or more
location parameters, Ann. Math. Statist. 37, 1083-1136.
Efron, B. and Morris, C. (1973), Stein's estimation rule and its competitors-an
empirical Bayes approach, J. Am. Stat. Ass. 68, 117-130.
Hoerl, A. E. and Kennard, R. W. (1970), Ridge regression: biased estimation for
non-orthogonal problems, Technometrics 12, 69-82.
James, W. and Stein, C. (1961), Estimation with quadratic loss, Proc. Fourth Berkeley
Symposium, University of California Press, 1, 361-379.
Jeffreys, H. (1961), Theory of Probability. Cambridge University Press, Cambridge.
Lindley, D. V. and Smith A. F. M. (1972), Bayes estimates for the linear model, J. Roy.
Stat. Soc. B 34, 1-41.
Stein, C. (1956), Inadmissibility of the usual estimator for the mean of a multivariate
normal population, Proc. Third Berkeley Symposium 1,197-206.
Strawderman, W. (1971), Proper Bayes minimax estimators of the multivariate normal
mean, Ann. Math. Statist. 42, 385-388.
CHAPTER 10
The Multinomial Distribution
10.0. Introduction
A discrete random variable X takes values i = 1, 2, ... ,k with probabilities
{Pi' i = 1,2, ... ,k}. A sample of size n from X gives the value X = i ni times.
The multivariate distribution {np i = 1, ... , k} is multinomial with parameters
n, {Pi' i = 1, ... , k}. It is ubiquitous in problems dealing with discrete data.
The values 1, 2, ... ,k are called categories or cells.
If(n l , n 2 ,···,nk ) is multinomial n, {pJ then n l +n 2 , n3 , ••• ,nk is multinomial n, (PI + P2 , P3 , ... , Pk); nl' n2 , ... , nj given I,{= Il1i is multinomial
I,{= j ni , {p/I,{= jP i ' i = 1, ... , j}.
The multinomial is obtained from k independent Poissons ni with expectations }; ; the distribution of III ••.• , 11k given n = I,11; is multinomial with parameters n, {A/I,~ = 1 Ai' i = I, .,. ,k}. This fact is very convenient in formulating
models and handling computations, because the Poisson n i are independent.
In general, the interesting problems in asymptotics and decision theory
arise when some of the Pi are small. For example the usual maximum likelihood estimates of Pi are inadmissible if IPi I ~ c, i = I, ... ,k.
Standard families of prior distributions exist for the multinomial, but they
don't work too well for many parameter problems. It is necessary to incorporate expected similarities between the p/s into the prior for many parameter
problems.
10.1. Dirichlet Priors
The multinomial X taking values i = I, 2, .. , ,k with probabilities Pi
p(X)
= Ilplx=iI
=
96
exp
[t:
{X
=
i} log [p/O - Pk) ]
+ log(l
- Pk)
J
97
10.2. Admissibility of Maximum Likelihood, Multinomial Case
is exponential E[,u, T, 0] where T is the vector {X = i}, i = 1, ... , k - 1 and 0
is the vector 10g[p/(1- Pk )], i = 1, ... ,k - 1, and J1. is counting measure on
1, 2, ... ,k. Because of the asymmetry of this parameterization, it is often
convenient to think of the multinomial as E{J1., [{X=i}, i=I, ... ,k],
[log Pi' i = 1, ... ,k]} where the parameters log Pi are constrained to lie in a
(k - 1) dimensional subset (l>i = 1) of Rk. For n observations, with ni=
Li=
I
{Xj = i},
p[X I " "
,XJ =
TIp~i
If {pJ is uniformly distributed over the simplex Pi ~ 0, LPi = 1, the
posterior density given n l , ... , nk is TIp~i «n + k - 1)!lTI ni !). More generally,
the Dirichlet density on {Pi}' with respect to the uniform J1. over Pi ~ 0,
LPi= 1 is d,,(P)=r(LlXi)TIp~i-I/r(IX); the Dirichlet probability is D,,=
E[J1., {log pJ, {lXi - I} J.
The Dirichlet generalizes the beta to many dimensions. If Vi are independent gamma with densities oc U~i - I exp ( - au i)' then {V /L~ = I V J is Dirichlet
D" (similarly to the multinomial being independent Poissons nl'''' ,nk
conditioned by Ln i = n). If the prior density is d., then the posterior given
nl, .. ·,nkisd.+ n ·
I
DJP] = a./LIXi' DJPP'] = ( lXIX' + [IXI '''IXJ) LlXi(La i + 1).
If (PI' ... , Pk) is Dirichlet D" then PI' P2 ' ... , Pr ' 1 - L;= IP i is D"', .... IX .. L>r IX i'
and PI' ... ,p r given Pr + I' ... 'P k is (1 - L...l r p.)D
l
cq, ... ,!Xr .
".>
10.2. Admissibility of Maximum Likelihood,
Multinomial Case
The maximum likelihood estimates of Pi are n/n; these are posterior means
for the non-unitary prior density I/TIpi' They are the only estimates that
do not depend on the fineness of subdivision of the multinomial cells, Johnson
(1932). If there are very many Pi' all probability estimates will be O/n or l/n
which is unsatisfactory.
Theorem (Johnson (1971)). The maximum likelihood estimator Pi = n/n is an
admissible estimator of Pi with loss function L(d, p) = L~= I (Pi - d)2.
PROOF. The technique of 6.3 would approximate Pi by Bayes estimates for
densities TIP~- I, with IX ~ 0, but this is not effective for k > 2 because more
than one of the n i may be zero, and this causes irretrievable degeneracy in the
98
10. The Multinomial Distribution
posterior densities when the corresponding Pi are zero. The essence of
Johnson's proof is careful handling of the cases where some ni are zero.
Consider first k = 2 and suppose 6 has risk nowhere greater than the risk
of nino
ret>, p) =
=
n,
+~=J:J [(6
°at
PI =
°
°or
1 -
P2 =
P1 )2 + (6 2
°
-
py]p~' p~2
since n/n has zero risk for P1 P2 = 0. Thus 6 1 (0, n) = 0,6 2 (0, n) = I and so t>
and n/n agree when n 1 = or n l = n.
ret>, p) - r(n/n, p) =
I
O<n,<n
±
(n)
[(6 i
n 1 i=1
p)2 - (n/n _
-
p)2]p~'p~2
J[r(t>, p) - r(n/n, p)] dP I
PI P2
~ I
~
±
(n)S
[(6; - Pi? - (n/n - p)2]p~'-1 p~2-ldpl
nl
i=1
since n/n is Bayes on prior I/P 1 P2,
°
O<n,<n
with equality only if 6i = n/n, i = 1, 2. Thus if 6 has risk no greater than n/n,
it equals n/n; thus n/n is admissible.
Note that integration is possible after multiplying by I/P 1 P2 because the
cases n 1 = and n 1 = n have been eliminated.
Consider next k = 3, and suppose that t> has risk no greater than that of
n/n. Letting PI = 0,
°
ret>, p) =
itl i
n2+~=n (:J J2
~ (:)
(6; - P fTIp7'
3
(6l0, n 2, n3) - p)2 ;=Q, p7'
+ n2 +~ =n (n~ }5~(0, n2, n3 )) i =Q, /;"
Since ret>, p)
~ r(n/n, p), then In2 +n3
=11
(:J
It= 2(6;(0, n2, n3) - p/ TIi=2,3P7'
is not greater than r(O, n2/n, n3 /n; p), which implies 6i (0, n2 , n3 ) = n/n, i = 1,2,3.
[Note that 6 1(0, n2 , n3 )=0 since otherwise, for PI =0, r(o,p»r(n/n, p).]
Similarly, t> agrees with n/n whenever n 1 or n2 or n3 = 0.
J[r(t>,p)-r(n/n,p)]d p 1 dP2
P1P2P3
I
=
n,n,>O
(n)SI[(6;-p)2
n
- (n/n - p;?]TIp7,-1
~o,
10.3. Inadmissibility of Maximum Likelihood, Poisson Case
99
since n/n is Bayes for the density l/fIpi with equality only if b; = nino The
integration is justified because n; > O. Thus if r(t5, p) ~ r(Dln, p), t5 = n/n, so
Din is admissible.
General k is handled by induction; if one of the p's is set zero, the decision
procedure with the corresponding n; zero must coincide with maximum
likelihood; so the difference between two procedures need only be assessed
over n; > 0; and integrating with respect to l/fIp; shows that t5 cannot beat n.
[Decisions of form: take bj with probability dj have risk exceeding that of
Ldjb j so they can't beat Din either.]
D
10.3. Inadmissibility of Maximum Likelihood,
Poisson Case
If the n; are independent Poissons with expectations A;, then n 1 , · · · , nk
given In; = n are multinomial with parameters A/LA;. For this reason, it
is often convenient to formulate multinomial models using independent
Poissons. A convenient family of prior densities for the Poisson P A'
PJn}=e-A).nln! is the gamma G(m,a) with density amAm-le-aAlr(m);
given observation n, the posterior density is G(m + n, a + 1).
Theorem (Clevenson and Zidek (1975)). Let n; be independent Poisson with
expectations Ai' i = 1,2, ... , k. Then,for k ~ 2
for all A; . Thus n; is inadmissible as an estimate of A;.
PROOF. Given n = L~= I np the ni are multinomial with parameters P; = A/LA;
P[Dan i - A)2 IAi In] = L[ a2npp - p)1A; + (anp; - A)2 I AJ
= a 2n(k - l)/LA; + (an - LA)2/LA;
The value a which minimizes this expression is LA)(n + k + 1), which is
estimated by n/(n + k + 1) = an' Set A = LA;.
P[L(ann i - AYIAJ = a; (n(k - 1) + n2)IL Ai - 2a nn + LA;
=n 3 /(n+k-l)LA;-2n 2/(n+k-l)+ LA;
n3 /(n + k - 1) - 2n 2A/(n + k - 1) = n2 - (2A + k - l)n + (k - 1)(2A + k + 1)
(k - 1)2(2A + k - 1)
n+k-l
P[(n 3 - 2n 2A)/(n + k - 1)] < A2 + A + (2A + k - l)(k - 1 - A)
- (k - 1)2(2A + k - 1)/(A + k - 1)
100
10. The Multinomial Distribution
since X and I/X are negatively correlated.
PO)ann j
-
.A. j )2/AJ < 2.A. + 1 - (2.A. + k - 1).A./(.A. + k - 1)
< k - (k - 1)2/(.A. + k - 1) < k = PO::<n j
-
0
),Y/.A.J.
10.4. Selection of Dirichlet Priors
np
Jeffreys's prior is D 1/2.1/2 ..... 1/2 giving density oc
j- 1/2, and posterior means
(nj + 1/2)/(Inj + k/2); the cell estimate depends significantly on k, so that
if other cells are subdivided into, say, 100 more cells, a given estimate is
substantially reduced. Perks (1947) suggests npj-l/k which gives estimates
(nj + l/k)/(Inj + 1). Following the binomial case, it is useful to consider
the family of prior densities p~i- 1, IX j > 0, IlXj = 1 which gives ranges of
estimates unaffected by amalgamation or subdivision of cells; these are
analogous to confidence priors in the binomial case.
Another possibility is to estimate the Dirichlet prior; assume the prior
np~-I; then P[n j ] = n/k, P[n;J = n/k + (n - I)(IX(IX + I)n/klX(klX + 1»,
P(Inn = n + n(n - I)(IX + I)/(klX
+ 1)
Thus IX is estimated by solving In~ = n + n(n - I)(IX + 1)/(klX + 1), or equivalently, (1/(k - 1)I(nj - n/k)2 = n(klX + n)/(klX + 1)k. There may be no
non-negative solution IX if nj has small enough variance; set IX =
(1/(k - l»I(nj - n/k)2 ~ n/k.
00
if
A more satisfactory (and more difficult) procedure due to Good (1965)
selects IX to maximize the likelihood
P(nllX) = r(klX)r(n+ 1) nr(nj+IX).
r(1X)kr(n + IXk) r(n j + 1)
Good (1975) shows that P(nllX) is maximized by IX = 00 when the chisquare goodness of fit statistic, X = (k/n) I(nj - n/kf ~ k - 1. He suggests
using G = sup [2 log P(n IIX)/P(no IIX) ] 1/2 as a test statistic for deciding
'" n = n/k. In Good (1967) it is asserted that G2 is distributed
Pj = I/k, where
Oj
as X~ given that G > 0, asymptotically as n ~ 00 for k fixed. However,
asymptotically, the expansions for gamma functions (Abramowitz and
Stegun, 1964, p. 257) show
I
G(IX) = log P(nllX)/P(n o IX) ~ t(k - 1) 10g(lXk/(n + IXk»
+ tXn/(n + IXk)
which is maximized by IX = 00 if X ~ k - 1, and IX = (n - I/k)/(X - k + 1)
if X > k + 1, which is the same as the estimate based on the first two moments.
Thus sup G(IX) ~ {f(X - k + 1) -t 10g[1 + (X - k + 1)/(k - I/n)]} + which
is a monotone function of X; its asymptotic distribution is determined
from the asymptotic distribution of X, which is X~-I; and Good's test
101
10.6. MuItinomials with Clusters
statistic G is just a monotone function of X asymptotically, without the
asymptotic behavior stated in Good (1967). Of course, a complete Bayesian
notes that p(Pila) = (n i + a)/(n + ka); specifies a prior density for a; and
computes P[Pi In] = P [(n i + a)/(n + k) In] averaging over the posterior
density of a given n. The likelihood P(n Ia) is messy enough to suggest no
simple closed form expression will be available. Good (1967) uses a logCauchy distribution for a.
10.5. Two Stage Poisson Models
Suppose ni are Poisson Ai' and the Ai are drawn from some distribution Po,
as in 9.4. The ni are sampleg from the discrete distribution with density Po,
po(n) = Po [Ane- "/n l].
The posterior mean of Ai given ni is po[An;+le-"/nil]/po[An;e-A/nil]
which equals (n i + I)po(n i + l)/po(nJ
Thus we can compute posterior means (and variances and other moments)
if we know po. If Po is not completely known, as good Bayesians, we would
need a prior distribution for it; the whole data set nl' ... ' nk would
determine a posterior distribution for Po and the estimate P[ Ai In] =
(n i + I)P[po(n i + l)/po(n)ln].
For many Poissons np we might estimate po(n) by # [ni = n]/ # ni,
the maximum likelihood estimate; however this does not take advantage of
smoothness induced by po(n) = Po [Ane-A/nl]. See Robbins (1956).
A special case is Po gamma with density aI1.AI1.-1 e-a"/r(a). Then po(n) =
(I
+
a)-<I1.+n>( a
+: -
I). the negative binomial; and a, a may be estimated
from the observed ni by maximum likelihood or by the moments Pn =
aa/{l + a), P(n - Pn)2 = a[a/{l + a)]2. Also P[Ailn] = P[(n i + a)/(I + a)ln];
if only we could think of a nice prior distribution 9 of a, a, the posterior density
9
n~=
1 (l
+
a)-<I1.+n;>( a
+:: -1)
could be used to obtain a Bayes estimate
00.
10.6. Multinomials with Clusters
In previous sections, all cells have been treated symmetrically, but it will
frequently happen that some groups of probabilities Pi will be expected to
be similar. One possibility is that the cells are grouped in clusters
C l' C 2 , ••• , Cjand then the prior density might betaken to be (LiEclJi- 1 ;
n
102
10. The Multinomial Distribution
this is as if we had made previous observations in which cx j individuals
occurred in the cluster CJ.. If the clusters are hierarchical, so that C.I and
C j overlap only if Ci c: C j or C j c: Ci , this model may be reformulated as a
number of Dirichlet priors on conditional probabilities, and probability
estimates may be simply computed. Suppose for example the multinomial is
a 2 x 3 contingency table:
Let C 1 =(11, 12, 13),
nil
n l2
n l3
n 21
n 22
n 23
Cij={ij}. The prior density
may
be transformed to a density on the marginal probabilities PI. =
Pl1 + PI2 + P13 , P2. = 1 - PI.' and conditional probabilities Pili = Pi/Pi.'
Il2-ln Pi?
""-IdPII dPI2dP13 dP21 dP22
(PI. )",-1 P2.
=p",+I+I("lj-l)p"2+1+r<"2j-l)np"u-Idp dp dp dp dp
I
2
jli
1.
111 211 112 212
(Note that only five parameters appear in the differential element, since
LPij = 1 .) The new density is
P'" + I +r<"'j-I) pill + I +I("lj- I)np~;{-I.
n(Lijec,)'i}Xk- 1 = (PI I
1.
C 2 =(21,22,23),
+ P12 + P13 )''' -1(P21 + P22 + P23 )"2- 1np:y-I
2.
JII
The advantage of this formulation is that the marginal and conditional
probabilities are independent, and so it is easy to do posterior computations.
For example P[Pij! nJ = P[P i.! nJP[Pili! n J. See Good (1965) for other methods
of generating priors for contingency tables.
10.7. Multinomials with Similarities
It may happen that the cells of a multinomial are ordered in such a way
that neighboring probabilities are likely to be close. The prior density
ensures that neighboring probabilities are not too different. Pioneering
work in this area occurs in Good and Gaskins (1971, 1980) studying density
functions. For multinomial probabilities, Leonard (1973) presents the following prior density.
Let [log pJ be multivariate normal, subject to the constraint LPi = I;
neighboring p/s are required to be highly correlated. A similar prior density
is considered by Simonoff (1980), the density exp [A L~':-II log2(pJp i+I)];
the penalty function L log2 (pip i + I ) ensures that Pi and Pi + I must be close.
In order to avoid the pesky dependence LPi = 1, let us assume ni Poisson
with expectation Ai' and that the log Ai form a normal autoregressive process
with lag one,
[log Ai + I
-
f..L] = p[log Ai - f..L]
+
(J 8 i
+I'
8i
independent N(O, 1).
103
10.8. Contingency Tables
A simple limiting case, with p = 1, has log '\ uniform, log Ai+ l/Ai independent
N(O, cr 2). The posterior density with respect to Lebesgue measure on {log Ai}
is ex exp [Ln i log Ai -!L log2(Ai+ 1/A)/cr2 - LA} It is difficult to compute
posterior means, but the posterior mode is easier to compute: the function
to be maximized is called a penalized likelihood function by Good and
Gaskins (1971), with penalty function L log2(Ai+ l/Ai) requiring neighboring
Xs to be close. It is not feasible to estimate the cr 2 in the obvious way, to
maximize the posterior density, because the inaccessible constant of proportionality includes cr.
The modal value of u i = log )'i satisfies
ni + (2u i - Ui - 1
Ui+ 1 )/2cr 2 - e"i = 0.
Concavity of Lniu i - L(U i+ 1 - Ui)2/2cr 2 - LeUi guarantees the existence
and uniqueness of a modal value. The solution may be found by a NewtonRaphson technique. Simonoff(1980) shows that for large k, with n., moderate,
the estimates Ai are weighted averages of the nj for j near i, giving asymptotic
behavior similar to kernel estimates. These techniques are related to spline
fitting methods used in regression and density estimation; see for example
Wahba's remarks in the discussion of Stone (1977).
An alternative prior on log Ai is exp [ - A L !log(Ai + l/Ai)!] which specifies
the absolute differences to be exponentially distributed. The model ui =
log Ai maximizes niu i - A L !ui + 1 - ui ! - L e"i; thus eUi = ni - 2A, ni + 2A,
e"i+' or e"i- '. The solution may be described by a number of intervals (Jr' J r)
such that ui is constant for 1r~i~Jr. If U1r _, < u1r <u1r +" then
(Jr -1r + l)e"lr = Llr~i:;;Jrni; this is equivalent to amalgamating the cells i,
1r ~ i ~ Jr. Search for tne optimal intervals requires techniques similar to
Barlow et al. (1972). This method clusters the cells which have similar ni and
is clearer in its action than the normal prior considered previously. Its
asymptotic properties are unknown.
-
-
10.8. Contingency Tables
The entries in a contingency table may be regarded as multinomial or
Poisson; the special structure of the contingency table requires special
priors for the parameters. Good (1965) has many useful ideas for such priors.
See also Leonard (1975).
For a two way contingency table with entries nu' 1 ~ i ~ 1, 1 ~j ~ J and
probabilities {p 1).. }, we often expect independence plJ.. = p.I .p. J. where p.I. = "J.p.
.,
'-,J
P. j = LiPi .. Good considers putting a prior density on the parameters
[Pi/Pi.pj, which has the effect of moving all parameter estimates Vii towards
independence.
For large tables with ordered rows and columns, the prior density in the
Poisson model, exp[ - LA log2[Aiii+lj+l/Aij+1Ai+lj]]' with respect to
104
10. The Multinomial'Distribution
log Aij Lebesgue, encourages each 2 x 2 table of neighboring cells to be
nearly independent. The modal posterior estimates of A are then approximate
weighted averages of counts in nearby cells.
With prior density exp( - LAllog(AijAi+li+l/Alj+1Ai+l)I), the posterior
mode requires blocks of neighboring 2 x 2 tables to be independent, and so
breaks the contingency table into a number of (unbalanced) subtables where
independence is achieved. Computations with both these techniques are
formidable.
10.9. Problems
PI. For all nj large, find an approximate expression for the Dirichlet parameter a
maximizing the likelihood (r(ka)/r(cfl)(r(n + I)/r(n + ak»n(r(n; + rx)/r(n; + 1).
P2. Let ..1.; be independent gamma variables with density a(a..1.)·-1 exp( -a..l.)jr(rx).
Let n; be independent Poisson with expectations ..1.;' Show that {}.j~)J given
{nJ has the same posterior distribution as {pJ in the multinomial model with
Dirichlet prior density ex np~-
'.
P3. In a binomial model with n = 10, compute the mean square error of the Bayes
estimators corresponding to beta prior densities [p(1 - p)].-l for rx = -1,0, t, I, 10
and sketch the risks as a function of p. [Hand computation will suffice.]
Obtain the distribution of r (the number of successes) given a and estimate rx
given r.
If Po[rx= 1/2] Po[rx 1]
findP[plr].
=
= =t,
EI. If p can take only the values kiN, 0 ~ k ~ N, show that the proportion of successes
in n trials in the binomial model, is inadmissible as an estimate of p with squared
error loss, when n > N.
P4. On visiting a new cafeteria, a distinguished statistician took five cubes of sugar for
his coffee. On each wrapper was pictured a bird; of the first four, the third was a
cardinal but the other three were swallows; what bird is likely to appear on the fifth
wrapper? (See Good (1965).)
E2. In a week books were borrowed from a library by persons in the following categories
First year students
= 6
Second year students
=10
7
Third year students
Fourth year students
5
Statistics faculty
3
Undergraduates
2
Other graduate students = 8
Other faculty
I
Other persons
3
Estimate the probability that the next book is borrowed by a person in each of the
above categories.
105
10.10. References
P5. For a 2 x 2 table, find a prior distribution on the probabilities PI1' P12' P21' Pzz
so that the Bayes test for independence is Fisher's test, rejecting independence if
the first observation nIl is too large or too small given n l ., n I'
E3. Is the estimate fi = 0 admissible as an estimate of P in binomial problems, with mean
squared error loss?
QI. Do a two stage analysis of the multinomial model analogous to the two stage
Poisson model, 10.5.
P6. In the binomial, is the maximum likelihood estimate fi admissible with loss (p _ fi)2/
p(1 - p) or p logp + (1- p)log(l- p)? Assume 0 < P < 1.
P7. Johnson (1971). In the binomial problem, given r successes in n trials, admissible
estimates of pare ofform:
p=O,
r~L
p = p o[P,-L(1 -
p= I,
where - I
~
p)U-,-I]/p o(P,-L-I(1 _ p)U-'-I)
r~ U
L < U ~ n + 1, and Po is not carried by {O, I}.
P8. (Clevenson and Zidek, 1975) For p independent Poissons ni with means Ai' show
that ()i(n) = (1 - (fJ + p - 1)/(})i + fJ + p - 1»n beats n as an estimate of )., using
loss function
i - },YP. ;l, for 1 ~ fJ ~ p - 1.
L«b
P9. In the multinomial, show that {n/n} is inadmissible for {pJ, with squared error loss,
if the parameter values satisfy Ipil > e, i = 1,2, ... , k.
10.10. References
Abramowitz, M. and Stegun, L A. (1964), Handbook of Mathematical Functions.
U.S. Department of Commerce.
Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D. (1972), Statistical
Inference under Order Restrictions. New York: John Wiley.
Clevenson, M. L. and Zidek, J. V. (1975), Simultaneous estimation of the means of
independent Poisson Laws, J. Am. Stat. Ass. 70, 698-705.
Good, I. J. (1965), The Estimation of Probabilities. Cambridge, Mass: M.LT. Press.
--(1967), A Bayesian significance test for multinomial distributions, J. Roy. Statist.
Soc. B 29,399-431.
--(1975), The Bayes factor against equiprobability of a multinomial population
using a symmetric Dirichlet prior, Annals of Statistics, 3, 246-250.
--and Gaskins, R. (1971), Nonparametric roughness penalties for probability
densities, Biometrika 58, 255-277.
--(1980). Density estimation and bump hunting by the penalized likelihood method
exemplified by scattering and meteorite data, J. Am. Stat. Ass. 75, 42-73.
Johnson, B. M. (1971), On the admissible estimators for certain fixed sample binomial
problems, Annals of Math. Statistics 42, 1579-1587.
Johnson, W. E. (1932), Appendix to probability: deductive and inductive problems,
Mind 41,421-423.
Leonard, T. (1973), A Bayesian method for histograms, Biometrika 60,297-308.
106
10. The Multinomial Distribution
--(1975), Bayesian estimation methods for two-way contingency tables, J. Roy.
Stat. Soc. B 37,23-37.
Perks, W. (1947), Some observations on inverse probability including a new indifference
rule, J. Inst. Actuaries 73,285-312.
Robbins, H. E. (1956), An empirical Bayes approach to statistics, Proc. III Berkeley
Symposium, 157-163.
Simonoff, J. S. (1980), A penalty function approach to smoothing large sparse contingency tables, Ph.D. Thesis, Yale University.
Stone, C. J. (1977), Consistent non-parametric regression, Annals of Statistics 5,
595-645.
CHAPTER 11
Asymptotic Normality of Posterior
Distributions
11.0. Introduction
Suppose X I ' ... , XII are independent observations from Po' 8ER. Suppose
that Pe has density fe(x) with respect to some measure v. The maximum
likelihood estimate of 8 (or the value of 8 that maximizes the density of
the posterior probability relative to the prior probability), maximizing
TI;'~ I fiX) is denoted by
As n -> 00, Fisher established that is asymptotically normal with mean 80 and variance (nI( 80)) - 1, where 80 is the true
value of8, and I(8 0 )is Fisher's information-{ - (d 2 /d8 2 )PoJlogfo(X)] }e~oo'
The asymptotic normality requires a tedious list of regularity conditions,
first promulgated by Waldo
Under almost the same conditions, with the additional requirement that
the prior density be positive and continuous in the neighborhood of 80 ,
the posterior distribution of 8 given X I ' ... , X is asymptotically normal
with mean ell and variance [nI(e)] - I .
In the same way that the posterior distribution is consistent for 80 under
very general conditions, it may be shown to be normal under very general
conditions; however these elegant general conditions are often more difficult
to verify than the longer maximum likelihood list.
The prior density does not affect the asymptotic distribution of 8 in the
terms of 0(\) or 0(n- 1/2 ). It does shift the mean of the asymptotic distribution
by a term O(n - 1).
Similar results hold for k-dimensional parameter spaces.
en'
en
II
107
108
II. Asymptotic Normality of Posterior Distributions
11.1. A Crude Demonstration of Asymptotic
Normality
Let Po denote the prior probability.
The posterior distribution P x is given by
P x y = Po[flfiX;ly(e)]/po(flfi X ).
For
I
e near ell'
~
d2
log JiX) = I 10gfeJX) + ¥e - eYIde2logfeJX) + small.
11
Assume that Xl' X 2' ... _ X n are drawn from Q, not necessarily a member
of the family Po- eE R; if Q = POo say eo is the true value of O.
Then
_
d2
I 10gjiX) = 10gf1lJX) + ~(e - eyn de 2 Q(logfll) + small.
II
Let Po have density Po with respect to Lebesgue measure fl·
Then log Po (e) = log po(ell ) + small for e near en'
And P x has density Px with respect to Lebesgue measure
px(e):::::: ex p [
~(e - ell)2n~Q(logf8) }(X)
so that e has the asymptotic density of a normal distribution with mean
ell and variance (nI( e) )- 1.
It is necessary to produce regularity conditions which will validate the
omission of various "small" terms.
11.2. Regularity Conditions for Asymptotic
Normality
See also Walker (1969).
Theorem.
(il Observations Xl' X 2' ... , X n are drawn from the unitary probability
QoniJ.Y.
(ii) It is contemplated that Xl"'" Xli might be drawn from Pe- some
(}ER. It is assumed that Pe has density fe(X) with respect to a measure
von '!Y.
(iii) The function Q(logfo) has a unique maximum at 0 = 00' (If Q = POb'
necessarily () a = (}~.)
109
11.2. Regularity Conditions for Asymptotic Normality
(iv) The prior Po has a density Po with po(e o) > 0, Po continuous at eo'
(v) px[le > e] --+0 as Q each e > O.
(vi) In a neighborhood of eo' the derivatives (d/de) logfg, (d 2/de 2) logfo
exist and are continuous in e, uniformly in X.
(vii) Q[ (dideo) logfooY < 00, Q«d2/de~) logfoo ) < O.
(viii) Let en be the maximum likelihood estimate for e near eo (necessarily
unique as n--+ (0). Let cP n = (e - On) [ -l)d2/de2)logfg(Xi)]!~tin' Then
the posterior density of cP n with respect to Lebesgue measure satisfies
eol
sup
I"'nl ~ K
Pn(cP n)
( 1 -1.2) - I --+ 0
I
J21r
exp
-"2'f' n
as Qfor each K > O.
First it will be shown that L log fiX) is maxjmized by a unique
O.
Since (d/de) logfo is continuous in e uniformly in X, (d/de)Q(logfo) =
Q«d/de) logfo)' Since Q(logfo) has a unique maximum at e = eo'
PROOF. (1)
On in a small neighborhood of eo' as Q as n --+ 00, with (d/den)L loghn(X i ) =
(d/deo)Q(logfoo ) = O.
Let Q[(d2/de~) logfoo(X i)] = - ,10'
I
d2
1
d2
~Lde210gfiX) = ~Lde~ logfoo(Xi) + ,1,
where
,1
< ,10/2 whenever Ie - eo 1< c5,
by uniform continuity, (vi). Since
2
[ d
1 d2
~Lde~IOgfoo(X)--+Q
de~logfoo(X) ]
1
asQ,
d2
~Lde2IogfiX) < - ,10/2
whenever Ie - eo I < c5, for all large n as Q.
Thus L logfiX i) has at most one maximizing value in
as Q. Also
1
1
dId
Ie - eo I < c5
d2
~Lde logfo(Xi) = ~Ldeo logfeo(X i) + ;;(e - eo)Lde*2Iogfe*(Xi)
where 1
e - e* 1,1 eo - e* I ~ Ieo as Q.
Thus
el· As n --+ 00, (l/n) L(d/de o) logfeo(X) --+ 0
1 d
;;Lde log fiX i) < (e - e o)L1/2
for
Ie - eol < c5, e > eo as Q
1 d
~Lde logfe(X i) > (e - e o),1/2
for
Ie - eo 1< c5, e < eo as Q.
Thus L (d/de) log fe(X i) has a zero in
Ie - eo I < c5 as Q as n --+ 00, and the
110
II. Asymptotic Normality of Posterior Distributions
10-001
10-001
zero is unique because I(d 2 /d0 2 )logfiX i ) < -Llo/2 in
<0 as Q;
the zero at o=en will maximize Ilogfe(X); since en lies in
<0
for n l~rge enough, for each choice of 0, en --> 00 as Q.
(2) On is asymptotically normal with mean 00 and variance (J2/n where
(J-2= -Q[(d2/dO~)logfeo(X)].
d
d
_
d2
0= de I 10gfeJX) = de I 10gfoo(X) + (On - 00)dozI 10gj~o(X)
n
0
0
+ nB(O, X)
where B(O, X) --> 0 as Q from (vi).
Now (l/n)(d/dO o)I 10gfoo(X) is asymptotically normal
N[O, (l/n)Q( (d/dO o) 10gfelJ
and
(l/n)(d Z /dO~) I log foo(X) --> Q[ (d z/dO~) log foo(X) J as Q.
Thus en - 00 is asymptotically normal with mean 0 and variance (J2/n.
(3) To conclude, let 0 --> 0;
log px(O)/px(e n) = l;g po(O)/po(e) + I 10gfe(X) - IlogfuJX)
=
B(O, eJ + ¥o - ey ( I
::; 10gfeJXi ) + nB(O, x»)
where by (iv) and (vi), B(O, en) --> 0 as Q and B(O, X) --> 0 as Q, uniformly over
10-001;£°n'
From (vii), (l/n)Iidz/de;)logfeJX)--> -=- LI, and cP n is a linear transformation of 0, with cPn(On) = O. Thus px(O)/PX(On) = Pn(cPn)/Pn(O),
(A) log Pn(cPn)/pn(O) + 1cP; --> 0 as Q upiformly over IcPnl ;£ K.
(B) Also log Pn(cP,,)/pn(O) < - ±(O - OynLl for all 10 - en 1;£ 0." as Q as
n --> 00.
(C) Finally, from (v) Px( I0 - 00 I> on) --> 0 as Q for some on --> o.
It is necessary to combine facts (A), (B), (C) to determine pJO). From (C),
J
Pn(cPn)dcP n--> 1
From (B),
J
le-eoI3Dn.le-8nl >KI.Jil
Pn(cPn)dcP n ;£ Pn(O)
J
10-6n l >Ki.fil
exp[ - ±(O - e,lnLlJdcP n
From (A),
P;; 1(0)
J
lo-ilnl <KI.Jn
Pn(cPn)dcP n--> lim
J
le-onl <K/.jn
exp( -1cP;)dcP n
III
11.3. Pointwise Asymptotic Normality
Here C n and C~ are bounded by I as K -+ 00, n -+ 00, so Pn(O)fo -+ 1.
Thus Pn(¢n)/[exp(-t¢~)jfo]-I-+O uniformly over l¢nl~K as
D
required.
Notes: The conditions of the theorem look forbidding but they are merely
those conditions which permit neglect of "small" terms in the Taylor series
expansion. It is not necessary that the Xl' ... , X n be sampled from a member
of the family Po, but it is necessary that a unique member POo be "closest"
to Q in maximizing Q[logdPo/dQ]; ifthere are eo' e~ such that POa and POb
are both closest, then the limiting posterior distribution should be bimodal
with modes near eo and e~. The regularity conditions on log fo near eo
are far stronger than is necessary. It is necessary that the prior density be
positive at eo' and that it be continuous at eo (the conclusion of the theorem
requires px(e,,)/px(e o) -+ 1 as In(e n - eo) -+ 0, n -+ 00; continuity of
(d/de) log fiX) requires po{en)jpoCeo) -+ I as .;;iCen - eo) -+ 0, which requires
continuity of Po).
It is necessary that the posterior distribution concentrates at eo; maximum
likelihood conditions for convergence of to eo might be given, governing
the behavior of the likelihood outside neighborhoods of eo' but it may be
easier to check convergence of the posterior distribution directly.
en
11.3. Pointwise Asymptotic Normality
Theorem.
(i) Let Xl' X 2' ... , X n be sampled from a unitary Q on r[!I.
(ii) Let Po, eE R be a family of probabilities on r[!I with densities fo with respect to some measure v on r[!I.
(iii) Let e = eo be a local maximum of Q(log fo).
(iv) Let a prior probability Po have density Po positive and continuous at eo.
(v) (d/de o) logfoo exists as Q and Q((l/b) 10g(.foa H / j~o) - (d/de o) logfo)2-+
o as b -+0.
(vi) Q((d/de o) logfo)2 < 00;
(d2/de~)Q(logfo) = - l/v < o.
Then the posterior density Px of e is pointwise asymptotically normal in
the neighborhood of eo with mean eo + (v/n)'j)d/deo)logfoa(X) and variance
v/n; that is
px(eo + (Jvji;)/px(e o) - ¢ [( -
¢[ Jf;Id~o
J~Id~o log foo(X;)J
10gfooCX)J-+o
in Q-probability for each (, where ¢(u) = exp( - tu 2)jfo.
I
112
II. Asymptotic Normality of Posterior Distributions
PROOF.
Let
I
hi X) = -;s 10gfooH(X)! foo(X)
ho(X) =
~ 10gfoo(X) =
de o
lim hiX)
0-0
defined as Q.
bn = ~~.
Let
+ bn)/po(e o)] x Il [JOoHJX i )/ foo(X)]
log px(e o + bn)/px(e o) = log po(e o + bn)!po(e o) + IbnhoJX)
Then px(e o + bn)!px(e o) = [po(eo
Ln(~) =
From (iv), L n(~) - "b
h, (X.)l -+ 0 as n -+ 00.
~nun
Also, IbJhoJX) - ho(X)] has mean nbnQ(h On - ho) and variance
nb;Q(hOn - h O)2 -+ 0 as n -+ 00 by (v).
From (iii) and (v), (d!deo)Q(logfoo ) = Q«djde o) logfo) = Qh = O.
From (vi), Q(logj~oH) = Q(logj~o) + bn(d!deo)Q(logfoo) + tb;[ - l!v +
sbJ where sb -+ 0 as n -+ 00. Thus nbnQ(ho" - ho) + tnb;!v -+ 0 as n -+ 00.
Therefore Ln(~) - bnIho(X) + (1!2)nb;!v -+ 0 in Q-probability.
ll
(*)
Ln(~) - ~ AIho(X) + t~2 -+ 0
log {Px({.io
in Q-probability.
+~ ~)!px(eo)} -IOg{ 4>[ ~ -
AIho(X) JI
4>[ AIho(X)J}-+o
in Q-probability.
px(eo +
~~)!px(eo) - 4>[ ~ -
in Q-probability as required.
AIho(X)J/4>[ AIho(X)J-+o
o
Notes: The condition (iii) is weaker than the corresponding condition (iii)
of Theorem 11.2; also there is no condition corresponding to 11.2(v) which
requires the posterior distribution to concentrate on eo' Thus the posterior
density may be asymptotically normal in the neighborhood of eo without
concentrating there!
Condition 1l.3(v) is much weaker than 11.2(vi); thus it is only possible
to prove pointwise convergence of the posterior density, rather than uniform
convergence. The core of the proof is showing that the log posterior density
is parabolic near the "optimal" eo' as in equation (*).
Note that the expression for asymptotic variance is (d2!de~)Q(logfo)
rather than Q«d2!de~)logfo); the second derivatives of logfo may not
exist for many X and e, but the second derivatives of Q(logfo)' averaging
out X, may well exist. See Ibragimov and Khas'minskii (1973) and LeCam
(1970) for some related results in maximum likelihood asymptotics.
113
11.4. Asymptotic Normality of Martingale Sequences
11.4. Asymptotic Normality of Martingale Sequences
Theorem. Let ito c it 1 C .. , cit" c .. , c it be probability spaces, and
let Po be a unitary probability on it. Let Pi be probabilities on it to iti with
Pi = PiP j, i ~j. Let X be an element ofit 00' where it 00 is the minimal complete
probability space with a probability P 00 equal to Po on it" each n. Define
= p"X - Pn _ 1 X
S2 = P (X - P X)2
n
n
"
U"
s;
Assume (i) 0 < < 00 all n, as Po'
(ii) 2:.i="+ 1 Pj- 1(u;)/s; ~ 1 as n ~ 00, as Po'
(iii) ~up Pj - 1(u;)/s; ~ 0 as n ~ 00, as Po'
)>"
(iv) 2:.i=n+l Pj_llujI3/S; ~ 0 as n ~
00,
as Po'
Then
PJ[(X - p"X)/s"] ~ Jfof(u)e-(1/2IU2dU,
as Po'
for each bounded continuous function f (so that X is asymptotically normal
with mean p"X and variance given itJ
s;
The proof parallels the usual method for proving a central limit
theorem for sums. Here X - p"X = 2:.i="+ 1 Uj = lim 2:7="+ 1 uj ; the quantiPROOF.
N .... oo
ties {u) play the role of the summands in the central limit theorem; they
are not independent, but satisfy uj Po-uncorrelated with f(u k ) for k > j,f
measurable.
Letf"(u) = exp(itu/s").
Then
PJfn(un+1 + un+2 )/PJ,,(un+I)P n+tf"(u,, + 2)] = P"Pn+1 [fn(u n+1)/PJ,,(Un+1)]
=1
By induction p,,[fn(2:f=n+ 1 u)/TIf=n+ 1 Pi - tf,,(U i)] = 1 (this makes the
characteristic function of P NX - P"X nearly the same as a product). From
Theorem 4.2, P" IPNX - X I~ 0 as N ~ 00.
Also Ifn(x)-f"(y)I~lx-YI, and Ifn(x)l~l, If"-I(X)I~l. Therefore
Ipn([fn(2:~+ 1 Ui) - fn(X - PnX)]/TIf=" + 1 Pi-tf"(ui»1 ~ p"IPNX - Xl ~ Oas
N~oo.
P i- tf,,(U i) = 1 + ~Pi-1Ui - tt 2 P i _ 1Uf/S; + vltI3Pi_lluiI3/S;
N
2:
i=n+ 1
s"
logPi_tf,,(U) = -tt 2
N
2:
i=n+ 1
Pi _ 1(uf)/s;+vltI 3
where
with Ivl
~1
N
2: Pi_ 1Iu i I3/s;+e"
,,+1
en ~ 0 as n ~ 00 by (iii).
114
II. Asymptotic Normality of Posterior Distributions
[Using the facts x - tx2 < log(l
+ x) < x,
I
Ix; - suplxJIlx;1 <
10g(1
+ xJ < Ix;.]
Thus I~n+ 1 log P;_ In(u) exists by (ii) and (iv), and approaches - tt 2
as n ~ 00 as Po. Since I~N+ 1 log P;_ Ifn(u) ~ 0 as N -> 00,
Pn[ fn(X - PnX) l=u 1 P;- In(U)]
=
Pn[fn(X - PnX)/
=
P n[ f,,(X - PnX)
lim
=
N---)-oo
Pn[j~(X -
TI
;=n+l
l=~
PnX)/
1
Pi-JJu;) x
fi Pi-Jn(U)]
N+l
P;- In(U)ZN]
fI
i=n+ 1
P;_ J"(U)]
where ZN
~ 1 as N ~ 00
= 1.
co
Il
i=n+ 1
P.1-1 f n(u.)l ~ e-(1/2)t 1 •
Thus the characteristic function of (X - PnX)jsn approaches the characteristic
function of the unit normal as Q. Since any continuous function that is zero
outside a compact set can be uniformly approximated on the set by a finite
sum
j exp(itp), the same result holds for such continuous functions.
Extension to arbitrary bounded continuous functions is straightforward. D
Ia
Notes: These results are very free of regularity conditions, and especially
of independence conditions. The conditions on the increments in posterior
means P"X - Pn- 1 X might be difficult to verify. Conditions (iv) might be
weakened to the Lindeberg-like condition IT:'n+ 1 Pj _ 1[I ujl2{ 1ujl > e} ]/S; ~
o. See Hall and Heyde (1980), and Brown (1971).
EXAMPLE. Suppose !!t n is generated by XI' ... ' X n' a sample from the
Bernoulli distribution P[x] = p{x=l}(l- p){x=O} given p, and where p has
some prior distribution Po. See Awad (1978), p. 53. Letting r = Ixi'
P n(P) = fer, n) = P o[pr+ 1( 1 - p)n-r]/ P o[pr(l - p)"- r]
Un+1 = fer
+x
lI
+l ' n + 1) - fer, n)
Un+1 given!!tn has values fer + 1,11 + 1) - fer, n) with probability fer, n),
fer, 11 + 1) - fer, n) with probability 1 - fer, 11).
Pn(U n+1) = 0, so fer, n + 1) [1 - fer, n)]
s; = P/p -
PnP)2
=
+ fer + 1, n + l)f(r, n) = fer, n).
fer + 1, n + l)f(r, n) - fer, n)2
Pn(u; + 1) = s~/ fer, n)(1 - fer, n»
Pnlunl3 = s~/[f-2(r, n) + (1- fer, n»-2J.
Note thatf(r, n) =
PnP~
p as
11~ 00,
as Po·
115
! 1.5. Higher Order Approximations to Posterior Densities
Assume that ns; ~ p( 1 - p), as Po'
00
Then
Since
00
L P j _ 1(uJlls; - n I [p(l - p)/(j - 1)2]/p(l - p) ~ o.
n+l
n+l
L
00
I/(j - 1f ~ I as n ~ 00, (ii) is satisfied.
j=n+ 1
sup P j _ 1(uJl/S2 ~ (1 + £)/n ~ 0 as n ~ 00, satisfying (iii)
j>n
co
n3/2
[p(1 - p)]3/[ I
I
]
.L Pj-llujI3/S:~[p(1_p)]3!2.L
(j_1)3
p2+(1_p)2
j=n+l
j=n+l
n
00
~O
satisfying (iv).
Thus the posterior distribution of p given Xl' ... , xn is normal whenever
ns; ~ p(1 - p), as Po' that is, whenever the posterior variance converges
to the asymptotic variance of the maximum likelihood estimator.
11.5. Higher Order Approximations to Posterior
Densities
Let's just blast away with Taylor series expansions and leave the regularity
conditions till later. See Johnson (1967) and Hartigan (1965).
(i) Assume XI' ... , X n are a sample from Po having density fe with respect
to some measure v on all. Let e be a real valued random variable.
Let
hr(X) = Cd/de]' logfo(X)o=8o'
gr = Q[h/X)] with respect to some measure Q on all.
(ii) Assume Q(log fe) is maximal at B = Bo.
(iii) Assume the prior Po on S has density Po' and the posterior P x has
density Px'
Then
n
log Px(B) = log Px(Bo) + log Po(B)/Po(B o) + L 10gfe(X;l/ foo(X i )
i= 1
d
log [Px(B)/Px(B o)] = (B - BO)-B log Po(Bo)
d 0
+ (8 - Bo)Lhl(X) + t(B - BO)2LhiX)
+ k(B - BO)3 Lh 3(X;l
+0[(0 - Bo)] +0[n(8 - BO)3]
(iv) This expansion is justified by requiring the first three derivatives
[d/dB)'logfe(X) to be continuous in a neighborhood of eo' uniformly in
X; and by requiring the derivatives (d/de) log pee) to be continuous in a
116
11. Asymptotic Normality of Posterior Distributions
neighborhood of 80 , In order to ensure that large deviations 1 8 - 80 I have
negligible probability, assume P x[ I 8 - 80 1 > n - 112+ <Jn k --+ 0 for every k > O.
The later terms are negligible if Qh2 < O.
Px(8) = c(X) ex p { U)2(XJ[ 8 - 8 0
x
{I
+ i(8 -
+ [ ~)l(XJ + d: o logpo J/~)Z(Xi)
J}
80)3h3(X) +0(8 - 80) +o[ n(8 - 80)3]}.
Here the term i(8 - 80 )3 l )3(X i ) causes an 0 (n- 1/2 ) skewness departure
from normality. The only effect of the prior is in shifting the mean by
- (d/d8 0) log PO/'ih2(XJ The first three moments determine the asymptotic
distribution
Pn8 = 80
-
~)I(X)/'ihz(Xi)
- {d~o log PO + H('ihY - 'i h zJ'i h 3/('i h2)2 } j'ih2 + 0 (n- 2)
o
P II (8 - P 8)2 = ('i h l'i h 3/'i hz - 'ihz)- 1 +0 (n- 2)
1I
P II{8 - PIIW = - 'ih3/('ih z )3 +0 (11- 3)
11.6. Problems
El. Show that the binomial model satisfies conditions 11.2, when the prior density is
continuous and positive at Po, 0 < Po < 1, and Po is assumed true.
E2. In the binomial case, if the prior distribution has an atom at Po' show that the conditions of 11.4 are not satisfied.
E3. Let f[ Xl' X 2'
... , X n] be the marginal density of the observations, and let p(O)
be the prior density. Show, under conditions 11.2, that
f(Xl'X2"",Xn)/Of(Xilen)~p(00)j2rr { nQ [
82
- 80210gf
o
]}-1/2
PI. Observations X are N(/l, I) and /l has prior density uniform on l/ll ~ 1. Give the
asymptotic behavior of the posterior distribution as the true /lo ranges from - 00
to 00.
E4. Under the conditions 11.2, when 00 is true, show that P(O < 00 IXn) is asymptotically
uniformly distributed.
E5. Under the conditions 11.2, the posterior distribution of log [Of (Xi IIl)Of(X i Ien) J
is asymptotically -hi. [Bayes intervals for 0 thus coincide approximately with
maximum likelihood intervals.]
e
E6. Xl"'" X n are uniform over [0 - t, + tJ and 0 is uniform over the asymptotic behavior of the posterior density when 0 = o.
00
to
00.
Give
117
11.6. Problems
E7. f(x I0)
=
=
I/O
1/(1 - 0)
=0
O<x<O
if 0 ~x < 1
elsewhere.
If 0 is uniform over (0, I), specify the asymptotic behavior of the posterior density
of 0 when 0 = 1/2 is true.
P2. f(x I11) = i exp { -I x-Ill}, 11 uniform. Specify the asymptotic behavior of the
posterior density of 11, given 11 = O.
P3. Let g be such that g(X) and g2(X) are Po integrable. If X I' ... , X. is a sample from
Po' and P on 0 is unitary, show that
P(n var [P o[g(X)] IX]) ~ P Po[g2(X)] - P[ pi g(X)].
Thus Pog(X) is known, given X, to order n- I !2.
E8. Generalise theorem 11.2 to k-dimensional parameters.
HO
Hi
~ p ;::;t} +
~ p ~ I}.
P4. For the binomial model, the prior on p has density
What is the posterior distribution of p asymptotically, when the true value is p =
±?
P5. The observation X, Y is bivariate normal, means a(O), b(O), identity covariance
matrix, a(8) = 0 for 8 ~ 0, a(O) = 8 for 8 ~ 0, and b(8) = a( - 0). Find the asymptotic
posterior distribution of 8, when the true value of the means is (1, 1). Assume a
uniform prior distribution for O.
P6. In the binomial model, find nondegenerate prior distributions for p for which
n var(p IXn) -> O.
P7. For X l ' ... , X. from N{J1,0"2), prior 11- N(11 0 ' O"~) verify the conditions of the
martingale central limit theorem.
Ql. Let X l ' ... , X. be from the normal mixture pN{J11' O"~) + (1 - P)N(J12' O"~) where
p has uniform prior, III and J1 2 are independently N(O, 1), O"~ fixed. What is the
asymptotic posterior distribution of p, J1 1 ,J12 for various true values of p, J1 1 and
J1 2 ?
Q2. Let.'!ll c.'!l2 c ... c .'!In''' be increasing, OE.'!l 00' and suppose Z.E.'!l., Z. -> 0 has
the property that (Z. - 8)/0".(8) -> N(O, 1) in distribution given O. Show that
(8 - Z.l/O".(Z.) -> N(O, 1) in distribution given Z., provided 0 has continuous
positive density on the line. (Note: Z. may not have a convergent density.) [Here
0".(8) is the standard deviation of 0 given.'!l. and O".(Z.) is the standard deviation of
Z. given .'!l•. ]
P8. LetX o' X l ' X 2' ... be observations from an autoregressive process X, = IXX'_I +~,
where the ~t are i.i.d. normal. Assume IX is uniform on (-1,1). Find the asymptotic
behavior of the posterior distribution of IX given X 0' X I' X 2' ... , X.'
P9. Let X l ' ... , X. be a sample from the density exp(O - Xl, X;;; 8. Let 0 have a prior
density which is continuous and positive at 0 = O. Find the asymptotic distribution
of 0 given X I' ... , X. if X I' ... , X. are sampled from the uniform on (0, 1).
118
II. Asymptotic Normality of Posterior Distributions
11.7. References
Awad, A. M. (1978), A martingale approach to the asymptotic normality of posterior
distributions, Ph.D. Thesis, Yale University.
Brown, B. M. (1971), The martingale central limit theorem, Ann. Math. Statist. 42,
59-66.
Hall, P. and Heyde, C. C. (1980), Martingale Limit Theory and Its Applications. New
York: Academic Press.
Hartigan, J. A. (1965), The asymptotically unbiased prior distribution, Ann. Math.
Statist.36,1137-1154.
Ibragimov, I. A. and Khas'minskii, R. Z. (1973), Asymptotic behaviour of some
statistical estimators II Limit theorems for the a posteriori density and Bayes
estimators, Theor. Probability Appl. 18, 76-91.
--(1975), Local asymptotic normality for non-identically distributed observations,
Theor, Probability Appl. 20, 246-260.
Johnson, R. A. (1967), An asymptotic expansion for posterior distributions, Ann.
Math. Statist. 38,1899-1907.
LeCam, L. (1958), Les proprietes asymptotiques des solutions des Bayes, Publ. Inst.
Statist. Univ. Paris 7, 17-35.
--(1970), On the assumptions used to prove asymptotic normality of maximum
likelihood estimates, Ann. Math. Statist. 41, 802-828.
Walker, A. M. (1969), Asymptotic behavior of posterior distributions, J. Roy. Stat.
Soc. B 31, 80-88.
CHAPTER 12
Robustness of Bayes Methods
12.0. Introduction
A statistical procedure is robust if its behavior is not very sensitive to the
assumptions which justify it. In classical statistics these are assumptions
about a probability model {Po' 8Ee} for the observations in C!!I, and about
a loss function L connecting the decision and unknown parameter value.
In Bayesian statistics, there is in addition an assumed prior distribution.
Bayesian techniques have been used by Box and Tiao (1973) and others
to study classical robustness questions such as the choice of a good estimate
of a location parameter for "near-normal" distributions; they imbed the
normal in a family with one more parameter, and then use standard Bayesian
techniques to determine the posterior distribution of the location parameter.
The usual robustness studies allow for a much larger neighborhood of
distributions however.
In studying Bayesian robustness, we wish to evaluate the effect on the
posterior distribution and on Bayesian decisions of various components
of the probability model. Since the loss function is chosen by the decision
maker it seems plausible to concentrate on the probability parts of the
model.
(i) the likelihood component {Po' 8Ee}
(ii) the prior component Po
Here we consider mainly the prior component Po' using the techniques
of de Robertis and Hartigan (1981).
119
120
12. Robustness of Bayes Methods
12.l. Intervals of Probabilities
Let Q 1 and Qz be probabilities on X.
Define Q 1 ~ Qz if Q 1 X ~ Qz X whenever X ~ O.
The interval of probabilities (L, U) is the set of probabilities Q with
L ~ Q ~ U. The probability L will be called the lower probability, and the
probability U will be called the upper probability. If L has density I, and
U has density u with respect to v, then (L, U) consists of the probabilities
with density q, I ~ q ~ u.
Theorem. Inf {Q( Y)/Q(X) IL ~ Q ~ U} is the unique solution
U(Y - J.X)- + L(Y - J.X)+ = 0, provided UX- + LX+ > O.
},
of
Note that X+ = X{X(s) ~ O}, X- = X{X(s) ~ O}.
Since Q(Y)~UY-+LY+ for L~Q~U, and QoZ=U[{Y~O}Z]+
L[{ Y ~ O}Z]
satisfies
L ~ Qo ~ U,
inf{Q(Y)IL ~ Q ~ U} = Qo Y =
UY-+LY+.
Now infQ(Y)/Q(X) ~}. if and only if inf[Q(Y) - },Q(X)] ~ 0, since
Q(X) ~ for L ~ Q ~ U.
Thus inf Q( Y)/Q(X) ~ J. if and only if U(Y - J.X)- + L(Y - },X) + ~ O.
Also
PROOF.
°
d
dJ. {U(Y - }.X)-
+ L(Y -
;.X)+}
= U(-X{Y~J.X})+L[ -X{Y~J.X}J.
~ - U X - - LX + < 0
(since QoZ = U{ Y ~ J.X}Z + L{ Y ~ },X}Z lies in (L, U».
Thus U(Y - J.X)- + L(Y - J.X)+ is strictly decreasing, and is zero at
}, = infQ(Y)/Q(X) as required.
0
12.2. Intervals of Means
Theorem. Let L = N(O, I), U = kL.
For QE(L, U), the mean ofQ is QX/QI where Xes) = s.
Then QX/QI has range [ - y(k), y(k)] where y(k) satisfies
ky=(k-l)[¢(y)+y4>(y)]
where ¢(x) = exp( PROOF.
tx vfo, 4>(x) = too ¢(u)du.
2
From 12.1, inf QX/Ql is the solution of U(X - J.)-
+ L(X -
That is
I
(x - J.)k¢(x)dx
+ I
(x - J.)¢(x)dx
=0
J.)+ = O.
121
12.3. Intervals of Risk
that is
+ cPU,) 1) [¢(A) + ),4>(J,)]
),[1 - 4>(),)]
- k¢(),) - kU)(A)
), = (k -
=
0
- kA = (k - l)[¢( - ),) - 24>( - A)]
o
Thus - y(k) = inf QX/Q 1. Similariyy(k) = sup QX/Ql.
k
l,(k)
1
0
1.25
.089
1.50
.162
1.75
.223
2
.276
2.5
.364
3
.436
4
.549
5
.636
6
.707
7
.766
8
.817
9
.862
10
.901
Thus quite substantial changes in the probability Q do not affect the mean
too much. Similar Bayes estimates will arise from a wide range of priors.
12.3. Intervals of Risk
Theorem. Suppose that the risk r(d, 0) is the loss in making decision d when 0
is true. Let the Bayes risk B( Q) = inf( Q[r(d, 0) ]/Q(l», the probable loss when
d
the best decision is taken. Assume 0 < L(1) ~ U(l) <
Inf {B( Q) IL ~ Q ~ U} is the unique solution of
f3 1 (),) = inf[U(r(d, 0) -
2)-
00.
+ L(r(d, 0) -
},)+] = 0
d
Sup{B(Q)IL ~ Q ~ U} is no greater than the unique solution of
fJ 2 (2) = inf[L(r(d, 0) - 2)- + U(r(d, 0) - A)+] = O.
d
PROOF. It is straightforward to show that f3 1 (2) and
strictly decreasing and have unique zeroes Al and A2
inf{B(Q)IL ~ Q ~ U} = sup {AIB(Q) ~},
f3 2 (A)
are continuous,
for Q such that L ~ Q ~ U}
= sup{2IQ[r(d, 0)] ~ AQ(l)
for all d and Q, L ~ Q ~ U}
=sup{AIQ[r(d,O)-2J ~O
all d and Q, L ~ Q ~ U}
= sup {A Iinf(L[r(d, 0) - A]+
+ U[r(d, 0) -
d
= sup {A If3 1(},) ~ O} =
sup{B(Q)IL ~ Q ~ U} = inf{JcIB(Q) <),
}'I
for Q, L ~ Q ~ U}
= inf {A IQ[r(d, 0)] < 2Q(l)
some d,
for each Q, L ~ Q ~ U}
~ inf{A1 sup Q[r(d, 0) Q
A] < 0 some d}
2]-) ~ O}
122
12. Robustness of Bayes Methods
=
infPIL[r(d, 8) - AJsomed}
+ U[r(d, 8) - ).]+
<0
= infPlfJ 2 ()·) < O} = )'z.
D
12.4. Posterior Variances
For the normal location problem, take L = N(O, 1), U
For r(d, 8) = (d - 8f, B(Q) is the variance of 8.
It is bounded by the solutions of
inf U {k[ (d - 8)Z - ..1.1 Jd
inf U{k[(d - 8)Z - A2J-
=
kL.
r }= 0
+ [(d -
8)2 - ..1.1
+ [(d -
8)2 - AzJ-} = O.
d
By symmetry of U about 0, the optimal solution is d = 0 for both equations.
(Consider probabilities of form k¢{ (d - 8)Z < )'1} + ¢{ (d - 8)2 > 1. 1 }; note
that the mean value of such probabilities lies between d and 0 for k> 1;
this implies that the only solution to the first equation is d = O. For the
second equation, k[ (d - 8)Z - Az)+ + [(d - 8f - AzJ+ is convex so its
minimizing value is unique. By symmetry if d is a minimum, so is - d.
Therefore d = 0.)
The solutions ..1.1 and Az are posterior variances of elements of (L, U) so
the bounds of Theorem 12.3 are sharp.
k -1
..1.1 = Ai is the solution of A¢(A) + (AZ - 1) [ 4J(A) - 2k
_ 2
J
-IJ
=
0
Az = A; is the solution of A¢(A) + (A2 - 1) [ 4J(A) - 2k
2k _ 2 = 0
k
Ll1
Ll2
1.25
.947
1.055
1.50
.904
1.100
1.75
.870
1.140
2
.840
1.174
2.5
.792
1.233
3
.754
1.282
4
.697
1.360
5
6
.654 .621
1.421 1.472
7
.574
1.515
8
.592
1.552
9
.552
1.585
10
.535
1.615
Thus, again, a very large change in the probability density causes a relatively
minor change in posterior variance. [Factor of 10 for density gives factor of
2 for variance.J
12.5. Intervals of Posterior Probabilities
Lemma. L
~
cQ
~
U for some c,
X~O, Y~O.
PROOF.
The "only if" is obvious.
if and
only
if QX/Q Y ~
U X / L Y for each
123
12.6. Asymptotic Behavior of Posterior Intervals
allX,Y:;;;O
IfQXLY~UXQY
sup Q(X)/UX
= c 1 ~ c 2 = inf QY/LY
Y~O
X~o
c 2 LX ~ QX ~
C1
c 2 LX
c 2 UX
~
1
c2
QX
L~-Q ~
~
U
UX
. d
as reqUIre.
Theorem. Let X, Y be random variables satisfying the conditions of Bayes'
theorem (3.4):
(i) X, Y and X x Yare random variables from U,:!Z to S,:!£, T, UJ! and
S x T,:!£ x UJ!.
(ii) f is a density on :![ x UJ!.
(iii) :![ and UJ! are a-finite.
(iv) fT(t):s-+f(s,t)E:![eacht.
(v) P~g = RY(gfs)for some probability R on UJ!.
(vi) For each QX, LX ~ QX ~ u x , f /Qx f T is a density on :![ x '!Y.
Then the quotient probability
QX satisfies, for some k( Y),
Q;
corresponding to the prior probability
L; ~ Q;k(Y) ~ U;(UXfT/LxfT)'
PROOF.
By Bayes' theorem,
Q;h = QX(hfT)/QX(fT)
Q;hl/Q;h2
= QX(hJT)/QX(h 2f T ) ~ UX(hJT)/LX(h2fT)'
From the lemma,
o
12.6. Asymptotic Behavior of Posterior Intervals
Theorem.
(i) Let X, Y1 , Y 2 ,
... ,
Y/l' ... be random variables from :!Z to fl£, UJ! I'
UJ! n' ... , and assume that y i- 1 (UJ!) is increasing.
... ,
(ii) Let the quotient probability p~n[gJ = RYn(gf;) for some f" which is a
density with respect to :![ x UJ!n on S x Tn' and some probability R on
UJ! /l' Assume that these quotient probabilities agree with a conditional
probability P x defined on the smallest probability space including all
y i- 1(UJ!J Px9n(Yn) = P;tgn for 9/l E UJ!n'
(iii) Assume that P~" is unitary.
124
12. Robustness of Bayes Methods
(iv) Let LX, V X be unitary probabilities such that F/Lx(f;) is a density with
respect to q; x OJ! n' and LXg = VX[lg] where VX{1 = O} = O.
(v) Assume that goEq; and IEq; are OJ!n-approximable in V-probability
that is, knEOJ!n' k~EOJ!~,
vXpY"lg
- k n I~O'
X
0
V X pY"IIk'n I~O •
x
Then
sup
LX;:iiQX;:iiUX
IQ;"go -
go (X) I~ 0 almost surely.
PROOF. Assume I(X) > 0 without loss of generality. Let A be the set of values
of X such that, for all rational A,
V;" {l[go - A]+
+ [go -
A]-} ~ {l(X) [go - A]+
+ [go -
A]-}
as PX. By Doob's Theorem (Doob, 1949), VX(AC) = O.
For a fixed value of X in A, suppose go(X) > oc, rational. Then
l(X) [go (X) - oc]+ + [go (X) - oc]- > 0, so V;" {l[go - oc]+ + [go - A]-} > 0
for large n, so sup Q;"go > oc for all large n, as P X. Similarly, if go(X) < P
for
P rational,
L;:iiQ;:iiU
inf Q;"go <
L;:iiQ;:iiU
for all rational oc and
P all
large n, as P X. Since these results hold
p, sup IQ;"(go(X» - go I ~ 0 as P x' except for a set
L;:iiQ;:iiU
0
of X values of V probability zero.
Note: A similar theorem is proved in deRobertis and Hartigan (1981) with
L, V u-finite.
12.7. Asymptotic Intervals under Asymptotic
Normality
Theorem. Let X, Yl' ... , Yn, ... be random variables satisfying the conditions
of Theorem 12.6, and, in addition, assume that go(X) is asymptotically conditionally normal under V:
v;"c[(go-Jln)/un]~Jc(u)exp( -~u2)du/fo
for each bounded continuous c, where Jl n = V;"go' u;
Then
as V,
= V;"g~ -
Jl;.
U;I[ sup Q;"go-(Jln+uny(kn»]-+O as V,
U;1 [
L~Q;:iiU
inf Q;"go - (Jl n - uny(kn»] ~ 0 as V,
L~Q~U
where y(k) is the solution ofky = (k - 1) [cP(y) + ytP(y)], kn = I/V;)(X).
125
12.8. A More General Range of Probabil ities
By 12.6, kn = l/U:) --- I/I(X) as U.
By asympt otic normal ity of go (X), for each A,
PROOF.
U: n {[go
~Pn -
AT + {go ~Pn - AT}
--- f{ [u - A] + + leu - A) - }¢(u)du as U.
ence
Let A be the set X values for which I(X) > 0, and the above converg
<
}¢du
IX)
/(X)(u
+
+
IX)
(u
{
S
then
IX,
<
X)]
occurs for all rationa!},. If}{l/I(
n,
large
all
<
-}
IX]
11n)/O"n
(go
0, so that U:J [(go - Pn)/O" n - IX] + + I[
then
IX,
>
)]
y[l/I(X
If
n.
so supQ:J (go-Pn )/O"n] >1X all large
sup Q:J (go - Pn)/O"nJ > ct all large n. Also l(k n ) --- y[l/I(X )] as U. Thus
°
o
or
Note: This theorem permits a close approx imation to the interval of posteri
ility
probab
means Q:ngO(X), compu ted by assumi ng that go(X) has upper
N(P n , 0";) and lower probab ility N(P n , 0"; )U;Jl( X)].
12.8. A More General Range of Probabilities
~ q ~ u and
If L ~ Q ~ U, and the measur es have densities [, q, u, then /
permitt ing
of
q(s! )/q(S2) ~ u(s! )/I(s2). This formul ation has the advant age
probab iliof
l
interva
Q to be a unitary probab ility. A difficulty in the present
ties is that q may be dramat ically discont inuous.
More generally let q(s! )/q(S2) ~ u(s!' S2) define Q E R.
The functio n u(sp S2) might be such that U(SI' s2)---l as SI --- S2·
Necessarily U(SI' S2} ~ U(SI' S3)U(S3' S2)·
or
The posteri or density qt(SI)/q/S2)~Ut(SI)lft(S2»U(SpS2)' so posteri
ient
conven
es
densities are handled in the same framework. It is sometim
3).
to use log[q(SI)/q(S2)J~P(SpS2). Then P(Sl'S2)~P(SI,S3)+P(S2,S
(Note that p may be negative, so it is not a metric.)
ents
For conditi onal densities Is(t) it seems desirable to constra in movem
e
baroqu
in Is, (t) and Is)t) where SI and S2 are close. This suggests the
densities
Us, (t 1)/ ISl (t 1». U S1 (t 2)/ Is, (t 2» ~ U(S l' S2' t l' t 2 )· Again posteri or
tz)
2)V(t!,
obey a bound of the same type. Maybe U(SI,sz ,t p t z )= U(Sl'S
t2·
=
tl
or
would be viable, but it doesn't force u(s I' s2' t l' t 2) = 1 if SI = S2
C = {s IXes) ~ O}, A =
It is necessary to decide if QX ~ all Q E R. Let A
/u(s,s')
{sIX(s) < O}. Then q(s) = inf q(s')u(s, s') for sEA; and q(s') = supq(s)
seA
s'eA
way
for s' E A C, in a solutio n which minimizes QX. But I can't see any simple
ry
necessa
really
is
n
erizatio
charact
a
such
and
0,
to charact erize X with QX ~
to use the range.
°
C
126
12. Robustness of Bayes Methods
12.9. Problems
E1. A prior distribution for the binomial parameter p is such that no interval oflength 1
has more than twice the probability of any other interval of length 1, for all 1. Show
Pp ~.j2 - 1.
E2. Let X l ' ... , Xn be n observations from N(8, I). Let the prior for 8 be Q, L ~ Q ~ U
where L is Lebesgue measure and U = 2L. Show that the posterior mean lies in the
interval X ± .276/
In.
PI. Suppose 7 successes are observed in 10 binomial trials. Let the prior for p lie between
U = uniform (0, 1) and 2U. Find the posterior mean's range. [Hint use the binomial
cumulative distribution.]
P2. A prior distribution for the binomial parameter p lies between U(O, 1) and 2U(0, 1).
Find the range of the variance of p.
P3. Consider densities of form
k
~
1.
Find the value of d for which the density fd has minimum variance.
P4. Suppose r successes are observed in n binomial trials. Let the prior for p lie between
U(O, I) and 2U(0, I). Find an asymptotic expression for the interval of posterior
means.
J2n
P5. Let f = 1/
exp [ - t x 2 + e(x)] where Ie(x) I ~ 1. Find bounds for the posterior
density of 8 given Xl' ... , Xn where Xl' ... , Xn is a sample from f(x - 8), and 8
has uniform prior density.
12.10. References
Box, G. E. P. and Tiao, G. C. (1973), Bayesian Inference in Statistical Analysis. Reading:
Addison - Wesley.
Doob, J. L. (1949), Applications of the theory of martingales, Colloques Internationaux
de Centre National de la Recherche Scientific Paris 22-28.
DeRobertis, L. and J. A. Hartigan (1981), Bayesian inference using intervals of
measures, The Annals of Statistics 9,235-244.
CHAPTER 13
Nonparametric Bayes Procedures
13.0. Introduction
Whereas Bayes procedures require detailed probability models for observations and parameters, nonparametric procedures work with a minimum of
probabilistic assumptions. It is therefore of interest to examine nonparametric problems from a Bayesian point of view.
Usually nonparametric procedures apply to samples of observations
from an unknown distribution function F. Inferences are made which are
true for all continuous F. For example if X(l)' X(2)' ••• ,x(n) denote order
statistics of the sample, [X(k)' X(k + I)J is a confidence interval for the population
median of size
(~)2-n, provided the true Fis continuous.
We must give some sort of family of distributions over distribution functions F which can be used as priors and posteriors in a Bayesian approach.
Ferguson (1973) suggests the Dirichlet process, which for a general observation space 1Jjf, gives a distribution over probabilities P on IJjf such that
P(B 1 ), ••• ,P(Bk ) is Dirichlet whenever {B j } is a partition of the sample space.
No unitary prior is known to reproduce nonparametric confidence procedures; worse, no prior of any sort is known that reproduces such confidence procedures. However some confidence procedures correspond to
families of conditional probabilities, and Lane and Sudderth (1978) have
used finitely additive probabilities to generate confidence procedures.
13.1. The Dirichlet Process
Let IJjf on Tbe a probability space, let &> be the set of unitary probabilities
P on 1Jjf, and let f£ denote the smallest probability space on &> such that
X: P --+ PY lies in f£ for all Yin 1Jjf. A Dirichlet process DIXon ?t, correspond127
128
13. Nonparametric Bayes Procedures
ing to a bounded measure 11. on T(a(T) < (0), is such that PB l ' P B 2 , ..• , PBk
is distributed as a Dirichlet Da(B,),a(Bz) .... ,a(Bd for each partition Bl ' B 2 , ... , Bk
of T. Proofs that a Dirichlet process exists are given in Ferguson (1973) and
Blackwell and MacQueen (l973).
Following Blackwell and MacQueen, a sequence of random variables
Y 1 , Y 2 , Y 3 , .•. taking values in qy, Tis a Po/ya sequence with parameter 11. if
P[j(Y)]
=
a(f)/a(T),
for fEOJJ
PJ=P[j(Yi+1)IYI' Y z ,· .. , YJ=[a(f)+ l:f(Y)]/[a(T)+i].
j~i
Given Y1 , Yz ' ... , Yi' the distribution of Yi+ 1 is a mixture, in the proportion
of i to a(T), of the empirical distribution based on Yl' Y2 ' '" , Yi and the
distribution a/a(T). As i approaches 00, the empirical component predominates and the limiting distribution of Y i + 1 given Y1 , Y 2 , ..• , Y i is the limiting
frequency distribution of Y1 ' Y z , ... , Yi . This limiting distribution, when it
exists, will be taken to be a realization P of the Dirichlet process. The different
limiting distributions P, for different sequences Y l' Y2' ... , Yn' ... give a
distribution of probabilities P that satisfy the definition of the Dirichlet
process.
Theorem (Blackwell-MacQueen). Let qy on T be separable (there exist a
sequence of 0-1 functions AI' A 2' ... , An' '" such that OJJ is the smallest
probability space including AI' A z , .. ·, An' ... ). Let {YJ be a Polya sequence
with parameter 11.. For each Yl ' Yz ' ... , Yi' ... , define
P*f
=
P*f
=
lim l:J(Y) when the limit exists for all fin qy
n-oo
n
af /aT
when the limit does not exist for all f in OJJ.
Then P* is distributed as a Dirichlet process Da on !!£, and the conditional
distribution of Y1 , Y2 ' •.. , y;, ... given P* is such that the Yi are independent
each with distribution P*.
PROOF. If YI' Y Z ' Y3 ' ... is a P61ya sequence, the Ionescu Tulcea theorem
(Neveu, 1965, p. 162) states that a probability P exists on the product space
TxT ... ,qy x OJJ ... such that P is consistent with each of the conditional probabilities Pi' The separability of q7j guarantees that functions of
the form {Y1 = Y 2 } lie in OJJ x OJJ provided OJJ includes all singleton functions
{t},tET. [{Y 1 =f. YJ =Sl1P[{Y 1EA i} - {Y2 EAJIEqy x OJJ; otherwise there
exist y 1 =f. y 2 such that y 1: y 2 lie both inside or both outside of every Ai' and
OJJ generated by Ai consists of functions f with f(y I) = f(yz); thus the singleton function {y J is excluded from OJJ.] If OJJ does not include all singletons,
consider the space OJJ*, T* where T* consists of the equivalence classes
Bt' tET''
t'EB
if and onl".J if {tEA.} = {t'EA.} all A .. And qy* consists of the
t
functions f*(B) = f(t), f EOJJ. Note that f* is well defined since f(t) = f(t ' )
1
1
1
129
\3.1. The Dirichlet Process
whenever t' E B t • Now all* is separable and includes all singleton functions
and the theorem may be proved for all*. The Dirichlet process P* defined
on all* x all* x ... has the desired properties. It will therefore be assumed
that all includes singletons.
It will be shown first that P*f = lim Lf( YJjn except for sequences {Yi }
n~<Xl
in a set of pro bability zero. Let f 1(t) = {Y1 = t}. Then f 1(Yn) is an exchangeable sequence given Yl' so from de Finetti's theorem, 4.5, L~=2nY)/n
converges to say P 1 with probability I. Next let f 2 (t) = {Y1 = t or Yz = t}.
Then {f2(Yn)' n > 2} is an exchangeable sequence given Y1 , Y 2, and so
L~=3f/YJjn converges to say P2 with probability I. Similarly, if fk(t) =
Ul';;';k{Yi=t}, then Dk(YJjn converges to say Pk with probability 1,
for alCk, 1 ~ k ~ 00. Now
... , Yk ] = [k + o:fk]/[k + o:T] -+ 1
as k -+ 00
p[Lfk(YJj(n - k) 1 Y1, Yz ' '" , Yk ] = P[j/Yn ) 1 Y1, ... , Yk ] -+ 1 as k -+ 00
P[Ifk(Y)/n> 1 - 8] Yl' ... , YkJ -+ 1 as k -+ 00, each 8>
P[Pk> 1 - 81 Y 1 , ... , Yk] -+ 1 as k -+ 00.
P[jk(Yn ) 1 Y1 , Y 2 ,
°
A probability P exists on all x all x all ... consistent with these conditional
probabilities, and from separability, the functions fk(Y) lie in all x all x
all x .... Thus with probability 1, all the limits Lfk ( Y)/n exist and
lim lim "2..Jk(Y;)/n
=
1.
k~<Xl
This guarantees that P*f = lim I f( Y;)/n exists for all f in all; P* is a discrete
distribution carried by {YJ To show this,
n
L f(YJjn= f(Y1)[Lf1(YJjn]
i= 1
+ f(Y z) [Lf 2 (Y) - Lf1(Y)]/n + ...
+ f(Yk ) [Ifk(Y) - Ifk-1(Y)]/n
+ A[n - Lfk(Y)]/n where IAI ~ suplfl
lim 1If(Y)/n - (f(Y1)P 1 + .,. + f(Yk)(P k - Pk - 1)) 1~ sup 1f 10 - Pk )·
n-oo
Thus all the limits Lf(Y)/n exist if the Pi exist and lim Pk = I. It is straightk
forward to show that P*f = Lf(Y)(P i - Pi - 1), Po = 0, defines a probability
on all, for each sequence Yl' Yz ' ... , Yi where the limits exist. Since P*f =
o:f /o:T defines a probability all when the limits don't exist, P* always takes
val ues in f!J>.
To show that P* is distributed as D a' it is necessary to show that p* B l ' ... ,
P* Bk is distributed as Dirichlet D a(B)) •...• a(Bk) for each partition B l ' ... , Bk
of T.
Define
130
13. Nonparametric Bayes Procedures
Then Zi is a random variable taking k discrete values, and it may be shown
that Zi is a P6lya sequence with parameter a*, a*{i} = a(BJ Since P*Bj =
lim L{Zi = j}/n, the problem is reduced to showing that P* is distributed
as D(J. when T is finite. If T = {I, 2, ... , k}, let p* = {pp P2 , ••• , pJ, LPi = I,
and note from de Finetti's theorem that the Yi are independent multinomial
given P*, with P[Yi = jlp*] = Pi"
Then
p[p;'pi
2 •••
Pk
k]
=
P[first Y\ Y's = I,
next Y2 Y's = 2,
last ykY's = k]
a(I) a(I) + I
+I
= a(T)'a(T)
a(l) + Y\ - I
a(T)+Y 1 -1 x
a(2) + 1
a(2)
a(2) + Y2
-
I
The expression on the right is the y\, Y2 , ••• , yk th moment of a Dirichlet
distribution with parameters a(l), a(2), ... ,a(k), La(i) = a(T). Since the
Dirichlet distributions is characterized by its moments, the result follows.
It remains to be shown that the Y\ are independent given P* with distriP*[Ji( Yi)]
bution P*. It is necessary to check that PEnn Y) IP*] =
obeys the product law: PEn P*Ii( Yi)] = PEnn Y)]. If the Ii are each members of the partition Bp B 2 , ... , Bk this follows from de Finetti's theorem for
the case o/J finite. More general Ii may be approximated by linear combinations of these simple Ii'
D
n
13.2. The Dirichlet Process on (0,1)
(I) Let a be uniform.
F(x) has expectation x; thus F(x) is beta with density PX-\(l - F)-x.
\J V
::!:
N
-...
::::
~
~
....0
..
0
>-
>-
::
III
C
0
0
-....
It)
~
0
...>-
III
C
III
0
0
4)
4)
~
.....
0
c:
4)
0
131
13.3. Bayes Theorem for a Dirichlet Process
We could generate a single random F from Da as follows:
(a) select F(t) from Be(t, t)
(b) select F(±j/F(t) from Be(t, t), [F(i-) - F@]/F@ from Be(t, t)·
(c) select [F( (2k + 1)/2/1) - F(k/2/1- I} J/[F( (k + 1)/2/1 - 1) - F(k/2 n - 1)) from
Be(l/2/1, 1/2/1)
(d) continue forever ... after you're finished:
J
.........
~.i·····
F(x)
I
j
..'
!~//
............
°
Note that F will be quite bumpy, because the relative changes in F will be
near or 1.
(2) rx gives weights 1 to 1/4, 1/2,3/4, 1 and is zero elsewhere.
F(x) is Be [rx(O, x rx(x, 1 J. Thus F(x) = for x < 1/4, and F(x) = 1 for
x = 1; F(x) changes value only at x = 1/4, 1/2, 3/4, 1 and has atoms LlF(t),
LlF(i), LlF(~), LlF(l) which are Dirichlet D u .!.!'
J,
o
I
°
J
I
3
"41"4
Another
Real izatlon
of F
13.3. Bayes Theorem for a Dirichlet Process
Theorem. Let rx be ajlnite measure on!lJJ, and let Da be a prior distribution on fJ}J
the family of probabilities P on!lJJ. Let t be an observationfrom T according to
P. The posterior distribution of P given t is D aH,' where (\ Y = yet).
PROOF.
The joint probability Q on f!Jl x !lJJ is defined by
Q[Z(P) x Y] = Q&'(Z(P)Q~Y) = Da(Z(PJPY).
132
13. Nonparametric Bayes Procedures
The marginal distribution on f!} is D(/., the marginal distribution on qy is
oc Y/ocl.
If B I , •.• , Bk is a partition of T, then P(YB)/PB i is independent of PBI' ... ,
P Bk when P is a Dirichlet process. (Since P( YBi) is the limit of a linear combisets BIJ.. ,"
B .. = B.I and {PB I)../PB.}I are Dirichlet
nation of disioint
:..r
~J I)
{oc(Bi)/ocBJ independent of Bi .)
Thus if Z(P) is of form Z(PB I , ••• , PB k ) = Zk say
DJZ(PB
"
... , PBk)PY]
=
k
I
DJZ(PB
j=1
"
... , PBk)P(B)]DJP(YBi)/PBi]
= IDJZ(PB " ... , PBk)PBj]oc( Y B;VocBi
tEB;, Da.+~,Zk
If
Thus
= D"B, ....."Bi+ 1 •...•"Bk Zk
= D,,(ZkPB)/D,.{PB)
D,,+/J,Zk = I BjD"(ZkPB;VD,,PB j
QD,,+/J,(Zk x Y) = Q(YIBP"(ZkPBi)/D,,PBj)
= Ioc(YBj)D,,(ZkPBj)/ocB i ·
Thus QD,,+/J,(Zk x Y) = Q(Zk X Y) whenever Z = Zk depends only on
{PBJ
Taking limits, the result holds for all Z, and so D,,+/J, is the posterior
0
distribution of P given t.
13.4. The Empirical Process
The limiting case of the Dirichlet D occurs when oc == 0; this corresponds to
the prior density I/p(l - p) for a binomial parameter p, which is not unitary.
It is very difficult to imagine generating P from Do' The conditional distributions of P and tn+ I given t I ' ... , tn are nice and simple:
(J.
(i) PnY = P[Y(Tn+ l>ltl'
... , tn] = (l/n)I Y(t), the empirical distribution
over {tJ
(ii) plt ... , tn is Dirichlet D rA ; thus [P(t,), ... , P(t n}] is Dirichlet
"
D u ..... , and P is a discrete distribution carried by the observed sample
points t" ... , tn' See Hartigan (1971).
We are in the embarrassing position of declaring tn+ I to be surely equal
to one of the previous sample points; carried back, this would imply tn = t I
with probability I, which is dull. We can pretend to the surprised at each new
observation tn which is not equal to t I ' ... , tn-after all events of probability zero do occur; but our credibility may be weakened by our always insisting
that the next observation is just one of the previous ones, and our always
being surprised!
To do error analysis of a parameter Y(P) estimated by Y(P n), we compute
the distribution of YP where p.- Dr,O(rd' For example, if Y(P n) = (l/n)It p
+,
133
13.5. Subsample Methods
J
Y(P) = tdF is distributed as INi where p '" D I ; P(IN) = f, var(INi) =
I(ti - f)2/n(n + 1). This procedure gives approximate error behavior for
any statistic based on the empirical distribution; it works best, when as here
for the mean, the excessive discreteness is smoothed out by the statistic.
13.5. Subsample Methods
Consider the following competitors of the empirical process for generating
posterior distributions of a functional Y(P) estimated by Y(P J
(i) Subsamples: Select a random subsample til' ... , tir of t I ' ... , tn where
the ilh observation lies in the subsample with probability 1/2; regard k
random subsample values Y(P n ) as a random sample from the posterior
distribution of Y(P). See Hartigan (1969).
(ii) Jackknife: Divide the sample into disjoint groups of size k (randomly
say). Define the ilh pseudo-value by
~
Y i = Y(t l
, ••. ,
tn) -
(~ -
1)
Y(t l ,
••• ,
tn less
jlh
group)
Act as if Y(P) is a location parameter and {YJ is a sample from N(Y(P), 0"2).
See Tukey (1958).
For example, let n = 50 and suppose Y(P) denotes the correlation of a
bivariate distribution. For the empirical process, sample PI' ... 'Ps o from
DI , and recompute the correlation on the data values weighted by Pi;
obtain 3 such values. For subsamples, select 3 random subsamples each of
size roughly 25. Do jackknifing with group size 25; if r I and r2 are the correlations on the groups, the pseudo values are 2r - r I' 2r - r2; the values
2r - r I ' 2r - r2' 2r - (r I + r2)/2 are regarded as a random sample from the
posterior distribution.
Each of the techniques gives 3 values, which divide the line into four
intervals; in 100 repetitions, the true correlations lay in the four intervals
as follows:
Bivariate normal
p=.95
Expected
Empirical process
Subsamples
Jackknife
25
31
31
28
25
25
23
29
25
22
28
21
Mixture of normals
25
22
20
21
25
37
33
28
25
28
23
27
25
19
27
25
25
16
17
21
That's a bit nasty! By what accident could the humble ad hoc Jackknife
beat such delightful Bayesian trickery? Hartigan (1975) shows that the
asymptotic inclusion probabilities are correct for the various techniques.
134
13. Nonparametric Bayes Procedures
13.6. The Tolerance Process
If t I ' ... , tn form a sample from a continuous distribution function F, and
if t(1)' ... ,t(n) denote the order statistics, then {t(k _ I) < tIl + I < tIkI} is a
tolerance interval for tn+ I of size l/(n + 1); that is, P[t(k-l) < tn+ I < t(k)] =
1/(11 + 1), averaging over all t I ' ... , tn' tn+ I ' After all, why should tn+ I be
any particular place in the ordered sample of t I ' t 2, ... , tn' tn+ 1 ?
The tolerance process defines tn + 1 given t l ' ... , tn to be such that
P[t(k_1) < tn+1 < t(k)lt l
, .. ·,
tn]
1
= n + l'
More detailed probability statements are made as evidence accumulates.
The joint distribution of tn + 1 ' tn + 2 given t I' ... ,tn may be computed by
combining the tn+ Ilt I' ... ,tn with tn+ 21 t I' ... , tIl' tn+ 1; more generally
tn +1 ' tn +.;: ... , It I ' ... ,tn has a certain joint distribution. Obviously, P A =
lim(1/n)2...{t i EA}, so the distribution of P may be obtained from the distribution of tn + 1 ' tn + 2' ... ; the distribution of P is just that
F(t(I)' F(t(2) - F(t(1)' ... , F(t n) - F(t(n_I)' 1 - F(t(n)
is D 1 or F(t I)' ... , F(t n) is a random sample from Uta, 1).
P[t(k) < median < t(k+ I) It I ' ... , tn] = (n)2-n reproduces non-parametric
k
confidence intervals for the median.
Here the probability space on which the distribution of P is defined changes
as evidence accumulates; Hill (1968) shows that no unitary probability on
P and Y1 , ••• , Yn will reproduce these conditional probabilities, but Lane
and Sudderth (1978) show that a finitely additive probability P exists which
produces these conditional probabilities.
13.7. Problems
E1. For observations 5, 7, 10, 11, 15 compute 50% confidence intervals for the median
using the empirical process, subsamples, jackknife with group size I, and the
tolerance process.
PI. For p'" D), show that LPiXi has skewness of opposite sign to that of J.I - X.
Thus if the X's are positively skew, LPiXi tends to be less than X, but J.I tends to be
greater than X.
P2. Let t = LaiX i where ai = [(Zi - (l/n)LZ;lIY] + (lIn), the Zi are independent
N(O, 1), and Y is independent [nX;_I]'t2. If {XJ is a sample from N(J.I, (j2), and the
prior density for (J4 (j2) is 1/(j2, show that t IX I ' ... ,X,I and J.lIX I ' ... ,X,I have the
same distribution.
P3. Let XI' ... , X II be independent and symmetrically distributed about O. Let Y1 , ••• ,
Y 2" _ 1 denote the ordered means of the 2'1 - I subsets of XI' ... , X II' Show that
P8(Yk <0< Yk+,)=2- n, 1 ~k~2'1-1.
13.8. References
135
E2. If Xn+IIX 1 , ••• , Xn is such that P(X(k) ~ X.+ 1 ~ X(k+ I)IX I' ... , Xn) = 1/(n + 1),
find X n + l , Xn+2IXI' ... , X n •
13.8 References
Blackwell, David and MacQueen, James B. (1973), Ferguson distributions via Polya
urn schemes, Annals of Statistics 1, 353-355.
Ferguson, T. S., (1973). A Bayesian analysis of some non-parametric problems, Annals
of Statistics 1, 209-230.
Hartjgan, J. A. (1969), Use of subsample values as typical values, J. Am. Stat. Ass.
104, 1303-1317.
--(1971), Error analysis by replaced samples, J. Roy. Statist. Soc. B 33, 98-110.
--(1975), Necessary and sufficient conditions for asymptotic joint normality of a
statistic and its subsamplevalues. Annals of Statistics, 3, 573-580.
Hill, Bruce M. (1968), Posterior distributions of percentiles : Bayes theorem for sampling
from a population, J. Am. Stat. Ass. 63, 677-691.
Lane, David A. and Sudderth, William D. (1978), Diffuse models for sampling and
predictive inference, Annals of Statistics 6, 1318-1336.
Neveu, J. (1965), Mathematical Foundations of the Calculus of Probability, San
Franciso: Holden-Day.
Tukey, J. W. (1958), Bias and confidence in not-quite large samples, Ann. Math.
Statist. 29. 614.
Author Index
A
Abramowitz, M.
Clevenson, M.L. 99, 105
Cox, D.R. 7, 13
100, 105
B
Baranchik, A.l. 85,93,95
Barlow, R.E. 103,105
Bartholomew, D.l. 105
Berger, 1.0. 57,62
Berk, R.H. 38,43
Bernardo,l.M. 46,50,55
Bernoulli,l. 2,5, 13
Blackwell, D. 128, 135
Borel, E. 7, 13
Box, G.E.P. 68,71, 119, 126
Brandwein, A.R. 92, 95
Bremner, 1.M. 105
Brown, B.M. 114, 118
Brown, L.D. 65,71,81,92,95
Brunk, H.D. 105
Buehler, R.I. 70,71
c
Christensen, R. 45, 55
Church, A. 4, 13
D
Dawid, A.P. 23,28,33,69,71
DeFinetti, B. ix, 6, 7, 9,10,13,14,15,
17,22,40
DeRobertis, L. 119, 126
Doob,l.L. 34,38,43, 124,126
Dunford, N. 15,22
E
Efron, B.
91,95
F
Farrell, R. 61, 62
Fedderson, A.P. 70,71
Ferguson, T.S. 57,62, 127, 128, 135
Fine, T. 20,22
Fisher, R.A. 56,62, 107
Fox, M. 65, 71
Fraser, D.A.S. 47,55
Freedman, D. 40,43,69,71
137
138
Author Index
G
Kraft, C. 20,21,22
Kullback, S. 45,55
Gaskins, R. 102, 103, 105
Geiringer, H. 13
Good, I.J. ix, 7, 8, 9,13,45,46,55,
101,102,103,105
L
H
Hall, P. 114, 118
Hartigan, J.A. 47,50,51,55,68,70,
71,115,118,119,126,132,133,
135
Heath, D.C. 58,61,62
Heyde, C.C. 114, 118
Hill, B.M. 134
Hinkley, D. 7,13
Hoed, A.E. 92, 95
M
MacQueen, J.B. 128, 135
Martin-Lof, M. 4, 13
Morris, C. 91, 95
N
I
Ibragimov,I.A.
Lane, D.A. 127, 134, 135
Laplace, P.S. ix, 1,2,8, 13
Leibler, R.A. 45,55
Leonard, T. 102, 103, 105
Lindley, D. V. 88,92,95
Loeve, M. 41, 43
Loomis,L.H. 17,22
112, 118
J
James, W. 87,95
Jaynes, E.T. 45,55
Jeffreys, H. ix, x, 3, 8, 13, 15,22,48,
49,50,55,68,73,74,75,76,77,
78,79,84,87,95,100
Johnson, B.M. 97, 105
Johnson, R.A. 115, 118
Johnson, W.E. 97,105
K
Kennard, R.W. 92,95
Keynes, J.M. ix, 2, 3, 8, 13, 15
Khasminskii, R.Z. 112, 118
Kolmogorov, A.N. ix, 4,5, 10, 13, 15,
22,23,24,33
Koopman, B.O. 20,22
Neveu, J. 128, 135
Neyman, J. 56,62
o
Olshen, R.A.
69,71
p
Pearson, E.S. 56,62
Peers, H. W. 49,55,76,83
Perks, W. 49,55, 100, 106
Pollard, D. vii
Pratt, J. 22
Purves,R.A. 69,71
R
Ramsey, F. 7,13,19,22
Renyi, A. 15,22,31,33
Robbins, H.E. 101,102,103,104,105
139
Author Index
s
Savage, L.J. 6,7,9,13,19,20,22
Schwartz, J.T. 15,22
Schwartz, L. 38,43
Scott, D. 20, 22
Seidenberg, A. 22
Shannon, C.E. 45,55
Simonoff, J.S. 103, 105
Smith, A.F.M. 88, 92, 95
Smith, C.A.B. 7,13
Stegun, LA. 100, 105
Stein, C. 65,71,84,87,95
Stone, C.J. 103, 106
Stone, M. 23,28,33,47,55,69,71
Strawderman, W. 87,92,95
Sudderth, W. 28,33,58,61,62,127,
134, 135
Tjur, T. 25,33
Tukey, J.W. 133,135
Tulcea, L 128
v
Von Mises, R.V.
ix, 3, 4,13
w
Wahba, G. 103
Wald, A. 56,57,62,107
Walker, A.M. 108, 118
Wallsten 7, 13
Welch, B.L. 49, 55,76,83
Winkler, R.L. 44,55
T
z
Thatcher, A.R. 77,83
Tiao, G.C. 68,71,119,126
Zellner, A. 46,55
Zidek, J. V. 28,33,99, 105
Subject Index
A
Absolute distance 48
Admissibility 56-62,63,75,76,87,
97,98,104,105
of Bayes decisions x, 56-62
various definitions of x,61-62
Analogy, Keynes uses 2
Approximable, mean- 35
square- 35
Approximating sequence 16,38
Asymptotic normality,
crude demonstration xii, 108
examples 74-79
martingale sequences xii, 113 - 115
of posterior distributions xii, 107 -118
pointwise xii, 111, 112
regularity conditions xii, 108, 109
Autoregressive process 102, 117
Axioms ix, 14-22
Kolmogorov's 5, 10, 15
of conditional probability 23, 24
B
Baire functions 40,41
Baranchik's Theorem xi, 84-86
Bayes 13
decisions 57-62,75
definition of probability 6
estimates xi, 63, 86-87, 90, 92
postulate 2
robustness of methods xii, 119 - 126
theorem x, 30
theory iii
unbiased tests xi, 65-66, 75
Bayesian law of large numbers 36
Behrens-Fisher 81,82
beta priors 76, 104
Bets 6,9
Binomial,
admissibility 61, 104
asymptotics 116, 117, 126
conditional probability for x, 31-32
convergence x, 38-39
exponential family 73
methods 76-78
priors xi, 76-79
c
Chisquares 93, 94
Clusters, multinomials with
Coherence 6
Collectives 3 , 4
Complete Bayesian 101
Complete class 57
Complexity 4
101-102
141
142
Subject Index
Conditional Bayes decisions 58 - 59
Conditional bets 68 - 69
Conditional probability ix, 9, 23-33
axioms x,24
binomial x, 31,32
Conditionally probable 68 -69
Confidence regions xi, 67 -69
beaten for normal location 70
forbinornial 76-78
for poisson 79
not conditional bets 68-69
not unitary Bayes xi,68
prior for location and scale 80
Confidence interval 127
Conjugate priors, a chimera 72
Consistency, of posterior distributions x,
38
Constructing probabilities 3
Contingency tables xii, 103, 105, 106
Contradiction 76
Convergence x, 34-43
definitions x, 35
in distribution 35
of conditional probabilities almost
sure x, 36-38
of conditional probabilities in mean x,
35-36
Countable additivity 10
D
Elephant 44
Elicitation of probabilities 44
Empirical Bayes 89
Empirical process xii, 132, 133
Empirical theories ix, 3-6
Kolmogorov 5
falsifiable models 5
Von Mises 3-4
Entropy 45
Exact Bayes estimates 63, 64
Exchangeable sequences x, 9,10,40,
41,52
Exponential families xi,72-83
prior distributions for xi, 73
Extension,
from a prespace 16
from a ring 17
F
Falsifiable models ix, 5
Fineness 20, 21
Finite additivity 10, 15,57-61
Fisher's test 105
Frequency theory 5
neither necessary nor sufficient
Fubini's theorem 26
Future, can't be sure about it 2
but like the past, probably 12
Decision theory x,56-62
Degree of belief 7
Density estimation 103
Dirichlet, priors xi, 96-97, 104
process xii,127-131
selection of xi, 100, 102
Dirichlet process 127-135
Bayes theorem for 131, 132
existence 127 -130
on (0,1) 130,131
Docile priors 74, 75, 76
Dutch book 6
G
E
I
Edgeworth expansions
Elementary events 1
78
9
Gambling system, impossible 3, 6
Gamma priors 79,97,99, 100, 104
H
Haldane's prior 31
Hellinger distance 48, 49, 50
High density region 68,75,80
Higher order approximations 115, 116
Imaginary results 8
Improper 15, 16,28
143
Subject Index
Improper distributions, embarrassingly
frequent vii
Inadmissibility of maximum likelihood,
Poisson 99
multinomial 105
Inadmissible means 84,92
Independence, like insufficient reason 6
conditional 28
of random variables 28
Indifference, principle of 2
Indifference priors 63
Infinite axioms ix, 9-10
Information 44-45, 72, 107
Information distance 48
Insufficient reason, principle of 2
Internal point 19
Intervals, asymptotic behavior of xii,
123-125
of means xii, 120, 121
of posterior probabilities xii, 122,
123
of probabilities xii, 120
of risk xii, 121, 122
of variances xii, 122
Invariance x, 3, 47-48
inconsistency of 3, 54
priors 75, 80
J
Jackknife, beats all 133
Jeffreys density x, 48-50,68,73-79,
84, 100
L
Least squares 84
Lebesgue measure 18
Leibniz, probability 1, 15
Likelihood 31
Limit space 16-18
Location parameter, for th~ normal
Logical theories ix, 1-2
and randomness 4
Jeffreys 3, 8
Keynes 2,8
Laplace 1-2,8
56
M
Many normal means 84-95
Marginal distribution 23, 80
Marginalization paradoxes x, 23,
28-29
Markov chain 43
Martingale sequences xii, 37, 113 -ll5,
117
Maximal learning probability 45-47,
68
Maximum likelihood 91,92,97,101,
105, 107, 116
Means, many normal xi, 84-95
mostly small xi, 89
multivariate xi, 89
random sample of xi, 89
shrinking towards xi, 88
unknown variance xi, 92
Measurable 34
Median 127
Minimum variance prior 73
Minimum variance unbiased 84
Multinomial distribution xi - xii,
96-105
maximum likelihood xi, 97
with clusters xi, 10 1
with similarities xii, 102
Multivariate means 91,92
Mystical methods 53
N
Neutral 19
Non-central chisquare 84
Non-parametric Bayes xii, 127 -136
Non-unitary 16,25,26,30,41,63
Normallocation xi, 74-76
location and scale xi, 79-82
many means xi, 84-95
priors 75
scale 73
Nuclear war, probability of 11-12
o
Order statistics
127
144
Subject Index
p
s
P-Bayes 57 -62
Penalized likelihood 103
Personal probability 6, 10
Pitman estimator 64,65
not the Bayes estimator 65
Poisson xi, 73, 79, 84, 96, 101-105
inadmissibility of maximum
likelihood xi,99
two stage models xi, 101
P61ya sequence 128, 135
Posterior distributions, consistency x,
38
asymptotic normality xii, 107 -118
higher order approximations xii,
115
Posterior mean 86,87,89,93,94,
101
Practically certain, in interpreting
frequencies 5
Prespaces ix, 16-18
Prior density 31
Probable bets ix, 18-20
Probability,
axioms ix, 14-22
betting definition 6-7
comparative ix,20
complete 15
conditional ix,23-33
finitely additive 15
making x,44-55
maximum learning x,45-47
product x,26-27
quotient x, 27 -28
space 14
theories ix, 1-13
Sample mean 56
admissibility of 60
Sample median 56
Shrinking 88
Significance tests 5,6,8,90
Similarities, probabilities from vii
and probability ix, 11- 12
multinomials with 102
Similarity probability 50- 53
Spline methods 103
Subjective theories ix, 6-8
bad probabilities 7
betting definition 6-7
de Finetti 6-7
Good 7-9
Subsample methods xii, 133
Support 39,59,60
R
Random variables ix, 18
Randonmess 4
Range of probabilities xii, 125
Rational belief 2
Recursive functions 3
Regression xi, 89, 92, 103
Relatively invariant 47
Rings ix, 16-18
Robustness 119 - 126
T
Tail probability 75
Tolerance process xii, 134
Tortoise 44
Two stage normal priors 76
u
Unbetworthiness 81
Unbiased,
Bayes tests xi,65-66
location estimates xi, 63-65, 91-94
Uniform distribution 10, 15,75,84
on the integers 25
on the plane 26
on the square 25
Uniform integrability, generalized 35
Uniformity criteria x,63-71
Unitary probability 10, 15,25,26,30,
TI,~,~,~,~-@,n,n,~,
108, 113, 123
v
Variance 73,74,75
Variance, components
xi,93-94
145
Subject Index
x
Xn-limit Bayes
y
59,60
Yale 94