Download Misspecification in Infinite-Dimensional Bayesian Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Transcript
Misspecification in Infinite-Dimensional Bayesian Statistics
Author(s): B. J. K. Kleijn and A. W. van der Vaart
Source: The Annals of Statistics, Vol. 34, No. 2 (Apr., 2006), pp. 837-877
Published by: Institute of Mathematical Statistics
Stable URL: http://www.jstor.org/stable/25463439 .
Accessed: 22/07/2011 07:06
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .
http://www.jstor.org/action/showPublisher?publisherCode=ims. .
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The
Annals of Statistics.
http://www.jstor.org
TheAnnals of Statistics
2006,Vol. 34,No. 2, 837-877
DOI: 10.1214/009053606000000029
? InstituteofMathematical Statistics, 2006
IN INFINITE-DIMENSIONAL
MISSPECIFICATION
BAYESIAN STATISTICS
By
B.
J. K. Kleijn
and
A. W.
We
consider
is misspecified.
a distribution
Pq,
that the posterior
that minimize
concentrates
The method
gence.
infinite-dimensional
ric regression
its mass
the Kullback-Leibler
and parametric
if the
from
show
near
in the support of the
the points
to Pq. An
with
respect
divergence
condition
determine
the rate of conver
a prior-mass
to several
is applied
examples,
models.
These
include Gaussian
and
condition
Vaart
of posterior
the asymptotic
behavior
distributions
a prior distribution
and a random
Given
sample
not be in the support
which
of the prior, we
may
prior
entropy
der
Amsterdam
Vrije Universiteit
model
van
with
interest
special
mixtures,
for
nonparamet
models.
con
Of all criteria for statistical
estimation,
asymptotic
is
the
the
least
that
estimation
among
sistency
Consistency
requires
disputed.
come arbitrarily close to the true, underlying
if enough
distribution,
procedure
are used. It is of a frequentist nature, because
a notion of
observations
it presumes
an underlying,
true distribution
If applied to posterior distrib
for the observations.
1. Introduction.
a useful property by many Bayesians,
as it could warn
utions, it is also considered
one away from prior distributions
or unexpected,
with undesirable,
consequences.
lead to undesirable
Priors which
have been documented,
in particular,
posteriors
for non- or semiparametric
models
(e.g., [4, 5]), in which case it is also difficult to
a particular prior on purely intuitive, subjective grounds.
In the present paper we consider
the situation where
the posterior distribution
cannot possibly
be asymptotically
or the prior, is
because
the
consistent,
model,
a
From
misspecified.
frequentist point of view, the relevance of studying misspec
ification is clear, because
the assumption
that the model
contains
the true, under
motivate
inmany practical
situations. From
lying distribution may lack realistic motivation
an objective Bayesian
of
the
is
of
in prin
view,
interest,
because,
point
question
a
the
allows
unrestricted
choice
of
ciple,
Bayesian
paradigm
prior, and, hence, we
must allow for the possibility
that the fixed distribution
of the observations
does
not belong
to the support of the prior. In this paper we show that in such a case
near a point in the support of the prior that is closest
the posterior will concentrate
to the true sampling distribution
as measured
diver
through the Kullback-Leibler
near this point.
for the rate of concentration
gence, and we give a characterization
Received
March
2003;
revised
March
2005.
AMS 2000 subject classifications. 62G07, 62G08, 62G20, 62F05, 62F15.
Key words
of convergence.
and phrases.
Misspecification,
infinite-dimensional
837
model,
posterior
distribution,
rate
B. J.K. KLEIJN AND A. W VAN DER VAART
838
are i.i.d. observations,
the paper we assume that X\, X2,...
each
Throughout
a
to probability measure Po- Given amodel & and a prior IT,
distributed according
subset B c g? is given by
supported on g?, the posterior mass of a measurable
=
(1.1) Un(B\Xx,...,Xn)
Here
it is assumed
= \
j f\p(Xi)dYl(P)/
j f\p(Xi)dYl(P).
i=
\
i
that the model
is dominated
p,
by a a-finite measure
to the dominating
measure
and the
is writ
density of a typical element P e g? relative
ten p and assumed appropriately measurable.
If we assume that the model
is well
means
that
then
that
the
is, Po ^,
specified,
posterior consistency
posterior dis
an arbitrarily large fraction of their total mass
tributions concentrate
in arbitrarily
small neighborhoods
of Po, if the number of observations
used to determine
the
is large enough. To formalize
this, we let d be a metric on ?P and say
posterior
that the Bayesian
if, for every s > 0,
procedure for the specified prior is consistent,
->
>
in
More
infor
0,
:d(P, Po)
Tln({P
Xn)
e}\X\,...,
specific
Po-probability.
an
mation
rate
the
behavior
of
estimator
is
its
of
concerning
asymptotic
given by
>
a
zero
Let
be
that
to
0
decreases
and
for
sn
sequence
convergence.
suppose that,
any
constants
(1.2)
?
00,
Mn
n?(P
e &>:d(P,
Po) > Mnen\Xu
..
,Xn)
-> 0,
to a decreasing
in Po-probability.
The sequence en corresponds
sequence of neigh
borhoods
of Po, the d-radius
of which goes to zero with n, while
still capturing
most of the posterior mass.
If (1.2) is satisfied, then we say that the rate of conver
gence is at least sn.
If Po is at a positive distance from the model gP and the prior concentrates
all
as it will concentrate
its mass on g?, then the posterior
is inconsistent
all its mass
on g? as well. However,
in this paper we show that the posterior will still settle
P* e g?, and we shall characterize
the sequences
down near a given measure
en
such that the preceding display is valid with d(P\ P*) taking the place of d(P, Po).
itsmass near minimum
Kullback
One would expect the posterior to concentrate
*s
near
the likelihood
maximal
since asymptotically
Leibler points,
YYi=\ P(X()
The integrand in the numerator
Kullback-Leibler
divergence.
points of minimal
so subsets of the model
in which
is
of (1.1) is the likelihood,
the (log-)likelihood
no
mass.
a
it
is
account
for
of
the
total
fraction
Hence,
great
large
large
posterior
P* is a minimum
Kullback
point of convergence
surprise that the appropriate
point in g&, but the general issue of rates (and which metric d to use) turns
than expected. We follow the work by Ghosal, Ghosh
out to be more complicated
and van der Vaart [8] for the well-specified
situation, but need to adapt, change or
Leibler
extend many steps.
After deriving general
in some detail, in
several examples
results, we consider
of
Dirichlet
Gaussian
mixtures
priors on the mixing
using
fitting
cluding Bayesian
Our
results on the re
the regression problem and parametric models.
distribution,
that a Bayesian
approach in
gression problem allow one, for instance, to conclude
BAYESIAN MISSPECIFICATION
839
that uses a prior on the regression
the nonparametric
function, but em
problem
a
will
to
lead
normal
distribution
for
the
consistent
of
estimation
errors,
ploys
the regression function, even if the regression errors are non-Gaussian.
This result,
fact that least squares estima
which
is the Bayesian
counterpart of the well-known
tors (the maximum
if the errors are Gaussian)
likelihood
estimators
perform well
if the errors are non-Gaussian,
is important to validate the Bayesian
approach
to regression,
in the literature.
but appears to have received little attention
even
Let L\(&,&/)
and organization.
denote
the set of all finite
measures
on
convex
be
and
let
the
hull
of a set of mea
&/)
(<$?",
conv(J2)
signed
sures J2: the set of all finite linear combinations
?
f?r
& and ^/ ? 0
Qi
J2i ^/ Qi
function /, let Qf denote the integral f fdQ.
with ]T, Xi = \. For a measurable
as follows.
The paper is organized
Section 2 contains
the main results of the
1.1. Notation
in increasing generality.
the three classes of
Sections
3, 4 and 5 concern
that we consider: mixtures,
the regression model and parametric models.
examples
re
6 and 7 contain the proofs of the main results, where
Sections
the necessary
sults on tests are developed
in Section 6 and are of independent
interest. The final
paper,
section
is a technical
appendix.
be an i.i.d. sample from a distribution
X2,...
Po on
a
on
collection
gP of probability
distributions
space i$g, &/). Given
the posterior measure
is defined
rion^,
i3?, &/) and a prior probability measure
as in (1.1) (where 0/0 = 0 by definition).
Here it is assumed
that the "model" g?
h^
a
measure
1
a
is dominated
a-finite
and
that
is
p
by
/?(i)
density of P e g?
relative to p such that the map ix, p) \-+ pix)
is measurable
relative to the product
on g?, so that the right-hand
of srf and an appropriate a-field
side of (1.1) is a
a
a function of B
of
measure
as
measurable
function
iX\,...,
Xn) and
probability
2. Main
a measurable
results.
Let Xi,
for every X\,...,
is positive. The
Xn such that the denominator
to the model gP. For simplicity
tion Po may or may not belong
assume that Po possesses
a density po relative to p as well.
"true" distribu
we
of notation,
Informally we think of the model g? as the "support" of the prior n, but we
sense. At this point we only assume
shall not make
this precise
in a topological
= 1
on ?? in the sense that n(^)
that the prior concentrates
(but we note later
that this too can be relaxed). Further requirements
are made
in the statements of
the main results. Our main
theorems
assume
the existence
of a point
implicitly
P* e g? minimizing
the Kullback-Leibler
of Po to the model &.
In
divergence
is
to
assumed
be
that
finite,
particular, the minimal Kullback-Leibler
is,
divergence
P* satisfies
(2.1)
By the convention
we assume without
observations.
_P0log^<oo.
Po
that logO = -00,
loss of generality
the above
implies that Po <?CP* and, hence,
that the density
/?* is strictly positive at the
B. J. K. KLEIJN AND A. W VAN DER VAART
840
Our
theorems
for the posterior distribution
to concen
trate in neighborhoods
of P* at a rate that is determined
by the amount of prior
mass "close to" the minimal Kullback-Leibler
P*
and
the "entropy" of the
point
model. To specify the terms between quotation marks, we make the following
de
give
sufficient
conditions
finitions.
We
define
the entropy and the neighborhoods
in which
the posterior
is to con
centrate
its mass
relative to a semi-metric
d on &.
The general results are for
mulated
relative to an arbitrary semi-metric
and next the conditions will be sim
or not these simplifications
can be
for more
plified
specific choices. Whether
an
case
made depends on the model
g?, convexity
(see
being
important special
Lemma
in the case of well-specified
for exam
2.2). Unlike
priors, considered,
distance
is not always appropriate
in the misspecified
ple, in [8], the Hellinger
num
in terms of a covering
general entropy bound is formulated
>
s
as
we
defined
for
under
0
follows:
define
testing
misspecification,
number Af of convex sets B\,...,
Nt(s, g?, d; Po, P*) as the minimal
B^ of prob
<
measures
on
cover
e
the set {P e g?:
(??", g/) needed to
d(P, P*) < 2e]
ability
such that, for every /,
situation.
The
ber for
inf
(2.2)
sup -logPo
/ p \a
e2
>-.
-^
is no finite covering
to be
of this type, we define the covering
number
as
refer to the logarithms
numbers
g*,
d\
Nt(s,
Po,
P*)
entropy
for
log
These numbers differ from ordinary metric entropy
testing under misspecification.
sets B[ are required to satisfy the preceding
in that the covering
numbers
display
to
rather than
be balls of radius e.We insist that the sets 5/ be convex and that (2.2)
If there
infinite. We
P that do not
hold for every P e Bj. This implies that (2.2) may involve measures
belong to the model g* if this is not convex itself.
For s > 0, we define a specific kind of Kullback-Leibler
of P*
neighborhood
by
(2.3) ^(^,P*;Po) =
<*2}
{pG^:-P0log4<^2^o(log4)
For a given model g*, prior Yl on g? and some P* e g*, as
< oo for all Peg0.
< oo and Po(p/p*)
that
that ?Polog(p*/po)
Suppose
?>
0 and ne^^oo
numbers sn with sn
there exist a sequence of strictly positive
and a constant L > 0, such that, for all n,
THEOREM
2.1.
sume
(2.4)
n(5(^,P*;P0))>^L^2,
Nt(s,
(2.5)
Then for
(2.6)
every sufficiently
n?(P
e ^:d(P,
Po, P*)
&,d\
< en?n
large constant M,as
P*)
>
Msn\Xu
all s>
for
sn.
n -* oo,
..., Xn)
-
0
in Li(P0").
841
BAYESIAN MISSPECIFICATION
The proof of this theorem is given in Section 7. The theorem does not explicitly
but this is
Kullback-Leibler
divergence,
require that P* be a point of minimal
to the
6.4
The
theorem
is
extended
conditions
the
Lemma
below).
(see
implied by
case of nonunique minimal Kullback-Leibler
points in Section 2.4.
2.1 are a prior mass condition
of Theorem
The two main conditions
an entropy
posterior
situation
condition
consistency
in [8]. Below
prior mass
(2.4) and
can
to Schwartz'
for
be compared
conditions
(2.5), which
for the well-specified
(see [13]), or the two main conditions
we discuss the background
in turn.
of these conditions
condition
for the
(2.4) reduces to the corresponding
=
<
we
in
if
Because
P*
oo,
Po.
?Po log(p*/po)
[8]
may
correctly specified
in the definition
rewrite the first inequality
(2.3) of the set Bis, P*; Po) as
The
condition
case
P
-Polog?
the set Bis,
Therefore,
imal Kullback-Leibler
Po
? P*
<-P0log
P*; Po) contains
with
divergence
i
+
s2.
Po
that are within s2 of the min
only P e &
over the model. The lower
to
Po
respect
as
(2.4) on the prior mass of Bis, P*; Po) requires that the prior measure
a
mass
to
share
its
certain
of
minimal
total
Kullback-Leibler
sign
neighborhoods
of P*. As argued in [8], a rough understanding
of the exact form of (2.4) for the
over g*.
"optimal" rate sn is that an optimal prior spreads its mass "uniformly"
bound
serves to lower-bound
the
2.1, the prior mass condition
in the expression
for the posterior.
The background
of the entropy condition
involved, but can be
(2.5) is more
a
to
condition
in the well-specified
situation given
in
compared
corresponding
In the proof
denominator
of Theorem
2.1 of [8]. The purpose of the entropy condition
is to measure
the com
a
a
to
the
of
rate
slower
of
model,
convergence.
plexity
larger entropy
leading
The entropy used in [8] is either the ordinary metric entropy
log Nis, g?, d), or
Theorem
the local entropy logNis/2,
{P e g?\s < diP, P0) < 2s}, d). For d theHellinger
the minimal
distance,
sn satisfying
g?, d) = ns2n is roughly the fastest
log Nisn,
rate of convergence
a density
for estimating
in the model
g? relative to d ob
of estimation
tainable by any method
(cf. [2]). We are not aware of a concept of
if the model
is misspecified,
but a rough interpre
"optimal rate of convergence"
tation of (2.5) given (2.4) would be that in the misspecified
situation the posterior
near the closest Kullback-Leibler
concentrates
at
the
point
optimal rate pertaining
to the model gP.
that the complexity
of the model
be measured
in
Misspecification
requires
a different,
somewhat
In
on
the
semi
way.
complicated
examples,
depending
metric d, the covering
numbers Nt(e, g?, d\ Pq, P*) can be related to ordinary
metric
numbers Nis, g*,d).
For instance, we show below
covering
(see Lem
mas
2.1-2.3)
are bounded
is convex, then the numbers Nt(e, g?, d; Pq, P*)
that, if the model &
if the distance diP\, P2) equals
by the covering numbers Nis, g?,d)
=
distance between
the measures
Qi defined by dQi
ip$/p*)dPi,
the Hellinger
that is, a weighted
Hellinger
distance
between
Pi and P2.
B. J. K. KLEIJN AND A. W VAN DER VAART
842
have P* = Po, and the entropy numbers for
above by ordinary entropy numbers for the Hellinger
dis
a refinement of the main theorem of [8]. To see
tance. Thus, Theorem
2.1 becomes
?
> 1 ? x for x > 0,
this, we first note that, since
log jc
In the well-specified
testing can be bounded
we
situation
> I-
-logPoj?
J
J
VpJpodp
=
-h2(p,p0).
?
=
side of (2.2) with a
1/2 and po
p* is bounded
below by infpG^ h2(p, po). Because Hellinger
balls are convex,
they are eligible
for the sets B[ required in the definition
candidates
of the covering numbers for
<
we
cover
set
set of
If
the
g?:2e
h(P, Po) < 4e] by a minimal
[P
testing.
balls of radius e/2, then these balls automatically
satisfy (2.2), by the
Hellinger
It follows
triangle
that the left-hand
inequality.
It follows
that
Nt(e, g?, h\ Po, Po) < N(e/2,
2s < h(P, P0) < 4^}, h).
{P e &:
a local covering number of the type used by [8].
sets rather than
for testing allow general convex
to
in
be
smaller
balls of a given diameter,
general than local
they appear
genuinely
a
convex
sets
not
need
size restriction.)
numbers.
the
covering
(Notably,
satisfy
we
are interested
we
not
in
of
the
do
know
of
However,
any examples
gains
setting
no
use
case
for the extended
there appears to be
in, so that in the well-specified
The
side
right-hand
the entropy
Because
is exactly
numbers
situation they
covering numbers as defined by (2.2). In the general, misspecified
even for standard parametric models,
are essential,
such as the one-dimensional
normal location model.
of
At a more technical
level, the entropy condition of [8] ensures the existence
P versus the true measure
certain tests of the measures
Po. In the misspecified
P to the minimal
to compare
Kullback-Leibler
case it is necessary
the measures
is not a test
turns
out
to
the
It
that
than
rather
P*,
Po.
appropriate comparison
point
P* in the ordinary sense of testing, but to
P versus the measure
of the measures
=
versus the measure
test the measures
defined by dQ(P)
Po
Q(P)
(po/p*)dP
measures
P
where
set
of
J2 the
ranges over g?, this leads
[see (7.4)]. With
Q(P)
to consideration
of minimax
testing risks of the type
inf sup(Po>+(2"(l-0)),
0 taking values in [0, 1].
A difference with the usual results on minimax
Q
testing risks is that the measures
in
measures
in
be
infinite
fact
not
be
(and
may
general).
may
probability
of Le Cam and Birge, we show in Section 6 that, for a
arguments
Extending
is bounded above
convex set B,
the minimax
display
testing risk in the preceding
where
the infimum
is taken over all measurable
functions
by
(2.7)
inf
0<a<lge=g
sup pa(Po,Q)n,
BAYESIAN MISSPECIFICATION
where
843
a h> pa(Po, Q)
is the Hellinger
transform
pa(P,
?
transform reduces to the map
the Hellinger
For Q
QiP),
the function
dp.
f paql~a
a^Px-a{QiP),Po)
=
Q)
=
Poip/p*)\
<
in (2.2) is satisfied, then Poip/p*)a
If the inequality
e~? /4 and, hence, the set of measures
QiP) with P ranging over B( can be tested
with error probabilities
bounded by e~n? /4'. For s bounded
away from zero, or
in (2.2).
also encountered
are exponentially
small, ensuring
converging
slowly to zero, these probabilities
on the "unlikely alternatives"
that the posterior does not concentrate
B{.
but
the
of
The testing bound (2.7) is valid for convex alternatives
alternatives
J2,
interest
{Pe^:d(P,P*)>
not convex. A test function
are complements
of balls and, hence, typically
Ms}
can be constructed
for nonconvex
alternatives
using a
sets. The entropy condition
controls
the
size
of this
(2.5)
covering of gP by convex
cover and, hence, the rate of convergence
in misspecified
is determined
situations
Because
the
numbers
the
of
the theorem
d;
gP,
Ntis,
Pq,
P*).
by
covering
validity
on
of
suitable
relies
the
existence
the
condition
tests,
(2.5) could be
entropy
only
a
can
condition.
To
be
condition
be
(2.5)
replaced by
testing
precise,
replaced by
2
the condition
that the conclusion
of Theorem
6.3 is satisfied with Dis) ? enSn.
and testing entropy.
Because
the entropies
for testing are
abstract, it is useful to relate them to ordinary entropy numbers. For our
the bound given by the following
lemma is useful. We assume that, for
examples,
? 1
some fixed constants c, C > 0 and for every m e N, k \,...,
km > 0 with J2i ^i
<
and every P, P\,...,
Pm e & with d(P, Pi)
cdiP, P*) for all i,
2.1. Distances
somewhat
(2.8)
P*)
Tkid2iPi,
LEMMA
on c and C
2.1.
such
Cj^k^iPi,
i
i
P) <
sup
0<a<\ -logPof^^y.
V P*
/
If (2.8) holds, then there exists a constant A > 0 depending
only
<
that, for all s > 0, Nt(s, g*,d;
Pq, P*) < NiAs,
{P e g?\s
diP, P*) < 2s}, d). [Anyconstant A < (1/8) A (l/4>/C) A (?c) works.]
For a given constant A > 0, we can cover the set g?e := {P e g?:s <
=
<
balls of radius As. If the centers of these
diP, P*)
2s} with N
N(As,
g??,d)
are
not contained
balls
in g?E, then we can replace these N balls by Af balls of
radius 2As with centers
in g?e whose
union also covers the set g?e. It suffices
Proof.
to show that (2.2) is valid for
this cover. Choose
2A < c. If
>
then diPi, P*)
diP, P*)
(2.8), the left-hand
sumption
by J2i kiiis
small A.
-
2As)2
-
B/ equal to the convex hull of a typical ball B in
P e g?e is the center of B and Pi e B for every /,
2As by the triangle inequality and, hence, by as
side of (2.2) with B( = conv(B)
is bounded below
Ci2As)2). This is bounded below by s2/4 for sufficiently
B. J.K. KLEIJN AND A. W VAN DER VAART
844
The logarithms logN (As, {P e g*:s
< d(P, P*) < 2s},d)
of the covering
in the preceding
numbers
lemma are called "local entropy numbers" and also the
Le Cam dimension
relative to the semi-metric d. They are bounded
of the model &
above by the simpler ordinary entropy numbers
g*, d). The preceding
log N(As,
shows that the entropy condition
(2.5) can be replaced by the ordinary en
<
condition
whenever
the semi-metric
d satisfies (2.8).
ns2
tropy
log N(sn, g*, d)
=
=
m
1 and P\
If we evaluate (2.8) with
P, then we obtain, for every P e g*,
lemma
(2.9)
d2(P,P*)<
sup
-logPof^V
of the covering
is also implied by finiteness
(Up to a factor 16, this inequality
an
the metrics
about
numbers for testing.) This simpler condition
indication
gives
with ordinary entropy. In Lemma 2.2 we show
d that may be used in combination
that if d and the model g? are convex, then the simpler condition
(2.9) is equivalent
to (2.8).
Because
minus
?
log*
>
the logarithm
? x
for every x > 0, we can further simplify by bounding
?
This yields
the
in the right-hand
side by 1
Po(p/p*)a.
1
bound
d2(P,P*)< sup |"l-P0(4)
o<(*<iL
Vp*/
J
I
side for
situation we have Po = P* and the right-hand
In the well-specified
a = 1/2 becomes
is 1/2 times the Hellinger
I? f
distance be
which
ofp+Jpodp,
can be
of lower bounding
this method
tween P and Po. In misspecified
situations
?
=
a
a
small
useless, as 1 Po(p/p*)a
may be negative for
1/2. On the other hand,
as it can be shown that as a | 0 the expression
value of a may be appropriate,
to the difference
of Kullback-Leibler
is proportional
1 ? Po(p/p*)a
divergences
of P*. If this approximation
is
definition
the
which
by
positive
Polog(p*/p),
is bounded above by the
d which
then a semi-metric
this
can
main
in
the
be used
theorem. We discuss
Kullback-Leibler
divergence
of Sections 4 and 5.
further in Section 6 and use this in the examples
is of interest, in particular, for non- or semipara
The case of convex models &
can be made
uniform
in p,
the point of
For a convex model,
and permits some simplification.
models,
is
it
Kullback-Leibler
exists)
(if
automatically
unique (up to
divergence
are auto
on
a
the
null-set of Po). Moreover,
redefinition
Po(p/p*)
expectations
2.1, and condition
(2.8) is satisfied for a
finite, as required in Theorem
matically
va
weighted Hellinger metric. We show this in Lemma 2.3, after first showing that
convex
on
the
semi-metric
hull
of
g*
the
lower
bound
of
the
(if
(2.9)
simpler
lidity
d is defined on this convex hull) implies the bound (2.8).
metric
minimal
i-? d2(P, P')
Ifd is defined on the convex hull of g?, the maps P
are convex on conv(g^) for every P' e g* and (2.9) is valid for every P in the
instead of d.
convex hull of g&, then (2.8) is satisfied for
^d
LEMMA
2.2.
BAYESIAN MISSPECIFICATION
845
If g? is convex and P* e g? is a point at minimal Kullback
< 1
to Pq, then Poip/p*)
Leibler
with respect
for every Peg0
divergence
and (2.8) is satisfied with
LEMMA
2.3.
d2{P\ ,Pi) =
of Lemma
Proof
i
yfpif?d^
For the proof of Lemma
to find
P*) <2j2^id2(Pi,
J2^id2(Pi,
\ j(SP~\
2.2.
repeatedly
gle inequality
~
2.2, we
first apply
the trian
P) + 2d2(P, P*)
i
<2j2^id2iPi,P)+4d2(p,J2^iPi)+^
<6j^kid2iPi,
by the convexity
of d2.
It follows
P) +
P*\
4d2\YkiPi,
that rf2(?,ktPi,
P*)
=
3/2Ei hd2(Pi, P). If (2.9) holds for P E/ hPi,
= 6.
replaced by d2/4 and C
Proof
[Px :ke
of Lemma
[0, 1]} c &
0 < fik) :=-P0log
^
(2.10)
since P*
For Peg0,
=
kP + (1
by Px
2.3.
e^
define
k)P*.
>
(l/4)?;
A^2(P;,
P*)
thenwe obtain (2.8) with d2
a family of convex combinations
For all values of k e [0,1],
=
"
+
-Polog^l
a(4
l)),
is at minimal
Kullback-Leibler
divergence with respect to Po in g*
For every fixed y > 0, the function k \-> log(l + A.y)/A. is nonnega
by assumption.
to y as k | 0. The function is bounded
tive and increases monotonically
in absolute
<
value by 2 for j
k
and
monotone
the
and
dominated
[?1, 0]
by
^. Therefore,
theorems applied to the positive and negative parts of the integrand in
convergence
the right-hand
side of (2.10),
/'((H-) =
l-*,(?).
the fact that /(0) = 0 with (2.10), we see that /'(OH-) > 0 and, hence,
Combining
< 1. The first assertion of Lemma 2.3 now follows.
we find PQip/p*)
For the proof that (2.9) is satisfied, we first note that - logx > 1? x, so that it
>
suffices to show that 1 Po(p/p*)1/2
d2(P, P*). Now
f(J^-^)2-*dp p* J
by the first part of the proof.
= l+
Po4
p*
< 2 - 2P0
2P0
?
J4
p*
y
yp
B. J.K. KLEIJN AND A. W VAN DER VAART
846
In this section we give some generalizations
2.1.
of Theorem
us to prove that optimal
rates are achieved
in parametric
2.3 extends Theorem
2.1 to situations
in which
the model,
the
2.2. Extensions.
Theorem
2.2
enables
Theorem
models.
on n. Third, we consider
the case in which
prior and the point P* are dependent
the priors Yln assign a mass slightly less than 1 to the models
g?n.
rate of convergence
2.1 does not give the optimal
Theorem
for finite
l/^/n
the choice sn = l/y/n
is excluded
dimensional
models
g?, both because
(by the
the prior mass condition
is too restrictive. The
condition
ns2 -> oo) and because
The adapted prior mass
theorem remedies
this, but is more complicated.
following
form: for all natural numbers
takes the following
condition
n n) Yl(Pe&>:jsn<d(P,P*)<2jsn)
Yl(B(sn,P*,P0))
n and j,
2j2/8
<
a given model gP, prior XI on g? and some P* e g?,
< oo and Po(p/p*)
< oo for all P
assume
g#.
that -Polog(p*/po)
If
->
>
are
such
that
0
and
numbers
with
sn
0,
sn
strictly positive
liminfn^
then, for every sequence Mn ?> oo, as n ?> oo,
(2.11) and (2.5) are satisfied,
THEOREM
(2.12)
For
2.2.
e &:d(P,
n?(P
P*)
>
Mnsn\Xl,...,
Xn)
-? 0
in Lx(P0).
reason to choose
the model
g? and the
to be no compelling
appears
n.
same
of
The
theorems
does not
the
n
for
the
every
validity
preceding
prior
each
on
For
in
the
formalize
theorem.
this.
We
this
fact
n, we let
following
depend
to
measures
on
a
relative
densities
set
of
be
pn
(3?, srf) given by
g?n
probability
on
an
a
measure
a a -finite measure
Yln
pn on this space. Given
appropriate
prior
There
a-field,
we
the posterior
define
by (1.1) with
P*
and n
replaced
by P?* and n?.
remain valid if gP, Yl, P* and d
theorems
The preceding
2.3.
on n, but satisfy the given conditions for each n (for a single constant L).
THEOREM
depend
e g?n :dn(P, P*) >
extension, we note that the assertion PfiYln(P
->
theorems remains valid even if the pri
0 of the preceding
Xn)
Mnsn\X\,...,
ors Yln do not put all their mass on the "models" g?n (but the models
g?n do satisfy
the entropy condition). Of course, in such cases the posterior puts mass outside the
to complement
the above assertion with the assertion that
and it is desirable
model
->
true if the priors put
1 in Li(Po).
The latter is certainly
Xn)
n?(c^w|Xi,...,
As
a final
only very small fractions of their mass
is true if
latter assertion
<2'13)
outside
the models
g?n. More
precisely,
the
*
nYln(B(sn,
,?,
This
observation
may
prove
p? LJ^n\ (p^)ndYln(P)<o(e-2^).
p*n)
P*, Po))
is not relevant
relevant
to alleviate
it
in the present paper. However,
for the examples
theorems
in the preceding
the entropy conditions
BAYESIAN MISSPECIFICATION
in certain
seems
limit the complexity
of the models
and it
to allow a trade-off between
and prior mass. Condi
complexity
a crude form of such a trade-off: a small part
g?cn of the model
situations.
reasonable
847
These
tion (2.13) allows
may be more complex,
conditions
that it receives
provided
a negligible
amount
of prior mass.
The preceding
theorems yield a rate of convergence
sn -> 0,
Consistency.
as
a
mass
and model
function of prior
the
entropy. In certain situations
expressed
In contrast, for inferring consis
prior mass and entropy may be hard to quantify.
such quantification
is unnecessary.
This could be proved
tency of the posterior,
2.3.
in the well-specified
directly, as [13] achieved
rate theorems.
from the preceding
[A direct
theorem with
of Theorem
a slightly
2.1 only.
set Bis,
bigger
situation,
but it can also be inferred
proof might
actually give the same
We
this for the situation
consider
P*; Po).]
For a given model g?, prior Y\ on g? and some P*
COROLLARY 2.1.
^,
?
< oo and
< oo for all P e g?. Suppose
that
Polog(/?*/Po)
Po(/?/p*)
that, for every s > 0,
assume
Tl(B(s, P*; Po)) > 0,
(2.14)
&,d\
supNt(ri,
Pq, P*)
< oo.
T]>?
Then for
(2.15)
every s > 0, as n ?> oo,
nn(Pe&:d(P,P*)>e\Xu...,Xn)->0
Proof.
inLxiP?).
and g by f(e) = UiBis,
Define functions /
P*; Po)) and g(e) =
sn -? 0
suPrj>s Ntirj, g?, d; Pq, P*). We shall show that there exists a sequence
2 2
_
n?"
>e
such that fisn)
and gisn) < en?" for all sufficiently
large n. This implies
that the conditions
of Theorem
2.1 are satisfied for this choice of sn and, hence,
the posterior converges with rate at least sn.
To show the existence
of sn, define functions
hn by
*.(.)=?-,(g(?)+7L).
the assumptions
and hnis) -> 0 as n ->
for every fixed s > 0. Therefore,
there exists sn I 0 with hnisn) -> 0 [e.g.,
- - - such that
<
> n^\ next define sn =
n\ < ni <
hnil/k)
\/k for all n
\/k
<
<n
In
an
there
Af such that hnisn) < 1 for n >
exists
rik
nic+{].
particular,
This
is well
defined
and finite
_
This
implies
that fisn)>e
by
2 2
ne" and
g(en)
< enE* for
> N.
every n
oo,
fix
for
N.
B. J.K. KLEIJN AND A. W. VAN DER VAART
848
we
2.4. Multiple
In this section
points of minimal Kullback-Leibler
divergence.
extend Theorem
2.1 to the situation where
there exists a set g?* of minimal
Kullback-Leibler
points.
points can arise
the true distribution
in two very different ways. First consider
the
and
the
elements
the
model
of
g*
Po
pos
are sampled from Po, they fall
the observations
minimal
Multiple
situation where
sess different
supports. Because
one in the set where po > 0 and, hence,
with probability
the exact nature of the
=
on
set
elements
of
the
model
the
is
g*
irrelevant.
p
0}
{po
Clearly, multiple
minima
arise if the model
contains multiple
extensions
of P* to the set on which
= 0. In this case the observations
to distinguish
do not provide
be
the means
po
no such distinction
can be made by the
tween these extensions
and, consequently,
either. Theorems
2.1 and 2.2 may apply under this type of nonunique
posterior
ness, as long as we fix one of the minimal
points, and the assertion of the theorem
as
soon as it is true for one of the minimal
will be true for any of the minimal
points
= 0
under
the
This
follows
conditions
of the theorem, d(P*,
because,
points.
P2*)
on the set po > 0, in view of (2.9).
whenever
P* and
P2* agree
Genuine multiple
Kullback-Leibler
may occur
points of minimal
divergence
as well.
For
with means
distribution
both have
artificial
of normal distributions
fit a model
instance, one might
consisting
?
in (?oo,
1] U [1, oo) and variance one, in a situation where the true
? 1
is normal with mean 0. The normal distributions with means
and 1
the minimal
and we
Kullback-Leibler
divergence.
This
situation
is somewhat
are not aware of more
or semiparametric
it appears
because
in the nonparametric
interesting examples
us most
in the present paper. Nevertheless,
of an
that the situation might arise, we give a brief discussion
case
that interests
2.1.
of Theorem
than to a single measure
set g** C g* of points at minimal
extension
Rather
P* e g*, the extension
refers to a finite sub
Kullback-Leibler
divergence. We give condi
near this
concentrates
distribution
asymptotically
the posterior
set of points. We redefine our "covering numbers for testing under misspecifica
number N of convex sets B\,...,
tion" Nt(s, g?, d; Po, g**) as the minimal
Bn
on (3?, srf) needed
to cover the set {P e gP: s < d(P,
of probability measures
tions under which
g?*) < 2s} such that
sup inf sup -logP0(4)
(2.16)
This
we
roughly says that, for every Peg0,
can apply arguments as before.
there exists
^ T'
some minimal
point
to which
Yl on g* and some subset
For a given model g*, prior
< oo for all P e ??
that
g?* C g?, assume
/ po) < oo and Po(p/p*)
Polog(p*
numbers sn
that there exist a sequence of strictly positive
and P* e g?*. Suppose
THEOREM
2.4.
BAYESIAN MISSPECIFICATION
with
sn -> 0 and ns2 ?> oo and a constant
849
L > 0, s?c/z
fto, /or
all n and all
s>sn,
inf
(2.17)
n(Bisn,P*;PQ))>e-Ln??,
(2.18) Ntis,
Then for
(2.19)
every sufficiently
large constant M
> 0, as n -? oo,
> Msn\Xx,.
Yln(P e g? :diP, &*)
3. Mixtures.
Let
/i denote
< en??.
g?, d; Pq, &>*)
the Lebesgue
i/iLi(/ff).
..J?)-^0
measure
on
z e R,
IR.For each
let
jch*
space (?3K\ &/) that
/?(jc|z) be a fixed /x-probability
density on a measurable
a
on
on
and
F
E
distribution
for
define
the /x-density
z),
ix,
depends measurably
PfM=
/ p(jc|z)JF(z).
In this section we consider mix
Let P/7 be the corresponding
probability measure.
ture models ^ ? {Pp'.F e^},
where J^ is the set of all probability
distributions
on a given compact
interval [?M, M]. We consider consistency
for general mix
a rate of convergence
in the special case that the family
location family, that is, with (j) the standard normal density,
tures and derive
the normal
p(-\z)
is
(3.1) PFix) =
f<t)ix-z)dFiz).
are an i.i.d. sample X\,
drawn from a distribution
...,Xn
Po on
not
with
which
is
of
the
As a
mixture
form.
(?vBT,&/)
pq
p-density
necessarily
we
a
use
on
for
Dirichlet
distribution
J?\
F,
process
(see [6, 7])
prior
The observations
3.1. General
mixtures.
We
say that the model
is Po-identifiable
if, for all pairs
Fx,F2e&,
F\^F2
=>
P0(pF] ^ PF2) > 0.
on the model
this condition
excludes
Imposing
ness of P* may occur (as discussed
in Section
the first way
in which
nonunique
2.4).
~
LEMMA 3.1.
Assume
that
PQlogip f / Po) < oo for some F e J?\ If the map
is continuous for all x, then there exists an F* e JP that minimizes
z ?->p(x\z)
?
over & - If the model
F h-> PQlogip f/pq)
is PQ-identifiable,
then this F* is
unique.
PROOF.
If Fn is a sequence
in& with Fn -? F for the weak topology on &,
then pFn ix) -? pp(x)
for every x, since the kernel is continuous
in z (and, hence,
as a result of the compactness
also bounded
of [?M, M])
and by use of the
B. J.K. KLEIJN AND A. W. VAN DER VAART
850
lemma. It
lemma. Consequently,
pFn -> pF in L\(p)
by Scheffe's
portmanteau
t->
the
is
continuous
for
weak
F
from
into
L
the
&
follows that
i (p)
pF
map
topol
theorem. The
is compact for this topology, by Prohorov's
ogy on &. The set &
?
as a
is lower semi-continuous
Kullback-Leibler
p \-+ Polog(p/po)
divergence
?
h>
F
the
to
Lemma
R
from
Therefore,
Po
map
8.2).
(see
log(pF/po)
map
L\(p)
on the compactum &
is lower semi-continuous
and, hence, attains its infimum
on J^.
?
of the
is also convex. By the strict convexity
The map p i-> Polog(p/po)
?
we
X
e
for
function x h>
have,
(0, 1),
any
logx,
-
-Polog
unless
Po(p\
Polog(pF/po)
V
=
/
po
pi)
=
1. This
shows
is satisfied
(3.2)
for the weighted
d2(PFl,PF2)
that
-
po
the point
A.)P0log?,
of minimum
Po
of
F m*
is Po-identifiable.
is unique if&
Kullback-Leibler
Thus, a minimal
on
the kernel p. Because
conditions
that (2.9)
< -AP0log-(1
=
point PF* exists and is unique under mild
is convex, Lemma 2.3 next shows
the model
distance, whose
square is equal to
Hellinger
lIJ f(^p^-^pJ-2)2I^-dp.pp*
is bounded
then this expression
by the squared Hellinger
Loo(/x),
the measures
between
PF[ and PFl.
and the map F \-+ pp from &
Because &
is compact for the weak
topology
=
to L\(p)
is continuous
{Pf :F e ^}
(cf. the proof of Lemma 3.1), the model g?
and total
the Hellinger
is compact relative to the total variation distance. Because
If po/Pf*
distance H
structure, the model
it is totally bounded,
distance and, hence,
relative to the Hellinger
with
are finite for all s. Combined
ering numbers N(s, g?,H)
distances
variation
define
the same uniform
is also compact
that is, the cov
the result of the
2.2 and 2.3, this yields the result that the en
paragraph and Lemmas
preceding
? ^oo(i^)
is
satisfied for d as in (3.2) if po/Pf*
of
2.1
condition
Corollary
tropy
we
theorem.
the
obtain
and
following
THEOREM
s>0,
then
3.1.
If po/PF*
PF*)
Yln(F:d(PF,
LOQ(p)
>
and
s\X\,...,Xn)
Yl(B(s, PF*\ Po))
-? 0 in
L\(P$)
> 0 for
for d
every
given
by (3.2).
to the situation where p(x\z) =
Next we specialize
mixtures.
3.2. Gaussian
?
The
kernel and derive the rate of convergence.
convolution
z) is a Gaussian
(j>(x
if Po is Lebesgue
is well known to be Po-identifiable
convolution model
Gaussian
as
in
be
defined
d
Let
continuous
(3.2). We assume that
(see, e.g., [12]).
absolutely
?
so
exists a minimal
some
there
that
for
is
finite
F,
Po is such that Po log(pF/po)
Kullback-Leibler
point F*, by Lemma 3.1.
851
BAYESIAN MISSPECIFICATION
LEMMA
3.2.
then the entropy
constant
If for some
condition
g?,d)
\ogNisn,
sn a large enough multiple
is satisfied for
> 0,
dippx,
C\
? C\HiPF\,
Pf2)
Pf2),
<nsl
oflognj'
y/n.
is bounded above by the
distance
the square of the Hellinger
?
<
that
the assumption
Li-norm,
implies
Pf2111- Hence,
Pf2)
d2iPp^
C\\\Ppx
for all s > 0, we have NiC\s,
g?, ||-||i). As a result of Lemma 3.3
g?, d) < Nis2,
in [9], there exists a constant C2 > 0 such that
Proof.
(3.3)
Because
|/Y, -PF2\\{ <C2\\PFx
-PFJ^maxjl,M,^log+?
from which it follows that /V(C^log(l/?)'/2,
&,
j,
j|-|loo) for
see that there exists
in [9], we
3.2
the help of Lemma
enough s. With
constant C3 > 0 such that, for all 0 < s < e~',
small
J^^
||-||i) < N(e,&>,
a
.
logW(e,?M|-||oo)<C3(logi)
all of the above, we note
Combining
/
/
that, for small enough
1\1/4
logW(c,C2?l/2(tog-J
s > 0,
\
,
&,d)
<\ogN{e, &,\\-\\oo)
<C3(logi)2.
So if we
sn such that, for all n >
can find a sequence
1, there exists
an s > 0 such
that
j x 1/4
/
<sn
C3flog-J
C1C2^1/2^log-j
then we have demonstrated
/
and
<C3Hog-J
that this is the case
shows
easily
=
case
we
which
choose, for fixed n, s
(in
We
the
are now
location
<ns2n,
that
logA^(^,^,J)<log^^CiC26:,/2('log-N)
One
1\2
,g?,d\
<ns2n.
sn = max{CiC2,
C^ilogn/^/n)
l/n), if n is taken large enough.
for
to apply Theorem
2.1. We consider,
in position
with
the
normal density
standard
mixtures
(3.1)
for given M > 0,
0 as the kernel.
B. J.K. KLEIJN AND A. W VAN DER VAART
852
We
the prior n equal to a Dirichlet
prior on &
specified
by a finite
a with compact
continuous
support and positive,
Lebesgue
density
choose
base measure
on [-M,M].
on R dominated
Let Po be a distribution
by Lebesgue mea
? ?00 0>0- Then the posterior
concen
that po/Pf*
distribution
at the rate logn/y/n
around PF* asymptotically
to the
relative
d on g? given by (3.2).
THEOREM
3.2.
sure p. Assume
trates its mass
distance
PROOF.
below
The
set of mixture
densities
by the upper and lower envelope
=
U(x)
(f>(x+ M)t{x<-M]
+(/>(*
pF with
functions
F e&
is bounded
above
and
M)t{x>M]
+(t>(0)t{-M<x<M},
L(x) = 4>{x M)t[x<o) +(/>(x + M)t{x>0].
So for any F e &,
\PF*J
L
0(0)
+ Po(e-2MXt{x<-M}
+e2MXt{x>M})
< 00,
of pF* and PF* has sub
by a multiple
2.2 and 2.3, the covering number for testing
in (2.5) is bounded
above by the ordinary metric
Po, P*)
Nt(s, g*,d;
covering
some
constant
for
A.
Then Lemma
3.2 demonstrates
that
number N(As,
g*, d),
a
of logn/^/n.
the entropy condition
(2.5) is satisfied for sn
large multiple
because
Gaussian
bounded
po is essentially
tails. In view of Lemmas
to verify the prior mass
condition
(2.4). Let s be given such that
a discrete distribution
in
Lemma
3.2
there
exists
function
[9],
By
F! e D[?M,
zi,...,
M] supported on at most N < C2 log(l/e)
{z\,
zn] C
points
?
<
>
are
constants
0
where
that
that
such
C2
C\s,
C\,
[?M, M]
\\pF*
Pf'Woo
=
we
on
Without
loss
of
M
write
We
F'
generality,
only.
depend
Pj^zj= 1,...,J2f=\
not
assume
if
set
is
this
is
that
the
N}
may
Namely,
2^-separated.
{zj :j
= 1,...,
:
of
and
subset
the case, we may choose a maximal
N}
2^-separated
{zj j
in this
shift the weights pj to the nearest point in the subset. A discrete F" obtained
?
<
virtue
of
the
So
fashion satisfies
\\pF> p/Hloo
2?||0/||oo.
triangle inequal
by
of the standard normal kernel (p is bounded,
ity and the fact that the derivative
a given F' may be replaced by a 2?>separated
F" if the constant C\ is changed
It suffices
0 < s < e~l.
accordingly.
By Lemma
difference
3.3
in [9], there exists
a constant
D\
such that the Li-norm
satisfies
/
\\PF*-PF'\\\<D\s\\og-J
lxl/2
,
of the
853
BAYESIAN MISSPECIFICATION
for small enough s. Using Lemma 3.6 in [9], we note, moreover,
a constant D2 such that, for any F e&,
\\Pf
-
Pf>\\\ <
a constant
So there exists
< s, then
Pj\
+ J2
D2(s
\Pizj
D > 0 such that, if F
~
-
e, zj+s]
satisfies
that there exists
Pj\\.
~
ZI7L1
( W/2
\F\Zj
?> Zj +?]
?
.
\\PF-PF*h<De\log-)
be the measure
Let QiP)
is essentially
Po/Pf*
\\QiPFl)
-
bounded
<
PF2111for all Fx,F2e &.
QiPF2)111 K\\PFl
a constant
that there exists
it follows
?
The assumption
that
ipQ/pF*)dP.
by dQiP)
> 0 such that
a
constant
K
exists
that
there
implies
defined
r
yv
Since g(PF.) = Pq,
D' > 0 such that, for small enough
1
s > 0,
\Fe^:J2\FlZj-s,Zj+s]-Pj\<s\
C
J.
jpe^:||(2(PF)-/>olli<(^)2^log^
=
<
We have that dQiPF)/dP0
/?f/pf*
PoiU/L) < 00. The
and Po(pf/Pf*)
is bounded by the square root of the L x-distance. Therefore,
we see that the set of
Lemma
8.1 with r\? r)is) = D^1/2(log(l/?))1//4,
on
set
F
in
the
side
of
with
the
the
last
Pp
right-hand
display is contained
distance
Hellinger
applying
measures
in the set
\pp.Fe
&, -Po\og?<
where f is) = D"r)is)ilogil/r)is)))
constant D", and small enough
I
Pf*
C2(^), Poflog?)
V J Pf^/
<
<
$2is)\
for an appropriate
D"D'sl/2ilogil/s))5/4,
s. It follows
that
Fe^:J^\F[zj-s,Zj-{-s]~pj\<SK
{N
[8] (Lemma 6.1) or Lemma A.2 in [9], we see that the prior measure
Following
side of the previous display
is lower bounded by
the right-hand
at
ciexp(-C2^1og(l/?))>exp(-L(log(l/^))2)>exp(-L/(log(l/C(?)))2),
where
quence
and L = C2C2 > 0. So if we can find a se
1, c2 > 0 are constants
sn such that, for each sufficiently
large n, there exists an ? > 0 such that
c\ >
en>S(e\
ns2n>
,
(log-?)
B. J.K. KLEIJN AND A. W VAN DER VAART
854
Yl(B(sn, PF*; P0)) > Yl(B($(s),
PF*; P0)) > exp(-L'nsl)
and,
is
shows
satisfied.
One
and ?(s) =
that, for sn = logn/^/n
(2.4)
easily
are fulfilled for sufficiently
the two requirements
large n.
then
hence,
l/^/n,
Let Po be the distribution
of a random vector
(X, Y) satis
for
random
X
and
variables
eo
eo taking val
independent
ues in a measurable
a measurable
and
in
and
IR, respectively,
space (3?,srf)
->
function fo: 3E
E. The variables X and eo have given marginal
distributions,
which may be unknown,
but are fixed throughout
the following.
is
The purpose
4. Regression.
?
fo(X)
fying Y
+
the regression
function fo based on a random sample of variables
as (X, Y).
with
the same distribution
(X\, Y\),...,
(Xn, Yn)
to this problem might
A Bayesian
start from the specification
of a
approach
->
a
on
of
If
distribution
class
the
measurable
functions
R.
&
JT
/:
prior
given
a posterior.
to determine
of X and eo are known,
distributions
this is sufficient
If
one
are
not
to introduce priors for
these distributions
then
known,
might proceed
to estimate
as well. The approach we take here is to fix the distribution
of eo
or Laplace distribution, while aware of the fact that its true distribution
the consequences
of the resulting model misspec
may be different. We investigate
of the error distribution
does not
ification. We
shall show that misspecification
these unknowns
to a normal
of the regression
In this sense
function.
a nonparametric
to misspecifi
the same robustness
Bayesian
approach possesses
or
contrast estimation
least
minimum
absolute
cation as minimum
squares
using
have
serious
deviation.
conditions
consequences
for estimation
see that the use of the Laplace
distribution
requires no
on the tail of the distribution
of the errors, whereas
the normal distrib
We
shall also
appears to give good results only if these tails are not too big. Thus, the tail
of the method
absolute deviation versus the nonrobustness
robustness of minimum
of least squares also extends to Bayesian
regression.
We build the posterior based on a regression model Y ? f(X) + e for X and e
on the true distribution
as is the assumption
of (X, Y). If we assume
independent,
cancels out
that the distribution
Px of X has a known form, then this distribution
for the posterior on /. If, instead, we put independent priors on /
of the expression
ution
then the prior on Px would disappear upon marginalization
and Px, respectively,
the posterior for /,
of the posterior of (/, Px) relative to /. Thus, for investigating
we may assume without
distribution
of X is
that the marginal
loss of generality
into the dominating measure p for the model.
It can be absorbed
of the random variable
For / e gP, let Pf be the distribution
(X, Y) satisfying
=
e
e
X
Y
and
for
+
variables, X having the same distribution
f(X)
independent
a given density p, possibly different from the density of
as before and e possessing
the cases that p is normal and Laplace. Given
the true error eo. We shall consider
known.
a prior n
on J?\ the posterior
distribution
for /
is given
^ /Bn?=iPW-/(x?))<*n(/)
fYYUp(Yi-f(Xi))dn(f)
by
BAYESIAN MISSPECIFICATION
855
near fo + Eeo in the case that p
concentrates
near
if
if these translates
and
+
P is Laplace,
median(eo)
fo
density
are
true
in
function
contained
the
model
of the
g?. If the prior is
fo
regression
or
also in the sense that fo + p?g?
(where p is the expectation
misspecified
this remains true with fo replaced
of eo), then, under some conditions,
median
We
shall show
that this distribution
is a normal
In agreement
/* of fo on &.
by a "projection"
the paper, we shall denote
the true distribution
with
the notation
in the rest of
of an observation
(X, Y) by Po
=
as
is
with
in
different
from
The
model &
that,
0).
/
(stressing
general, Po
Pf
on 3tg x R
in the statement of the main results is the set of all distributions
Pf
with / e&.
4.1. Normal
normal
(4.1)
that the density p is equal
regression.
Suppose
=
Then, with p = Eeo,
p(z)
(27r)-1/2exp(?\z2).
density
-
=
log -^-(X, Y) -\(f2
Pfo
j {
=
?L
-Polog
Pfo
-
-P0(f
2 2
to the standard
fo)(X),
f0)2(X) + eo(f
fo
-
-
p)2
-p2.
i-> ?
is mini
Polog(pf/po)
?
?
mized for / = /* e g? minimizing
the map / i-> Po(f
fo
l^)2
In particular,
if fo + p e gfi, then the minimizer
is fo + p and
is the point
P/0+/x
in the model
that is closest to Po in the Kullback-Leibler
sense. If also p = 0, then,
even though the posterior on
near
asymptotically
Pf will concentrate
P/0, which
is typically not equal to Po, the induced posterior on / will concentrate
near the
It follows
that the Kullback-Leibler
divergence
/
true regression function fo. This favorable property of Bayesian
is anal
estimation
to
of
error
that
least
also
for
nonnormal
distributions.
ogous
squares estimators,
If /o + p is not contained
in the model,
then the posterior for / will in general
not be consistent. We assume
that there exists a unique /* e J?~ that minimizes
?
?
is a closed, convex subset
/ h-> Po(/
fo
p)2, as is the case, for instance, if&
some conditions
of L2(Po). Under
we shall show that the posterior concentrates
near /*.
If p = 0, then /*
is the projection
of fo into &
and the
asymptotically
posterior still behaves
that Eo<?o = 0.
The
following
L2(Po)-distance
LEMMA
in a desirable
lemma
shows
manner.
that
(2.8)
For simplicity
is satisfied
of notation,
for
we assume
a multiple
of
the
on &.
be a class
of uniformly bounded functions
f :3? ?> R
convex
is
and
in
closed
that fo
fo
Z.2(Po). Assume
is uniformly bounded,
that Eo^o = 0 and that Eo^M'^0' < oo for every M > 0.
Then there exist positive
constants C\,Ci,
C3 such that, for all m e N, /, f\,...,
such
4.1.
Let ^
that either
&
or &
B. J.K. KLEIJN AND A. W. VAN DER VAART
856
with J2i A./=
and kx,...,km>0
fme^
1,
Polog^<-^P0(/-r)2,
2
Pf*
(4.2)
<ClP0(/-r)2,
P0^log^
0<a<l
We
Proof.
(43)
/
V Pf*
-
C2J^ki(PQifi n2
>
sup -logPof^i^)"
C3Po(/
-
f)2).
i
have
-
Y) = -I[(/0
2
iog?L(X,
Pf*
f)2
-
-
ifo D2]iX)
e0(f*
-
f)(X).
The first
zero by assumption.
side has mean
second term on the right-hand
?
?
?
as is
if
side has expectation
term on the right-hand
fo
/*,
f)2
|Po(/*
of
-Furthermore,
the minimizing
is convex,
if&
/*
&
property
the case if /o
The
?
?
/*)(/*
implies that Poifo
/) > 0 for every f e &
side
of the first term on the right-hand
in both cases (4.2) holds.
Therefore,
From (4.3) we also have, with M
is bounded
a uniform
and then the expectation
above
upper bound
by
~
?
f)2
\Poif*
on &
and fQ,
9
<P0[(/*
(-BL)
\Pf*/
Poflog-^-)
\
Pf*/
-
< P0[(f* - /)2(2/o
-??-)
Poflog
V Pf*/
x
-
/)2(2/o-
/
/
~
-
/*)2] + 2P0e2P0(f*
-
~
P)2 + 2e2if* f)2]
f)2,
^
e2a(M2+M\e0\)
?
times Po(/
sides can be further bounded by a constant
/*)2,
right-hand
of eQ only.
where
the constant depends on M and the distribution
=
we see that there ex
In view of Lemma 4.3 (below) with p = pf* and #;
/?/.,
on M only such that, for all A,;> 0 with J^t A.j= 1,
ists a constant C > 0 depending
Both
(4.4)
ll- Po(^^)"
V Pf*
I
By Lemma
a =
4.3 with
<2a2c^Ai.Po(/;.
) -?Polog^-|
i
T,i*-iP/i\
1 and p ? Pf
and similar
arguments,
we
for any / e&,
\l-Po(^^)-Polog-^-\<2Cj:xiPo(fi-f)2.
V Pf )
I
Y,ihPf,\
For A, =
i
1 this becomes
ll
I
-
Po(^-)
\PfJ
Polog^-I
Pf,\
< 2CPo(fi - f)2.
_
r)2
also have
that,
BAYESIAN MISSPECIFICATION
differences,
Taking
we obtain
857
that
Polog-^-?>Polog^
EihPfi
Pfi
i
<4C?A,Po(//-/)2.
i
=
By the fact that log(ab)
log a + logb for every a, b > 0, this inequality remains
on
true if /
the left is replaced by /*. Combine
the resulting inequality with (4.4)
to find that
\
pp
/
Pfi
i
~
-2a2Cj2^iPo(f*
i
>
^
_
2a2c)
~
fi)2-4C^iPo(fi
f)2
i
~
5>,ft(/*
fi)2-4C^iPo(fi
-
f)2,
we have used
small a > 0 and suitable
con
(4.2). For sufficiently
stants C2, C3, the right-hand
side is bounded below by the right-hand
side of the
lemma. Finally the left-hand side of the lemma can be bounded
the
supremum
by
?
over a 6 (0, 1) of the left-hand side of the last
> 1? x for
display, since
log x
every x > 0.
where
In view
of the preceding
of the quantities
involved
lemma, the estimation
can be based on the L2(Po) distance.
The "neighborhoods"
B(s, Pf*; Po) involved in the prior mass conditions
and (2.11) can be interpreted in the form
main
in the
theorems
(2.4)
< P0(/* < s2}.
B(s, Pf.; Po) = {feg?: P0(f
fo)2
fo)2 + e2, P0(f
f)2
If P0(f - /*)(/* - /o) = 0 for every / e & (as is the case if /* = f0 or if /* lies
in the interior of &\
then this reduces
to an L2(Po)-ball
around /*
by Pythagoras'
theorem.
In view of the preceding
lemma and Lemma 2.1, the entropy for testing in (2.5)
can be replaced by the local
The rate of
entropy of g? for the Z>2(Po)-metric.
of the posterior distribution
Theorem
2.1
is then also
convergence
guaranteed
by
relative to the L2(Po)-distance.
These observations
the
theorem.
yield
following
THEOREM
Po(f
?
/*)(/*
4.1.
?
Assume
the assertions
of Lemma
4.1 and,
in addition,
that
fo) = Ofor every f e g?. If sn is a sequence of strictly positive
B. J. K. KLEIJN AND A. W. VAN DER VAART
858
numbers
?
sn ?> 0 and ns?
with
(4.5)
oc such that, for
-
e^:PQif
n(/
<
pf
L > 0 and all n,
a constant
>
s2n) e~Ln?n,
(4.6) N(sn,^,\\-\\Po,2)<en?n,
-
then nnife^:
>
P0(/
/*)2
large constant M.
ficiently
-> 0 in
Xn)
Ms2n\Xx,...,
LxiP$),for
every suf
are many
special cases of interest of this theorem and the more general
reason
can
2.1 and 2.2 using the preceding
results that
be obtained from Theorems
in the context of the well-specified
ing. Some of these are considered
regression
There
estimates on the prior mass and the entropy are not dif
[14]. The necessary
ferent for problems other than the regression model. Entropy estimates can also be
contrast estimators. For these
of minimum
found in work on rates of convergence
reasons we exclude a discussion
concrete
of
examples.
results.
The following
pair of lemmas was used in the proof of the preceding
model
LEMMA
measure
There exists a universal
4.2.
Pq and any finite
measures
constant C such that, for any probability
P and Q and any 0 < a < 1,
-aPolog^ <a2CP0
l-Po(-)
(log^j ((^j
+
t[q>p]
.
t{qSp^j
1 x)/ix2ex)
for x > 0 and
The function R defined by P(x) = iex
PROOF.
?
?
=
for x < 0 is uniformly bounded on R by a constant C. We
1 x)/x2
iex
Rix)
can write
=l+aP0log
Po(-)
\P/
P
+
The
lemma
LEMMA
ity measure
P0P(tflog^(alog^)
(^j
Mq>p}+Mq<p]
follows.
4.3.
There
exists a universal
Pq and all finite
ki >0withJ2i^i
measures
= 1,
constant
P, Qx,...,
C such
that, for any probabil
0 < a < 1,
Qm and constants
V pJ l\pJ
V p )
Ii1_fo(SM)"_a/,olog^l<2aJci:A,/.?(iog*)2[(*)2+l"i
iZiMqA
i
PROOF.
In view
of Lemma
?IA 8?p?)
4.2 with
ll?p?)
q
=
?,
hqi,
it suffices
to bound
1^k"i<>p+t^x"ii-p)
J
BAYESIAN MISSPECIFICATION
859
side of the lemma. We can replace a in the display by 2 and
by the right-hand
to the
the expression
make
larger. Next we bound the two terms corresponding
by indicators separately.
decomposition
of the map x h> x log x,
the
convexity
By
If Y.i ^iQi > P> then the left-hand side is positive and the inequality
of the map x h-> x2 allows
when we square on both sides. Convexity
the square of the right-hand side as in the lemma.
of the logarithm,
the concavity
By
-
log-<
Xi log ?.
>
P
i
the set J^i ^iGi < P the left-hand side is positive
sides and preserve the inequality.
On
is preserved
us to bound
P
and we
can again
take squares
on both
4.2.
Laplace density p(x)
=
that the error-density
Suppose
regression.
Laplace
p
is equal
to the
Then
\ exp(?\x\).
log^-(X,
Y) = -(\e0 + /o(X)
/(X)|
-
kol),
Pfo
_Jp0log^
=
p0<D(/-/0),
Pfo
?
?
over v e R at the
<J>is minimized
for $(v) = Eo(|eo
v\
\eo\). The function
in&,
that if fo + m, for m the median of eo, is contained
median of eo. It follows
?
over
is minimized
then the Kullback-Leibler
/ e g?
divergence
Polog(pf/po)
at / = fo + m. If&
is a compact, convex subset of L i (Po), then in any case there
the Kullback-Leibler
but it appears dif
exists /*
^" that minimizes
divergence,
in general. For simplicity
of notation, we shall
this concretely
ficult to determine
that m
assume
=
0.
then the function O will
of eo is smooth,
to expect
at v = m = 0, it is reasonable
it is minimal
= 0 and some
m
constant
of
Co,
positive
neighborhood
If the distribution
Because
d>(v) = Eofleo
(4.7)
Because
exists,
O
H
it is also reasonable
-
smooth
too.
that, for v in a
kol) > CoM2.
to expect
that its second
derivative,
if it
is strictly positive.
LEMMA
and
is convex,
"
be
let fo
4.4.
Let g? be a class
be uniformly
bounded.
?> R
of uniformly bounded functions
f: 3C
Assume
that either foegp
and (4.7) holds,
860
B. J.K. KLEIJN AND A. W. VAN DER VAART
or
that ^
is convex
and
in Li(Po)
and that 4> is twice continu
compact
second derivative.
Then there exist pos
strictly positive
m
such
e
all
and
that, for
C3
N, /, fx,...,
fm e &
with J2i ki = 1,
with
ously differentiable
itive constants
Cq,Cx,C2,
kx,...,km>?
Po\og-^<-C0P0if-f*)2,
Pf*
(4.8)
<cxP0if-f*)2,
p0(iogEll\
\
Pf J
sup
0<c*<l
>
-logP0(EiXiPfi)a
V Pf*
/
C2J2^i(Po(fi
n2
-
-
C3Poif
1
fi)2).
that fo ? &', so that /* = fo. As 4> is monotone
on
v
a
is
also
satisfied
for
in
(0, 00) and (?00,0),
(4.7)
inequality
automatically
given
on the compactum).
the compactum
(with Co depending
compactum
Choosing
?
such that (/
is contained
in it with probability
one, we
large enough
/*)(X)
=
that (4.8) holds (with f0
conclude
/*).
If /*
is not contained
in&
but &
is convex, we obtain a similar inequality
?
with /* replacing fQ, as follows. Because
/* minimizes
/ h> Po<&if
fo) over
e
t
and ft = (1
&
for
e
the
&
of
+ tf
the map
[0,1],
0/*
right derivative
> 0.
t h> PQ<$>ift - fo) is nonnegative
at t = 0. This yields P0&(f*
fo)if
/*)
By a Taylor expansion,
Proof.
first
Suppose
=
Polog ^?
f0)
P0(0>(/
Pf
=
for some / between
tive and the function
Thus,
the right-hand
(4.8) follows.
P0O'(/*
-
-
fo)if
/*) + X-Po<s>"if-fo)if
-
/*)2,
and /*. The first term on the right-hand
side is nonnega
<J>"is bounded away from zero on compacta by assumption.
?
side is bounded below by a constant times Poif
f*)2 and
/
again
Because
with M
<&(/* f0))
is bounded
logipf/p/*)
a uniform upper bound on &
in absolute
value by
\f
?
f*\,
we
also have,
and fo,
2
Po(log^)
\
Pf*)
\
pp)
\pf*J
in the proof of Lemma
4.1, we
Lemma 4.3 to obtain the result.
As
<Po(f*-f)2,
can
combine
these
inequalities,
(4.8)
and
BAYESIAN MISSPECIFICATION
861
As in the case of regression using the normal density for the error-distribution,
of Theo
for the application
lemma reduces the entropy calculations
the preceding
rem 2.1 to estimates of the L2(Po)-entropy
of the class of regression
functions &.
is the same as in the case where a normal distri
The resulting rate of convergence
bution
is used
for the error. A difference
with
the normal
case
< oo are necessary.
Instead
of the type Eo^0'
tail conditions
a certain smoothness
of the error eo.
of the true distribution
is that presently no
the lemma assumes
of posterior
for finite
The behavior
distributions
5. Parametric
models.
was
more
in
models
considered
and
dimensional,
[1]
recently
by
misspecified
we
in
In
the
the
also
this
section
references
Bunke and Milhaud
[3] (see
latter).
near a minimal Kullback
show that the basic result that the posterior concentrates
Leibler point at the rate y/n follows from our general theorems under some natural
in a general metric
indexed by a parameter
We first consider models
rate
to
of
the
the
metric
of
the parameter set.
and
relate
convergence
space
entropy
sets.
to Euclidean
Next we specialize
parameter
conditions.
of probability densities
indexed by a parameter 0
[po :0 e 0} be a collection
of the data and assume that
in a metric space (&,d). Let Po be the true distribution
there exists a #* e 0, such that, for all 0, 0\, 02 e 0 and some constant C > 0,
Let
(5.1)
?
Po log
<
-Cd2(0,0*),
Po*
(5.2)
pJJK_x\
\ Vpo2
(5.3) Po(log^)
)
<d2(eue2),
2
<d2(0u02).
Kullback-Leibler
di
inequality
implies that #* is a point of minimal
?
m*
9
between
and
the
model.
The
second
and
third
Po
vergence
Polog(pe/po)
are (integrated) Lipschitz
on the dependence
conditions
conditions
of po on 9.
The following
lemma shows that in the application
of Theorems
2.1 and 2.2 these
The
first
one
to replace the entropy for testing by the local entropy of 0
to (a multiple
of) the natural metric d.
In examples
to relax the conditions
itmay be worthwhile
In partic
somewhat.
can
the
conditions
be
Rather
"localized."
that they
than
ular,
(5.2)-(5.3)
assuming
conditions
allow
relative
are valid
the same results can be obtained
if they are valid
0\,02 e 0,
for every pair (9\, O2) with d(9\, O2) sufficiently
small and every pair (9\, O2) with
= 9*. For
= 9* and
=
and
situa
6\
92
92
Po
P$*
(i.e., the well-specified
arbitrary
a
on
condition
is
bound
the
distance
between
and
tion),
(5.2)
Pq*
Hellinger
P#,.
LEMMA
for every
5.1.
stants C\, C2 such
Under
that, for
the preceding
all m eN,9,9\,...
conditions,
,9m e 0
there
exist
and X\,...,
con
positive
Xm > 0 with
862
B. J.K. KLEIJN AND A. W. VAN DER VAART
sup -logPQfE/X/^T.
V PO*
YJ^d2(0i,9^-CXTkid2i9,9i)<C2
PROOF.
In view
a constant
exists
1-
(5.4)
By Lemma
of Lemma
/
/ / 0?x<l
5.3
with
(below)
p
=
pe*,
(5.2)
and (5.3),
there
C such that
\ Pe* )
\
Po(^^Y-aPo(log-^-)
5.3 with
a =
1, p = po,
<
Hi^iPOi/ t
2a2Cj:kid2i6i,e%
(5.2) and (5.3),
V Pe J
V Ei^iPOiJ <2C?M2(^).
l-Po(^^)-/>o(log-^-)
i
We
can evaluate
combination
this with
k(
of the resulting
=
1 (for each / in turn) and next subtract the convex
from the preceding
inequalities
display to obtain
Pe
Poflog
)-J2kiP0(log^-)
<4Cj^kid2i9i,9).
this remains valid if we replace 6 on the left
By the additivity of the logarithm,
hand side by 0*. Combining
the resulting inequality with (5.1) and (5.4), we obtain
l-Po( \
/
Po*iP0i)
The
lemma
follows
?
> 1 ? x.
logx
>aYkid2iei,6*)iC-2a)-4CYkid2i9i,9).
upon
i
choosing
a
>
0
small
sufficiently
and
using
If the prior on the model
is induced by a prior on the parameter
{pe :9 e 0}
set 0,
then the prior mass condition
(2.11) translates into a lower bound for the
mass
set
of
the
prior
Bis,9*',Po) =
to (5.1),
In addition
\9ee:-Polog-^<s2,P0(log-^)
Ipo*
to assume
it is reasonable
V
a lower bound
po*/
<s2\.
J
of the form
(5.5) P0log-^>-Cd2i9,9*),
Po*
at
least
for
small
of di9,9*).
a ball of the form
values
This
that
(5.3) implies
together with
<
Bis, 9*; Po)
[9:di9,
9*)
Cxs] for small enough s.
of (2.4) or (2.11) we may replace Bis, P*; Po) by a ball
Thus, in the verification
of radius s around 0*. These observations
lead to the following
theorem.
contains
BAYESIAN MISSPECIFICATION
Let
5.1.
THEOREM
hold.
(5.1 )-(5.5)
d(0,
{0e@:s<
sup logN(As,
If for
S>?n
Yl(0:jsn
863
small A and C,
sufficiently
0*) < 2s}, d) < ns2n,
<d(0,0*)<2jsn)
<
~e
ne2j2/%
Yl(0:d(0,0*)<Csn)
then Yl (0 :d(0,
0*) > Mnsn\X\,...,
5.1. Finite-dimensional
Euclidean
Xn) -> 0 in L i (P$) for
Let 0
models.
with
space equipped
the Euclidean
any Mn -> oo.
be an open subset of m-dimensional
distance d and let {po :0 e 0} be a
model
(5.1)?(5.5).
satisfying
Then the local covering numbers
constant B,
N(As,
{0e@:s<
as in the preceding
theorem
satisfy,
for some
/ -B\m
d(0, 0*) < 2s}, d) < (
)
(see, e.g., [8], Section 5). In view of Lemma 2.2, condition
(2.5) is satisfied for sn
a density that is bounded
a large multiple
If the prior n on 0 possesses
of l/^/n.
away from zero and infinity, then
Yl(0:d(0,0*)<js)
Yl(B(s,0*;Po))
~<c V
,m
'
for some constant
the
C2. It follows that (2.11) is satisfied for the same sn. Hence,
at rate 1/*Jn near the point 0*.
posterior concentrates
The preceding
situation arises if the minimal
point 0* is interior to the parameter
set 0. An example
is fitting an exponential
family, such as the Gaussian model,
to observations
that are not sampled from an element of the family. If the minimal
then we cannot expect (5.1) to hold for the natural
point 0* is not interior to 0,
and different
distance
rates of convergence may arise. We
is somewhat
surprising.
include a simple
example
of the latter type, which
Example.
that Po is the standard normal distribution
and the model
Suppose
>
of all N(0,
0
with
1.
The
minimal
Kullback-Leibler
1)-distributions
= 1. If the
a density on [1, 00) that is bounded
point is 0*
prior possesses
away
near
from 0 and infinity
near 0* at the rate l/n.
1, then the posterior concentrates
consists
One
easily
shows
that
-Polog^
(5.6)
=
Po*
-logPot^Y
\po*'
This
shows
y/\0\ -021
^(0-0*)(0+0*),
2
=l-a(0-0*){0+0*-a(0-0*)).
2
=
that (2.9) is satisfied
for a multiple
of the metric
d(pox,po2)
on 0 = [1, 00). Its strengthening
(2.8) can be verified by the same
B. J.K. KLEIJN AND A. W. VAN DER VAART
864
as before, or, alternatively,
the existence
of suitable tests can be estab
lished directly based on the special nature of the normal location family. [A suit
able test for an interval (0\, 02) can be obtained from a suitable test for its left end
methods
as in regular parametric mod
point.] The entropy and prior mass can be estimated
can
to
els and conditions
be
be
shown
satisfied
for sn a large multiple
(2.5)?(2.11)
?
rate
to
of l/y/n. This yields the
the metric Vl#i
relative
#21 and, hence,
\/y/n
the rate l/n
Theorem
in the natural metric.
In the
only gives an upper bound on the rate of convergence.
on
a
to
be sharp. For instance, for uniform prior
[1, 2],
present situation this appears
the posterior mass of the interval [c, 2] can be seen to be, with Zn = y/nXn,
2.2
<S>(2jil-Zn)-<S>(c^-Zn)
~
^
we
c e
c(-i/2)(<?-i)n+Zn(c-l)JZ
cJTi-Zn
Q(2jn-Zn)-&(y/n-Zn)
where
y/E-Zn
?
?
if xn, yn ->
ratio to see that <?>(yn) ?(xn)
(l/xn)<f>(xn)
c = cn =
zero
from
for
that xn/yn -> 0. This is bounded
away
use Mills'
(0, 1) such
1+ C/n and fixed C.
LEMMA
measure
exists a universal
There
5.2.
Po and any finite
measures
constant C such that for any probability
P and Q and any 0 < a < 1,
(
|l Fo(-) -<xPolog-|<a2CP0 /--l)
(log-) Mq<P]l
There exists a universal constant C such that, for any probability
and
Xm > 0 with
P, Q\,...,
Po
any finite measures
Qm and any X\,...,
holds:
1 and 0 < a < 1, the following
inequality
LEMMA
measure
=
J2i h
+
lte>p>
5.3.
l-^o
I
\
P
-otPolog?-?
J ?/*i#l
<2a^EA,/>o[(y|-l)
+(log|)2].
1
The function R defined by R(x) = (ex
PROOFS.
I)2
x)/a2(ex/2a
? 1?
bounded on R by
for x < 0 is uniformly
for x > 0 and R(x) = (ex
x)/x2
a constant C, independent
of a e (0,1].
[This may be proved by noting that the
?
?
?
?
?
1 x)/(ex^2
and
functions
1)
(ex
I)2 are bounded, where
(ex
l)/a(eax
in their power series.] For
the exponentials
this follows for the first by developing
the proof of the first lemma, we can proceed as in the proof of Lemma 4.2. For the
as in the proof of Lemma 4.3, this time
proof of the second lemma, we proceed
?
use
of
the
of the convexity
also making
map x h> \y/x
112 on [0, oo).
BAYESIAN MISSPECIFICATION
The proofs of Theorems
6. Existence
of tests.
versus the positive, finite measures
obtained
QiP)
865
2.1 and 2.2 rely on tests of Po
from points P that are at pos
Kullback-Leibler
divergence.
(i.e., not necessarily
probabil
itive distance
from the set of points with minimal
we need to test Po against finite measures
known results on tests using the Hellinger
ity measures),
Because
such as in [10]
distance,
or [8], do not apply. It turns out that in this situation the Hellinger
distance may
not be appropriate and instead we use the full Hellinger
transform. The aim of this
section is to prove the existence
of suitable tests and give upper bounds on their
first
the
in a general notation and then specialize
We
formulate
to
results
power.
the application
inmisspecified
models.
on a measurable
Let P be a probability measure
setup.
space
a
role
and
on i3?, srf)
the
of
let
be
measures
finite
class
?
of
Po)
iX, srf) (playing
=
Q with dQ
[playing the role of the measures
(po/p*) dP\ We wish to bound
the minimax
risk for testing P versus J2, defined by
6.1.
General
tt(P, B) = inf sup (P0 + Qil
(/))),
where
the infimum
conv(i?)
denote
LEMMA
is taken over
the convex
all measurable
If there exists a a-finite
6.1.
functions
->
(p: 3?
[0, 1]. Let
hull of the set B.
TtiP,B) =
sup
measure
that dominates
all Qe
B,
then
>
(Pip <q) + Qip
q)).
geconv(^)
Moreover,
there exists a test 0 that attains
the infimum
in the definition
of TtiP, B).
If p' is a measure
that dominates
both ?? and P
PROOF.
then a a-finite measure
p exists
dominating B,
?
Let
and
be
+
of
p
P).
p
pi
q
(e.g.,
//-densities
P and Q, for every Q e B. The set of test-functions
can
be
identified
with
(j)
the positive unit ball O of L^iX,
is dual to Lxi3?', srf, p), since
srf, p), which
If equipped with the weak-*
the positive
unit ball <J>is
p is a-finite.
topology,
Hausdorff
and compact by the Banach-Alaoglu
theorem
(see, e.g., [11], Theo
rem 2.6.18, and note that the positive
functions
form a closed and convex subset
of
the unit ball). The convex hull conv(i?)
(or rather the corresponding
is a convex subset of L\ i3?, srf, //). The map
set of
//-densities)
L^iSC,s^,p)
x Lxi$r,?f,p)-+R,
(0,g)H>0P
is concave
+ (l-0)G
in Q and convex in 0. [Note that in the current context we write 0P
instead of P0,
in accordance
with the fact that we consider 0 as a bounded
lin
ear functional
on Lxi&,
the map is weak-*-continuous
in (f>
srf, p).] Moreover,
for every fixed Q [note that every
net (f)a ^-X cf)by definition
weak-*-converging
B. J. K. KLEIJN AND A. W VAN DER VAART
866
for application
srf, p)]. The conditions
cpaQ -> (j)Q for all Q eL\(36',
the minimax
theorem (see, e.g., [15], page 239) are satisfied and we conclude
of
satisfies
inf
^^
(0P + (1-0)G)=
sup
inf(0P + (l-(/))Q).
sup
<t>e?
Qeconv(g)
Qeconv(?)
on the left-hand
side is the minimax
expression
testing risk ix(P,g2g). The
infimum on the right-hand side is attained at the point (p= t{p < q}, which
leads
to the first assertion of the lemma upon substitution.
The second assertion of the lemma follows because the function 0 h? sup{0P +
?
e conv(J2)}
is a supremum
of weak-*-continuous
functions
and,
(1
(p)Q:Q
The
hence,
attains
its minimum
on the compactum
<?>.
to express
It is possible
the right-hand
side of the preceding
lemma in the
P and Q, but this is not useful for the following.
Instead,
L\ -distance between
we use a bound
in terms of the Hellinger
transform pa(P, Q) defined by, for
0<cx < 1,
Pa(P,Q) =
Paql-adp.
f
inequality, this quantity is finite for all finite measures
By Holder's
of the choice of dominating measure p.
is independent
definition
For any pair (P, Q) and every a e (0, 1), we can bound
P(p <q) + Q(p>q)=
pdp+
Jp<q
(6.1)
I
P and Q. The
qdp
Jp>q
<[Jp<q paq{-adix+Jp>q
f paqX-adix
=
Pa(P,Q)
lemma is bounded by
the right-hand
side of the preceding
Hence,
supg pa(P, Q)
if
is the fact that it factorizes
of this bound
for all a e (0, 1). The advantage
For ease of notation, define
P and Q are product measures.
=
<&)
pa(g*,
LEMMA
6.2.
sup{pa(P,
e conv(^),
Q):P
For any 0 < a <
1 and classes
Q e conv(^)}.
g?\, g^2, &\,
^2
of finite
mea
sures,
Pa(g?l
where
g?\
x g?2
X &>2, ?X
denotes
X Bi)
the class
<
Pa(&l,?l)pa(&2,
of product
measures
?22),
{P\ x P2'P\
^i,
P2 ^2}.
x ?2)
x g?2) and Q e conv(i?i
be given. Since
Let P e con\(g?\
a
measures
-finite
both are (finite) convex combinations,
p\ and P2 can always be
Proof.
867
BAYESIAN MISSPECIFICATION
found
such that both P and Q have px x p2 densities
as
convex combination
which
both can be written
in the form of a finite
pix,y)
qix,y)
=
=
i|i\
> 0,
^Vy-=
1,
^Kjq\jix)q2jiy),
k}
for px x p2-almost-all
pairs
to&\
for measures
belonging
are /^-densities
for measures
paq{-adipx
1,
ki>0,
./
f
J^A./=
^kipuix)p2iiy),
j
x JT. Here pi/ and
ix, y) e X
qXj are //j-densities
and Bx, respectively
p% and q2j
(and, analogously,
in g?2 and J22). This implies that we can write
xp2)
/ V LjKjqijix)
E/^/Pi/C*)
/
J
(where, as usual, the integrand of the inner integral is taken equal to zero whenever
the /xj-density
equals zero). The inner integral is bounded
by p(Xi&)2, J22) for
x
e
fixed
After
this
the
X.
bound,
every
upper
substituting
remaining
integral is
bounded
by pai^x,B]).
Combining
6.2 and 6.1, we obtain
(6.1) with Lemmas
measure
If P is a probability
on i3?', srf), then, for
nated set of finite measures
->
:
3?n
such
that,
1]
[0,
4>n
for all 0 < a < 1,
THEOREM
6.1.
sup (P>?
+ Qnil
The bound
measures
given by the theorem
P and Q, we have
PMliP,G)
(f>n))
is useful
= 1|
<
is a domi
(X', srf) and B
>
1, there exists a test
every n
J2)n.
if pa (P, ?})
-
theorem.
on
PaiP,
only
f{VP
the following
<
1. For probability
Jqfdp
use the bound with a = 1/2 if the Hellinger
of
distance
and, hence, we might
to
a
measure
P
is
For
finite
the
conv(J2)
positive.
Q,
general
quantity pX/2iP, Q)
on Q, the Hellinger
transform paiP, Q) may
may be bigger than 1 and, depending
even
1 for every a. The
the (generalized)
Kullback-Leibler
lie above
lemma shows that this is controlled
?
P logiq/p).
divergence
following
by
B. J.K. KLEIJN AND A. W. VAN DER VAART
868
LEMMA 6.3.
tion a h> Pa(Q,
measure
P and a finite measure
Q, the func
?>
>
[0, 1] with pa(Q, P)
0) as a | 0,
P(q
For a probability
is convex on
P)
Pa(Q, P) -> Q(P > 0) as a f 1 and
D1 (q\
dPa(Q,P)\
dot
(which may
to
be equal
?
|a=o
\pJ
oo).
The function a h-> eay is convex on (0, 1) for all y e [?oo, oo),
on (0, 1). The function
of a \-+ pa(Q, P) = P(q/p)a
the
convexity
implying
=
>
y->
on [0, 1] for any y
a
for y < 1,
ealogy is continuous
0, is decreasing
ya
=
as a | 0,
1. By monotone
convergence,
increasing for y > 1 and constant for y
PROOF.
= Q(0<p<q).
Mo<P<q}
l{0<?}te(-)
q(^)
the dominated
theorem, with dominating
convergence
<
<
-> 0,
for a
1/2, we have, as a
Mp>q)
(p/q)l/2^{p>q)
By
!{*>*>-?g(-)
e(-)
Combining
Q(p/q)a
the
-
two
displays
preceding
function
1{^}
= g(p>^).
above,
we
see
that
x
(p/q)a
pi-a(Q,
P)
=
Q(p > 0) as a | 0.
?
of the function a \-+ eay, the map or i-?-/^(y) = (ea:y
By the convexity
=
<
as a \, 0, to (d/da)|a=o/a(.y)
0, we
y, for every y. For y
decreases,
> 0,
< 0, while,
for
y
by Taylor's formula,
fa(y)
fa(y)<
\)/ot
have
sup^<y^<-^(a+?)>;.
^
0<a/<a
< 0v
the quotient
that fa(y)
conclude
Consequently,
6^le^a+?^yty>o.
_
as
a
and
is
above
to
0
bounded
decreases
|
by 0 v
^
log(q/p)
a-i(eaiog(q/p)
<
>
1.
We
a
2s
for
conclude
which
is
for small
0,
P-integrable
s~l(q/p)2??q>p
that
we
Hence,
~
P) ~ Po(Q,P))=
^(p?(G.
lp((-)
as a I 0, by the monotone
Two
typical graphs
convergence
of the Hellinger
i
l)h>o Plog(^yq>o,
theorem.
\-+ pa (Q, P) are shown in Fig
in a situation
normal location model
transform
a
to fitting a unit variance
ure 1 [corresponding
For P a proba
are sampled from a N(0,
the observations
2)-distribution].
a = 0, but will
1
at
to
is
transform
measure
P
the
with
< Q,
equal
Hellinger
bility
=
>
near
a
1
a
if
level that is above 1
increase to
0) > 1. Unless
Q(p
eventually
where
BAYESIAN MISSPECIFICATION
o J
co
10
J
c\i
oJ
cvi
FlG.
O
"1J
I?I-1-1-1-1-T"^
0.2
1.
^J
Ifi
1
The Hellinger
0.8
transforms
measure defined by dQ = (dN(3/2,
with
Intercepts
(right).
Q(p > 0), respectively.
slope
^I-1-1-1-1-r-1
0.2
0.0
\-> pa(Q,
l)/dN(0,
axis
the vertical
The
J
"I
1.0
a
at
at 0 equals
/
/
|
j
o
O
0.6
oJ
c\i
d
0.0
0.4
j
cvi "I
//
J
ol
CO
//
I
I
IT)
d "1
o
/
/
I
869
P), far
\))dP
the right
(minus)
P =
N(0,2)
0.4
and
0.8
0.61.0
Q,
(left) and dQ = (dN(3/2,
and
equal
left of the graphs
the Kullback-Leibler
divergence
the
respectively,
l)/dN(\,
P(q
P
>
\))dP
0)
and
\og(p/q).
itwill never decrease below the level 1. For prob
is negative,
the slope P logip/q)
P and Q, this slope equals minus
the Kullback-Leibler
distance
ability measures
=
and, hence, is strictly negative unless P
Q. In that case, the graph is strictly be
choice to work with. For a general
low 1 on (0, 1) and pX/2iP, Q) is a convenient
to assume val
finite measure
transform
the
paiQ, P) is guaranteed
Q,
Hellinger
ues strictly less than 1 near a = 0, provided
that the Kullback-Leibler
divergence
is negative, which
is not automatically
the case. For testing a compos
P logip/q)
in Q e conv(i?).
i?, we shall need that this is the case uniformly
6.1 guarantees
of tests based
alternative
the existence
i?, Theorem
2
on n observations
if
with error probabilities
bounded by e~n?
ite alternative
For a convex
s
2<
sup
sup
0<a<lQe?
1
i
log-.
PaiQ,P)
In some of the examples we can achieve
of this type by bounding
the
inequalities
?
\-^
a
a
of
(uniform) Taylor expansion
right-hand side below by
log pa (P, Q) in
a near a ? 0. Such arguments are not mere technical generalizations:
they can be
to
to
relative
standard
necessary
prove
posterior
consistency
already
misspecified
parametric models.
If Piq = 0) > 0, then the Hellinger
hence, good tests exist, even though
is strictly less than 1 at a = 0 and,
be true that pi/2(P,
Q) > 1. The exis
transform
itmay
B. J.K. KLEIJN AND A. W. VAN DER VAART
870
tence of good tests is obvious
in this case, since we can reject Q if the observations
land in the set q = 0.
is dominated.
If this is not the case, then
In the above we have assumed
that ?
we
use
tests [10], that
Le Cam's generalized
that
the results go through, provided
is, we define
tt(P, ?) = inf sup (4>P+ (1 0)g),
linear maps
the infimum is taken over the set of all continuous,
positive
< 1 for all
h>
measures
P.
This
collec
R
that
such
Lx
i3?,
&/)
</>P
0:
probability
includes the linear maps that arise from integration of measur
tion of functionals
able functions 0: 3? i-> [0, 1], but may be larger. Such tests would be good enough
where
for our purposes,
but the generality
to misspecified
models.
application
to have
appears
little additional
value
for our
The next step is to extend the upper bound to alternatives ?H that are possibly
of
that are complements
not convex. We are particularly
interested in alternatives
on
be the set of finite measures
balls around P in some metric. Let
g/)
L\i3?',
h->R be such that r (P, .):.Sh>R
si)
*/) x L+(^,
(#\ */) and let r :L\i3?,
function (written in a notation so as to suggest a distance from P
is a nonnegative
to Q). For QeB,
set
(6.2)
r2(P,?)=
For s > 0, define NT(e,
L\i%\s/):f(P,
J2)
sup log
0<cx<l
PaiP,
to be the minimal
of convex
Q) > s/2} needed to cover {QeB'.s
assume
that J2 is such that this number
that these convex subsets have f-distance
theorem
number
Q)
is finite
s/2
to P
subsets
of
{Q e
< r(P, Q) < 2s} and
for all s > 0. (The requirement
is essential.) Then the following
applies.
set of
measure
and J2 be a dominated
Let P be a probability
THEOREM 6.2.
> 1, there exists a test (j)n
>
n
s
and
0
all
measures
Then
all
oni3?,
&/).
for
finite
such that, for all J e N,
oo
j=1
(6.3)
sup
{Q:r(P,Q)>Je}
Qn(l-4>n)<e'nj2?/4.
PROOF. Fix n > 1 and s > 0 and define ?>j =
{Q e ?:je
there exists for every j > 1 a finite
(j + 1)?}. By assumption,
=
of finite measures,
convex
sets
?!)
NT(js,
Cyj,...,
Nj
C/.w,property that
(6.4)
inf
f(P,Q)>J-^-,
l<i<Nj.
< r(P, Q) <
cover
with
of
J27 by
the further
871
BAYESIAN MISSPECIFICATION
6.1, for all n > 1 and for each
such that, for all a e (0, 1), we have
to Theorem
According
test
4>njj
set
there exists
Cjj,
a
Pn(t>n,j,i<Pa(PXjj)\
SUp Q"(l-<l>nJj)<Pa(P,Cjj)n
QeCjj
(6.4), we have
By
inf
sup
sup
pa(P,Q)=
For fixed
P
continuously
main
L\(3?,
the left-hand
srf) is concave. By
side of the preceding
inf
define
Q) is convex and can be extended
on [0, 1]. The function Q \-+ p?(P, Q) with do
the minimax
theorem (see, e.g., [15], page 239),
display
inf pa(P,Cjj).
sup pa(P,Q)=
0<a<l
that
sup
Q*cu
a new
test function
Qn(\-4>njj)<e-nj2e2/4.
(j)nby
max
(j)n =sup
j>\
Then,
equals
Q^c
Pnc/>njJv
Now
<e~j2e2/4.
\-^ pa(P,
a
and Q, the function
to a convex function
0<a<l
It follows
e-f2(P>Q)
QeCjj
QeCj,i0<a<l
for every J >
0n,/\i.
\<i<Nj
1,
oo Nj
oo
pn*n <E E pn^u< E ^v^'V/4>
= =
=
7
sup Gn(l
-0?)
1'
<
sup max
? ^ j>ji<Nj
1
7
1
'
sup Qn(\
-fajj)
7>i
QeCjj
where ^ = {e:T(P,G)>7e}
=
<
sup^"^V/4
-
^^yV/4,
Uy>y^y.
to misspecification.
6.2. Application
When applying the above in the proof for
in
the problem
is to test the true distribution
models,
consistency
Po
misspecified
=
=
for P e g?. In this
Q
Q(P)
against measures
taking the form d Q
(po/p*)dP
transform takes the form pa(Q, Po) ? Po(p/p*)a
case, the Hellinger
and its right
at a = 0 is equal to Polog(/?//?*).
derivative
This is negative for every P e g? if
and only if P* is the point in&
at minimal
to Po.
Kullback-Leibler
divergence
This
observation
of minimal
point
sumed. We
illustrates
that the measure
Kullback-Leibler
formalize
P*
divergence,
this in the following
lemma.
in Theorem
even if this
a
2.1 is necessarily
as
is not explicitly
B. J. K. KLEIJN AND A. W. VAN DER VAART
872
< oo and Poip/
that Polog(/?o/P*)
<
oo and the right-hand
side of (2.9) is positive,
then Pologipo/p*)
p*)
the covering numbers for testing Nt is, ??, d; Po, P*)
Po logipo/p).
Consequently,
in Theorem 2.1 can be finite only if P* is a point of minimal Kullback-Leibler
di
LEMMA
If P*
6.4.
and
are
P
such
<
to Po.
relative
vergence
=
= 1. If
The assumptions
0) > 0, then
P0(p
imply that Po(p* > 0)
= oo and there is
we
assume
to
that p is
prove. Thus,
may
nothing
Polog(/?o/p)
also strictly positive under Po. Then, in view of Lemma 6.3, the function g defined
PROOF.
by g(a)
=
Po(p/pT
= PaiQ, Po) is continuous on [0, 1] with g(0) = P0(p >
=
side of (2.9) can be positive only if gia) < 1 for some
1 and the right-hand
0)
a e [0, 1]. By convexity of g and the fact that g(0) = 1, this can happen only if the
In view of Lemma 6.3, this derivative
right derivative of g at zero is nonpositive.
iss'(0+)
=
Finiteness
P0log(/?/p*).
of
that
numbers
for testing for some s > 0 implies
>
e
P
is positive
with
??
0, as
J(P, P*)
every
of
in one of the sets P; in the definition
be contained
>
case the right-hand
side of (2.9)
for some s
0, in which
the covering
of (2.9)
side
the right-hand
must
P
such
every
Ntis, ??, d; Po, P*)
is bounded below by s2/4.
< 1 for every
If Poip/p*)
a
is
(po/p*)dP
subprobability
e &,
measure
P
then
the measure
Q
defined
by dQ
?
the Hellinger
and, hence, by convexity,
\->
transform a
paiPo, Q) is never above the level 1 and is strictly less than 1 at
=
=
a
Q. In such a case there appears to be no loss in generality
1/2 unless Po
to work with the choice a = 1/2 only, leading to the distance d as in Lemma 2.3.
This
lemma
The
shows
following
of our main
proof
Ntis, &,d',P0,P*)
that this situation
theorem
translates
results. Recall
arises
is convex.
if 2?
Theorem
the definition
6.2
into the form
of the covering
needed
numbers
for the
for testing
in i2.2).
< oo for all P e ??. Assume
6.3.
Suppose P* e ?? and Poip/p*)
D such that, for some sn > 0 and every
that there exists a nonincreasing
function
THEOREM
s>
sn,
(6.5) Ntis,&>,d;PQ,P*)<Dis).
Then for every
every J e N,
s > sn there exists
a test 4>n idepending
on s > 0) such
e-ne2/4
^^Dis)^-^^,
(6.6)
sup
{Pe&>:d(P,P*)>Je}
QiP)ni\-d>n)<e-nj2el\
that, for
BAYESIAN MISSPECIFICATION
Proof.
J2 as
Define
the
set of
all finite
873
measures
as P
Q(P)
ranges
(where po/p* = 0 if po = 0) and define x(Q\, Q2) = d(P\, P2).
over g*
Then
Q(P*)
rem 6.2 with
have NT(s,
test function
=
P0 and, hence, d(P, P*)
P0). Identify P of Theo
x(Q(P),
the
the present measure
definitions
Po. By
(2.2) and (6.2), we
<
<
>
s
for
the
sn. Therefore,
Po, P*)
D(s)
every
<2) Nt(s, g?,d;
to exist by Theorem
6.2 satisfies
guaranteed
=
00
00
<
PS4>n E
< D(s)
?
D(js)e-^2fA
e-n^2l\
7=1
7=1
is nonincreasing.
This can be bounded further (as in the assertion)
since,
<
<
x
for all 0
1, J2n>\ xH <x/(l?
*) The second line in the assertion is simply
the second line in (6.3).
D
because
7. Proofs
of
the main
theorems.
8.1 in [8] and can be proved
Lemma
LEMMA
define B(s,
on g*,
7.1.
For
P*; Po)
given
s > 0 and P*
by (2.3). Then for
pon(fU jtmdTKP)
of
The following
in the same manner.
is analogous
to
< 00,
that Polog(po/p*)
measure
Yl
every C > 0 and probability
e g?
such
<
< Tl(B(e,/>*;
/>o))o-2(,+C)) -^.
In view of (2.5), the conditions
of Theorem
6.3
2
are satisfied, with the function D(s) = en6n (i.e., constant in s > sn). Let <f>n
be the
test as in the assertion of this theorem for s = Msn
to be
and M a large constant,
determined
later in the proof.
For C > 0, also to be determined
later in the proof, let Qn be the event
Proof
C7-1)
Then
Theorem
lemma
/
P%(3?n
Set
2.2.
>
fl -^(Xt)dYl(P) ^(1+C)^2n(P(^, P*; P0)).
<
\ Qn)
=
nn(P
by Lemma 7.1.
> s\Xu...,Xn).
&>:d(P,P*)
l/(C2ns2n),
Yln(s)
J e N, we can decompose
P%Yln(JMsn)
=
(7.2)
For
P$(fln(JMen)4>n) + P0"(n?(/M??)(l
every
n >
1 and
0n)l^)
+ P^{Yln(JMsn)(l-(t>n)t^n).
< 1,
the three terms on the right-hand side separately. Because
Yln(s)
->
zero
as
to
00 for
This converges
ns2
by l/(C2ns2).
fixed C and/or can be made arbitrarily small by choosing
a large constant C if ns2
is bounded away from zero.
We
estimate
the middle
term is bounded
B. J. K. KLEIJN AND A. W. VAN DER VAART
874
By
the first inequality
the first term on the right-hand
in (6.6),
side of (7.2)
is
bounded by
<
P^hniJMen)4>n)
Pfa
<
(l-M2/4)ns2
[_e_nMle:iA.
on the right-hand
side is bounded
above
sufficiently
large M, the expression
/8
for sufficiently
by 2e~n?"M
large n and, hence, can be made arbitrarily small by
to 0 for fixed M if ne2 ?> oo.
choice of M, or converges
of the third term on the right-hand
Estimation
side of (7.2) is more
involved.
= 1, we can write
>
Because
0)
Po(p*
For
PS(nn(JMen)(l-4>n)lQn)
(?'3)
-pan-*
[fd(p,pn>JMenni=i(p/p*)(Xi)dnjP)^
o( ^4
where
we
have written
UUUip/p^x^diiiP)
the arguments
J'
for clarity. By the definition
of Qn, the
is bounded
below by e~^+c^nenTl(B(
n, P*; Po)).
integral in the denominator
=
measure
the
this
for
defined
bound,
QiP)
writing
Inserting
by dQiP)
and using Fubini's
the right-hand
side of
theorem, we can bound
ipo/P*)dP,
the preceding
display by
,
e{\+C)nel
(7'4)
777^7-^-^P*;
UiBiSn,
Xt
Po))
/
Jd(P,P*)>JMen
Q(P)na-4>n)dTl(P).
=
< diP, P*) < Men(j
+ 1)}, we can decompose
{P e 0?: Msnj
=
> JMsn}
tests (pn have been chosen
to satisfy
The
Uj>j &nj< e^1
1^!* uniformly
in P e &>nJ.
the inequality
[Cf. the sec
<2(P)n(l
4>n)
is
bounded
in (6.6).] It follows
that the preceding
ond inequality
display
by
Setting
<?nj
[P :diP, P*)
{\+C)nel
njJ
e-nJ2M2^4Ui^n
YliBisn, P*; Po))-!-Y\
f^j
<V
j)
e(l+C)ne*+nelM2j2/Z-nj2M2e*/4^
large M,
by (2.11). For fixed C and sufficiently
bounded away from zero and J = Jn -> oo.
this converges
to zero
if ns2
is
the numera
is a probability measure,
mass
the
condition
(2.11) is
prior
by
asser
mass
We
that
the
the
conclude
condition
(2.4).
prior
implied (for large j) by
?
=
2.2. That in
tion of Theorem
oo, follows from Theorem
2.1, but with M
Mn
fact it suffices that M is sufficiently
large follows by inspection of the preceding
Proof
of
tor in (2.11)
proof.
Theorem
2.1.
is bounded
above
Because
n
1. Therefore,
BAYESIAN MISSPECIFICATION
of
Proof
2.4.
Theorem
875
The proof of this theorem
is that we cannot
difference
follows
the same steps
to the prepara
as the preceding
appeal
proofs. A
to
in
the
theorems
lemmas
and
separate steps. The shells gPnj
proof
tory
split
=
<
<
must
e
be covered by sets BnJj
+
d(P, gP*)
l)sn}
{P
M(j
P\Mjsn
we
use
as in the definition
set
the appropriate
element
(2.16), and for each such
P* ,, e g?* to define a test
'J'
<j>n,; and to rewrite the left-hand side of (7.3). We
n,j,i
omit
the details.
Lemma
8.1 is used to upper bound
the Kullback
the
of
and
the
expectation
squared logarithm by a function of
divergence
A similar lemma was presented
in [16], where both p and q were
the Li-norm.
8. Technical
lemmas.
Leibler
assumed
to be densities
of probability
distributions. We generalize
this result to
we
a
measure
are
use
to
is
and
the
L
finite
forced
instead
of the
\
q
distance.
the case where
Hellinger
LEMMA
for
8.1.
For
every probability
s\> > 0 such that,
every b > 0, there exists a constant
<
measure
measure
P and finite
Q with 0 < h2(p,q)
sbP(p/q)b,
r: (0, oo) -> R defined
PROOF.
The function
implicitly
?
?
the following
1)
l)2 possesses
r(x)(*J~x
properties:
r is nonnegative
and decreasing.
~
asx \, 0, whence
there exists s'
r(x)
log(l/x)
on [0, e']. (A computer graph indicates that sr =
> 0 such that
For every b > 0, there exists
s'^
?
>
1, we may take s'^
(For b
1, but for b close
In view of the definition
of r and the first property,
=
Plog^
Q -2p(/i-lUpr(i)(^-l)2
\VP I
\P/\S P
by
?
=
log Jt
> 0 such that r(x)
0.4 will do.)
<
2(*J~x
21og(l/x)
is increasing on
xbr(x)
to zero,
must be very
s^
we
can write
I
J--A
<
h2(p, q) +
\\p-q\\\+
r(s)h2(p, q) +M~\\
~ <
el,
[0, s^].
small.)
B. J.K. KLEIJN AND A. W. VAN DER VAART
876
for any 0 < s < 4, where we use the fact that \yfqfp ? 11< 1 if q/p < 4. Next we
choose s <
and use the third property to bound the last term on the right-hand
si
side by P(p/q)bsbr(s).
the resulting bound with the second property,
Combining
we then obtain, for s < s1 A
A 4,
s^
Plog^</z2(p,^)
+
+
||p-^||1+21og-/z2(p,^)
sq
sb =
.
2^1og-pf^
s \qj
and third terms on the right-hand
side
<
a
for
small sb, then this
q)
sufficiently
s\)P(p/q)b
we can
choice is eligible and the first inequality of the lemma follows. Specifically,
A 4)b.
choose Sb < (s' A
s'l
To prove the second inequality, we first note that, since | logx| < 2\y/x ? 1| for
For
the second
h2(p,q)/P(p/q)b,
take the same form. If h2(p,
x>l,
/
2
P(k)g^ ij?>lJ<4P(
Next,
with
I?
/i-lj
\
=4h2(p,q).
r as in the first part of the proof,
<
,
Sh2(p, q) + 2r2(s)h2(p, q) +
2sbr2(s)P^j
for s <
of the third property of r. (The power of 4 in the first line
?
< 1.)We can use the second
of the array can be lowered to 2 or 0, as \y/q/p
11
=
r
to finish the
next
to
and
choose
of
bound
sb
r(s)
property
h2(p, q)/P(p/q)b
<
a
we
can
^
choose sb
(sf
proof. Specifically,
s^>2)b'
sfl>2,
in view
aw
LEMMA
8.2.
If p,pn,Poo
Pn -> Poo as n -> oo, then liminf^oo
If Xn = pn/p
can write
mean.
in
We
in L\(p)
densities
probability
> P
P log(p/p?)
log(p/poo).
?
such
that
then Xn -> X in P-probability
Poo/p,
sum of P(logX?)lxn>i
as
and
the
and
Plog(pn/p)
<
<
is
Because
0
x, the sequence
(logx)l^>i
(logXw)lxn>i
P(logX?)lxn<i.
in
is
in absolute value by the sequence
dominated
hence,
and,
|Xn|,
uniformly
theorem, we have
convergence
tegrable. By a suitable version of the dominated
are
->
Because
the variables
(logZn)lXn<i
P(logX00)lx00>i.
P(logX?)lLxn>i
<
we can apply Fatou's
lemma to see that limsupP(logXw)lxn<i
nonnegative,
Proof.
P(logXoo)lXoo<i.
and X
877
BAYESIAN MISSPECIFICATION
REFERENCES
[1] Berk,
R. H. (1966). Limiting
of posterior
distributions
behavior
Ann. Math.
Statist.
37 51-58. MR0189176
[Corrigendum
L.
[2] BlRGE,
(1983).
Z. Wahrsch.
O.
[3] BUNKE,
[4] DiACONis,
models.
Ann.
[6]
P.
and
l'estimation.
de
under
estimates
possibly
MR1626075
the consistency
On
(1986).
theorie
of Bayes
behavior
Asymptotic
26 617-644.
D.
et
metriques
MR0722129
(1998).
D.
Freedman,
of Bayes
estimates
Bayes
estimates
of
problems.
Ann.
(with discus
On
(1986).
inconsistent
location.
MR0829556
A Bayesian
(1973).
espaces
is incorrect.
1-67. MR0829555
14 68-87.
Statist.
181-238.
Statist.
14
Statist.
T. S.
Ferguson,
Ann.
P. and FREEDMAN,
sion). Ann.
[5] DIACONIS,
X.
and MlLHAUD,
incorrect
les
dans
Approximation
Verw. Gebiete
65
the model
when
37 745-746.]
of some
analysis
non-parametric
Statist.
1 209-230. MR0350949
[7] FERGUSON,
T. S.
Prior
(1974).
on
distributions
of probability
spaces
measures.
Ann.
2
Statist.
615-629. MR0438568
[8] Ghosal,
J. K. and van
S., Ghosh,
Ann.
distributions.
S. and van
[9] Ghosal,
imum
Statist.
der
likelihood
der
A. W.
Vaart,
estimation
and Bayes
A. W.
Vaart,
28 500-531.
(2000).
rates of posterior
Convergence
MR1790007
and rates of convergence
(2001). Entropies
for mixtures
of normal
Ann.
densities.
for max
Statist.
29
1233-1263. MR1873329
[10]
Le
Cam,
L. M.
(1986).
Asymptotic
in Statistical
Methods
Decision
Theory.
Springer,
New York. MR0856411
R. E.
[11] MEGGINSON,
(1998).
An
Introduction
to Banach
Space
Theory.
New
Springer,
York.
MR 1650235
[12]
of maximum
likelihood
estimators
for certain nonparamet
Consistency
in particular: Mixtures.
J. Statist. Plann.
19
137-158.
MR0944202
Inference
Z Wahrsch.
Verw. Gebiete
4 10-26. MR0184378
(1965). On Bayes
procedures.
J. (1988).
Pfanzagl,
ric families,
[13]
Schwartz,
[14]
Shen,
L.
X.
Ann.
[15]
STRASSER,
[16] Wong,
W.
gence
and Wasserman,
Statist.
H.
H.
L.
29 687-714.
(1985).
Mathematical
and Shen,
X.
rates of sieve MLE's.
(2001).
Rates
of
of
convergence
distributions.
posterior
MR1865337
Berlin. MR0812467
de Gruyter,
Theory
of Statistics,
for likelihood
ratios and conver
Probability
inequalities
Ann. Statist.
MR 1332570
23 339-362.
(1995).
of Mathematics
Department
Vrije
Universiteit
De Boelelaan
Amsterdam
1081a
1081 HV Amsterdam
The
Netherlands
E-mail:
[email protected]
[email protected]