Download Limit Theorems for General Empirical Processes

Document related concepts

Geometrization conjecture wikipedia , lookup

Covering space wikipedia , lookup

Fundamental group wikipedia , lookup

Grothendieck topology wikipedia , lookup

3-manifold wikipedia , lookup

General topology wikipedia , lookup

Continuous function wikipedia , lookup

Brouwer fixed-point theorem wikipedia , lookup

Transcript
Faculty of Science
Department of Mathematics
Limit Theorems for
General Empirical Processes
Master thesis submitted in partial fulfillment of the
requirements for the degree of Master in Mathematics
Gauthier Dierickx
Promotor:
MAY 2012
Prof. Dr. Uwe Einmahl
2
Faculteit Wetenschappen
Departement Wiskunde
Limiet Stellingen voor
Algemene Empirische Processen
Proefschrift ingediend met het oog op het behalen
van de graad van Master in de Wiskunde
Gauthier Dierickx
Promotor:
MEI 2012
Prof. Dr. Uwe Einmahl
2
Thanks.
I would like to thank first of all my parents for their support during my studies,
and then my fellow students. I’m also very grateful to all the professors who introduced me in the many exciting disciplines of modern mathematics and physics
with a ever lasting enthusiasm.
I’d like to express my deepest gratitude to Prof. Dr. U. Einmahl to accept me
for a masterthesis under his direction, for his unlimited patience and for the stimulating talks we had on different subjects. I’m also thankful for his invitation for the
biannual congress Deutsche Stochastiktage 2012 held at Mainz, which motivated
me even more to explore the large field of probability theory and mathematical
statistics.
i
ii
Abstract.
This master thesis is about uniform limit theorems and its main goal is to present
a uniform strong law and a uniform weak convergence result for the empirical
process indexed by general classes of functions.
The topic of uniform versions of the classical limit theorems in probability
started in the 1930’s, when Glivenko and then Cantelli proved that the empirical
distribution function converges uniformly with probability one to the unknown
distribution function of the underlying random variables. This can be considered
as a uniform version of the strong law of large numbers. First versions of the
central limit theorem for the empirical process were obtained by Doob, Donsker,
Kolmogorov and Skorokhod around 1950. In the 1970’s a general theory was
started by Vapnik and C̆ervonenkis, who introduced a new method suitable to treat
empirical processes indexed by general function classes. Dudley also made major
contributions to this theory. Later Giné and Zinn obtained further refinements.
More recently, a weak convergence theory for nonmeasurable mappings has
been developed, from which one can infer the limit results for empirical processes.
The purpose of this thesis is to given an account of this approach based on the
books by Dudley and van der Vaart and Wellner.
This version is not the definitive one, but a
slightly adapted version, in which the main typos of the first version are corrected and some
shortcuts of proofs are implemented, due to the
comments of the members of the jury during
the presentation.
iii
iv
Contents
Thanks.
i
Abstract.
iii
Introduction.
vii
1
2
Empirical measures and processes.
1.1 Definitions and problems. . . . . . . . .
1.2 The classical cases. . . . . . . . . . . .
1.2.1 The Glivenko–Cantelli theorem.
1.2.2 Donsker’s theorem. . . . . . . .
1.3 The problems. . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Weak Convergence
2.1 Outer probability and expectation . . . . . . . .
2.2 Perfect functions . . . . . . . . . . . . . . . . .
2.3 Convergence: almost uniformly, outer probability
2.4 Convergence in law . . . . . . . . . . . . . . . .
2.5 The (extended) Portmanteau Theorem . . . . . .
2.6 Asymptotic tightness and measurability . . . . .
2.7 Spaces of bounded functions . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
2
3
.
.
.
.
.
.
.
5
5
10
11
13
15
19
23
3
Vapnik–C̆ervonenkis classes.
33
3.1 Introduction: definitions, fundamental lemma. . . . . . . . . . . . 33
3.2 Uniform bounds for packing number of VC class. . . . . . . . . . 39
4
On measurability.
47
4.1 Admissibility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Suslin image admissibility. . . . . . . . . . . . . . . . . . . . . . 58
v
vi
5
CONTENTS
Uniform limit theorems.
5.1 Entropy and Covering Numbers. . . . . . . . . .
5.2 A Symmetrization Lemma. . . . . . . . . . . . .
5.3 Martingale property, Glivenko–Cantelli theorem.
5.4 Pollard’s Central Limit Theorm. . . . . . . . . .
.
.
.
.
67
. . . . 67
. . . . 68
. . . . 76
. . . . 86
A Topology and Measure Theory.
A.1 Metric and topological spaces. . . . . . . . . . . . . . . .
A.1.1 Definitions. . . . . . . . . . . . . . . . . . . . . .
A.1.2 Some important theorems. . . . . . . . . . . . . .
A.2 Measure Theory. . . . . . . . . . . . . . . . . . . . . . .
A.2.1 Rings, algebra’s σ–algebra’s and (outer) measures.
A.2.2 (Sub)Martingales and reversed (sub)martingales. .
105
105
105
115
122
122
138
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B Gaussian Processes.
141
C More about Suslin / Analytic Sets.
C.1 The Borel Isomorphism Theorem. . . . . . .
C.2 Definitions and properties of Analytical Sets.
C.3 Universal measurability of Analytic Sets. . .
C.4 A selection theorem for Analytic Sets. . . . .
145
145
146
148
150
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D Entropy and useful inequalities.
153
D.1 Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
D.2 Exponential inequalities. . . . . . . . . . . . . . . . . . . . . . . 157
Introduction.
This master thesis deals with uniform limit theorems for empirical measures and
processes, i.e. limit theorems for normalized finite sums like
n
−1
n
X
δ(Xi ) (·) := n
n
X
I· (Xi )
and
i=1
i=1
n−1/2
−1
n
X
δ(Xi ) (·) − P (·)
i=1
The name empirical refers to the fact the measure and process are based upon
the data, namely the Xi , who are supposed to be i.i.d. and to come from a given
distribution F .
In the classical case, for a fixed subset A of S (some metric space, e.g. R), by
the law of large numbers the empirical measure converges a.s. to P (A), where P
denotes the probability measure associated with the distribution function F . The
CLT, for the empirical process, can also be applied and tells us that the empirical
process, for fixed A, which is then just a Binomial random variable, converges in
distribution to some normal random variable.
In the case S = R, and At :=]−∞, t], for t ∈ R, a theorem of Glivenko–Cantelli,
assures us that we have got a.s. converges of the empirical process over all the sets
At , t ∈ R. For the empirical process some similar theorem proved by Donsker is
available. It says that we have convergence to the Brownian Bridge Y in the space
(D[0, 1], d∞ ), the space of càdlàg functions equipped with the uniform topology;
√
k n(Fn − F ) − Yn k∞ → 0
where Yn are Brownian Bridges and F := U [0, 1] the uniform distribution and
n
√
1 X
n(Fn (t) − t) = √
I[0,t] (Xi ) − t .
n i=1
Or in other words, the empirical process converges uniformly over all closed intervals, [0, t], t ∈ [0, 1], in distribution to a Brownian Bridge. The main goal is
vii
viii
CONTENTS
to present similar results, but for larger classes of functions, instead of classes of
sets and in general spaces.
In chapter one we will briefly give the general framework in which we will be
working, and go a bit further into details for the classical cases.
In order to achieve our goal, many problems have to be tackled, the class over
which the supremum is taken could possible be too big to still have measurability
of the empirical measure and process. So we will need a general theory of convergence a.s., in probability and in distribution for non measurable functions. The
development of this theory will be our main concern in the 2nd chapter. Definitions of almost uniform convergence, convergence in outer probability and weak
convergence for non measurable functions will be given. It will be seen that we
can recover pretty much everything of the measurable case for the non measurable
functions, a general, extended portmanteau theorem will be proven together with
conditions characterizing non measurable weak convergence.
The third chapter, a rather short one, will be about special classes of functions.
One way of measuring the size of classes of functions (and sets) will be presented,
and then Vapnik–C̆ervonenkis classes will be defined and two important properties
about those classes will be proven. Some examples of VC classes are given at the
end.
Because in general measurability of the empirical measure and process is not
satisfied, we will dedicated a chapter, the 4th one, to our investigations towards a
method, which will imply measurability of the empirical measure and process.
The main chapter is where everything from the precedent chapters is combined to prove uniform limit theorems. The main theorem, a uniform central limit
theorem for the empirical process due to David Pollard, and extended by Richard
Mansfield Dudley is stated and proved. Thereafter two corollaries about weak
convergences for special types of VC classes are shown.
Because empirical process theory makes use of topology, a large appendix on
topology has been written, and some definitions and theorems of measure theory
are recalled and sometimes we have added a proof to the theorems. A third part of
the appendix is considered with some properties of analytic sets, which play a crucial role in the development of conditions implying measurability of the empirical
measure and the empirical process.
Chapter 1
Empirical measures and processes.
1.1
Definitions and problems.
In the following part, we will mainly be involved with ”empirical measures” and
”empirical processes”. In order to define such objects, we need a sequence of
i.i.d. random variables. One way to do so, is to take a countable product of copies
of a probability space (S, B, P ) (called the sample space) (S ∞ , B ∞ , P ∞ ) with
product σ–algebra and product measure, and let the coordinates onto (S, B, P ) be
the random variables. Defining the X1 , X2 , · · · as such will be called the standard
model.
Now we are able to define empirical measures :
Pn : S
∞
∞
× B → [0, 1] : (s , A) 7→ n
−1
n
X
δXi (s∞ ) (A)
i=1
where δx (A) := IA (x). Defined this way Pn is a probability measure on B, and
actually on 2S := P(S). And the empirical process is defined as
√ √
νn : S ∞ × B → [− n, n] : (s∞ , A) 7→ n1/2 (Pn (s∞ , A) − P (A))
If we fix a Borel set B, then νn (B), for each positive integer n, is a random
variable on (S ∞ , B ∞ , P ∞ ). P (B) is fixed, so constant, and Pn (ω, B) is then
a normalized Binomial variable, as a sum of i.i.d. Bernoulli( P (A) ) random
variables, each of which records whether Xi lies in B or not. Therefor letting n
tend to infinity and applying the classical one dimensional Central Limit Theorem
A.2.21, we know that νn (B) converges in distribution to a normal random variable
with mean zero, variance P (B)(1 − P (B)).
For purposes later on, we’ll need a certain stochastic process, commonly called
Brownian Bridge . This will appear as a limit distribution of empirical processes,
1
2
CHAPTER 1. EMPIRICAL MEASURES AND PROCESSES.
and is an example of a Gaussian Process. So we will study general Gaussian Processes indexed by some classes of functions, and we will see that the Brownian
Bridge is an example of such class indexed Gaussian Process.
Here
GP : (S ∞ , B ∞ , P ∞ ) → `∞ (F),
where GP (f ), f ∈ F, is a one dimensional Gaussian random variable with mean
zero and variance Var(GP ) := Var(f ).
The covariance w.r.t. P induces also a pseudometric on L2 (PR) as follows:
ρP (f, g) := {E[(GP (f ) − GP (g))2 ]}1/2 . On L20 := {f : f ∈ L2 , f dP = 0}
this pseudometric coincides with the usual one for
ρP (f, g) = {E[(GP (f − g))2 ]}1/2 = Var(f − g)1/2 = E[(f − g))2 ]1/2 .
1.2
1.2.1
The classical cases.
The Glivenko–Cantelli theorem.
Let At :=] − ∞, t], closed half lines.
Theorem 1.2.1 (Glivenko–Cantelli). Let X1 , X2 , · · · be i.i.d. random variables
with common distribution function F . Then supx |Fn (x) − F (x)| → 0 a.s. as
n → ∞, where
n
1X
Fn (x, ω) :=
I]−∞,x] (Xi (ω)).
n i=1
I.e. kPn − P k{At :t∈R} → 0 a.s.
Proof. We refer to [Bill1] chapter 4 theorem 20.6 on page 286 or [Dud1] theorem
11.4.2 on page 400 or even [Pol] pages 13–16 for a proof.
1.2.2
Donsker’s theorem.
Theorem 1.2.2 (Donsker). Let Xi be i.i.d. random variables, uniformly [0, 1]
distributed. Let αn be the nth empirical process, then αn converges weakly in
D[0, 1], with the Skorokhod topology, to a certain Gaussian Process, known as
the Brownian Bridge.
1.3. THE PROBLEMS.
3
Recall that by definition, we have weak convergence iff for every G; real–valued,
bounded and continuous, w.r.t. the Skorokhod topology, function on D[0, 1]:
Z
Z
G dµn → G dW ◦ ,
where µn denotes the probability measure associated with the empirical process
and W ◦ the one of the Brownian Bridge.
Proof. We refer to [Bill2] chapter 3 theorem 14.3 on page 149 for a proof.
In Billingsley’s classical book [Bill2], the Skorokhod topology is defined at
the beginning of chapter 3. This is done, since actually D with the uniform metric
is not separable; it actually contains an uncountable discrete set, which causes
measurability problems. In the original proof of Donsker there was thus a mistake,
due to the nonmeasurability: one cannot associate a probability measure with a
nonmeasurable function. But this problem was taken care of by Skorokhod, who
introduced a new metric on D, which turns D into a Polish space, i.e. a separable
complete metric space.
As seen in [Bill2] chapter 3 section 15 changing the metric of D accounts
in fact to reduce the Borel σ–algebra of D for the uniform topology to the ball
σ–algebra of that same uniform topology. The ball σ–algebra is the σ–generated
by the open balls and in non separable spaces both σ–algebra’s could be different.
The relation between the uniform topology and the Skorokhod topology in terms
of σ–algebra’s is as follows: the ball σ–algebra of the uniform topology equals
the Borel σ–algebra of the Skorokhod topology.
1.3
The problems.
The problem with empirical processes in non separable metric spaces is that the
σ–algebra (on the codomain) is often too large to allow the process to be Borel
measurable and thus not for every real–valued, continuous and bounded function
G, G ◦ νn has to be measurable. This is exactly what happens in Donsker’s (classical) theorem. Since D[0, 1] with the uniform metric is non separable. We refer
to [Bill2] chapter 3 section 15 for a proof of the non separability of (D, k · k∞ )
and for the non measurability of the empirical process as a stochastic process in
(D, k · k∞ ).
As mentioned after Donsker’s theorem in the previous section, one can take
care of this non measurability, by introducing a new metric on D[0, 1]. This is
the same as considering the empirical process as a random variable in D w.r.t.
the ball σ–algebra for the uniform topology. So one could think of developing
a theory of weak convergence for that specific (ball) σ–algebra, restricting the
4
CHAPTER 1. EMPIRICAL MEASURES AND PROCESSES.
real–valued bounded continuous G to be measurable w.r.t. the ball σ–algebra, as
done in [Bill2] chapter 1 section 6.
But it turns out that for general classes of functions the empirical process indexed by such classes is not even measurable for the ball σ–algebra for the uniform topology. So one actually needs a new approach to what weak convergence
means and how it can be defined (such that it is consistent with the theory of weak
convergence for measurable random variables). One can then address the problem for which classes we still have Donsker type, i.e. central limit theorems for
stochastic process in the space of uniformly bounded real–valued functions with
uniform norm, results for those general index classes.
In particular we need a new definition for convergence in law; the integral of
a nonmeasurable function is not defined. To achieve this, we will use the concept
of upper integral; that will be one of the subjects of the next chapter.
Chapter 2
Weak convergence for
non-measurable functions
2.1
Outer probability and expectation
Let (X , A, P ) be a probability space and set R := [−∞, ∞].
Definition 2.1.1. Let g : X → R be a (not necessarily measurable) function. The
outer expectation (or the upper P –integral) of g is defined as
Z ∗
∗
EP [g] :=
g dP := inf EP [h] : h ≥ g, h : X → R, P –semi–integrable .
Recall that a function h : X → R is called PR –semi–integrable
if it is
R
A–measurable and at least one of the integrals h+ dP , h− dP is finite.
Similarly, we can define an outer measure P ∗ : 2X (:= P(X )) → [0, 1] by setting
P ∗ (B) = inf{P (A) : A ⊃ B, A ∈ A}, for B ⊂ X .
Lemma 2.1.1. Given B ⊂ X , we can find a set A ∈ A, A ⊃ B such that
P ∗ (B) = P (A).
Moreover, this set is P –almost surely unique, i.e. if A1 , A2 ⊃ B are two sets
with P ∗ (B) = P (A1 ) = P (A2 ), we have P (A1 ∆A2 ) = 0. We call any set
(B ⊂)A ∈ A with P ∗ (B) = P (A) a measurable cover of B.
Proof. Choose a sequence B ⊂ An ∈ A such that P (An ) → P ∗ (B). Then clearly
by monotonicity
∗
P ∗ (B) ≤ P (∩∞
n=1 An ) ≤ inf P (An ) = P (B),
n≥1
5
6
CHAPTER 2. WEAK CONVERGENCE
which shows that ∩∞
n=1 An is a measurable cover.
If A1 , A2 are measurable covers of B we obtain again by monotonicity that A1 ∩A2
is a measurable cover of B and hence that P (A1 ∩ A2 ) = P (A1 ) = P (A2 )
which implies that P (A1 ∆A2 ) = 0. Indeed A1 is the disjoint union of A1 \A2
and A1 ∩ A2 . Hence P (A1 \A2 ) = 0, similarly P (A2 \A1 ) = 0 and recall that the
symmetric difference is defined as A1 ∆A2 := (A1 \A2 ) ∪ (A2 \A1 ).
We next show that there are also measurable cover functions for the outer
integrals. We need some additional notation. Let L0 = L0 (X , A, P, R) be the set
of all Borel measurable functions f : X → R.
Definition 2.1.2. If J ⊂ L0 (X , A, P, R), a function f ∈ L0 is called an essential
infimum of J , if
i) f ≤ j P –a.s. for all j ∈ J ;
ii) f ≥ g P –a.s. for all measurable g satisfying: g ≤ j for all j ∈ J .
From this definition it immediately follows that the essential infimum is unique,
P –almost surely, provided that it exists; which will be shown in the following theorem.
Theorem 2.1.2. Let (X , A, P ) be a probability space and J ⊂ L0 (X , A, P, R),
then an essential infimum of J exists.
Proof. W. l. o. g. we assume that J 6= ∅. Define
J1 = {min(f1 , · · · , fn ) : fi ∈ J , i = 1, · · · , n, n ∈ N}
and set c = inf{E[arctan(j)] : j ∈ J1 }. Clearly, c ∈ [−π/2, π/2]. Choose then a
sequence jn so that E[arctan(jn )] → c. Set hn = min(j1 , · · · , jn ). Since hn ∈ J1
we still have E[arctan(hn )] → c. The sequence arctan(hn ) is monotone, thus it
converges to a measurable function H = arctan(h). By bounded convergence
(theorem A.2.19) we have E[arctan(h)] = c. Take an arbitrary j ∈ J . Then
hn ∧ j ∈ J1 and hn ∧ j → h ∧ j. Using again bounded convergence we see that
c ≤ E[arctan(hn ∧ j)] → E[arctan(h ∧ j)] ≤ E[arctan(h)] = c.
Thus E[arctan(h ∧ j)] = E[arctan(h)] which implies that h ∧ j = h or h ≤ j
P –almost surely.
If g is another A–measurable function satisfying g ≤ j, j ∈ J , we also have
g ≤ hn , n ≥ 1 which trivially implies that g ≤ limn→∞ hn = h a.s. Thus, h is an
essential infimum for J .
2.1. OUTER PROBABILITY AND EXPECTATION
7
Consider the function class J = {j ∈ L0 : j ≥ f everywhere }. From the
above proof it is then clear that we can choose a version of the essential infimum
which we will denote in the sequel by f ∗ so that f ∗ ≥ f everywhere.
It is easy to see that if E ∗ [f ] < ∞, f ∗ is P –semi–integrable and E ∗ [f ] = E[f ∗ ].
Lemma 2.1.3. We have for all functions f, g : X →] − ∞, ∞]
i) (f + g)∗ 6 f ∗ + g ∗ a.s.,
ii) (f − g)∗ > f ∗ − g ∗ , whenever both sides are defined a.s.
Proof.
i) The RHS is well–defined, because −∞ < f ∗ , g ∗ 6 ∞ everywhere,
and also measurable and > f + g everywhere by definition.
ii) On the set where g ∗ = +∞, f ∗ is finite a.s. by assumption. So on that set
f ∗ − g ∗ = −∞ and thus 6 (f − g)∗ . On the set where g ∗ is finite, g is too,
and then f = (f − g) + g. So by part (i): f ∗ 6 (f − g)∗ + g ∗ , and then
f ∗ − g ∗ 6 (f − g)∗ .
Lemma 2.1.4. Let V be a vector space equiped with a seminorm k · k. For any
pair of functions X, Y : X → S :
i) kX + Y k∗ 6 (kXk + kY k)∗ 6 kXk∗ + kY k∗ a.s.;
ii) kcXk∗ = |c|kXk∗ a.s; c ∈ R.
Proof. Because of the triangle inequality and the definition of measurable cover,
the first inequality in (i) follows, whereas the second inequality in (i) is a consequence of the previous lemma (2.1.3).
For c = 0 the second assertion is trivial, so consider c ∈ R0 . Clearly kcXk∗ 6
|c|kXk∗ because the RHS is measurable and kcXk = |c|kXk ≤ |c|kXk∗ and by
definition kcXk∗ 6 h for each h ∈ L0 : h ≥ kcXk. The converse inequality
holds too. For fixed c, for any h ∈ L0 : h > kcXk, h/|c| > kXk, and in
particular kcXk∗ > kcXk, thus kcXk∗ /|c| > kXk∗ .
In order to distribute the star over a product or a sum, one needs independence.
Lemma 2.1.5. Let (Xj , Aj , Pj ), j = 1, · · · , n; n ∈ N be probability spaces and
let fj be functions from Xj into R.
i) If either fQ
j > 0; j = 1, · · · , n or f1 ≡ 1, n = 2.
Q
Then on nj=1Q
(Xj , Aj , Pj ), x := (x1 , · · · , xn ), f (x) = nj=1 fj (xj ) we
have f ∗ (x) = nj=1 fj∗ (xj ) a.s. (as usual 0 · ∞ := 0).
8
CHAPTER 2. WEAK CONVERGENCE
ii) If fj > −∞ for allPxj ; j = 1, · · · , n and g(x) :=
g ∗ ((x1 , · · · , xn )) = nj=1 fj∗ (xj ).
Pn
j=1
fj (xj ), then
Proof. We start with the proof for the sum. By induction it is enough to consider
the case for n = 2. By lemma 2.1.3 g ∗ 6 f1∗ (x1 ) + f2∗ (x2 ) a.s.
Now we continue by contradiction, so assume that on a set C ⊂ X1 ×X2 of strictly
positive probability we have a strict inequality. Then
C = C ∩ ∪q∈Q C q , C q := {(x, y) : g ∗ (x, y) < q < f1∗ (x) + f2∗ (y)}
P
(note that C q is indeed a measurable set). And then P (C) 6 q∈Q P (C q ∩ C),
so for at least one q we have P (C q ∩ C) > 0. Denote that q by t. Thus there is a
rational t such that g ∗ < t < f1∗ (x) + f2∗ (y) for all (x, y) ∈ C.
Now consider h : Q × Q → Q : (p, q) 7→ p + q, then define
Qt = {(x, y) : h(x, y) ∈]t, ∞[} and C q,p := {(x, y) : f1∗ (x) > q, f2∗ (y) > p}.
P
As above C t = ∪(q,p)∈Qt C q,p . And because 0 < P (C t ) 6 (q,p)∈Qt P (C q,p ). And
thus there exist rational q, r with q + r > t such that q < f1∗ (x), r < f2∗ (y) for all
(x, y) ∈ C := C q,r .
Consider the section Cx := {y ∈ X2 : (x, y) ∈ C}. By (the proof of) the
Tonelli–Fubini theorem (A.2.17) there is a set D1 ⊂ X1 , P1 (D1 ) > 0 such that
P2 (Cx ) > 0 for all x ∈ D1 . If f1 |D1 (x) 6 q then f1∗ 6 q a.s. on D1 . But for any
x ∈ D1 and y ∈ Cx 6= ∅ : f1∗ > q a contradiction. So f1 |D1 (x) > q for some x ∈
D1 , and fix that x. Then for any y ∈ Cx : q + f2 (y) < f1 (x) + f2 (y) 6 g ∗ (x, y).
Thus for almost all y ∈ Cx : f2 (y) < g ∗ (x, y) − q, f2∗ (y) < g ∗ (x, y) − q. For any
such y : q + f2∗ (y) < q + r and f2∗ (y) < r a contradiction.
Now Q
we treat the case of the products. It is clear from the definition that
∗
f (x) 6 nj=1 fj∗ (xj ) a.s. and also 1∗ ≡ 1. As before, it is enough to consider the
case n = 2 for the converse inequality. Suppose that on a set of strictly positive
probability one has: f ∗ ((x, y)) < f1∗ (x)f2∗ (y) a.s. As before for some rational r
on a set of strictly positive probability: f ∗ (x, y) < r < f1∗ (x)f2∗ (y).
If f1 ≡ 1, we have f (x, y) 6 f ∗ (x, y) < r < f2∗ (y) on some set of strictly positive
probability. Then by Tonelli–Fubini (A.2.17): for some x : f2 (y) 6 f ∗ (x, y) <
r < f2∗ (y) on a set of y with strictly positive probability, contradicting the choice
of f2∗ (y).
If f1 > 0, f2 > 0, then as before on a set C ⊂ X1 × X2 of strictly positive
probability and for some rationals a, b : ab > r, f1∗ (x) > a > 0, f2∗ (y) > b > 0
we have that f ∗ (x, y) < r < ab < f1∗ (x)f2∗ (y). Repeating the argument above,
there is a set D1 ⊂ X1 : P1 (D1 ) > 0 and P2 (Cu ) > 0 for every u ∈ D1 where
f1 (u) > a. Then for any v ∈ Cu : f2 (v) 6 f ∗ (u, v)/a and so f2∗ (v) 6 f ∗ (u, v)/a
for almost all v ∈ Cu . or such a v the following holds: af2∗ (v) < ab and f2∗ (v) < b,
again a contradiction.
2.1. OUTER PROBABILITY AND EXPECTATION
9
Lemma 2.1.6. Let f : X → R be a function. Then for any real number t:
i) P ∗ (f > t) = P (f ∗ > t),
ii) ∀ > 0 : P ∗ (f > t) 6 P (f ∗ > t) 6 P ∗ (f > t − )
Proof.
i) Because f 6 f ∗ implies {f > t} ⊂ {f ∗ > t} and the latter is measurable,
by definition of the outer measure P ∗ (f > t) 6 P (f ∗ > t).
Now let A be a measurable cover of {f > t}, since we can always redefine
A := A ∩ {f ∗ > t}, w.l.o.g. one may suppose A ⊂ {f ∗ > t}.
If P (A) < P (f ∗ > t), or in other words if P (f ∗ > t, Ac ) > 0 then
define g := f ∗ IA + tIAc . By construction g > f , but on {f ∗ > t} ∩ Ac
(a set of strictly positive probability) f ∗ (ω) > tIAc (ω) = g(ω). Because
g is measurable this is a contradiction with the definition of f ∗ . Hence
P (A) > P (f ∗ > t).
ii) As in (i) {f > t} ⊂ {f ∗ > t}, implying P ∗ (f > t) 6 P (f ∗ > t). If
0 < δ 6 , then using the result from (i), we get
P (f ∗ > t − δ) = P ∗ (f > t − δ) 6 P ∗ (f > t − ).
Because of the continuity from above of P ;
P (f ∗ > t) = lim P (f ∗ > t − δ) 6 P ∗ (f > t − ).
δ→0
Let (X , A, P ) be a probability space. Then for any function f on X into R,
let
Z
f dP := sup
nZ
o
h dP : h P -semi-integrable and h 6 f .
∗
R
R∗
It is easy to see that ∗ f dP = − (−f ) dP.
We can also define f∗ as the essential supremum of all measurable functions
g 6 f . Then in exactly the same
way asRis seen for f ∗ , f∗ is P –a.s. unique, and
R
whenever one of both exists ∗ f dP = f∗ dP . The relation between essential
supremum and essential infimum is −f∗ = (−f )∗ .
10
CHAPTER 2. WEAK CONVERGENCE
2.2
Perfect functions
For a function g : X1 → X2 , define g[A] := {g(x) : x ∈ A}, for A ⊂ X1 . In
this part we will investigate which conditions are needed to be satisfied by g, a
measurable function, such that for each f real–valued: (f ◦ g)∗ = f ∗ ◦ g. It will
turn out later that such functions g will be useful.
Theorem 2.2.1. Let (X1 , A, P ) be a probability space, (X2 , B) a measurable
space, and g : X1 → X2 a measurable function. Let Q := P ◦ g −1 be the
image measure through g on B. For any real–valued function f on X2 , define f ∗
w.r.t. Q. Then the following are equivalent:
a) For any A ∈ A there is a B ∈ B with B ⊂ g[A] and Q(B) > P (A);
b) for any A ∈ A where P (A) > 0, there is a B ∈ B with B ⊂ g[A] and
Q(B) > 0;
c) for every real–valued function f on X2 , (f ◦ g)∗ = f ∗ ◦ g Q–a.s.;
∗
d) for any D ⊂ X2 , (ID ◦ g)∗ = ID
◦ g.
Proof.
a) ⇒ b) is trivial.
b) ⇒ c) Note that (f ◦ g)∗ 6 f ∗ ◦ g always holds. So suppose (f ◦ g)∗ < f ∗ ◦ g on
some set of strictly positive probability. Note that
[
∗
∗
∗
∗
{f ◦ g > q} ∩ {q > (f ◦ g) }
{f ◦ g > (f ◦ g) } =
q∈Q
and thus, as in the proof of lemma 2.1.5, for some rational r;
(f ◦ g)∗ < r < f ∗ ◦ g on a set A ∈ A : P (A) > 0. Using (b) we find a
B ∈ B : B ⊂ g[A] and Q(B) > 0. Then f ◦ g < r on A implies f < r on
B, and so f ∗ 6 r on B a.s. But f ∗ ◦ g > r on A, a contradiction.
c) ⇒ d) Again this is trivial.
d) ⇒ a) Take a set A ∈ A, and define D := X2 \g[A]. Then there is some set
∗
∗
∗
C ∈ B : ID
= IC . ID
> ID , thus D ⊂ C, and ID
◦g = (ID ◦g)∗ = 0 a.s. on
∗
A (if not then this would contradict our choice of ID
). Let B := X2 \C ∈ B.
Then B ⊂ g[A] and
Z
Z
Q(B) = 1 − IC dQ = 1 −
IC d(P ◦ g −1 )
Y
Z
Z
∗
= 1−
ID ◦ g dP = 1 − (ID ◦ g)∗ dP
g −1 (Ω0 )
> P (A)
Ω
2.3. CONVERGENCE: ALMOST UNIFORMLY, OUTER PROBABILITY 11
Definition 2.2.1. A function g satisfying one, and so by the previous theorem all
four conditions, of the conditions above will be called perfect or P–perfect.
Next, we show that on product spaces the projections are perfect.
Proposition 2.2.2. Suppose X = X1 × X2 , P is a product probability ν × µ on
A = A1 ⊗ A2 and g is the natural projection from X onto X2 . Then g is perfect.
Proof. Note that P ◦ g −1 = µ.
For any B ⊂ X let Bx2 := {x1 : (x1 , x2 ) ∈ B}, x2 ∈ X2 be a section of B.
If B is measurable, define C := {x2 : ν(Bx2 ) > 0}. This set is contained
in A2 since x2 7→ ν(Bx2 ) is A2 -measurable by Tonelli–Fubini A.2.17. Clearly
C ⊂ g[B] and µ(C) = P (B). So by condition (a) of the previous theorem, i.e.
theorem 2.2.1, g is perfect.
This proof is symmetric in the arguments, so the projection onto the first coordinate is perfect too. Moreover one can easily consider a countable product space
without changing the argument.
2.3
Convergence almost uniformly and in outer probability
Definition 2.3.1. Let (Ω, A, P ) be a probability space, (S, d) a metric space and
(fn )n∈N , f0 functions from Ω into S.
(i) fn is said to converge to f0 in outer probability iff d(fn , f0 )∗ → 0 in
probability, or equivalently P ∗ {d(fn , f0 ) > } → 0, n → ∞ for every
> 0.
(ii) fn is said to converge almost uniformly to f0 iff d(fn , f0 )∗ → 0 P –almost
surely.
The following result gives a characterization of almost uniform convergence.
Proposition 2.3.1. Let (Ω, A, P ) be a probability space, (S, d) a metric space and
f0 , fn ; n ∈ N functions from Ω into S. Then the following are equivalent:
A) fn → f0 almost uniformly;
B) There exist measurable hn > d(fn , f0 ) with hn → 0 a.s.;
C) For any > 0 : P ∗ {supn>m d(fn , f0 ) > } ↓ 0 as m → ∞;
12
CHAPTER 2. WEAK CONVERGENCE
D) For any δ > 0 there is some B ∈ A with P (B) > 1 − δ such that fn → f0
uniformly on B.
Proof. Clearly from the definition of almost uniform convergence we have that
(A) and (B) are equivalent.
A) ⇒ C) From the definition of essential infimum, one immediately gets d(fn , f0 ) 6
d(fn , f0 )∗ and thus (supn>m d(fn , f0 ))∗ 6 supn>m d(fn , f0 )∗ . Consequently,
P ∗ {sup d(fn , f0 ) > } = P {(sup d(fn , f0 ))∗ > }
n>m
n>m
≤ P {sup d(fn , f0 )∗ > } → 0, > 0
n>m
C) ⇒ D) Take for k = 1, · · · the sets
Ck := { sup d(fn , f0 ) > 1/k}.
n>m(k)
Then inserting from (C) for m(k) large enough P ∗ (Ck ) < 2−k . If we take
measurable covers Bk for Ck with P (Bk ) < 2−k then on Ar := ∩k>r Bkc we
have that fn → f0 uniformly and P (Ar ) > 1 − 2−r ;
D) ⇒ A) Assuming (D) we obtain sets Bk with P (Bk ) ↑ 1 and fn → f0 on Bk .
Then Ck := ∪kj=1 Bj so that C1 ⊂ C2 ⊂ · · · and P (Ck ) ↑ 1. Now for
m(k) large enough d(fn , f0 ) < 1/k on Ck for all n > m(k). Then also
d(fn , f0 )∗ 6 1/k and so d(fn , f0 )∗ → 0 a.s.
Part (D) is the same as what is usually called ”Egorov’s theorem” (for almost
surely convergent sequences of measurable functions) in the literature.
Like for measurable functions and convergence in probability, continuous functions preserve convergence in outer probability.
Proposition 2.3.2. Let (S, d) and (Y, e) be metric spaces and (Ω, A, P ) a probability space. Let fn be functions from Ω into S for n = 0, 1, · · · such that fn → f0
in outer probability as n → ∞. Assume that f0 has separable range and is Borel
measurable. Let g be continuous from S into Y . Then g(fn ) → g(f0 ) in outer
probability.
Proof. Given > 0, k = 1, · · · define
Bk := {x ∈ S : d(x, y) < 1/k implies e(g(x), g(y)) ≤ , y ∈ S}
2.4. CONVERGENCE IN LAW
13
We claim that Bk is closed. To see that we take a sequence xn in Bk converging to
some x. If d(x, y) < 1/k we also have d(xm , y) < 1/k for m ≥ mk (k is fixed!)
and by definition of Bk it follows that e(g(xm ), g(y)) ≤ , m ≥ mk . By continuity
of g, g(xm ) → g(x) and also for m ≥ mk , ≥ e(g(xm ), g(y)) → e(g(x), g(y)),
whence x ∈ Bk . Secondly note that Bk ↑ S as k → ∞ by the continuity of g
which implies that f0−1 (Bk ) ↑ Ω.
Pick k large enough such that P (f0−1 (Bk )) > 1 − . Now it follows from the
definition of Bk that {e(g(fn ), g(f )) > } ∩ f0−1 (Bk ) ⊂ {d(fn , f ) > 1/k}.
P ∗ {e(g(fn ), g(f )) > } ≤ P ∗ ({e(g(fn ), g(f )) > } ∩ f0−1 (Bk )}
+ P ∗ ({e(g(fn ), g(f )) > } ∩ (f0−1 (Bk ))c }
6 P ∗ {d(fn , f ) > 1/k} + P {(f0−1 (Bk ))c }
< P ∗ {d(fn , f ) > 1/k} + < 2
Lemma 2.3.3. Let (Ω, A, P ) be a probability space and {gn }∞
n=0 a sequence of
uniformly bounded real–valued functions on Ω
R ∗such that Rg0 is measurable. If
gn → g0 in outer probability, then lim supn→∞ gn dP 6 g0 dP .
Proof. Let |gn (x)| 6 M < ∞ for all n ∈ N and x ∈ Ω. By setting gn := gn /M
we can and do assume that M = 1. Given > 0, we have for n large enough,
P ∗ (|gn − g0 | > ) = P (An ) < ,
where An = {|gn − g0 |∗ > }. Then:
gn∗ ≤ |gn − g0 |∗ + g0 ≤ |gn − g|∗ IAn + |gn − g|∗ IAcn + g0 ≤ 2IAn + + g0 .
It follows that for any > 0,
Z
∗
Z
gn dP =
gn∗
Z
dP ≤ 2P (An ) + +
Z
g0 dP ≤ 3 +
g0 dP,
and the lemma has been proven.
2.4
Convergence in law
Now we are able to give a definition for convergence of laws, where only the limit
has to be measurable. In this section we assume that (S, d) is a metric space.
14
CHAPTER 2. WEAK CONVERGENCE
Definition 2.4.1 (J. Hoffmann–Jørgensen). Let (Ωn , An , Pn ); n = 0, 1, · · · be
probability spaces. Consider further a sequence Yn : Ωn → S; n > 0. Suppose
that the range of Y0 is included in some separable subset of S and that Y0 is Borel
measurable with respect to the Borel sets on its range. Then for n → ∞, Yn will
be said to converge in law to Y0 , noted Yn ⇒ Y0 , if for every g, bounded and
continuous real–valued function on S,
Z ∗
Z
g(Yn ) dPn → g(Y0 ) dP0 .
Remark
Note that an equivalent
R
R ∗ condition could
R be given in terms of the integrals
g(Y
)
dP
as
we
have
−
g(Y
)
dP
=
−g(Yn ) dPn , n ≥ 1.
n
n
n
n
∗
∗
Similarly as in the classical case we have that convergence in outer probability
implies convergence in law.
Theorem 2.4.1. If Yn : Ω → S be a sequence such that Yn → Y0 in outer
probability and if Y0 is measurable with separable range, then Yn ⇒ Y0 .
Proof. Let G be a bounded, continuous function from S into R. Then applying
proposition 2.3.2 to G, we obtain that G(Yn ) → G(Y0 ) in outer probability. The
same holds for −G. Then lemma 2.3.3 tells us that
Z
Z ∗
lim sup
G(Yn ) dP ≤ G(Y0 ) dP
and
n→∞
Z
Z ∗
−G(Yn ) dP ≤ −G(Y0 ) dP
lim sup
n→∞
Rewriting the last term gives us:
Z ∗
lim sup
−G(Yn ) dP
n→∞
Z
lim sup − G(Yn ) dP
n→∞
Z∗
− lim inf
G(Yn ) dP
n→∞ ∗
Z
lim inf
G(Yn ) dP
n→∞
Z
−G(Y0 ) dP
Z
6 − G(Y0 ) dP
Z
6 − G(Y0 ) dP
Z
>
G(Y0 ) dP.
6
∗
Noticing that trivially lim inf n→∞
obtain convergence in law.
R
∗
G(Yn ) dP 6 lim inf n→∞
R∗
G(Yn ) dP , we
Here is a theorem about convergence where perfect functions are used. With
the aid of those functions, one is able to prove that convergence in outer probability implies convergence in distribution. Note that the domain of the functions fn
2.5. THE (EXTENDED) PORTMANTEAU THEOREM
15
is here only a measurable space, no measure is yet defined, this is different from
theorem 2.4.1.
Theorem 2.4.2. Let for n ≥ 0, (Yn , Bn ) be measurable spaces, gn : Ω → Yn
be perfect A, Bn -measurable mappings and fn : Yn → S. Suppose also that the
range of f0 is separable and that f0 Borel measurable. Let Qn := P ◦ gn−1 on Bn .
If fn ◦ gn → f0 ◦ g0 in outer probability, as n → ∞, then fn ⇒ f0 as n → ∞ for
fn on (Yn , Bn , Qn ).
Proof. By Theorem 2.4.1 , fn ◦ gn ⇒ f0 ◦ g0 . Let H be a bounded, continuous,
real–valued function on S. Writing it down gives
Z ∗
Z
Z
H(fn (gn )) dP → H(f0 (g0 )) dP = H(f0 ) dQ0
where the last equation follows by the image measure theorem. The following
equalities will finish the proof.
Z ∗
Z
H(fn (gn )) dP =
H(fn (gn ))∗ dP
Z
=
H(fn )∗ ◦ gn dP
Z
=
H(fn )∗ dQn
Z ∗
H(fn ) dQn
=
The first step holds because of the definition of upper integral and essential infimum, the second and third because gn is perfect (and thus also measurable) and
finally the last one is as the first equality. Combining the two facts, we get
Z ∗
Z ∗
Z
Z
H(fn ) dQn =
H(fn (gn )) dP → H(f0 (g0 )) dP = H(f0 ) dQ0
as claimed.
2.5
The (extended) Portmanteau Theorem
In the previous section we defined the outer measure and expectation of any set
or function. We also defined weak convergence for non measurable maps. This
was motivated by the definition of weak convergence for the usual case where one
considers (sequences of) measurable functions. In the classical case we can give
16
CHAPTER 2. WEAK CONVERGENCE
an intuitive explanation of what weak converges is; by the portmanteau theorem
weak convergence of a sequence of probability measures is the same as convergence of the measures for some, thus not necessarily all, Borel sets. Like in the
classical case there is also a portmanteau theorem for (a sequence of) non measurable functions.
Theorem 2.5.1 ((Extended) portmanteau theorem). Let (S, d) be a metric space.
For n = 0, 1, 2, · · · let (Xn , An , Qn ) be a probability space and fn : Xn → S a
mapping. Suppose that f0 has separable range S0 and is measurable. Let P0 :=
Q0 ◦ f0−1 , then the following are equivalent:
a) fn ⇒ f0 ;
R
b) lim supn→∞ EQ∗ n [G(fn )] ≤ G dP0 for each bounded continuous / Lipschitz real–valued function G on S;
R
c) EQ∗ n [G(fn )] → G dP0 for each bounded and real–valued Lipschitz function G on S;
d) For any F ⊂ S closed, P0 (F ) ≥ lim supn→∞ Q∗n ({fn ∈ F });
e) For any U ⊂ S open, P0 (U ) ≤ lim inf n→∞ (Qn )∗ ({fn ∈ U });
f) For any continuity set A ⊂ S, i.e. P0 (∂A) = 0, of P0 , Q∗n ({fn ∈ A}) →
P0 (A) and (Qn )∗ ({fn ∈ A}) → P0 (A).
Before starting we recall briefly the definition of weak convergence for the
general (i.e. not necessarily measurable)
setting, see
R∗
R also definition 2.4.1. So
fnR⇒ f0 iff for eachRG ∈ Cb (S) :
G(fn ) dQn → G(f0 ) dQ0 or equivalently
if ∗ G(fn ) dQn → G(f0 ) dQ0 for all G ∈ Cb (S).
Proof. (a) ⇒ b) trivial
b) ⇒ c) Consider −G instead of G, we find that
Z
Z ∗
−G(fn ) dQn
−G(f0 ) dQ0 ≥ lim sup
n→∞
Z
Z
− G(f0 ) dQ0 ≥ lim sup − G(fn ) dQn
n→∞
Z
Z∗
− G(f0 ) dQ0 ≥ − lim inf G(fn ) dQn .
n→∞
∗
Next note that
lim sup E ∗ [G(fn )] ≤ E[G(f0 )] ≤ lim inf E∗ [G(fn )] ≤ lim inf E ∗ [G(fn )].
n→∞
n→∞
n→∞
2.5. THE (EXTENDED) PORTMANTEAU THEOREM
17
Thus (b) implies (c).
c) ⇒ d) Let F be any closed set of (S, d). Then let {gk }k≥1 be a sequence of
Lipschitz function, with IF ≤ gk and gk ↓ IF , e.g. gk (x) := max(1−kd(x, F ), 0).
For such a function: gk ≥ IF , they are also obviously non increasing. They are
Lipschitz, lemma A.1.13, and bounded by 1.
From (c) ( or also (b) ) for every k:
Z
Z ∗
lim sup
gk ◦ fn dQn ≤ gk dP0 .
n→∞
As gk ≥ IF for every k we have
Z
Z ∗
gk ◦ fn dQn ≥
∗
IF ◦ fn dQn = Q∗n (fn ∈ F ).
First taking the lim sup on both sides:
lim sup Q∗n (fn
n→∞
Z
∗
∈ F ) ≤ lim sup
Z
gk ◦ fn dQn ≤
gk dP0 ,
n→∞
and then letting k → ∞, we finally get that
Z
P0 (F ) = lim
gk dP0 ≥ lim sup Q∗n (fn ∈ F ).
k→∞
n→∞
d) ⇒ e) For U ⊂ S open, let F := S\U , then F is closed, hence by (c)
P (F ) ≥ lim sup Q∗n (fn ∈ F ).
n→∞
Rewriting the whole expression in terms of U :
Z
∗
(1 − IU ) ◦ fn dQn
Z
1 − P (U ) ≥ lim sup 1 − IU ◦ fn dQn
1 − P (U ) ≥ lim sup
n→∞
n→∞
∗
1 − P (U ) ≥ 1 − lim inf (Qn )∗ (fn ∈ U )
n→∞
lim inf (Qn )∗ (fn ∈ U ) ≥ P (U )
n→∞
In fact it is not hard to see that we have (d) ⇐⇒ (e).
d)+ e) ⇒ f) By definition the boundary of a set A ⊂ S, with S a topological
space, is ∂A := A\Å. For continuity sets A of P0 , due to the trivial inclusions
18
CHAPTER 2. WEAK CONVERGENCE
Å ⊂ A ⊂ A one has P0 (A) = P0 (A) = P0 (Å).
To prove Q∗n ({fn ∈ A}) → P0 (A), start with noting:
lim sup Q∗n ({fn ∈ A}) ≤ lim sup Q∗n ({fn ∈ A}) ≤ P0 (A) = P0 (Å)
n→∞
n→∞
where the second inequality follows from assumption (d) and equality from the
remark above. Using assumtption (e) we get similarly
P0 (Å) ≤ lim inf (Qn )∗ ({fn ∈ Å}) ≤ lim inf (Qn )∗ ({fn ∈ Å})
n→∞
n→∞
∗
≤ lim inf (Qn ) ({fn ∈ A})
n→∞
Taking all facts together
lim sup Q∗n ({fn ∈ A}) = lim inf (Qn )∗ ({fn ∈ A}) = P0 (A)
n→∞
n→∞
The proof of (Qn )∗ ({fn ∈ A}) → P0 (A) goes along the same inequalities. We
use:
lim sup(Qn )∗ ({fn ∈ A}) ≤ lim sup Q∗n ({fn ∈ A}) ≤ P0 (A) = P0 (Å)
n→∞
n→∞
and
P0 (Å) ≤ lim inf (Qn )∗ ({fn ∈ Å}) ≤ lim inf (Qn )∗ ({fn ∈ A})
n→∞
n→∞
f) ⇒ a) Let G be a bounded real–valued continuous function. Then for some
M > 0 : G(S) ⊂ [−M, M ]. For at most countably many t ∈ [−M, M ] : {G < t}
is not a continuity set of P . This is easy to see; ∂{G < t} = {G = t} by
continuity of G. {G < t} is open, because G is continuous, its closure in S is
certainly contained in {G ≤ t}, which is a closed set. Hence ∂{G < t} ⊂ {G =
t}. Let Ft := {G = t}, then for at most countably many t : P0 (Ft ) > 0. The set
A := {u ∈ [−M, M ] : P0 (Fu ) > 0} =
[n
u ∈ [−M, M ] : P0 (Fu ) >
n≥1
1o
n
The number of t ∈ [−M, M ] : P0 (Ft ) > 1/n is at most n ( P0 {G ∈ [−M, M ]}) =
P0 (S) = P (S0 ) = 1). And a countable union of finite (or at most countable ) sets
remains (at most) countable. Let > 0, and for k any integer let:
Bk, := {s ∈ S : k ≤ G(s) < (k + 1)}
2.6. ASYMPTOTIC TIGHTNESS AND MEASURABILITY
19
and ∂Bk, ⊂ ∂Fk ∪ ∂F(k+1) . By choosing appropriately we can take all Bk, to
be continuity sets.
Z
X
X
G dP − ≤
kP (Bk, ) = lim
kQ∗n (Bk, )
n→∞
k
Z
k
∗
Z
∗
≤ lim inf
G ◦ fn dQn ≤ lim sup
G ◦ fn dQn
n→∞
n→∞
X
X
≤
(k + 1)Q∗n (Bk, ) ≤
(k + 1)P (Bk, )
Zk
≤
k
G dP + Because RG is bounded thoseR sums over k are finite sums,
So
R ∗ thus they exist.
R
∗
limn→∞ G ◦ fn dQn ∈ B( G dP ), ), hence limn→∞ G ◦ fn dQn = G dP
and fn ⇒ f0 .
2.6
Asymptotic tightness and measurability
Let (S, d) be a metric space and set for A ⊂ S, and δ > 0:
Aδ := {y ∈ S : d(y, A) < δ}.
Note that these sets are open in S (d(·, A) is 1–Lipschitz continuous by lemma
A.1.13).
Definition 2.6.1. A p-measure Q defined on the Borel σ–algebra of (S, d) is said
to be tight iff for every > 0 there exists a compact set K in S such that Q(K) ≥
1 − .
A map f : X , → S which is measurable w.r.t. A and the Borel sets of S, is said
to be a tight random variable iff its law L(f ) := Q ◦ f −1 is a tight measure on
(S, d).
If (Xn , An , Qn ) are p-spaces, a sequence fn : Xn → S, n ≥ 1 is said to be
asymptotically tight iff for every > 0 there is a compact set K such that
lim inf (Qn )∗ (fn ∈ K δ ) ≥ 1 − for every δ > 0
n→∞
Such a sequence is said to be asymptotically measurable iff
Z ∗
Z
G ◦ fn dQn − G ◦ fn dQn → 0
∗
for all G ∈ Cb (S).
20
CHAPTER 2. WEAK CONVERGENCE
It is easy to see that if f is tight, it has separable support. From the definition
of weak convergence it also follows that fn has to be asymptotically measurable
if fn converges weakly to an f : X → S with separable support. Moreover, fn
being asymptotically tight and fn ⇒ f implies that f is tight.
The next theorem is an extension of Prohorov’s theorem to possibly non measurable mappings and gives a kind of reverse implication, namely that asymptotic
measurability and asymptotic tightness imply that there is a weakly convergent
subsequence.
Theorem 2.6.1 (Prohorov). Let (Xn , An , Qn ) be a probability and fn : Xn → S
be mappings for n = 1, 2, · · · . If fn is asymptotically tight and asymptotically
measurable, then there exists a subsequence fnj that converges weakly to a tight
Borel law.
Proof. We refer to [vdVaartAndWell] theorem 1.3.9 page 21 for a proof.
The following lemma will help to verify whether a given sequence fn is asymptotically measurable.
Lemma 2.6.2. Let (Xn , An , Qn ) be a probability space and fn : Xn → S a
mapping for n = 1, 2, · · · . If {fn }n≥1 is asymptotically tight, and
Z ∗
Z
G ◦ fn dQn − G ◦ fn dQn → 0 for all G ∈ F,
(2.6.1)
∗
where F is a subalgebra of Cb (S) that separates the points of (S, d), then {fn }n≥1
is asymptotically measurable.
Proof. Let > 0, choose a compact set K ⊂ S such that
lim inf (Qn )∗ ({fn ∈ K δ }) ≥ 1 − for every δ > 0.
n→∞
Next note that one can clearly add all constant functions to F. F remains then a
subalgebra for which the assumption, equation 2.6.1, still holds.
Then by the Stone–Weierstrass theorem (A.1.20 ) the restriction of F to K is
uniformly dense in Cb (K). So for G ∈ Cb (S) there exists an F ∈ F:
|G(x) − F (x)| ≤ /4 for all x ∈ K.
Using the compactness of K, we can say more. Actually it is even true that
|G(x) − F (x)| < /3
for all x ∈ K δ and for some δ > 0. This follows from lemma A.1.23.
2.6. ASYMPTOTIC TIGHTNESS AND MEASURABILITY
21
For {fn ∈ K δ } choose a measurable subset An , e.g. {fn ∈ K δ }∗ , such that:
Qn (An ) = (Qn )∗ ({fn ∈ K δ }).
Then from the definition of inner and outer cover:
/ K δ })
Qn (X n \{fn ∈ K δ }∗ ) = Q∗n ({fn ∈
and for n large enough and G ∈ Cb (S):
≤
≤
≤
≤
Qn (|(G ◦ fn )∗ − (G ◦ fn )∗ | > )
Qn ({|(G ◦ fn )∗ − (G ◦ fn )∗ | > } ∩ {fn ∈ K δ }∗ )
+ Qn (Xn \{fn ∈ K δ }∗ )
Qn ({|(G ◦ fn )∗ − (G ◦ fn )∗ | > } ∩ {fn ∈ K δ }∗ ) + Qn ({|(G ◦ fn )∗ − (F ◦ fn )∗ | + |(F ◦ fn )∗ − (F ◦ fn )∗ |
+ |(F ◦ fn )∗ − (G ◦ fn )∗ | > } ∩ An ) + Qn ({|(F ◦ fn )∗ − (F ◦ fn )∗ | > /3}) + The second inequality holds by asymptotic tightness, and the fifth by uniform
approximation on K δ . Note that we used part (ii) of lemma 2.1.3 and also lemma
2.1.4 for:
|(G ◦ fn )∗ − (F ◦ fn )∗ | ≤ |(G ◦ fn ) − (F ◦ fn )|∗
and that, since {fn ∈ K δ }∗ ⊂ {fn ∈ K δ }:
I{fn ∈K δ }∗ |[(G ◦ fn ) − (F ◦ fn )]∗ | ≤ (I{fn ∈K δ }∗ )∗ = I{fn ∈K δ }∗
∗
I{fn ∈K δ }∗ [(G ◦ fn ) − (F ◦ fn )] ≤ I{fn ∈K δ }∗
Also
|(F ◦ fn )∗ − (G ◦ fn )∗ | = | − (−F ◦ fn )∗ − −(−G ◦ fn )∗ |
| − (−F ◦ fn )∗ + (−G ◦ fn )∗ | ≤ |(G ◦ fn ) − (F ◦ fn )|∗
the latter term is less than /3 on An = {fn ∈ K δ }∗ .
Since, for F ∈ F, E[(F ◦ fn )∗ − (F ◦ fn )∗ ] → 0, (F ◦ fn )∗ − (F ◦ fn )∗ → 0
in probability, and so (G ◦ fn )∗ − (G ◦ fn )∗ → 0 in probability too. This is a
uniformly bounded sequence and it easily follows that
Z ∗
Z
G ◦ fn dQn − G ◦ fn dQn = E[(G ◦ fn )∗ − (G ◦ fn )∗ ] → 0
∗
22
CHAPTER 2. WEAK CONVERGENCE
We still need a unicity condition for tight Borel-measures on the Borel subsets of S. Combined with Prohorov’s theorem, which gives weakly convergent
subsequences, one can then prove weak convergence to a unique limit.
Lemma 2.6.3. Let µ, ν be two finite Borel measures on a metric space (S, d).
R
R
i) If F dµ = F dν for all F ∈ Cb (S), then µ = ν.
R
ii) RIf µ, ν are tight Borel probability measures on (S, d). Then, if F dµ =
F dν for every F in a vector lattice (see A.1.12 for a definition ) F ⊂
Cb (S), that contains the constants and separates points of S, µ = ν.
Proof.
i) Let G be an open set of S, then define hk (x) := min(kd(x, S\G), 1}. And
0 ≤ hk ≤ 1 actually hk (x) ≤ IG , because hk (x) = 0 for x ∈ S\G and is
bounded by 1 everywhere. Also hk ≤ hk+1 , because when x ∈
/ G : hk (x) =
0 = hk+1 (x) and
kd(x, S\G) ≤ (k + 1)d(x, S\G)(< 1) for x ∈ G, (k + 1)d(x, S\G) < 1.
The hk are clearly continuous and bounded (they are actually k–Lipschitz,
compare with the proof of (c) ⇒ (d) in the extended portmanteau theorem
2.5.1 ) and converge monotonically to IG . By the Monotone Convergence
theorem A.2.18 :
Z
Z
µ(G) = lim
hk dµ = lim
hk dν = ν(G).
k→∞
k→∞
The open sets generate the Borel σ–algebra, and form a π–system, i.e. they
are closed under finite intersections. So both measures µ and ν extend
uniquely to the whole σ–algebra and thus are equal, theorem A.2.16.
ii) let > 0 and choose a compact K ⊂ S for which min(µ(K), ν(K)) ≥ 1−.
By a version of the Stone–Weierstrass theorem (A.1.21 ) a vector lattice
F ⊂ Cb (K), which contains the constants and separates points of S, is
uniformly dense in Cb (K). Let G ∈ Cb (S), then G is uniformly bounded,
so for some M > 0 : kgk∞ ≤ M , so by adding M to G and dividing G+M
by 2M , one has that 0 ≤ (G + M )/2M ≤ 1, and let g := (G + M )/2M .
Then since F is uniformly dense in Cb (K), take F ∈ F : |g(x) − F (x)| ≤ for all x ∈ K. Because 0 ≤ g ≤ 1 we can also take f := max(min(F, 1), 0)
as an –approximation of g; first if y ∈ {F > 1} ∩ K :
F (y) − g(y) = |g(y) − F (y)| ≤ and 0 ≤ 1 − g(y) ≤ |F (y) − g(y)|.
2.7. SPACES OF BOUNDED FUNCTIONS
23
Thus | min(F, 1)(x) − g(x)| ≤ for all x ∈ K. In the second step: for
z ∈ {min(F, 1) < 0} ∩ K :
0 ≤ g(z) ≤ g(z) − min(F, 1)(z) = |g(z) − min(F, 1)(z)| ≤ R
R
Now we will bound A := | g dµ − g dν|, also since F is a vector lattice:
f := max(min(F, 1), 0) ∈ F so that f is uniformly bounded by 1;
Z
Z
A = g dµ − g dν Z
Z
Z
Z
= (g − f ) dµ + f dµ − (g − f ) dν − f dν Z
Z
Z
Z
≤ 2 + (g − f ) dµ + f dµ − f dν + (g − f ) dν S\K
S\K
Z
Z
≤ 2 + 2 + f dµ − f dν + 2 = 6
R
R
Since by assumption H dµ = H dν for all H ∈ F. Thus since g :=
(G + M )/2M :
Z
Z
Z
Z
g dµ − g dν = 1/2M G dµ − G dν R
R
We have got that: G dµ − G dν ≤ 3/M for any > 0, then by part
(i): µ equals ν on the Borel σ–algebra of (S, d).
2.7
Spaces of bounded functions
Let T be an arbitrary set. We now look at the space
`∞ (T ) := {F : T → R : F is uniformly bounded }.
This is a normed space with the norm kzkT := supt∈T |z(t)| and corresponding
metric d(z1 , z2 ) = kz1 − z2 kT , z1 , z2 ∈ `∞ (T ).
Then any stochastic process (f (t, ·)t∈T on a probability space (X , A, P ) which is
(pointwise) bounded gives a mapping f : X → `∞ (T ). This mapping possesses
some measurability since the functions f (t, ·) : X → R have to be A-measurable.
This will allow us to simplify the general criterion for asymptotic measurabilty,
see lemma 2.6.2, somewhat.
24
CHAPTER 2. WEAK CONVERGENCE
Lemma 2.7.1. Let (Xn , An , Qn ) be a probability space, T a set and fn a function from Xn into `∞ (T ) for n = 1, 2, · · · . Suppose that fn is asymptotically
tight. Then fn is asymptotically measurable iff (fn (t1 ), · · · , fn (tk )) : Xn → Rk is
asymptotically measurable for any choice t1 , . . . , tk ∈ T and k ≥ 1.
Proof. Assume fn : Xn → `∞ (T ) is asymptotically measurable. Then we have to
show that the same is true for (fn (t1 ), . . . , fn (tk )) : Xn → Rk for t1 , . . . , tk ∈ T .
Note that (fn (t1 ), . . . , fn (tk )) = fn ◦ g, where g = (πt1 , . . . , πtk ) : `∞ (T ) → Rk
with πt : `∞ (T ) → R : z → z(t) being the projection onto the t–th coordinate
of z. πt is trivially continuous since `∞ (T ) bears the uniform topology. This of
course implies that g : `∞ (T ) → Rk is continuous as well.
Let H ∈ Cb (Rk ). Then we have trivially,
E[(H(fn (t1 ), . . . , fn (tk ))∗ − (H(fn (t1 ), . . . , fn (tk ))∗ ]
= E[(H ◦ g) ◦ fn )∗ − (H ◦ g) ◦ fn )∗ ].
Since H ◦ g ∈ Cb (`∞ (T )) the latter term converges to zero as n → ∞. Hence as
claimed (fn (t1 ), . . . , fn (tk )) is asymptotically measurable.
Conversely, consider the collection of continuous functions
F := {h : `∞ (T ) → R :z → h(z) := G(z(t1 ), · · · , z(tk )) :
G ∈ Cb (Rk ); ti ∈ T, i = 1, · · · , k; k ∈ N}
Such a class of functions has certain properties: it is an algebra and a vector
lattice (see definition A.1.12 for definitions). For the proofs we refer to lemma
A.1.22.Moreover F also separates elements of `∞ (T ). Let z1 6= z2 , then for
some t ∈ T : z1 (t) 6= z2 (t). Both functions are bounded by M , then h(z) :=
max(min(M, z(t)), −M ) is bounded (by M ) and continuous, as composition of
continuous functions (projection on t–th coordinate and min and max).
By Lemma 2.6.2 we have asymptotic measurability if
E[(F ◦ fn )∗ − (F ◦ fn )∗ ] → 0 as n → ∞, ∀f ∈ F.
By definition of the function class F this is the same as asymptotic measurability
of (fn (t1 ), . . . , fn (tk )) for all t1 , . . . , tk ∈ T and k ≥ 1.
In the sequel we will call the (finite-dimensional) distributions of f , i.e all
distributions (f (t1 ), . . . , f (tk )) for t1 , . . . , tk and k ≥ 1 the marginal distributions
of f or simply the marginals. Using the same argument as in the above proof, we
can infer from Lemma 2.6.3 that
2.7. SPACES OF BOUNDED FUNCTIONS
25
Lemma 2.7.2. Let f, g be two tight Borel measurable maps from a probability
space (X , A, P ) into `∞ (T ). Then f and g are equal in Borel law iff all corresponding marginals of f and g are equal in law.
Proof. One implication is trivial, if f and g have the same Borel law, then because the projections are measurable, f (t) are measurable, and for ti ∈ T ; i =
1, · · · k; (f (t1 ), · · · , f (tk )) are measurable too. Hence
P ((f (t1 ), · · · , f (tk )) ∈ B) =
=
=
=
=
P (((πt1 , · · · , πtk ) ◦ f ) ∈ B)
P (f ∈ A)
P (g ∈ A)
P (((πt1 , · · · , πtk ) ◦ g) ∈ B)
P ((g(t1 ), · · · , g(tk )) ∈ B)
where B ∈ Rk and A := {F ∈ `∞ (T ) : (F (t1 ), · · · , F (tk )) ∈ B}.
Conversely, suppose all marginals are equal in distribution. Consider again
the collection of functions
F := {h : `∞ (T ) → R :z → f (z) := G(z(t1 ), · · · , z(tk )) :
G ∈ Cb (Rk ); ti ∈ T, i = 1, · · · , k; k ∈ N}
R
R
Now we note that h ◦ f dP = h ◦ g dP for all h ∈ F. But this is true, since h
depends only on finitely many coordinates and the marginals of f ang g are equal.
Then, by lemma 2.6.3, f and g are equal in Borel law.
Combining the two above lemmas with Prohorov’s theorem, we get the following.
Theorem 2.7.3. Let (Xn , An , Qn ) be a probability space, and let fn be a mapping
from Xn to `∞ (T ) for n = 1, 2, · · · .
Then fn converge weakly to a tight limit iff {fn }n≥1 is asymptotically tight and the
marginals (fn (t1 ), · · · , fn (tk )) converge weakly to a limit for every finite subset
{t1 , · · · , tk } ⊂ T ; k = 1, 2, · · · .
Proof. Suppose fn converges to a tight limit, say f : X0 → `∞ (T ) which is
defined on a p-space (X0 , A0 , Q0 ).
We first show that fn is asymptotically tight. So let > 0. Since f is tight, take a
compact K subset of `∞ (T ) with Q0 {f ∈ K} ≥ 1−. By part (e) of the extended
portmanteau theorem, 2.5.1, since K δ is open;
1 − ≤ Q0 ({f ∈ K δ }) ≤ lim inf (Qn )∗ ({fn ∈ K δ })
n→∞
26
CHAPTER 2. WEAK CONVERGENCE
for any δ > 0.
That the marginals converge follows from the continuous mapping theorem.
Now assume fn is asymptotically tight and the marginals converge for any finite
subset T0 of T . From the definition of weak convergence it follows then that
(fn (t)t∈T0 is asymptotically measurable. So, by lemma 2.7.1, fn is asymptotically measurable, and it was given that fn was also asymptotically tight. Thus by
Prohorov’s theorem, 2.6.1, fn is relatively compact. So one may possible have
different weak limits, but since all marginals converge to a certain (unique) limit
and, by lemma 2.7.2, convergence of all marginals is enough to characterize a
limit, the limit must always be the same. So fn converges weakly to a unique
limit.
As seen from theorem 2.7.3, two conditions have to be satisfied for weak convergence: convergence of marginals is usually the easier part, because a lot of
techniques are already available. The second condition, tightness, is much harder.
This will be related to an asymptotic unform equicontinuity condition on the sample paths t → fn (t).
Definition 2.7.1. Let ρ be a semimetric on T . A sequence fn : Xn → `∞ (T )
is said to be asymptotically uniformly ρ–equicontinuous in probability iff for
every , η > 0 there exists a δ > 0:
lim sup Q∗n sup{|fn (s) − fn (t)| : s, t ∈ T, ρ(s, t) < δ} > < η
(2.7.1)
n→∞
Condition (2.7.1) will also be called the asymptotic equicontinuity condition.
Theorem 2.7.4. A sequence fn : (Xn , An , Qn ) → `∞ (T ) is asymptotically tight
iff fn (t) is asymptotically tight in R for every t ∈ R, and there exists a semimetric
ρ on T such that (T, ρ) is totally bounded and {fn }n≥1 is asymptotically uniformly
ρ–equicontinuous in probability.
Addendum Moreover, if fn ⇒ f0 , then almost all paths t 7→ f0 (·)(t) are uniformly ρ–continuous; and the semimetric ρ can, without loss of generality, be
taken equal to any semimetric ρ̃ for which (T, ρ̃) is totally bounded and the paths
t 7→ f0 (·)(t) are uniformly ρ̃–continuous .
Proof.
⇐ Let ζ > 0, and {m }m≥1 a sequence:
m > 0 and m ↓ 0.
We claim that kfn kT := supt∈T |fn (t)| is a tight sequence in R. Indeed, let
= 1 and since we have asymptotic ρ–equicontinuity of fn in probability,
2.7. SPACES OF BOUNDED FUNCTIONS
27
condition 2.7.1, for that particular choice of and ζ =: η implies that there
is a δ > 0 such that
lim inf (Qn )∗ sup{|fn (s) − fn (t)| : s, t ∈ T, ρ(s, t) < δ} ≤ 1 ≥ 1 − ζ
n→∞
and since (T, ρ) is totally bounded: T ⊂ ∪rj=1 B(T,ρ) (tj , δ) for some tj ∈ T .
Hence if xn ∈ sup{|fn (s) − fn (t)| : s, t ∈ T, ρ(s, t) < δ} ≤ 1 , we
have kfn (xn )k∞ ≤ maxrj=1 |fn (xn )(tj )| + 1.
Moreover, the functions fn (tj ) : Xn → R are asymptotically tight. This
implies that there exist constants Mj > 0 so that (Qn )∗ (|fn (tj )| ≤ Mj ) ≥
1 − η/r, 1 ≤ j ≤ r. Setting M = max(M1 , . . . , Mr ) + 1 we can conclude
that
lim inf (Qn )∗ kfn kT ≤ M ≥ 1 − 2ζ.
n→∞
−m
For m and η := 2 ζ, choose δm > 0 such that, equation 2.7.1 is valid.
For each m, finitely many balls of radius δm cover T , they could have
non–empty intersection. To avoid that we make a partition out of them,
as usually by excluding the previous, i.e.
(m)
Bk
:=
(m)
B(T,ρ) (tk , δm )\
k−1
[
(m)
B(T,ρ) (tj , δm )
j=1
for all m ≥ 1 and k ∈ {1, · · · , qm }. For m fixed, let zj ; 1 ≤ j ≤ pm be
(m)
functions from T into R constant on each Bk ; 1 ≤ k ≤ qm and taking one
(m)
of the values 0, ±m , · · · , ±dM/m em on each Bk . The number of such
possible functions is bounded by qm (2dM/m e + 1), hence finite. Further
let
pm
[
B`∞ (T ) (zj , m ).
Km :=
j=1
Next for xn in:
{kfn kT ≤ M } ∩
n
max
1≤k≤qm
sup |fn (s) − fn (t)| ≤ m
o
(m)
s,t∈Bk
(m)
it follows that xn lies in {fn ∈ Km }. Indeed, for tk as above one has
(m)
|fn (xn )(tk )| ≤ M ≤ dM/m em , moreover
(m)
fn (xn )(tk ) ∈ [lm , (l + 1)m ] for some −dM/m e ≤ l ≤ dM/m e − 1.
(m)
And for all s ∈ Bk :
fn (xn )(s) ∈ B`∞ (T ) (lm IB (m) , m ) ∪ B`∞ (T ) ((l + 1)m IB (m) , m )
k
k
28
CHAPTER 2. WEAK CONVERGENCE
Let K := ∩m≥1 Km , because all Km are closed, K is closed too. Moreover
K is also totally bounded in `∞ (T ), for ξ > 0, choose m̃ < ξ, which can
be done since m ↓ 0, then note that
K ⊂ Km̃ =
pm̃
[
j=1
B`∞ (T ) (zj , m̃ ) ⊂
pm̃
[
B`∞ (T ) (zj , ξ).
j=1
Hence K is compact. Moreover, given any δ > 0 there is an m = mδ ≥ 1
so that
K δ ⊃ ∩m
i=1 Ki .
This will be proved by contradiction. Let δ > 0 arbitrary and suppose that
for each m: K δ + ∩m
i=1 Ki . Then there is a sequence
{um }m≥1 : um ∈ ∩m
/ Kδ.
i=1 Ki and um ∈
1
B`∞ (T ) (zj , 1 ), there
In particular um ∈ K1 for all m ≥ 1. Since K1 = ∪pj=1
exists a subsequence {uk1 (n) }n≥1 and a closed ball B1 of K1 such that
uk1 (n) ∈ B1 := B`∞ (T ) (z˜1 , 1 )
Next, since k1 (n) > n: uk1 (n) ∈ K2 for all n ≥ 1. So there exists a (closed)
ball B2 of K2 and a further subsequence {uk2 (n) }n≥1 of {uk1 (n) }n≥1 such
that uk2 (n) ∈ B2 , n ≥ 1. Continuing recursively we obtain at stage j a subsequence {ukj (n) }n≥1 of {ukj−1 (n) }n≥1 such that ukj (n) ∈ Bj , Bj a closed
ball making up Kj .
Now let {ukn (n) }n≥1 denote the diagonal subsequence; for each m ≥ 0
and n ≥ m : ukn (n) ∈ Bm , a closed ball of Km , which has radius m .
So {ukn (n) }n≥1 is a Cauchy sequence in `∞ (T ) (at every stage m, elements of the queue of the diagonal sequence lie at most 2m away from
each other, and m ↓ 0). It is well known that the space (`∞ (T ), k · kT )
is complete, in other words any Cauchy sequence converges. Let u denote
the limit of {ukn (n) }n≥1 . Since the limit of {ukm (m) }m≥1 is the same as that
of {ukm (m) }m≥j+1 for any j ≥ 1, one obtain that u ∈ Bj for any j ≥ 1,
because
{ukm (m) }m≥j+1 ⊂ Bj and Bj = Bj .
So in particular u ∈ K is true. On the other hand, since all um ∈
/ Kδ:
{ukm (m) }m≥1 ∈ (K δ )c = (K δ )c .
Thus u ∈ (K δ )c ∩ K, this is clearly a contradiction.
Finally if fn (xn ) ∈
/ K δ , then fn (xn ) ∈
/ ∩m
i=1 Ki for some fixed m and it
2.7. SPACES OF BOUNDED FUNCTIONS
29
follows that:
lim sup Q∗n {fn ∈
/ K δ } ≤ lim sup Q∗n
n→∞
fn ∈
/
n→∞
m
\
Ki
o
!
i=1
m n
o
[
kfn kT > M
≤ lim sup Q∗n
n→∞
i=1
n
∪ max
1≤k≤qi
≤ lim sup
n→∞
Q∗n
+
m
X
sup |fn (s) − fn (t)| > i
!
o
(i)
s,t∈Bk
{kfn kT > M }
m
X
Q∗n
n
max
1≤k≤qi
i=1
≤ 2ζ +
n
sup |fn (s) − fn (t)| > i
!
o
(i)
s,t∈Bk
2−m ζ < 3ζ
i=1
fn is thus asymptotically tight.
⇒ If fn is asymptotically tight, then fn (t) will be also asymptotically tight for
any t ∈ T . Let > 0, then there exists a K ⊂ `∞ (T ) compact such that
lim inf (Qn )∗ (fn ∈ K δ ) ≥ 1 − n→∞
for all δ > 0. Now since πt is continuous, πt (K) is compact ( A.1.4 ).
Moreover πt (K δ ) ⊂ (πt (K))δ , because if h ∈ K δ then for some k ∈ K :
supu∈T |k(u) − h(u)| < δ, in particular |πt (h) − πt (k)| = |h(t) − k(t)| < δ
and
(Qn )∗ (fn ∈ K δ ) ≤ (Qn )∗ (fn (t) ∈ (πt (K))δ ).
We now take compact sets K1 ⊂ K2 ⊂ · · · such that for any > 0:
lim inf (Qn )∗ (fn ∈ Km
) ≥ 1 − 1/m.
n→∞
And for each m fixed we define the semimetric ρm : T × T → R+ by
ρm (s, t) := sup |z(s) − z(t)|, s, t ∈ T.
z∈Km
The triangle inequality follows from the triangle inequality for the absolute
value.
30
CHAPTER 2. WEAK CONVERGENCE
We now claim that (T, ρm ) is a totally bounded space. Let η > 0, since
Km is compact it is also totally bounded: Km ⊂ ∪ki=1 Bd∞ (zi , η). Next we
partition Rk in cubes of edge η. Since zi ; i = 1, · · · , k is uniformly bounded
in R, (z1 , z2 , · · · , zk ) is uniformly bounded in Rk . Hence the set A :=
{(z1 (t), z2 (t), · · · , zk (t)) : t ∈ T } has non empty intersection with only
finitely many cubes, say p such cubes. For each cube we pick one element
s from T such that (z1 (s), z2 (s), · · · , zk (s)) lies in that cube. Thus we have
got only finitely many tj ; j = 1, · · · , p such that (z1 (tj ), z2 (tj ), · · · , zk (tj ))
lies in a cube.
Then T ⊂ ∪pi=1 Bρm (ti , 3η). First note that
k
sup |z(t) − z(s)| ≤ 2η + max |zj (t) − zj (s)|,
j=1
z∈Km
since Km ⊂ ∪ki=1 Bd∞ (zi , η). Secondly, (z1 (t), · · · , zk (t)) lies in the cube of
some (unique) (z1 (ti ), · · · , zk (ti )) of edge η. So for t ∈ T fixed, choose ti ∈
{t1 , · · · , tp } such that (z1 (t), · · · , zk (t)) lies in the cube of (z1 (ti ), · · · , zk (ti )):
ρm (t, ti ) =
sup |z(t) − z(ti )|
z∈Km
k
≤ 2η + max |zj (t) − zj (ti )|
j=1
≤ 2η + η = 3η.
Without loss of generality we can take another metric that is bounded by 1
and induces the same topology as ρm , e.g. min(ρm , 1). We define now a
metric
X
ρ(s, t) :=
2−m (min(ρm , 1))
m≥1
on T . With this new metric T still will be totally bounded. Let η > 0,
and take m ∈ N : 2−m < η. (T, ρm ) was totally bounded, so there are
finitely many ti ; i = 1, · · · p : T ⊂ ∪pi=1 Bρm (ti , η). Because Kn ⊂ Km ,
ρn ≤ ρm , 1 ≤ n ≤ m. For t ∈ T , take ti so that t ∈ Bρm (ti , η). Then
ρ(t, ti ) ≤
m
X
l=1
≤
m
X
2−l ρl (t, ti ) +
X
2−l
l≥m+1
2−l η + 2−m < 2η
l=1
Now we prove the asymptotic uniform ρ–equicontinuity in probability of fn .
Let z ∈ Km , from the definition of ρm it follows |z(t) − z(s)| ≤ ρm (t, s).
2.7. SPACES OF BOUNDED FUNCTIONS
31
Because fn is asymptotically tight, it would suffice to have
n
o
Km
⊂ z ∈ `∞ (T ) :
sup |z(s) − z(t)| ≤ 3
ρ(s,t)<2−m for some m ∈ N. Indeed if z ∈ Km
, then for some z̃ ∈ Km :
|z(s) − z(t)| ≤ |z(s) − z̃(s)| + |z̃(s) − z̃(t)| + |z̃(t) − z(t)|
≤ + ρm (s, t) + ≤ 2 + ρ(s, t)2m
since min(ρm , 1) ≤ ρ2m . So if δ < 2−m :
lim inf (Qn )∗ sup |fn (s) − fn (t)| ≤ 3 ≥ 1 − 1/m.
n→∞
ρ(s,t)<δ
Since this is true for any m ≥ 1 and any > 0, we easily see that condition
(2.7.1) is satisfied.
It remains to prove the addendum. So assume that fn ⇒ f0 , where f0 is defined
on the probability space (X0 , A0 , Q0 ). Defining the sets Kn as in the above proof
of the implication “⇒”, we have that Q0 (f0 ∈ Km ) ≥ 1 − 1/m, m ≥ 1 which
trivially implies that Q0 (f0 ∈ ∪∞
m=1 Km ) = 1. By definition of ρm all functions
in Km are uniformly ρm –continuous which also implies that they are uniformly
ρ–continuous (since ρm ∧ 1 ≤ 2m ρ).
|z̃(t) − z̃(s)| ≤ 1. sup |z(t) − z(s)| ≤ ρm (s, t);
z∈Km
for z̃ ∈ Km .
Thus ∪∞
m=1 Km is a subset of the uniformly ρ–continuous functions on T .
To prove the last part of the addendum, we note first that the set of uniformly
continuous functions on a totally bounded semi–metric space T is separable and
complete so that f0 is tight (for a proof see [Bill2] chapter 1 theorem 1.3 on page
8). Indeed, let
(U C(T ), k · kT ) := {g : (T, ρ̃) → R : g is uniformly continuous}.
Then it is well known (from standard real analysis) that (U C(T ), k · kT ) is a
closed subset of the complete space (`∞ (T ), k · kT ), hence (U C(T ), k · kT ) is
complete too. For the separability, recall that any metric space can be completed,
in particular there exists a complete metric space (S, d) and an isometry φ from
T onto S, such that φ(T ) is dense in S (e.g. see [Dud1] theorem 2.5.1 on page
58 for a proof). Also φ(T ) is still totally bounded (since an isometry is Lipschitz
continuous). We claim that φ(T )(= S) is still totally bounded, so that (S, e)
32
CHAPTER 2. WEAK CONVERGENCE
would be compact. Let > 0 and x ∈ φ(T ), then since φ(T ) is totally bounded
there exists φ(tj ) (an isometry is always injective so φ : T → φ(T ) is a bijection),
1 ≤ j ≤ n:
n
[
φ(T ) ⊂
B(φ(tj ), /2).
j=1
Also there exists a sequence {yn } ⊂ φ(T ): yn → x, hence for k large enough:
e(yk , x) < /2 and for some φ(tl ) :e(φ(tl ), yk ) > /2 so that;
φ(T ) ⊂
n
[
B(tj , ).
j=1
Hence S(e) is a compact metric space and T is isometric to a dense subset of S, so
w.l.o.g. one can assume T is a dense subset of a compact space. Now the separability of (U C(T ), k · kT ) follows from the separability (theorem A.1.20: consider
the polynomial functions on S) of C(S) together with the string of equalities
U C(T ) = U C(S) = C(S),
where the first equality follows by extending uniform continuous functions on T
by mean of a limit argument to S and the second from the fact that continuous
functions on a compact space are uniformly continuous (corollary A.1.11). Further consider:
Fδ, = {z ∈ `∞ (T ) : sup{|z(s) − z(t)| : s, t ∈ T, ρ̃(s, t) < δ} ≥ }.
Note that such sets are closed subsets of `∞ (T ). Consequently, we have
lim sup Q∗n (fn ∈ Fδ, ) ≤ Q0 (f0 ∈ Fδ, )
n→∞
(by theorem 2.5.1 (d)). Since the paths t → f0 (t) are Q0 –a.s. uniformly ρ̃continuous, we have for any > 0
Q0 (f0 ∈ Fδ, ) → 0 as δ → 0.
Combining the two last conditions, we get (2.7.1) for the semi-metric ρ̃.
Chapter 3
Vapnik–C̆ervonenkis classes.
3.1
Introduction: definitions and a fundamental lemma.
This part is concerned with a special kind of classes of sets. If one desires to
obtain results on convergence in law, in (outer) probability or almost uniformly
for a sequence of probability measures, uniformly over a class of sets, it should
not be surprising that such (uniform) convergence will not hold for any such class.
One of the paths leading towards a solution uses the concept of a VC class. The
condition imposed assures one that the class of sets considered is, in some sense
to be defined later on, not too big. But unfortunately this isn’t enough, one has
to impose also some (mild) measurability conditions on that class. However the
good news is that the last remark is more of a theoretical nature, the statistician in
his daily life hardly encounters classes that violate the measurability condition.
Definition 3.1.1. Let X be any set and C a class of subsets of X . For A ⊂ X , let
CA := C u A := A u C := {C ∩ A : C ∈ C}. Denote the cardinality of A by |A|
and let 2A := {B : B ⊂ A}. Let ∆C (A) := |CA |.
C is said to shatter A iff C u A = 2A . If A is finite, then C shatters A iff
∆C (A) = 2|A| .
Let mC (n) := max{∆C (F ) : F ⊂ X, |F | = n}, for n = 0, 1, 2, · · · . If X is
finite, say |X | < n, then mC (m) 6 2n for all m. Now we define
(
inf{n : mC (n) < 2n }, if inf is finite;
V (C) :=
+∞,
if mC (n) = 2n for all n.
(
sup{n : mC (n) = 2n };
S(C) :=
−1,
if C is empty.
We clearly have S(C) = V (C) − 1, so if one is finite, then the other is too. S(C)
is, by definition, the largest cardinality of a set shattered by C. V (C) = n is the
33
34
CHAPTER 3. VAPNIK–C̆ERVONENKIS CLASSES.
smallest index n such that no set with cardinality larger than n is shattered by C.
C will be called a Vapnik–C̆ervonenkis or VC class whenever V (C) < ∞.
In the case X = R or more generally Rd , one can give classes of sets which
are VC classes. We will only work one example out, and give references for
books where more examples are given. In the case of R, the class C := {] −
∞, t] : t ∈ R} is a VC class with S(C) = 1. Indeed any singleton is shattered.
Now take any set of two points {x1 , x2 }, and w.l.o.g. assume x1 < x2 , then
{x1 , x2 } u C = {∅, {x1 }, {x1 , x2 }}, but x2 cannot be isolated by sets of C, every
set that contains x2 has to contain x1 too. For similar results in higher dimensions,
we refer to [vdVaartAndWell] Example 2.6.1 on page 135.
A nice property of VC classes is that the property is preserved by (finitely
many) Boolean operations. This is an easy way to generate VC classes, starting
from simple VC classes. For more on this subject we refer to [vdVaartAndWell]
section 2.6.5 (146–149).
If X isPfinite, with
n elements,
then 2X is a VC class, with S(2X ) = n. Let
k
N
N
N C6k :=
j=0 j , where j is 0 for j > N . We have an identity as in Pascal’s
triangle.
Proposition 3.1.1.
1, 2, · · · .
N C6k
=
N −1 C6k
+
N −1 C6k−1
for k = 1, 2, · · · and N =
Proof. For each j = 1, 2, · · · , N we have
N
N −1
N −1
=
+
j
j
j−1
Then summing over all j finishes the proof (by definition
N
j
= 0 for j > N ).
Then next fact illustrates why we consider VC classes, and why in some sense
VC–classes are small. Clearly for any non VC class mC (n) = 2n meaning that
arbitrary large finite sets are shattered. If however C is a VC class then the next
facts will show that mC (n) = O(nr ) for some r ∈ N. Therefore for a VC class
mC (n) only grows as a polynomial in n, which is much slower than the exponential growth for non VC classes.
Theorem 3.1.2 (Sauer’s lemma). If mC (n) > n C6k−1 , k > 1, then mC (k) = 2k .
So if S(C) < ∞, then mC (n) 6 n C6S(C) for all n.
Proof. The proof goes by induction on k and n. For k = 1 : n C60 = 1 < mC (n).
Thus mC (n) > 2, and so C must contain at least two elements. So for some
singleton G = {x} : ∆C (G) = 2. If k > n, then n C6k−1 = 2n > mC (n), so the
assumption implies k 6 n.
3.1. INTRODUCTION: DEFINITIONS, FUNDAMENTAL LEMMA.
35
Assume that the statement is true for k 6 K and n > k. Now for fixed
k := K + 1. As above, we only need to consider n > k = K + 1. So for
n = k = K + 1, the condition
mC (n) > n C6n−1 = 2n − 1,
implies mC (n) = 2n . To continue, suppose now the statement holds for all
(k ≤)n ≤ N . We will prove it then for n := N + 1. So one starts with
mC (N + 1) = mC (n) > n C6k−1 = N +1 C6K ,
by definition of mC (n) there exists a set
Hn := {x1 , · · · , xn } such that ∆C (Hn ) > n C6K .
Let HN := Hn \{xn } (recall n := N + 1). If ∆C (HN ) > N C6K we have by our
induction hypothesis that mC (k) = 2k . So assume now that
∆C (HN ) ≤ N C6K .
Let Cn := Hn u C := {Hn ∩ A : A ∈ C}. Furthermore we will need following
sets, called full sets. A set E ⊂ HN is said to be full iff E and E ∪ {xn } belong
to Cn , i.e. there exists a C1 ∈ C : E ⊂ C1 and a C2 ∈ C : E ∪ {xn } ⊂ C2 . Denote
by f the number of full sets. The map
Cn → CN : D 7→ D ∩ HN
is onto, for B := D∩HN , with D ∈ C, take D∩Hn ∈ Cn , and (D∩Hn )∩HN = B.
We claim ∆C (Hn ) = ∆C (HN ) + f . Indeed for full sets we have two candidates
namely E and E ∪ {xn } and for non full sets only one possibility.
Let F be the collection of full sets. Suppose f = ∆F (HN ) > N C6K−1 .
Then by our induction hypothesis, there is a set G ⊂ HN of cardinality K and
mC (G) = 2K , i.e. G is shattered by the collection of full sets. Let J := G ∪ {xn },
then card(J) := 2K+1 and since the union is a disjoint one, mC (J) = 2K+1 .
If however f ≤ N C6K−1 , then
∆C (Hn ) = ∆C (HN ) + f ≤ N C6K + N C6K−1 = N + 1C6K
where the last equation follows from proposition 3.1.1. A contradiction to our
assumption: ∆C (Hn ) > n C6K .
For the second part it is enough to note that S(C) < +∞, implies mC (S(C) +
j) < 2S(C)+j for all j ≥ 1, so by the lemma just proven: mC (n) ≤ n C6S(C) for
all n (since for all n ≥ S(C) + 1 by the lemma and it holds vacuously for all
n ≤ S(C)).
36
CHAPTER 3. VAPNIK–C̆ERVONENKIS CLASSES.
Proposition 3.1.3 (Vapnik–C̆ervonenkis). Let n be any nonnegative integer and
k: k + 2 ≤ n, then n C6k ≤ (1, 5)nk /k!.
Proof. Two proofs are provided, one nearly optimal relying on Stirling’s formula
and one other, more elementary and based on purely probabilistic methods where
we prove n C6k ≤ (e/k)k nk for all n ≥ k + 1.
• for k = 0: n C6k = 1 < 1, 5 = (1, 5)nk /k!.
may assume k ≥ 1.
Pp Sopwep−j
p
Recall the binomial theorem: (a + b) = j=0 j a bj for p = 1, 2, · · ·
and a, b ∈ R. We deduce easily:
k
(n + 1) =
k X
k
j=0
j
k k
k k−1
1 ≥
n +
n
= nk−1 (k + n).
0
1
k−j j
n
And further, by diving the above inequality by k!:
nk−1
nk
(n + 1)k
≥
+ .
k!
(k − 1)!
k!
(3.1.1)
We will continue by induction on n and k. For k = 1 and n ≥ k + 2(= 3),
k
n C6k = 1 + n < (1/2)n + n = (1, 5)n /k!. Until now, we proved it for
k = 0, 1 and n ≥ k + 2. But, before considering general k and n, we prove
it for a specific value of n, namely n := k + 2, and k ≥ 2.
For n = k + 2 the desired inequality is
(1, 5)nn−2
(n − 2)!
nn−1 n − 1 1, 5
=
n n − 1 (n − 2)!
(1, 5)nn−1 (n − 1)
=
.
n!
2n − n − 1 ≤
(3.1.2)
From the binomial theorem it is deduced that: 2n = (1 + 1)n =
n n−2 X
n n−j j X n
n
n
1 1 =
+
+
= n C6n−2 + n + 1.
j
j
n−1
n
j=0
j=0
Since for k = 0, nothing was to be proved, we verify by hand, that the
inequality holds for k = 1, 2, 3, and 4, or similarly for n = 3, 4, 5, and 6.
3.1. INTRODUCTION: DEFINITIONS, FUNDAMENTAL LEMMA.
37
So let n = 3, then
(1, 5)33−1 (3 − 1)
3!
4−1
(1,
5)4
(4 − 1)
24 − 4 − 1 = 11 < 12 =
4!
(1, 5)55−1 (5 − 1)
25 − 5 − 1 = 26 < 31, 25 =
5!
6−1
(1,
5)6
(6
− 1)
26 − 6 − 1 = 57 < 81 =
6!
23 − 3 − 1 = 4 < 4, 5 =
for n = 3, 3.1.2 is true;
for n = 4, 3.1.2 is true;
for n = 5, 3.1.2 is true;
for n = 6, 3.1.2 is true.
For greater n, Stirling’s formula
n n
1 n! ≤
,
(2πn)1/2 exp
e
12n
see [Dud2] theorem 1.3.13 on page 17. So for n ≥ 7 (i.e. k ≥ 5 and
n = k + 2), it is enough to have
(2n − n − 1)n! ≤ 2n n!
1 n n
(2πn)1/2 exp
≤ 2n
e
12n
This will follow from noticing: (e/2)n ≥ 2n1/2 for n ≥ 7. Indeed, for if
this would be true:
1 1 n
1 −1/2 n √
1/2
1/2
n n
(2πn) exp
≤
n
n 2πn exp
2
e
12n
2
12n
r
1 π
=
exp
nn
2
12n
1
< 1, 5nn 1 −
n
since, for n = 7:
r
1 π
1
exp
< 1, 27 < 1, 28 < 1, 5 1 −
2
12 ∗ 7
7
where for larger n, the LHS, becomes smaller, and the RHS greater. Consider the functions (e/2)x , 2x1/2 on [7, +∞[. In 7:
√
(e/2)7 > 8 > 6 > 2 7
Also ((e/2)x )0 = (e/2)x ln(e/2) and (2x1/2 )0 = x−1/2 so and
(e/2)x ln(e/2) > 8 ∗ 0, 3 > 1 > x−1/2 .
38
CHAPTER 3. VAPNIK–C̆ERVONENKIS CLASSES.
And so from the fundamental theorem of calculus, one deduces that (e/2)x ≥
2x1/2 , for all x ≥ 7. Thus for all n, equation 3.1.2 is valid.
Now we have come to the last step. Suppose, for k = 1, · · · , K and n ≥
k + 2, the result holds (notice that we did prove it for k = 1). Then let
k = K + 1 and now we have to prove it for n = k + j = (K + 1) + j, where
j ≥ 2, by induction on j. Let j = 2, nothing has to be proved since it was
done in the previous paragraph. So we do it for j := j + 1,
n C6k
= (K+1)+j+1 C6K+1
= (K+1)+j C6K+1 + (K+1)+j C6K
≤ (1, 5)((K + 1) + j)K+1 /(K + 1)! + (1, 5)(K + 1 + j)K /K!
≤ (1, 5)((K + 1) + j + 1)K+1 /(K + 1)!
= (1, 5)nk /k!
where the first equation is the definition, the second is true by proposition
3.1.1, the first inequality by our induction hypothesis, the second inequality
by equation 3.1.1.
• A probabilistic proof of Vapnik–C̆ervonenkis’s theorem. Let Y be a
Binomial(n, 1/2) random variable. Then, for s a positive integer
n C6s
:=
s X
n
j=0
j
= 2n Pr({Y ≤ s}) = 2n Pr({s − Y ≥ 0}).
Let a > 1 arbitrary for now, then
2n Pr({s − Y ≥ 0}) = 2n Pr({as−Y ≥ 1})
and by Markov’s inequality the last term is bounded by 2n E[as−Y ]. So
n C6s
≤ 2n as E[a−Y ] = 2n as E[rY ]
where r = 1/a < 1. Since Y ∼Bin(n, 1/2), we can decompose it as a sum
of n independent Bernoulli(1/2), i.e. Binomial(1, 1/2), random
variables
P
Yi (Pr({Yi = 0}) = 1/2 = Pr({Yi = 1})) and such that i Yi is equal in
distribution to Y . We rewrite the above expectation, using independence of
the Yi ’s as
Y
Y
P
2n as E[rY ] = 2n as E[r i Yi ] = 2n as E[ rYi ] = 2n as
E[rYi ].
i
i
3.2. UNIFORM BOUNDS FOR PACKING NUMBER OF VC CLASS.
39
Recall that the expectation of a discrete random variable Z is given by
P
z z Pr(Z = z). Hence
n s
2 a
Y
n
Y
(r0 1/2 + r1 1/2)
E[r ] = 2 a
Yi
n s
i=1
i
n s
= 2 a (1/2)n (1 + r)n = as (1 + r)n
Thus let C be a VC–class, then by Sauer’s lemma 3.1.2, S(C) < +∞ and so
mC (n) 6 n C6S(C) for all n. Let s := S(C), then by the previous discussion
one has the inequality
mC (n) ≤ n C6S(C) ≤ as (1 + r)n
for each a > 1 and r = 1/a. Let n > s(= S(C)). Then, for r = s/n < 1.
n s s n n s
r−s (1 + r)n ≤
1+
≤
exp(s) = ns (e/s)s
s
n
s
and (e/s)s is a constant that depends only on S(C).
Definition 3.1.2. Let (X , A) be a measurable space and C ⊂ A, we define
dens(C) := inf{r ∈ R+
0 : there is a K < +∞ such that
mC (n) ≤ Knr for all on n ≥ 1}.
Corollary 3.1.3. For any set X and C ⊂ 2X , dens(C) ≤ S(C), and conversely, if
dens(C) < +∞, then S(C) < +∞ too.
Proof. We first start with dens(C) ≤ S(C). By Sauer’s lemma ( 3.1.2 ), since
S(C) < +∞, we have got mC (n) 6 n C6S(C) for all n ≥ 1, and by proposition
3.1.3, there is a K such that mC (n) ≤ KnS(C) for all n ≥ S(C) + 2. By taking a
larger constant K, the result holds for all n ≥ 1.
In the other direction, we note that if dens(C) < +∞, then there is a constant
K and a strictly positive real number r such that mC (n) < Knr , for all n ≥ 1.
Also since nr /2n → 0 as n → ∞, for all n from some n0 on: mC (n) < 2n , and
S(C) < +∞.
3.2
Uniform bounds for packing number of VC class.
Definition 3.2.1. Let (X , A) be a measurable space and C ⊂ A, we define a
semimetric dP on (X , A) as follows:
dP : A × A → R+ : (A, B) 7→ dP (A, B) := P (A∆B),
40
CHAPTER 3. VAPNIK–C̆ERVONENKIS CLASSES.
where A∆B := (A\B) ∪ (B\A).
That it is symmetric is obvious and P (A∆B) = 0 iff A = B, P –a.s. The triangle
inequality follows from: A∆B ⊂ (A∆C) ∪ (B∆C) for any A, B, C and the
subadditivity of P .
Proof.
A∆B =
=
⊂
=
=
(A\B) ∪ (B\A) = (A ∩ B c ) ∪ (Ac ∩ B)
[(A ∩ B c ∩ C) ∪ (A ∩ B c ∩ C c )] ∪ [(Ac ∩ B ∩ C) ∪ (Ac ∩ B ∩ C c )]
[(B c ∩ C) ∪ (A ∩ C c )] ∪ [(Ac ∩ C) ∪ (B ∩ C c )]
[(A ∩ C c ) ∪ (Ac ∩ C)] ∪ [(B ∩ C c ) ∪ (B c ∩ C)]
(A∆C) ∪ (B∆C)
Definition 3.2.2. Let (X , A) be a measurable space and C ⊂ A, we define
s(C) := inf{w : there is a K = K(w, C) < +∞ such that
for every law P on A and 0 < ≤ 1
D(, C, dP ) ≤ K−w }.
the index of C.
We state the definition of packing number and of envelope function for a class
of real–measurable functions on some probability space.
Definition 3.2.3. Let (S, d) be a metric space, let > 0 and A ⊂ S, the packing
number of the set A of is the quantity defined as D(, A, d) :=
sup{n : there exists a set F ⊂ A, F having n elements
satisfying d(x, y) > , for all x 6= y, x, y ∈ F }.
Let F ⊂ L2 (X , A, P ), where (X , A, P ) is some probability space. Let
FF (x) := sup{|f (x)| : f ∈ F} = kδx kF ,
where δx (g) := g(x). A measurable function F ≥ FF is called an envelope
function for F. If FF is A–measurable, then it is said to be the envelope function.
For any law Q on (X , A), FF∗ , the essential infimum as defined in chapter 2 section
1 is an envelope function for F depending on Q.
The next theorem will be useful when dens(C) is finite, because then the index
s(C) will be finite too, and we have then got a uniform bound of the packing
number for any law P .
3.2. UNIFORM BOUNDS FOR PACKING NUMBER OF VC CLASS.
41
Theorem 3.2.1. For any measurable space (X , A) and C ⊂ A, dens(C) ≥ s(C).
Proof. In the first part of the proof, dens(C) ≥ s(C) is showed. Let P be a
probability measure on A and let ∈]0, 1]. By definition dens(C) is the infimum a all real numbers r such that mC (n) ≤ M nr for all n, with M some constant. It will be enough that for each s > r, for some constant K = K(s):
D(, C, dP ) ≤ K(s)−s .
Let m ≤ D(, C, dP ), then there are some A1 , · · · , Am ∈ C satisfying dP (Ai ∆Aj ) >
for all 1 ≤ i < j ≤ m. We show that for n large mC ≥ m.
As usual Xi denote the coordinates on the countable product spaces
∞
(X , A∞ , P ∞ ) each i.i.d. of law P . Let Pn be the empirical measure. Consider
for n = 1, 2, · · · :
Pr{ for some i 6= j, Xk ∈
/ Ai ∆Aj , for every k ≤ n }
!
n
[ \
= Pr
{Xk ∈
/ Ai ∆Aj }
1≤i<j≤m k=1
≤
X
Pr
1≤i<j≤m
=
X
n
\
!
{Xk ∈
/ Ai ∆Aj }
k=1
n
Y
Pr{Xk ∈
/ Ai ∆Aj } =
1≤i<j≤m k=1
≤
X
1≤i<j≤m
(1 − )n ≤
X
P (Ai ∆Aj )n
1≤i<j≤m
m(m − 1)
(1 − )n .
2
Now an easy calculation shows that m(m − 1)2−1 (1 − )n is strictly smaller than
1 if;
m(m − 1) n > − ln
/ ln(1 − ).
2
For such n there is a strictly positive probability that for all i 6= j
Pn (Ai ∆Aj ) = 1/n
n
X
IAi ∆Aj (Xk ) > 0.
k=1
It is easy now to see that mC (n) ≥ m.
Let r > dens(C), then for such r there is an M = M (r, C) < +∞ for which
mC (n) ≤ M nr for all n. Remark that − ln(1 − ) ≥ , since by Taylor’s formula
for x around 0: ln(1 + x) ≤ x. Then; m ≤ mC (n) ≤ M nr , for n large enough, in
particular for:
m(m − 1) n := 2 ln(m2 )−1 > − ln
/ ln(1 − )
2
42
CHAPTER 3. VAPNIK–C̆ERVONENKIS CLASSES.
If m ≥ 2 then m ≤ M (2 ln(m2 ))r −r , which can be written as
m(ln(m))−r ≤ M1 −r
for M1 := 4r M . Let δ > 0; then ln(m)r ≤ Cmδ , because ln(x), from a certain
point on, grows slower than any x1/n , by de l’Hôpital’s rule. Also
ln(exp(n2 )) = n2 < en = (exp(n2 ))1/n .
So in particular for all x ≥ 0: ln(x) < n2 (x)1/n and C can be taken n2 . Hence for
the smallest n > r/δ:
ln(m) < n2 m1/n
(ln(m))r < (dr/δe)2r mr/n < (dr/δe)2r mδ
and this holds for all m ≥ 1. Now for all m ≥ 0, recall that
limx→0 x ln(x) = 0;
m1−δ < m(ln(m))−r (dr/δe)2r < (dr/δe)2r M1 −r
and thus
m ≤ ((dr/δe)2r )1/(1−δ) (M1 −r )1/(1−δ)
Let M2 (r, C, δ) := (dr/δe)2r )1/(1−δ) (M1 )1/(1−δ) , then M2 < +∞ for all δ > 0.
also for δ small 1/(1 − δ) > 1. And r/(1 − δ) > 1 > r ≤dens(C). Hence letting
r ↓ dens(C) and δ ↓ 0, gives dens(C) ≥ s(C).
Remark. Actually, both, dens(C) and s(C) are equal, but this inequality is sufficient for our purposes later on in chapter 5. For a complete proof, we refer
to [Dud1] theorem 4.6.1 on page 156–157.
Definition 3.2.4. Let X be a set and F a class of real–valued functions on X . Let
f ∈ F, the subgraph of f is the subset:
{(x, t) : 0 ≤ t ≤ f (x) if f (x) > 0 or, f (x) ≤ t ≤ 0 if f (x) < 0. }
of X × R.
If D is a class of subset of X × R, and if for all f ∈ F the subgraph of f lies
in D, then F will be called a subgraph class .
If D is a VC class, then F is called a VC subgraph class .
As the previous theorem, i.e. theorem 3.2.1 , the next theorem provides a
uniform bound of the packing number, for any Q probability measure on a finite
set.
3.2. UNIFORM BOUNDS FOR PACKING NUMBER OF VC CLASS.
43
Theorem 3.2.2. Let 1 ≤ p < +∞. Let (X , A, Q) be a probability space and
F a VC class subgraph class of measurable real–valued functions
on X . Let
R
F ∈ Lp (X , A, Q) be an envelope function for F, such that F dQ > 0, and
F ≥ 1. Let C be the collection of all the subgraphs of functions in F.
Then for any W > S(C), there is an A := A(W, S(C)), A < +∞:
(p)
DF (, F, Q) ≤ A(2p−1 /p )W for 0 < ≤ 1.
Proof. Let ∈]0, 1[ and m maximal and f1 , · · · , fm :
Z
Z
p
p
|fi − fj | dQ > F p dQ for 1 ≤ i < j ≤ m,
(p)
which can be done since DF (, F, Q) is finite.
Suppose p = 1, then for any B ∈ A, let
Z
F dQ and QF := (F Q)/Q(F ),
(F Q)(B) :=
B
R
where Q(F ) = F dQ, then QF is a probability measure on A. Furthermore let
k = k(m, ), be the smallest integer such that
k m
>
;
(3.2.1)
exp
2
2
then k ≤ 1 + (4 ln(m))/. Let X1 , · · · , Xk be i.i.d. with law QF . Given Xi ,
define Yi as random variables uniformly distributed on [−F (Xi , F (Xi ], such that
the vectors (Xi , Yi ) are independent 1 ≤ i ≤ k. Denote the subgraph of fj by Cj ,
1 ≤ j ≤ m. For all i and j 6= s:
Pr((Xi , Yi ) ∈ Cj ∆Cs )
Z
|fj (Xi ) − fs (Xi )|
dQF (Xi )
=
2F (Xi )
this since conditionally on Xi = xi , Yi is uniformly distributed over [−F (xi ), F (xi )].
And
λ({t ∈ [−F (xi ), F (xi )] : (xi , t) ∈ Cj ∆Cs }) = |fj (xi ) − fs (xi )|.
Indeed, let max(fj (xi ), fs (xi )) = M (xi ) and min(fj (xi ), fs (xi )) = m(xi ) if both
fj (xi ), fs (xi ) ≥ 0, we take all t ∈ [M (xi ), m(xi )[, if both fj (xi ), fs (xi ) ≤ 0, then
take t ∈]M (xi ), m(xi )], if one is nonnegative and the other nonpositive then we
have t ∈ [M (xi ), m(xi )]. So whatever the case: we have always M (xi )−m(xi ) =
|fj (xi ) − fs (xi )|.
44
CHAPTER 3. VAPNIK–C̆ERVONENKIS CLASSES.
We continue:
|fj (Xi ) − fs (Xi )|
dQF (Xi )
2F (Xi )
Z
|fj (Xi ) − fs (Xi )| F (Xi )
=
dQ(Xi )
2F (Xi )
Q(F )
Z
=
|fj (Xi ) − fs (Xi )| dQ(Xi )/(2Q(F )) > /2.
Z
(1)
The last equation
is valid since p = 1, m = DF (, F, Q) and fi where chosen
R
such that |fi − fj | dQ > Q(F ).
Let Ajsk be the event that (Xi , Yi ) ∈
/ Cj ∆Cs for all 1 ≤ i ≤ k. By independence of the vectors (Xi , Yi ) and by the previous inequality:
−k k
Pr(Ajsk ) ≤ 1 −
≤ exp
2
2
and
−k m
Pr(∪j6=s Ajsk ) ≤
exp
≤ 1,
2
2
because k was chosen as the smallest positive integer such that equation 3.2.1 is
valid. So with positive probability there exists an i such that (Xi , Yi ) ∈ Cj ∆Cs
for all j 6= s. We denote the first couple with those property by (Xl , Yl ). This
means nothing more than that all sets Cp are different. Hence mC (m) ≥ m, since
otherwise we would have less than m different sets (as in the proof of theorem
3.2.1 ).
Let S := S(C). By Sauer’s lemma ( 3.1.2 ) and the Vapnik–C̆ervonenkis
proposition (3.1.3 ), mC (k) ≤ 1, 5k S /S!, for all k ≥ S + 2. Hence for some
constant C depending only on S, mC (k) ≤ Ck S . So
4 ln(m) S
.
m ≤ mC (k) ≤ Ck S ≤ C 1 +
As in the proof of theorem 3.2.1, for any α > 0, there is an m0 such that for all
m ≥ m0 : 1 + 4 ln(m) ≤ mα , and then m1−α ≤ C−S . We choose α > 0 small
enough such that αS < 1, this implies m ≤ C1 −S/(1−αS) for all m ≥ m0 , and C1
S
some constant. For any W > S, e.g. W = 1−αS
, we can solve it for α = WW−S
.
S
−W
−W
For such α, we certainly have αS < 1 and since < 1, > 1, so m ≤ A
where A := max(C1 , m0 ), where m0 depends only on W and S (through the
choice of α), so that A is a function of W and S only.
The theorem is true for p = 1.
3.2. UNIFORM BOUNDS FOR PACKING NUMBER OF VC CLASS.
45
Consider the case of 1 < p < +∞. Let QF,p := F p−1 Q/Q(F p−1 ). Then for
(p)
m ≤ DF (, F, Q), f1 , · · · , fm satisfying kfi − fj kpL2 (Q) > p Q(F p ); for i 6= j,
p
Z
Z
p
|fi − fj | dQ ≤
F dQ <
Z
=
Z
=
Z
p
|fi − fj |(2F )p−1 dQ
(3.2.2)
|fi − fj | dQ2F,p Q((2F )p−1 )
|fi − fj | dQF,p Q((2F )p−1 ),
(recall that we assumed: F ≥ 1). The last equation is valid, because Q2F,p = QF,p .
Let
p Q(F p )
δ :=
QF,p (F )Q((2F )p−1 )
plugging that value for δ in equation 3.2.2:
p Q(F p ) QF,p (F )
<
δQF,p (F ) =
Q((2F )p−1 ) QF,p (F )
Z
|fi − fj | dQF,p .
By the case p = 1 we conclude that the following inequalities
(p)
(1)
DF (, F, Q) ≤ DF (δ, F, QF,p ) ≤ Aδ −W
hold.
We simplify our expression for δ:
δ=
p Q(F p )
p
p Q(F p )
=
=
.
p
Q(F )
p−1 )
QF,p (F )Q((2F )p−1 )
2p−1
2p−1 Q(F
p−1 ) Q(F
As claimed:
(p)
DF (, F, Q)
≤ Aδ
−W
2p−1 W
=A
.
p
46
CHAPTER 3. VAPNIK–C̆ERVONENKIS CLASSES.
Chapter 4
On measurability.
As we mentioned earlier, the empirical measure, and process or even the supremum over a possibly uncountable class of functions, they all don’t have to be
measurable. Here we provide a way to ensure that measurability, or some more
general measurability conditions, like universal measurability, are satisfied.
4.1
Admissibility.
Let X be a set, and F a family of real–valued functions on X , such that they are
measurable for some σ–algebra A on X . Then consider the natural map
F × X → R : (f, x) 7→ f (x).
In general there will not be any σ–algebra on F such that the evaluation map is
jointly measurable.
Definition 4.1.1. Let (X , B) be a measurable space. Then (X , B) will be called
separable if B is generated by some countable family C ⊂ B and B contains all
singletons {x}, x ∈ X . From now on, a space (X , B) will always be assumed to
be a separable measurable space!
Let F be a family of real–valued functions on X . Then F is said to be admissible iff there is a σ–algebra T on F such that the evaluation map
F × X → R : (f, x) 7→ f (x).
is T ⊗ B, R–measurable, where R is the Borel σ–algebra on R. If this condition,
namely joint measurability, is satisfied, T will be called an admissible structure
for F.
F will be called image admissible via (Y, S, T ) if (Y, S) is a measurable space
and T is a function from Y onto F such that the map
Y × X → R : (y, x) 7→ T (y)(x)
47
48
CHAPTER 4. ON MEASURABILITY.
is jointly measurable. This last two definitions can also be applied to classes of
sets C by letting F := {IC : C ∈ C}.
Remark. Note that every separable metric space equipped with its Borel σ–algebra,
is a separable measurable space in the sense of the above definition. Metrizability
implies the Hausdorff property, i.e. singletons are closed, hence Borel. Secondly,
in metric space separability is equivalent with the second countability property of
topological space, i.e. the topology has a countable base. So since the topology
has a countable base, the σ–algebra generated by that countable base equals the
Borel σ–algebra.
If one considers general, (Hausdorff) separable topological spaces, not necessarily
metrizable, and hence not necessarily satisfying the countable base axiom. Then
the space with its Borel σ–algebra could possible not be a separable measurable
space anymore. Indeed consider e.g. RR with product topology, i.e. all real valued
function on R, with the topology of pointwise convergence. It is well known that
this space is not metrizable and is separable and Hausdorff (property A.1.8 for the
non metrizability, the remark in the proof of property A.1.9 for the separability
and [Will] chapter 5 theorem 13.8(b) on page 87 for the Hausdorff property and
chapter 5 theorem 16.2(c) on page 108 for the non existence of a countable base).
Hence RR with Borel σ–algebra is not a separable measurable space. Indeed, as
for the case for the Borel σ–algebra on R, one shows that the size of a countably
generated (Borel) σ–algebra is bounded by the cardinality of the continuum (c).
On the other hand the product topology on RR is at least as large as the space itself, since the singletons are all closed sets. And so its cardinality is at least that of
RR , which is strictly greater than c. For the second claim, recall that Card(RR ) =
Card(P(R)) > Card(R) = c. And for the first claim follows from an argument
using transfinite induction. As later on, before theorem 4.1.2, one defines a Borel
hiërarchy as follows. Let (W, ≤) be an uncountanble well ordered set, such that
every segment is at most countable. Then
A0 := any countable class of Borel sets
Ax := the class formed out of all differences
and (at most) countable unions of elements of Ay
for any x with finite segment and y := max{a : a < x}. And finally for the
infinite segments
Az := the class formed out of all differences
and (at most) countable unions of elements of ∪t<z At }
One proves that B = ∪w∈(W,≤) Aw , and notes that each class Aw has cardinality
4.1. ADMISSIBILITY.
49
less than c, hence
|B| = | ∪w∈(W,≤) Aw | ≤
X
w∈(W,≤)
|Aw | ≤
X
c = c · c = c.
w∈(W,≤)
We refer the interested reader to [Jech] chapter 1 section 6 for cardinal arithmetic
and chapter 7 section 39 for a description from the inside of (Borel) σ–algebra’s.
Example 4.1.1. Let us first consider an example of such X and F. Take (K, d) a
compact metric space, and F some set of continuous real–valued functions on
K, such that F is compact for the supremum norm on C(K). Then, by the
Arzelà–Ascoli theorem , see theorem A.1.18, the functions in F are uniformly
equicontinuous on K, definition A.1.10, i.e. for all > 0 there is a δ > 0 such
that for all x, y ∈ K and f ∈ F one has
|f (x) − f (y)| < whenever d(x, y) < δ.
It follows that the evaluation map is jointly continuous. Indeed for a neighbourhood BR (f (x), ) of f (x) the set B(C(K),d∞ ) (f, /2) × B(K,d) (x, δ), with δ chosen
such that if
d(x, y) < δ then |g(x) − g(y)| < /2 for all g ∈ F,
is a neighbourhood of {(f, x)} that is sent by the evaluation map in BR (f (x), ).
Indeed, let (h, z) ∈ B(C(K),d∞ ) (f, /2) × B(K,d) (x, δ):
|eval(f, x) − eval(h, z)| < |f (x) − f (z)| + |f (z) − h(z)| < ,
where the first term is small by uniform (equi)continuity of f , and the second by
definition of the supremum metric.
Now K, which plays the role of X , is chosen to be a compact set, in particular K
is separable. (C(K), d∞ ) is a metric space and by the Stone–Weierstrass theorem
A.1.20 it is separable too. F endowed with the relative topology is separable as
a subspace of a metric and separable space. By A.2.6 the Borel σ–algebra of the
product topology is the same as the product σ–algebra generated by the product
of the two Borel σ–algebras.
Therefore the map is jointly measurable and F admissible for the Borel σ–algebra
generated by the relative topology of uniform convergence on F.
Theorem 4.1.1. For any separable space (X , B), there is a subset Y of I := [0, 1]
and a 1–1 function M from X onto Y which is a measurable isomorphism for the
Borel σ–algebra on Y .
50
CHAPTER 4. ON MEASURABILITY.
Proof. The Borel sets of Y are the intersections of Borel sets of I with Y . Let
A := {Aj }j≥1 be a countable set of generators for B. Consider the map,
f : X → 2∞ : x 7→ {IAj (x)}j≥1 ,
with the discrete on {0, 1} and 2∞ := {0, 1}N equipped with the product topology
(see definition A.1.6 ). Then f is a 1–1 function and onto its range f (X ) = Z, see
lemma A.2.7.
Moreover f is a measurable isomorphism of X onto Z. First consider Cj :=
−1
πj {1}, Dj := πj−1 {0}. Note that such sets form a subbase for the product topology on 2∞ , see definition A.1.6. Because the space is separable and metrizable (propositions A.1.9 and A.1.8). And hence second–countable, proposition
A.1.3; therefore open sets are countable unions of finite intersections of those
subbase elements, hence open sets are in the σ–algebra generated by the subbase of the product topology and thus equal the Borel σ–algebra of 2∞ . Then
f −1 (C) = Aj , f −1 (D) = AC
j , which are measurable. Now conversely, to prove
measurability of f −1 : Z → X , it is enough to look at the generating sets Aj and
because f is a bijection (preserves unions and intersections):
f (Aj ) = {{xn }n≥1 ∈ Z : xn = 1} = πn−1 ({1}) ∩ Z = Cn ∩ Z
Q
Where πn : m≥1 {0, 1}m → {0, 1}. Cn ∩Z is a Borel set of the relative σ–algebra
on Z, because Cn is Borel (because closed) of 2∞ . Next we look at the map
∞
g:2
→ I : {zj }j≥1 7→
∞
X
2zj
j=1
3j
.
Then g(2∞ ) is the Cantor set. The Cantor set can also be described recursively as
a countable intersection of closed set (see e.g. [Will] chapter 6 section 17). Let
C1 = [0, 1/3] ∪ [2/3, 1] and Cn the subset of I where we removed the (open)
middle third of the intervals [j/3n−1 , (j + 1)/3n−1 ], j = 0, · · · , 3n−1 − 1. Then
C := g(2∞ ) = ∩n Cn . Using that representation g is seen to be 1–1, because if
x 6= y then let n ∈ N be the first index such that: xn 6= yn , so in the n–th step x lies
in the first or third part, and y in the other, of such an interval [j/3n−1 , (j+1)/3n−1 ]
for some j.
If we equip 2∞ with the product topology (definition A.1.6 ) then it is metrizable
(lemma A.1.8 ), and one may take the following metric:
e(x, y) :=
∞
X
j=1
d(xj , yj )/2j ,
4.1. ADMISSIBILITY.
51
where d denotes the discrete metric on {0, 1}. Now it is easy to see that g is
Lipschitz continuous and thus continuous.
∞
∞
∞
X
X
2xj X 2yj 2(xj − yj ) |g(x) − g(y)| = −
=
j
j
j
3
3
3
j=1
j=1
j=1
∞
∞
X
X
2|xj − yj |
|xj − yj |
≤
≤ 2
= 2e(x, y)
j
j
3
2
j=1
j=1
By Tychonoff’s theorem A.1.16 2∞ is a compact topological space, I is a Hausdorff topological space and g is a continuous bijection, then it is a well known
fact that g is a homeomorphism (theorem A.1.7). Because the product σ–algebra
and the Borel σ–algebra on 2∞ are the same g is easily seen to be a measurable
isomorphism between 2∞ and g(2∞ ). So the restriction of g to Z is a measurable
isomorphism onto its range Y . The composition g ◦ f , called the Marczewski
function,
∞
X
2IAj (x)
,
M (x) :=
j
3
j=1
is 1–1 from X onto Y ⊂ I := [0, 1] and is also a measurable isomorphism onto
Y.
Let (X , B) be a separable space, where B is generated by A1 , A2 , · · · . By
taking the union of the (finite) algebras generated by A1 , A2 , · · · , An ; we can and
do take A := {Aj }j≥1 to be an algebra.
Indeed, let Gn be the algebra generated by Ai , 1 ≤ i ≤ n. Then by proposition
A.2.1, Gn is finite. Let A := ∪n Gn , then A is at most countable and note that Gn ⊂
Gn+1 , so that it is an algebra, it trivially contains X , is contains complements,
and finite unions, since for a finite number m of sets, there exists an algebra,
namely some Fk , who contains all of the sets. Remark that the same does not
hold for σ–algebras, consider on ]0, 1] the finite σ–algebras generated by (refined)
partitions of ]0, 1]:
o
ni j j + 1 i
n
,
:
0
≤
j
≤
2
−
1
;
An := A
2n 2n
none of the An contains {1}, but {1} = ∩n ](2n − 1)/2n , 1] and the latter term is
contained in σ(∪n An ).
P
Let F0 be the class of all finite sums nj=1 qj IAj with rational qj and n =
1, 2, · · · . Now we are going to define the Borel or Banach classes . Here we will
need the existence of a well–ordered set with certain properties. Recall that a
well–ordered set is a pair (X , ≤), where X is a set and ≤ is a partial order on
52
CHAPTER 4. ON MEASURABILITY.
X , i.e. ≤ is reflexive (a ≤ a), antisymmetric ( if a ≤ b and b ≤ a, then a = b) and
transitive (a ≤ b and b ≤ c, then a ≤ b) (for all a, b, c ∈ X ). Moreover that partial
order must be linear , i.e. a < b or b < a for all different a, b ∈ X , and such that
every subset of X has a smallest element.
Let W be uncountable and such that for every a ∈ W ; the segment Wa , i.e.
{x ∈ W : x ≤ a, x 6= a} = {x ∈ W : x < a}, is at most countable. Such a set
exists, one can e.g. take Ω the first uncountable ordinal with its usual order. For
more informations about ordinals we refer the (interested) reader to the excellent
book on Real and Abstract Analysis by Hewitt and Stromberg, chapter 1 section
4.
Denote by 0 the smallest element of (W, ≤) which exists, since W was understood to be a well–ordered set. We proceed by transfinite induction to define Fv
for all v ∈ W . Since F0 is already defined, then we will define Fw , supposing Fv
for v < w are defined, let Fw the class of all functions f who are the everywhere
limit of functions from Fv , v < w. Let U := ∪v∈W Fv .
Theorem 4.1.2. U is the set of all measurable real functions on X , with (X , B) a
separable space.
Proof. By construction of U every function it contains, is measurable. Indeed every measurable function can be written as the almost everywhere pointwise limit
of simple functions, then changing the function and the simple functions on a set
of measure zero, one has everywhere pointwise convergence.
Conversely, let
C := {B ∈ B : IB ∈ U }.
Step 1 By definition of F0 , C contains A, the generating algebra. We show that
C has some properties, namely it is a monotone class (definition A.2.4).
Let Bn ∈ C for n ≥ 1, then IBn ∈ U for every n, and thus IBn ∈ Fvn for some
vn ∈ (W, ≤). Denote the segment of vn (i.e. all x ∈ W : x < vn ) by Wvn ,
then Wvn ∪ {vn } is an at most countable subset of W (by definition of (W, ≤));
then also ∪n (Wvn ∪ {vn }) is at most countable. (W, ≤) was chosen to be an
uncountable set such that every segment is countable, so:
V := W \ ∪n (Wvn ∪ {vn }) 6= ∅.
Choose w ∈ V and a further element (w <)w0 ∈ (W, ≤), which can also be found,
since W \(Ww ∪ {w}) is non empty ((W, ≤) being uncountable, the segment of w
being at most countable). Then clearly IBn ∈ Fw ⊂ Fw0 . If there is an everywhere
limit for {IBn }n≥1 , then it necessarily belongs to Fw0 , and a fortiori to C.
4.1. ADMISSIBILITY.
53
• suppose that Bn ↑ B, for some B ∈ B. Let x ∈ B, then x ∈ ∪n Bn , so
x ∈ Bk0 for some k0 . From Bn ⊂ Bn+j , one gets IBm (x) = 1 = IB (x) for
all m ≥ k0 . And if x ∈
/ B, then x ∈
/ Bn for any n (implying
IBn (x) = 0 = IB ). So in any case: IBn (x) → IB (x) as n → ∞ for all
x ∈ X.
• suppose Bn ↓ B, then for x ∈ B, x lies in every Bn and
IBn (x) = 1 = IB (x). On the other hand if x ∈
/ B, x ∈
/ Bk for some k0 .
Since Bm ⊃ Bm+p for all p ≥ 1, x ∈
/ Bm for all m ≥ k. Thus in both cases:
IBn (x) → IB (x) as n → ∞.
We deduced that for Bn ↑ B or Bn ↓ B; IBn → IB everywhere, and thus
IB ∈ Fw0 , by definition of Fw0 and thus IB ∈ U . This assures that C is a monotone
class, and since it contains A, the algebra generating B, by the Monotone Class
Theorem ( A.2.10 ), C = B.
Step 2.1 In the next step we fix real constants a, b and a set A ∈ A. Let
(a,b)
CA
:= {B ∈ B : aIA + bIB ∈ U }.
(a,b)
A ⊂ CA is easy to see. Indeed for E ∈ A consider an IA + bn IE , where an , bn
are rationals converging to a and b and note that an IA + bn IE → aIA + bIE and
(a,b)
an IA + bn IE ∈ F0 for all n. As in step 1 we prove that CA is a monotone class,
(a,b)
and as such must equal B. Let Bn ∈ CA for all n ≥ 1.
• suppose Bn ↑ B, as is the case in step 1; IBn ↑ IB for all x ∈ X hence
aIA + dIBn → aIA + bIB everywhere.
• suppose Bn ↓ B, as is the case in step 1; IBn ↓ IB for all x ∈ X hence
aIA + bIBn → aIA + bIB everywhere.
and in both cases aIA + bIBn and aIA + bIB ∈ Fw0 for some w0 ∈ W , by an
argument as in step 1. And for all a, b ∈ R, for all A ∈ A, by the monotone class
(a,b)
theorem (A.2.10) CA = B.
Step 2.2 We continue by letting:
(c,d)
CC
(c,d)
:= {D ∈ B : cIC + bID ∈ U }, with c, d ∈ R.
(d,c)
We claim : CD ⊃ A. Indeed, by step 2.1, for any A ∈ A: CA = B, which
means dIA + cIC ∈ U for any C ∈ B. This is the same as saying: cIC + dIA ∈ U ,
(c,d)
which is equivalent with: CC ⊃ A for any c, d ∈ R, C ∈ B.
Now, by arguments as in step 1 (using the monotone class theorem), it is not hard
54
CHAPTER 4. ON MEASURABILITY.
(c,d)
to see that for any c, d ∈ R, C ∈ B, CC
= B.
As a temporary conclusion: eIE + f IF ∈ U for any e, f ∈ R, E, F ∈ B.
Step 3 Then we go to classes of the form:
(c ,c ,c)
CA11,A22
:= {C ∈ B : c1 IA1 + c2 IA2 + cIC ∈ U }.
(c ,c ,c)
for c1 , c2 , c arbitrary fixed real constants, and A1 , A2 fixed elements of A. CA11,A22
(c ,c ,c)
will turn out to be a monotone class, and since A ⊂ CA11,A22 , it will equal B.
Letting
(c ,c ,c)
CB11,B22 := {C ∈ B : c1 IB1 + c2 IB2 + cIC ∈ U }.
(c ,c ,c)
We have, again by similar arguments as above, that CB11,B22 contains A, and is
(c1 ,c2 ,c)
a monotone class, so equals B. One proceeds first by showing that CA,B
and
(c1 ,c2 ,c)
CB,A
equal B, for any c1 , c2 , c ∈ R, A, ∈ A and B ∈ B.
It is clear that we can repeat those
Pn steps inductively for any n real constants and n
elements of B. In particular i=1 ci IBn lie in U .
Step 4 Let f be a measurable function. Decompose f as f + − f − , where
f + := max(f, 0) and f − := max(−f, 0). Each term, is a measurable function,
and is the everywhere limit of simple functions (theorem A.2.12). So one has two
sequences of simple functions, they can be merged in one simple sequence. We
will now argue as in step 1 to have that f ∈ U . We continue with the merged
sequence fn , which are simple functions, and by our previous discussion (steps
1 − 3) we know that each fn belongs to U , so to some Fvn . Each segment Wvn
with top {vn } is at most countable, and a countable union of at most countable
sets remains at most countable. Since W is uncountable, there is some element
w ∈ (W, ≤): vn < w for all n. Then all fn are contained in Fw . Since Ww ∪ {w}
is again at most countable, there is a w0 ∈ (W, ≤): w < w0 , then f ∈ Fw0 as
everywhere limit of fn .
The following main theorem provides useful characterizations of the admissibility condition of a class of (real–valued measurable) functions.
Theorem 4.1.3 (Aumann). Let I := [0, 1] with its usual Borel σ–algebra. Given
a separable measurable space (X , B) and a class F of measurable real–valued
functions on X , the following are equivalent:
i) F ⊂ Fw for some w ∈ (W, ≤);
4.1. ADMISSIBILITY.
55
ii) there is a jointly measurable function G : I × X → R such that for each
f ∈ F, f = G(t, ·) for some t ∈ I;
iii) there is a separable admissible structure for F;
iv) F is admissible;
v) 2F is an admissible structure for F;
vi) F is image admissible via some (Y, S, T ).
Proof. The implication of (iii) to (iv) is trivial, as is the one from (iv) to (v).
From (v) to (iv) follows also immediately. From (iv) to (iii), every real–valued
measurable function H is always measurable w.r.t. some countably generated
sub–σ–algebra (e.g. the one generated by the sets {H > q} for rational q). In
particular one has that the evaluation map is measurable for a countably generated sub–σ–algebra. Let T be the admissible structure for F; we have got:
eval : (F × X , C) → (R, B(R)) : (f, x) 7→ eval((f, x)) := f (x)
where C := σ({{ eval > q} : q ∈ Q}) ⊂ T ⊗ B. So C is countably generated.
Aq := { eval > q} ∈ T ⊗ B = σ(T ⊗ B). By lemma A.2.5, each Aq lies in
the σ–algebra generated by (at most) countably many measurable rectangles, i.e.
(q)
(q)
by Ti × Bi , i ≥ 1. But then C is also generated by at most countably many
(q)
(q)
measurable rectangles, namely Ti × Bi , q ∈ Q and i ≥ 1. Let
(q)
D := σ({Ti : q ∈ Q, i ≥ 1}
:= σ({Rj : j ≥ 0}
(q)
where we renumbered the sets Ti
It suffices now to show that {f } ∈ D for all f ∈ F, since then D would be
separable; i.e. countably generated and contain the singletons. Let f ∈ F, we
claim that
!
!
\
\
c
{f } =
Rj ∩
(Rk )
j∈N1
k∈N2
where j ∈ N1 iff f ∈ Rj and k ∈ N2 iff f ∈
/ Rk . That {f } is contained in
that intersection is trivial. Let g 6= f , then for some x ∈ X : f (x) 6= g(x). The
function eval(x, ·) : F → R : h 7→ h(x) is D measurable, as seen in the proof of
Tonelli –Fubini theorem A.2.17. Let |f (x) − g(x)| = > 0, then
g ∈
/ {h ∈ F : |f (x) − h(x)| < /2}; where the latter lies in D. If we would
suppose that g is also contained in (∩j∈N1 Rj ) ∩ (∩k∈N2 (Rk )c ), then we will derive
a contradiction. As in the proof of lemma A.2.7, since f and g lie in Rj together
56
CHAPTER 4. ON MEASURABILITY.
or in Rjc together, we have that f ∈ C iff g ∈ C for all C ∈ D, a contradiction
with the set {h ∈ F : |f (x) − h(x)| < /2} lying in D. As hoped, D separates
the elements of F, hence D is separable.
Thus (iii) ⇐⇒ (iv) ⇐⇒ (v).
The equivalence between (ii) and (iii). First assume (ii), by AC, choose for
each f ∈ F a unique t ∈ I such that f = G(t, ·) and let J be the set of all those
t’s. The restriction of the map G to J remains jointly measurable for J u B(I),
the relative Borel σ–algebra on J and B, since it is the composition of (i, idX ) :
J × X → I × X and G, where i : J → I the natural injection. The function
(i, idX ) is measurable, since both components are measurable.
For the converse implication we will use theorem 4.1.1 to find a measurable
isomorphism Φ from a subset Y of I with Borel σ–algebra onto F and Φ(t, ·) =
f (·). Then by admissibility
G := eval ◦ (Φ, idX ) : Y × X → R
is measurable, as it is the composition of the measurable functions: the evaluation
map F × X → R and the map (Φ, idX ) : Y × X → F × X (consider the
measurable rectangles which generate T ⊗ B). Using theorem 4.2.5 in [Dud1] on
page 127, which states that any measurable real–valued function can be extended
to the whole domain by another measurable function, we obtain (ii).
Thus (ii) ⇐⇒ (iii).
(iv) implies (vi), let T be an admissible structure for F, in particular (F, T )
is a measurable space. Let (Y, S) := (F, T ), and T : Y → F the identity. Then
F is trivially image admissible via that (Y, S, T ).
Now the converse; if (vi) holds, there is a function T : (Y, S) → F onto,
and such that the map (y, x) 7→ T (y)(x) is (jointly) measurable. Now, by AC, we
choose a subset Z of Y such that T is 1–1 and onto. By SZ and TZ we denote the
restriction of S and T to Z. By taking the composition of the measurable maps
(i, idX ) : (Z × X , SZ ⊗ B) → (Y × X , S ⊗ B), where i : Z → Y denotes the
natural injection, and Y × X → R, F remains image admissible via (Z, SZ , TZ ).
Because T restricted to Z is a bijection, we define a σ–algebra on F as follows
T := {T [A] : A ∈ SZ }, where T [A] := {T (a) : a ∈ A}.
Thus (iv) ⇐⇒ (vi)
Finally, it only remains to prove the equivalence between (i) and (ii). Assume
(ii) holds, then (I×X , B(I)⊗B) is a separable space, for if {Ai }i≥1 are Borel sets
on I that generate B(I) and {Bi }i≥1 : σ({Bi : i ∈ N}) = B. And both families of
generators are algebra’s, then {Ak , Bl }k,l≥1 generates the product σ–algebra as
4.1. ADMISSIBILITY.
57
seen in theorem A.2.11. Thus the product σ–algebra
B(I) ⊗ B := σ(A × B) where A × B := {A × B : A ∈ A, B ∈ B}
is countably generated.
Let G be the given jointly measurable map
G : (I × X , B(I) ⊗ B) → R
such that for each f ∈ F : f = G(t, ·) for some t ∈ I. Then, as a real measurable
function, G, by theorem 4.1.2, belongs to U and thus there exists a v ∈ (W, ≤) :
G ∈ Fv . Now the sections G(t, ·) which equal the functions from F are contained
in Fv on X . For the last assertion we will proceed by transfinite induction .
Let v = 0, i.e. v is the smallest element of (W, ≤), then any H ∈ F0 is of the
form:
n
X
H(t, x) :=
cj IAij ×Blj (t, x), with cj ∈ Q,
j=1
and its sections:
H(t, ·) =
n
X
cj IAij (t)IBlj (·)
j=1
for each t ∈ [0, 1] fixed, are trivially in F0 for the algebra {Bj }j≥1 . Suppose
the assertion, i.e. the sections F (t, ·) of any function F (·, ·) ∈ Fu are in Fu for
the algebra {Bj }j≥1 , holds for all u < w, then we have to show if H ∈ Fw , all
its section H(t, ·) lie in Fw too. By our assumption H ∈ Fw , then by definition
of Fw there is a sequence {Hn }n≥1 , with Hn ∈ Fvn , vn < w and such that
Hn (z, y) → H(z, y) as n → ∞ for all (z, y) ∈ I × X . Fix t ∈ I, by our induction
hypothesis: Hn (t, ·) ∈ Fvn for all n, and trivially Hn (t, x) → H(t, x) for all
x ∈ X . Hence H(t, ·) ∈ Fw for every t ∈ I.
Next assume (i) holds; in proving (ii) we will need the concept of a universal
class α function . Recall that one possible choice for (W, ≤) was Ω, the first
uncountable ordinal. Ordinals and their elements, which are also ordinals, are
traditionaly denoted by Greek letters, such as α, β, γ and so on. This explains the
name of such functions. Since Ω and our (W, ≤) are order isomorphic we could
call such a function also a universal class w function.
First we state the definition of such a function, a function G : I × X → R is
called a universal class w function iff every function f ∈ Fw on X is of the form
G(t, ·) for some t ∈ I. To finish the proof, we use Lebesgue’s theorem A.2.13,
which asserts that such universal class α or w functions exists.
Thus (i) ⇐⇒ (ii).
58
CHAPTER 4. ON MEASURABILITY.
Theorem 4.1.4. Let (X , B) be a separable measurable space, 0 < p < +∞ and
F ⊂ Lp (X , B, Q) where F is admissible. Then if F is also image admissible via
(Y, S, T ), U ⊂ F and U is relatively dp,Q –open in F, we have T −1 (U ) ∈ S.
Proof. (X , B) is separable, thus in particular countably generated. Elements of
Lp (X , B, Q) are approximated by simple functions, which in turn are approximated by the (countable) class of functions F0 as defined before Aumann’s theorem (4.1.3 ), thus Lp (X , B, Q) is a separable normed space. By lemma A.1.3 a
metric space is separable iff it is second–countable , so for U open in Lp (X , B, Q),
U can be written as a countable union of open sets, even open balls:
U = ∪n≥1 {f : dp,Q (f, gn ) < rn } with gn ∈ U and 0 < rn < +∞.
The pseudometric dp,Q depends on the value of p.
(R
( |f − g|p dQ)1/p , if 1 ≤ p < +∞;
dp,Q (f, g) = R
( |f − g|p dQ),
if 0 < p < 1.
T −1 (U ) = ∪n≥1 T −1 ({f : dp,Q (f, gn ) < rn } by separability of Lp (X , B, Q). To
prove T −1 (U ) ∈ S it therefore suffices to prove T −1 ({f : dp,Q (f, g) < r} ∈ S
for any r > 0 and g ∈ Lp (X , B, Q).
Let p > 0 the functions (·)1/p are continuous real–valued functions, so measurable. Thus we need only to prove the measurability of the maps:
Z
y 7→ |T (y) − g|p dQ,
for any g fixed. Let g be fixed, because T is jointly measurable the function:
X × Y → R : (x, y) 7→ |T (y)(x) − g(x)|p
is jointly measurable. So we can reduce the proof to the case to p = 1, g ≡ 0 and
T (y)(x) ≥ 0 for all x ∈ X , y ∈ Y . By Tonelli–Fubini ( theorem A.2.17 ) the
function
Z
Y → R : y 7→ T (y)(x) dQ(x),
is measurable.
4.2
Suslin image admissibility.
In the previous section we discussed the concept of admissibility for a class F.
This was useful in the sense that it provides conditions such that the evaluation
4.2. SUSLIN IMAGE ADMISSIBILITY.
59
function (x, f ) 7→ f (x) is jointly measurable. What we would like is to deduce
measurability of kkF the supremum over F of e.g. the empirical measure or
process. It turns out that rather easy counterexamples for measurability of kkF
exist, even when F is admissible. Therefore we need a stronger notion. This
notion, called Suslin image admissibility, will assure us that kP2n kF , or kνn kF
satisfy some measurability condition. Here we state such an example.
Example 4.2.1. Let (X , B) be (I, B(I)) and let P be Lebesgue measure on I.
Furthermore let A be a non–Lebesgue measurable set (such a set exists, under the
assumption that the Axiom of Choice (AC) is true). Let C := {{x} : x ∈ A}.
• We claim that C is an admissible class. Indeed part (ii) of Aumann’s theorem ( 4.1.3 ) is satisfied for the choice:
(
1, if t = x;
G : I × X → R : (t, x) 7→ G(t, x) :=
0, otherwise.
because G = ID where D denotes the diagonal of I × I, and is closed in a
Hausdorff space, hence Borel.
• We now claim that kP1 kC is nonmeasurable, where
kP1 kC :(X ∞ , B ∞ , P ∞ ) → (I, B(I))
:{xn }n≥1 7→ sup |δX1 ({xn }n≥1 )({a})| = sup δx1 (a).
{a}∈C
a∈A
One can also see kP1 kC as a function from X into I. Then {kP1 kC = 1} =
A. Since supa∈A δa (x) = 1 iff x = a for some a ∈ A. Hence kP1 kC is
not measurable. In fact kPn kC for any n will never be measurable,
since we
Q
can repeat the previous argument, namely {kPn kC = 1} = ni=1 Ai , where
Ai = A for all i, a nonmeasurable set.
It is now clear than admissibility alone is not strong enough to imply measurability
of the empirical measure or process.
Definition 4.2.1. If (X , A) is a measurable space and F a set of functions on X ,
then a function
Ψ : F × X → R(f, x) 7→ Ψ(f, x)
will be called image admissible Suslin via (Y, S, T ) iff (Y, S) is a Suslin measurable space, T a function from Y onto F and
Ψ ◦ (T, idX ) : (Y × X , S ⊗ B) → (R, B(R)) : (y, x) 7→ Ψ(T (y), x)
is measurable.
60
CHAPTER 4. ON MEASURABILITY.
This main theorem will have as important corollary that the supremum over a
Suslin image admissible class F will be (universally) measurable.
Theorem 4.2.1 (Selection Theorem of Sainte–Beuve). Let (X , B) be any measurable space and Ψ : F × X → R be image admissible Suslin via (Y, S, T ). Then
for any Borel set B ⊂ R,
ΠΨ (B) := {x ∈ X : Ψ(x, f ) ∈ B for some f ∈ F}
= {x ∈ X : (x, f ) ∈ Ψ−1 (B)} = projX (Ψ−1 (B))
is a u.m. set in X , and there is a u.m. function H from ΠΨ (B) into Y such that
Ψ(T (H(x)), x) ∈ B for all x ∈ ΠΨ (B).
Proof. Since F is image admissible Suslin via some (Y, S, T ) there is a Polish
space (Z, Z) and a Borel map g from Z onto Y . So F is also image admissible
Suslin via (Z, σ(Z), T ◦ g). (g, idX ) : Z × X → Y × X is measurable, so the
composition with Ψ remains measurable.
Firstly for any measurable set V in a product σ–algebra, there are countably
many An ⊂ Y and En ⊂ X such that V is contained in the σ–generated by
countably many rectangles of the form An × En , see lemma A.2.5, since the
product σ–algebra is generated by all measurable rectangles. Here in particular
one will consider sets of the form:
V := {(y, x) : Ψ(T (y), x) ∈ B} ⊂ Y × X .
Secondly Ψ is a real–valued map, and recall that the Borel σ–algebra of R
is countably generated (the topology is countably generated hence the σ–algebra
too). Exactly as in the proof of Aumann’s theorem (4.1.3) in the implication from
(iv) to (iii), one can show that Ψ is measurable for an at most countably generated
(sub–)σ–algebra of S ⊗ B, e.g.
σ({Ψ > q} : q ∈ Q).
Combining those two facts just mentioned one has that for any B, one of the
(countably many) generators of B(R) that the set:
{Ψ(T (y), x) ∈ B},
is contained in a countably generated sub–σ–algebra of the product σ–algebra. As
is already said B(R) is countably generated, so one concludes that
(y, x) 7→ Ψ(T (y), x) is measurable for the product σ–algebra of S (on Y ) and
C := σ({An ⊂ X : An ∈ B, n ∈ N}) (on X ) (exactly as in the proof of
Aumann’s theorem, (iv) implies (iii)).
4.2. SUSLIN IMAGE ADMISSIBILITY.
61
By repeating the argument in the proof of theorem 4.1.1 we can define a Marczewski function
X
b : (X , C) → ([0, 1], B(I)) : x 7→ b(x) :=
2IAj (x)/3j
j≥1
for the sequence An . In fact, b will map X in a measurable way in the Cantor set
(for a defintion of the Cantor set, see proof of theorem 4.1.1). Note that b will not
necessarily be a measurable isomorphism this time, because the sub–σ–algebra
generated by An doesn’t need to be separable.
Our next step will be to rewrite Ψ in a different way. So Ψ(T (·), ·) = F (·, b(·)),
for some F : Y × C → R, which is S ⊗ BC /B(R) measurable, by lemma 4.2.2.
The set
{(y, c) ∈ Y × C : F (y, c) ∈ B} = {F ∈ B},
with B a Borel set of R, is contained in S ⊗ BC by measurability of F . Our aim
is to prove that {F ∈ B} is an analytic set (or Suslin space, which is the same).
Note that C as a Borel set of the Polish space [0, 1] is Suslin, by the remark after
definition/theorem C.2.1. By Suslin image admissibility Y is Suslin too. If the
product Y × C is Suslin too, then we are done, since {F ∈ B} as a Borel set of
a Suslin space is Suslin, see the remark after definition/theorem C.2.1. But Polish
spaces are stable under products, lemma A.1.14, so Y ×C is the measurable image
of a Polish space, hence is a Suslin set.
To continue, note that by definition/theorem C.2.1, projections of analytic sets
remain analytic, hence
ΠF (B) := {c ∈ C : for some y ∈ Y, F (y, c) ∈ B} = πC {F −1 (B)}
is Suslin.
Suslin subsets of Polish spaces need not be Borel sets, but they are universally
measurable, i.e. they are contained in the measure–theoretic completion (definition A.2.2 ) of the Borel σ–algebra (of the Polish space) for any measure µ
defined on the Borel σ–algebra, theorem C.3.1 . By another selection theorem
C.4.1, there is a u.m. function h from ΠF (B) into Y such that F (h(c), c) ∈ B for
all c ∈ ΠF (B). Using lemma A.2.4, which states that inverse images of u.m. sets
through measurable functions remain u.m. sets, we conclude that
ΠΨ (B) := b−1 ({c ∈ C : for some y ∈ Y, F (y, c) ∈ B}) = b−1 (ΠF (B))
is a u.m. set of (X , B). Let
H := h ◦ b : (X , B) → (Y, S);
62
CHAPTER 4. ON MEASURABILITY.
then H is u.m. Indeed, for any E ∈ S, consider H −1 (E) = b−1 (h−1 (E)). Now
since h is u.m., by the selection theorem for analytic sets C.4.1, by definition
one gets that h−1 (E) is a u.m. set in ΠF (B)(⊂ (C, B(C))) with relative Borel
σ–algebra. Then, again by lemma A.2.4, b (which is measurable) preserves u.m.
sets, so b−1 (h−1 (E)) is also a u.m. set in (X , B). And as claimed, H is u.m.
We conclude with the following:
Ψ(T (H(x)), x) = Ψ(T (h(b(x))), x) = F (h(b(x)), b(x)) = F (h(c), c) ∈ B
for all x ∈ ΠΨ (B) = b−1 (ΠF (B)), because Ψ(T (·), ·) = F (·, b(·)) and for all
c ∈ C; (h(c), c) ∈ {F ∈ B}.
Lemma 4.2.2. Let (X , C) and (Y, S) be measurable spaces and
φ : (Y × X , S ⊗ C) → (R, B(R))
a measurable function. Then there is a S ⊗ B(C)/B(R) measurable function
F : Y × C → R : (y, c) 7→ F (y, c) , where C is the Cantor set and B(C) its Borel
σ–algebra and such that φ(y, x) = (F ◦ (idY , b))(y, x) for all (y, x) ∈ Y × X . As
in theorem 4.2.1, b : X → C is a Marczewski function..
Proof. We will work in different steps and use the monotone class theorem (A.2.10).
(i) In the first step we consider indicator functions of some measurable rectangle, so let IA×B ; A ∈ S, B ∈ B. Then if we define
F := IA×b−1 (B) ◦ (T, idC ),
Pn
the result holds. For
Pn a finite weighted sum of indicators i=1 ci IAi ×Bi ; we
can take a F as i=1 ci IAi ×b−1 (Bi ) ◦ (T, idC ). Hence the result holds for
finite sums of weighted indicator functions of measurable rectangles.
(ii) In the second step we prove that indicator functions of finite unions of measurable rectangles satisfy the condition of the lemma.
This follows easily from collecting some trivial facts: first of all σ–algebra’s
are in particular semirings, and a (finite) product of semirings is again a
semiring ( [Dud1] proposition 3.2.2 on page 95 for a proof). Hence the
class of measurable rectangles is contained in a semiring.
Next it is also well known that the class S + existing of all finite disjoint
unions of elements of a semiring is a ring, and in fact is even the ring generated by the semiring R(S ⊗ C), i.e. the smallest ring containing the elements of the semiring. (See [Dud1] proposition 3.2.3 on page 96 for a
4.2. SUSLIN IMAGE ADMISSIBILITY.
63
proof.) Since a ring is closed under finite unions, finite unions of measurable rectangles are contained in R(S ⊗ C) and thus also in S + . Elements of
S + are finite unions of mutually disjoint sets, hence a finite union of measurable rectangles has a representation as a finite disjoint union of (other)
measurable rectangles.
If we combine this with the result of the first step, we have that indicator
functions of finite unions of measurable rectangles, i.e. I∪ni=1 Ai ×Bi satisfy
the condition of the lemma.
(iii) In the third step we claim that the class D of all finite unions of measurable
rectangles is an algebra. This follows directly from noting that D = S + by
the previous discussion in (ii). Now S + is a ring and contains the whole
space Y × X , hence is an algebra.
(iv) In the fourth step we show that the lemma is true for the indicator function
of any measurable set. In the third step we had that
D :=
n
n[
(Ai × Bi ) : Ai ∈ S, Bi ∈ C
o
i=1
is an algebra. Consider the class
E := {C ∈ A ⊗ B : IC satisfies the statement of the lemma }.
In the second step we had that indicator functions of elements of D can be
written as an F : Y × C → R, in other words: D ⊂ E. We claim that E is a
monotone class, definition A.2.4.
Let En ↑ E, with En ∈ E, then IEn ↑ IE everywhere, and Fn ↓ F ,
with Fn ∈ F, then IFn ↓ IF everywhere. In both cases we have functions ψn : Y × X → R, such that there exists a Fn : Y × C → R, with
ψn (y, x) = Fn (y, b(x)) and the sequence ψn converges pointwise to some (
S ⊗ C measurable) ψ. Define
F : (Y × C, S ⊗ B(C)) → (R, B(R)
(
limn→∞ Fn (y, c)
: (y, c) 7→ F (y, c) :=
0
whenever the limit exists;
otherwise.
As seen in the proof of theorem 4.2.5 on page 127 the set where the Fn
converge is measurable, hence F is measurable. For any (y, x):
ψ(y, x) = lim ψn (y, x) = lim Fn (y, b(x)) = F (y, b(x))
n→∞
n→∞
64
CHAPTER 4. ON MEASURABILITY.
since the limit clearly exists. So in particular IE , IF ∈ E. So E is a monotone class, that contains an algebra, namely D, which generates product
σ–algebra S ⊗ C (= σ(S × C ) ). By the monotone class theorem ( A.2.10 )
E = S ⊗ C.
(v) Let ψ be any S ⊗ C measurable function, by theorem A.2.12, there exists a
sequence of (measurable) simple functions ψn which converges pointwise
everywhere to ψ. Those ψn are weighted sums of indicators functions, indicator functions satisfy the lemma, by the fourth step, and as in the first step
weighted sums of indicators which satisfy the lemma, also satisfy the condition of the lemma. Then again as in the fourth step, ψ, as the everywhere
limit of the ψn satisfy the lemma.
In the last step we proved that any measurable function satisfy the condition of
the lemma, hence we are done.
An easy corollary will assure that taking the supremum over a class of functions of an image admissible map gives a universally measurable set.
Corollary 4.2.2. Let (X , A) be a measurable space, F set of real–valued measurable functions from X and (f, x) 7→ Ψ(f, x) be real–valued and image admissible Suslin via some (Y, S, T ). Then x 7→ sup{Ψ(f, x) : f ∈ F} and
x 7→ sup{|Ψ(f, x)| : f ∈ F} are u.m. functions.
Proof. The class of all u.m. sets is a σ–algebra, by theorem A.2.3. Then because
σ(f −1 (C)) = f −1 (σ(C)) and the Borel σ–algebra on R is generated by sets of the
form ]t, +∞[ it is enough to prove that sets like
{x ∈ X : sup Ψ(f, x) ∈]t, +∞[}
f ∈F
are universally measurable. This will follow from Saint–Beuve’s Selection Theorem ( 4.2.1 );
{x ∈ X : sup Ψ(f, x) ∈]t, +∞[} = {x ∈ X : Ψ(f, x) > t, for some f ∈ F}
f ∈F
= ΠΨ (]t, +∞[).
The latter set is universally measurable.
Finally we state a theorem which proves that the image admissible property of
finitely many classes Fi is preserved when composing with a measurable function.
4.2. SUSLIN IMAGE ADMISSIBILITY.
65
Theorem 4.2.3. Let Ψi ; i = 1, · · · , d be image admissible Suslin real–valued
functions on X × Fi , for one measurable space (X , A), via some (Yi , Si , Ti ).
Let g : Rd → R be a Borel measurable function, then
(X × F1 × · · · × Fd ) → R : (x, f1 , · · · , fd ) 7→ g(Ψ1 (x, f1 ), · · · , Ψd (x, fd ))
is image admissible Suslin.
Proof. It suffices to note that, by lemma A.1.15, (Y d , ⊗di=1 Si ) is again a Suslin
measurable space and
T : (Y
d
, ⊗di=1 Si )
→
d
Y
Fi : (y1 , · · · , yd ) 7→ (T1 (y1 ), · · · , Td (yd ))
i=1
is onto and
Ψ ◦ T : (x, y1 , · · · , yd ) 7→ g ◦ (Ψ1 (x, T1 (y1 )), · · · , Ψd (x, Td (yd )))
is A ⊗ S, B(R) measurable as a composition of two measurable functions. g is
given to be measurable and the second can be written as a composition of
X × Y d → (X × Y )d
(x, y1 , · · · , yd ) 7→ ((x, y1 ), · · · , (x, yd ))
→R
7→ (Ψ1 (x, T1 (y1 )), · · · , Ψd (x, Td (yd )))
Since Rd is separable and (X × Y )d is also separable (measurable space) it is
enough, to look at the component functions. Ψi (·, Ti (·)) are measurable functions
for 1 ≤ i ≤ d. And (x, y1 , · · · , yd ) 7→ (x, yi ) for 1 ≤ i ≤ d as a projection is
measurable too.
66
CHAPTER 4. ON MEASURABILITY.
Chapter 5
Uniform limit theorems for the
empirical process and measure.
5.1
Entropy and Covering Numbers.
For F a class of measurable functions on (X , A) let
FF (x) := sup{|f (x)| : f ∈ F} = kδx kF ,
where δx : F → R is linear functional.
A measurable function F ≥ FF is called an envelope function for F. If FF
is A–measurable, then it is said to be the envelope function. Given a law P on
(X , A) we call FF∗ the essential infimum of FF the envelope function of F for P .
P
Definition 5.1.1. Let Γ be the set of all laws on (X , A) of the form n−1 nj=1 δxj
for some xj ∈ X and j = 1, · · · , n, n ∈ N0 . For > 0, 0 < p < ∞ and γ ∈ Γ,
then for an envelope function F of F let
(
DFp (, F, γ) := sup m : f1 , · · · , fm ∈ F, and all
Z
i 6= j,
|fi − fj |p dγ > p
Z
(5.1.1)
)
F p dγ .
log DFp (, F, γ) is called the Koltchinskii Pollard entropy.
One also sets supγ∈Γ DFp (, F, γ) = DFp (, F).
In other words m = DFp (, F, γ) is the greatest non negative integer such that
for m different functions of F such
no function fi is contained inR the closed
R that
p
ball of any fj , j 6= i with radius ( F dγ)1/p , for p ∈ [1, +∞[ and p F p dγ for
67
68
CHAPTER 5. UNIFORM LIMIT THEOREMS.
p ∈]0, 1[. Also having an envelope function that is finite, allows us to look at the
unit ball of Lp (X , A, γ). This entropy number is also easily seen to be a covering
number. Indeed if some part of F were not covered by those balls then one could
easily add one such that the condition remains satisfied. In fact DFp (, F, γ) is
the greatest
R number of functions of F one can pick such that: the closed balls of
radius ( F p dγ)1/p cover F and such that the centers don’t overlap.
5.2
A Symmetrization Lemma.
Our objective is to try to bound the probability of a not necessarily measurable
supremum. In proving limit theorems it is often useful to make use of symmetrization. If we have an empirical process νn , then we can construct two (or actually
0
00
as many as we want) identical and independent copies νn and νn of the original
process νn . The Koltchinskii–Pollard method provides a way to bound the probability of the original empirical process by the probability of the difference of the
two copies. This difference is easier to handle when conditioned in a suitable way,
namely by conditioning one reduces the differences of this processes to a discrete
random variables taking at most 2n values. For discrete random variables one
disposes of very strong techniques, i.e. exponential inequalities, which will allow
us to consider very large (index) classes of sets.
For n = 1, 2, · · · we are given X1 , · · · , X2n independent and identically distributed random variables, namely in our case they will be the coordinates of the
product probability space (X 2n , A2n , P 2n ). Now let {σi }ni=1 be independent random variables and independent of the {Xi }2n
i=1 with
Pr(σi = 2i) = 1/2 = Pr(σi = 2i − 1).
One can e.g. let the σi be the coordinates of the space {1, 2} × · · · × {2n − 1, 2n}.
Then it is easy to see that {Xσi (ω) (ω)}ni=1 are i.i.d. with law P . We first define
other random variables. Then we will calculate their joint distribution. Define the
random variables {τi }ni=1 as follows:
(
2i,
if σi = 2i − 1;
τi =
2i − 1, if σi = 2i.
Furthermore, let
P
P
0
00
Pn := n−1 nj=1 δXσj Pn := n−1 nj=1 δXτj
0
0
00
00
νn := n1/2 (Pn − P )
νn := n1/2 (Pn − P )
and the differences
0
00
0
00
Pn0 := Pn − Pn νn0 := νn − νn = n1/2 Pn0 .
5.2. A SYMMETRIZATION LEMMA.
69
Independence. We claim that {Xσi (ω) (ω)}ni=1 and {Xτj (ω) (ω)}nj=1 are mutually
independent with law P .
Proof. Mutual independence and law of the Xσ , Xτ .
Let A1i := {σi = 2i}, A0i := {σi = 2i − 1}. Assume (w.l.o.g.) that the Xj , σk are
defined on the same probability space (Ω, A, Q) which is taken to be
(X 2n × Y n , B n ⊗ C n , P n × Qn ) with Xj , σk the projections onto X , respectively
Y . Pick measurable sets Bj ⊂ X , for j ∈ {1, · · · , 2n}. Let
Y := (Y1 , · · · , Y2n ) := (Xσ1 , · · · , Xσn , Xτ1 , · · · , Xτn )
Q
and for b ∈ ni=1 {0, 1}i notice that when σi = 2i − bi , τi = 2i − 1 + bi . We
calculate the distribution of Y.
Pr
2n
\
{Yj ∈ Bj }
j=1
= Pr
2n
\
{Yj ∈ Bj } ∩
j=1
= Pr
=
=
[
˙
b∈{0,1}n
X
Pr
X
n
\
b∈{0,1}n
{Yj ∈ Bj } ∩
{σi = 2i − bi }
n
\
{σi = 2i − bi }
!
i=1
{Yj ∈ Bj } ∩
j=1
!
i=1
b∈{0,1}n
Pr
b∈{0,1}n
j=1
2n
\
2n
\
n
\
[
˙
n
\
{σi = 2i − bi }
!
i=1
{X2j−bj ∈ Bj } ∩
n
\
j=1
{X2j−bj +1 ∈ Bn+j }
j=1
n
\
∩
{σi = 2i − bi }
!
i=1
=
X
2−n
b∈{0,1}n
=
2n
Y
n
Y
j=1
Pr({X2j−bj ∈ Bj })
n
Y
Pr({X2j−bj +1 ∈ Bn+j })
j=1
P (Bj )
j=1
The fourth equality follows from the fact that for each b ∈ {0, 1}n the σi , τi
are all different, and each value of {1, · · · , 2n} is taken by one of them. The fifth
is implied by the mutual independence of the σi and Xj .
0
00
It follows, by the definition of (νn (f ))f ∈F and (νn )(f )f ∈F , that they are independent, under suitable measurability conditions, i.e. assumptions on F, of both
70
CHAPTER 5. UNIFORM LIMIT THEOREMS.
processes. Here by independence of stochastic processes we mean that all the finite dimensional distributions are independent, i.e. for any k ≥ 1, f1 , · · · , fk ∈ F:
0
0
0
00
00
00
νn (f1 ), νn (f2 ), · · · , νn (fk ) ⊥
⊥ νn (f1 ), νn (f2 ), · · · , νn (fk ) ,
where ⊥
⊥ stands for (statistical) independence.
Identical distribution. By similar arguments one sees that both are identically
distributed, assuming again measurability of the processes.
The σi are defined on another probability space; e.g. one could take the product space Πni=1 {2i − 1, 2i} with usual product σ–algebra and product measure
Q := ⊗ni=1 Qi , where for 1 ≤ i ≤ n:
Qi ({2i}) = Qi ({2i − 1}),
0
00
(i.e. Qi = 1/2(δ{2i} +δ{2i−1} ). So the processes νn , νn will be defined on X 2n ×Z.
Remark. Note also that Pn0 has a symmetric distribution, this is easily seen conditioning on the σ’s.
Theorem 5.2.1 (Symmetrization Lemma). Let ζ > 0 and F ⊂ L2 (X , A, P ) s.t.
Z
|f |2 dP ≤ ζ 2 for all f ∈ F (or F ⊂ BL2 (P ) (0, ζ)).
Assume further that F is image admissible Suslin via some (Y, S, T ). Then for
any η > 0,
Pr{kνn0 kF > η} ≥ (1 − ζ 2 η −2 ) Pr{kνn kF > 2η}.
Remark. By abuse of notation we will write νn for the empirical process on
0
00
(X ∞ , A∞ ) and its restriction on (X n , An ) and the same for νn , νn and any pre–images
of (measurable) sets through the empirical process or one of its copies.
Proof. We don’t need to work with outer probabilities, since the events will turn
out to be (universally) measurable, implying that they equal (at least P ∞ –a.s. a
measurable event.
Because we are dealing with suprema over large classes, measurability isn’t
necessarily satisfied. It is here that the Suslin condition appears to ensure measurability. So we first prove that the events considered are indeed elements of the
(completed, see definition A.2.2) σ–algebra A∞ . Recall that
!
n
X
00
00
νn := n1/2 (Pn − P ) := n1/2 n−1
(δXτi − P ) ,
i=1
5.2. A SYMMETRIZATION LEMMA.
71
or more precisely
00
νn : (X ∞ × Z) × F → R :
n
1 X
((a, z), f ) 7→ √
(f ◦ Xτi (z) )(a) − P (f ) .
n i=1
It will turn out to be image admissible Suslin.
Step 1: Measurability of the events and image Suslin admissibility of the
empirical process and its copies.
i) Because F is image admissible Suslin, there exists a Suslin space (Y, S)
and a map T : Y → F onto, such that
eval(T (·), ·) :(X × Y, A ⊗ S) → R :
(x, y) 7→ eval(T (y), x) := T (y)(x)
is A ⊗ S/B(R) measurable. Secondly, as immediate consequence of the
above remark together with the proof of the theorem of Tonelli–Fubini
(A.2.17), on gets the measurable map
Z
(X × Y, A ⊗ S) → R : (x, y) 7→ IX (x) T (y)(u) dP (u).
One then trivially obtains measurability of
Z
Φ : (X × Y, A ⊗ S) → R : ((x, z), y) 7→ T (y)(x) −
T (y)(u) dP (u).
ii) Recall that the Xi were the coordinate functions on (X ∞ , A∞ ) and are thus
measurable A∞ /A. Since the spaces (X , A) and (Y, S) are separable measurable spaces, the product spaces (X ×Y ) and (X n ×Y ) are also separable,
as a consequence:
(Xi , idY ) : (X ∞ × Y, A∞ ⊗ S) → (X × Y, A ⊗ S),
i≥1
and
(Xn , idY ) : (X ∞ × Y, A∞ ⊗ S) → (X n × Y, An ⊗ S),
where Xn := (X1 , · · · , Xn ), n ≥ 1; are measurable maps.
Hence
Ψi := Φ ◦ (Xi , idY )
is also measurable.
72
CHAPTER 5. UNIFORM LIMIT THEOREMS.
Remark that since the space (X , A) is a separable measurable space,
(X ∞ , A∞ ) and (X ∞ × Z, A∞ ⊗ Z)
are also separable measurable spaces (proof of lemma A.2.14).
It is now clear that the (normalised) sum of Ψi , 1 ≤ i ≤ n, which equals νn ,
is measurable. An immediate consequence is measurability of
n
o
H̃ := (x, f ) : |νn (f )| > 2η, f ∈ F ⊂ X ∞ × F.
iii) Recall that Xτi : (X ∞ × Z, A∞ ⊗ Z) → (X , A), and
(
x2i
if τi (z) = 2i;
Xτi ({xn }n≥1 , z) =
x2i−1 if τi (z) = 2i − 1.
Then Xτi is measurable, since for any A ∈ A: {Xτi ∈ A}
n
o[n
o
(X2i , τi ) ∈ (A × {2i})
(X2i−1 , τi ) ∈ (A × {2i − 1}) .
iv) Let Xτ := (Xτ1 , · · · , Xτn ). Then Xτ : X ∞ × Z → X n is measurable, as
said earlier in step 1, point three. We will see that we can go in a measurable
00
way from νn to νn . Indeed our goal is to construct a function
γn : (X ∞ × Z, A∞ ⊗ Z) → (X ∞ , A∞ )
Q
such that Xn ◦ γn = Xτ . Recall that Z := ∞
j=1 {2j − 1, 2j}, let
γn (x∞ , z) = (xz1 , xz2 , · · · , xzn , x2n+1 , x2n+2 , · · · )
Recall
A∞ = σ(A1 × · · · , ×An × X ∞ : Ai ∈ A, 1 ≤ i ≤ n, n ≥ 1),
then it is enough to consider sets B := {γn ∈ A1 × · · · × An × X ∞ } to
prove measurability. Trivially:
!
2n
Y
[
Y
B=
Ck × X ∞ × {z} ×
{2i − 1, 2i}
z∈
Qn
i≥n+1
k=1
j=1 {2j−1,2j}
where
C2k
(
Ak ,
:=
X,
if zk = 2k,
otherwise.
5.2. A SYMMETRIZATION LEMMA.
and
C2k−1
73
(
Ak
:=
X,
, if zk = 2k − 1,
otherwise.
for k ∈ {1, · · · , n}. And the equality Xn ◦ γn = Xτ is clearly satisfied too.
Step 2: Further let
o
n
00
H := ((x, z), f ) : |νn (f )| > 2η, f ∈ F ⊂ (X ∞ × Z) × F.
We now show that H is a measurable subset of X ∞ × Z × F.
00
By step 1, it follows that νn = νn ◦ (γn , idF ) and moreover
00
H = {|νn | ∈ ]2η, +∞[} = {|νn ◦ (γn , idF )| ∈ ]2η, +∞[}
= {(γn , idF ) ∈ H̃}
and is thus a measurable subset of (X ∞ × Z) × F.
00
One also has kνn kF = kνn kF ◦ γn . In step 1, νn was showed to be Suslin
image admissible, so kνn kF is a u.m. function by corollary 4.2.2 and further from
00
lemma A.2.4 universal measurability of kνn kF is deduced.
By theorem 4.2.1, there exists a universally measurable selector
h̃ : projX ∞ (H̃) → F
such that (x∞ , h̃(x∞ )) ∈ H̃.
Define a universally measurable function (lemma A.2.4)
h : projX ∞ ×Z (H) → F
by letting h = h̃ ◦ γn .
By abuse of notation νn stands also for
n
1 X
X × F 7→ R : √
(f (Xi ) − P (f ))
n i=1
n
where Xi : X n → Xi ; this because the empirical process depends only on the first
n coordinates. Then it is easy to see, that one can suppose:
Xτ : X 2n × Z → X ,
00
νn : X 2n × Z × F → R
and that as above there is a γn : X 2n × Z → X n measurable such that γn = Xτ .
So H̃ will also stand for a measurable set of X n × F, and H := γn−1 (H̃).
Then it is clear that h = h1 ◦ γn = h1 ◦ Xτ , where both function are u.m. Let
74
CHAPTER 5. UNIFORM LIMIT THEOREMS.
J = proj(H̃), then J is a u.m. set of X 2 and h is a u.m. function, as the composition h1 ◦ γn = h1 ◦ Xτ . The set γn−1 (J) = {Xτ ∈ J} is a u.m. set of X 2n × Z
(lemma A.2.4). And h acts as a fixed f ∈ F on the former set.
Step 3: let Tn be the smallest σ–algebra for which
Xτ (·) = (Xτ1 , · · · , Xτn ) : (X 2n × Z) → X n
is measurable. In other words: Tn = σ{Xτi : i = 1, · · · , n}.
From the definition of conditional probability one has the following identity:
0
(P (2n) ⊗ Q) {|νn (·, h1 (Xτ (·))| ≤ η} ∩ {Xτ ∈ J}||Tn
0
=(P (2n) ⊗ Q) {|νn (·, h1 (Xτ (·))| ≤ η}||Tn I{Xτ ∈J} .
We will bound the conditional probability, on the set {Xτ (·) ∈ J}, so that one
can consider h1 (Xτ )(·) to act as a given, fixed f ∈ F. Recall that νn (f, Xσ ) =
0
νn (f, ·) so that
0
νn ((·, h1 (Xτ (·)) = νn (Xσ , Xτ ).
Because (Xσ , Xτ ) are independent:
E[I{|νn0 (·,h1 (Xτ (·))|≤η} ||Tn ]
= E[(G ◦ ν ◦ (idX 2n ×Z , h1 )Xσ , Xτ )||Tn ]
= E[(G ◦ ν ◦ (idX 2n ×Z , h1 )Xσ , x)||Xτ = x]
= E[I{|νn0 (·,h1 (x)|≤η} ||Xτ = x].
(5.2.1)
Further by monotonicity of the conditional probability one sees that:
0
(P (2n) ⊗ Q) {|νn (·, h1 (Xτ (·))| ≤ η}||Tn
!
n 00
o
00
0
νn (·, h1 (Xτ (·)) − (νn (·, h1 (Xτ (·)) − νn (·, h1 (Xτ (·)) ≤ η ||Tn
= Pr
n 00
νn (·, h1 (Xτ (·)) − (νn00 (·, h1 (Xτ (·)) − νn0 (·, h1 (Xτ (·)) ≤ η}||Tn
≤ Pr
00
0
00
Now on {Xτ ∈ J}, by definition kνn kF ≥ 2η. Recalling that νn0 := νn − νn ;
00
0
(2n)
(P
⊗ Q) 2η − η ≤ νn (·, h1 (Xτ (·)) − νn (·, h1 (Xτ (·)) ||Tn
(2n)
0
≤ (P
⊗ Q) {η ≤ kνn kF }||Tn
!
5.2. A SYMMETRIZATION LEMMA.
75
So we have, on the set {Xτ ∈ J}, the following inequality:
n 0
o
(2n)
(P
⊗ Q) νn (·, h1 (Xτ (·)) ≤ η ||Tn
≤ (P
(2n)
0
⊗ Q) {η ≤ kνn kF }||Tn .
The conditional probability on Xτ allows one to handle h1 (Xτ )(·) as a fixed
0
f (∈ F), see equation 5.2.1. Also since by construction νn is independent of
Tn = σ(Xτ ); one has by Chebyshev’s inequality,
(P
(2n)
0
Var(νn (f )) ζ 2
⊗ Q)(|νn (f )| ≥ η) ≤
≤
,
η2
η
0
where the bound of the variance is deduced from:
0
Var(νn (f )) = Var(νn (f )) = nVar(Pn (f ) − P (f ))
!
n
X
−1
= n Var
(f (Xi ) − P (f ))
i=1
= Var(f (X1 ) − P (f )) = Var(f (X1 ))
!2 Z
Z
Z
=
f 2 dP −
f dP
≤ f 2 dP ≤ ζ 2
0
(recall that νn and νn were identically distributed and that the Xi were all i.i.d.).
0
Taking the complement of the event {|νn (f )| ≥ η}:
0
0
(P (2n) ⊗ Q)(|νn (f )| ≤ η) ≥ (P (2n) ⊗ Q)(|νn (f )| < η)
ζ 2
≥1−
.
η
Hence, still on the set {Xτ ∈ J}:
!
0
(2n)
0
(2n)
(P
⊗ Q) {η < kνn kF }||Tn
≥ (P
⊗ Q) {|νn (·, h1 (Xτ (·))| ≤ η}||Tn
≥
1−
ζ 2
!
η
More precisely :
(2n)
0
(P
⊗ Q) {η < kνn kF }||Tn I{Xτ ∈J} ≥
1−
ζ 2
η
!
I{Xτ ∈J}
76
CHAPTER 5. UNIFORM LIMIT THEOREMS.
00
Finally, integrating both sides out, and using {Xtτ ∈ J} = {kνn kF ≥ 2η} together
00
with νn = νn gives:
(P (2n) ⊗ Q) {η < kνn0 kF } ≥
5.3
1−
ζ 2
η
!
P (n) ({kνn kF > 2η}).
A martingale property and a uniform law of large
numbers for the empirical process.
A technique often used to prove a limit exists is based upon proving the sequence
has some martingale property (or is related to a martingale), because we have nice
limit theorems for martingales and also maximal inequalities. We first state formally the definitions of martingale, submartingale and reversed (sub)martingale
and then prove two theorems about martingale behaviour of the empirical measures.
The empirical measures enjoy many interesting properties and one of them is
that its supremum over a class of integrable functions is a reversed submartingale.
Remark that here n and k will denote negative integers, as it is usual for reversed submartingales.
Theorem 5.3.1. Let (X , A, P ) be a probability space, F ⊂ L1 (X , A, P ) and let
{Pn }n≥1 be the empirical measures for P . Let Sn be the smallest σ–algebra for
which
k
1X
f (Xi (·)),
Pk (·, f ) :=
k i=1
for all k ≥ n and all f ∈ L1 (X , A, P ) are measurable. Then
i) For any f ∈ F, {Pn (f ), Sn }n≥1 is a reversed martingale, i.e.
E[Pn−1 (f )||Sn ] = Pn (f ) a.s. if n ≥ 2.
ii) (F. Strobl) If F has an envelope function F ∈ L1 (X , A, P ) and if for each
n, kPn − P kF is measurable for the completion of P n (the product measure
on (X n , An )). Then {kPn − P kF , Sn }n≥1 is a reversed submartingale, i.e.
kPn − P kF ≤ E[kPk − P kF ||Sn ] a.s. for k ≤ n
for n ≥ 2.
5.3. MARTINGALE PROPERTY, GLIVENKO–CANTELLI THEOREM.
77
Proof.
i) Sm ⊂ Sm−1 , m = 2, 3, · · · is an immediate consequence of the
definition of Sn , n ≥ 1. Hence
S1 ⊃ S2 ⊃ S3 ⊃ · · · .
For n fixed, each set of Sn will be invariant under permutation of the first
n coordinates Xi . Indeed let A ∈ Sn , then for some k ≤ n, f ∈ F and
B ∈ B(R) : A = Pk (f )−1 (B).
This follows from the particular choice of the Xi (the projections) and the
commutativity of the addition in R.
Sn = σ({(Pk (f ))−1 (B) : B ∈ B(R), f ∈ F, k ≤ n})
All sets of the form (Pk (f ))−1 (B) are invariant under permutation of the
first n coordinates. Moreover the class C of all sets invariant under permutations of the first n coordinates is a σ–algebra, lemma A.2.8. Thus Sn ⊂ C
and indeed sets in Sn are invariant under permutations of the first n coordinates.
Let 1 ≤ i < j ≤ n:
E[f (Xi )||Sn ] = E[f (Xj )||Sn ],
is valid because both are Sn measurable (by definition of the conditional
expectation) and for any A ∈ Sn , Π ∈ Sym(n) (the symmetric group of
order n):
(x1 , · · · , xn , xn+1 , · · · ) ∈ A ⇐⇒ (xΠ(1) , · · · , xΠ(n) , xn+1 , · · · ) ∈ A.
Hence Xi (A) = Xj (A), if x ∈ Xi (A), then for some {ym } ∈ A: yi = x,
but then (yΠ(1) , · · · , yΠ(n) , yn+1 , · · · ) ∈ A where Π ∈ Sym(n): Π(j) = i,
so
Xj ((yΠ(1) , · · · , yΠ(n) , yn+1 , · · · )) = yΠ(j) = yi = x ⇐⇒ x ∈ Xj (A).
It follows:
Z
f (Xi ({xm }m≥1 )) dP ∞ ({xm }m≥1 )
ZA
Z
−1
∞
=
f (xi ) d(P ◦ Xi )(xi ) =
f (x) dP (x)
Xi (A)
Xi (A)
Z
Z
=
f (x) dP (x) =
f (xj ) d(P ∞ ◦ Xj−1 )(xj )
X (A)
Xj (A)
Z j
=
f (Xj ({xm }m≥1 )) dP ∞ ({xm }m≥1 )
E[f (Xi )IA ] =
A
= E[f (Xj )IA ]
78
CHAPTER 5. UNIFORM LIMIT THEOREMS.
We used the image measure formula and the fact that our random variables
are identically distributed. Recall: Xn : (X ∞ , A∞ , P ∞ ) → (X , A, P ) the
n–th projection. Thus summing over 1 ≤ i ≤ k for k = 1, · · · , n and
dividing by k gives:
E[Pk (f )||Sn ] = E[f (X1 )||Sn ].
In particular for the choices: k = n − 1 and k = n;
E[Pn−1 (f )||Sn ] = E[f (X1 )||Sn ] = E[Pn (f )||Sn ] = Pn (f ).
ii) Now we proof the second part. We first prove the integrability of kPn −P kF ,
taking the expectation (w.r.t. the completion of P n ) is allowed since we
assume it completion measurable. Because one has an integrable envelope
function:
kP kF = sup(|P (f )|) ≤ sup(P (|f |)) ≤ P (F ) < +∞
f ∈F
f ∈F
kPn kF = sup{|Pn (f )| : f ∈ F} ≤ Pn (F ) < +∞
so
E[kPn − P kF ] ≤ E[kPn kF + kP kF ] < 2P (F ) < +∞.
So it remains to prove
kPn − P kF ≤ E[kPk − P kF ||Sn ] a.s. for k ≥ n.
in order to have a reversed submartingale. Next, it will be shown that
kPn − P kF is measurable for the completion of Sn . Because the open half
lines {]q, +∞[: q ∈ Q+ } form a generating class for the Borel σ–algebra
of the real line; it suffices to proof that the sets:
Aq := Aq,n := {kPn − P kF ∈]q, +∞[}
are in the completion of Sn (which is a σ–algebra according to definition
A.2.2 ). First by our assumption on kPn − P kF we have that there exists
sets
Cq , Dq ∈ An : Cq ⊂ Aq ⊂ Dq and P n (Dq \Cq ) = 0.
Our purpose is now to find sets Uq , Vq ∈ Sn such that the above equation
remains true (for these sets). Consider
Π
:= {(xΠ(−1) , · · · , xΠ(n) ) : (x−1 , · · · , xn ) ∈ Cq }
CqΠ := Cq,n
5.3. MARTINGALE PROPERTY, GLIVENKO–CANTELLI THEOREM.
79
the set obtained by permuting the n first coordinates of the set Cq,n according to Π ∈ Sym(n). Then if we let Uq := ∪Π∈Sym(n) CqΠ , then Uq is
invariant under any permutation the the first n coordinates and because Aq
was invariant: Uq ⊂ Aq . Now in the same way we define DqΠ and we let
Vq := ∩Π∈Sym(n) DqΠ , then Vq is invariant under permutations and Aq ⊂ Vq .
Because the mappings
fΠ : (X n , An ) → (X n , An ) :
(x1 , · · · , xn ) 7→ fΠ ((x1 , · · · , xn )) := (xΠ(1) , · · · , xΠ(n) )
are measurable; Uq and Vq are both measurable too. Thus Uq , Vq are in An
and are invariant under permutations, by theorem A.2.9 they lie in Sn and
Aq lies in the completion of Sn so that kPn − P kF is measurable for the
completion of Sn .
For i = 1, · · · , n + 1 let
Pn,i := n
n+1
X
−1
δXj .
j=1,j6=i
Then trivially Pn,n+1 := Pn and for i 6= n + 1, Pn,i has the same properties
as Pn . In particular, as above, kPn,i − P kF is measurable for the completion
of An+1 .
For any n ≥ 1 one has got:
kPn+1 − P kF
n+1
X
−1
δXl − P = (n + 1)
1 −1
=
n
n+1
l=1
n+1
X
F
nδXl
− (n + 1)P l=1
F
n+1
1 X
=
Pn,i − (n + 1)P n+1
F
i=1
where the last P
step requires only an easy, rapid calculation (namely rearrange the sum n+1
l=1 nδXl . By
P the triangle inequality:
kPn+1 − P kF ≤ (n + 1)−1 n+1
i=1 kPn,i − P kF .
Also since kPn+1 − P kF is Sn+1 measurable:
kPn+1 − P kF = E kPn+1 − P kF ||Sn+1 . And, by the previous calculation
and monotonicity and linearity of the conditional expectation:
n+1
E kPn+1 − P kF ||Sn+1 ≤
1 X
E kPn,i − P kF ||Sn+1
n + 1 i=1
80
CHAPTER 5. UNIFORM LIMIT THEOREMS.
The submartingale property
would follow if E kPn,i − P kF ||Sn+1 = E kPn − P kF ||Sn+1 a.s. But this is easy to see.
Recall:
n+1
n
X
X
Pi,n+1 :=
δXj and Pn :=
δXj .
j=1
j=1,j6=i
And thus
kPi,n+1 − P kF = H ◦ (X1 , · · · , Xi−1 , Xi+1 , · · · , Xn+1 )
kPn − P kF = H ◦ (X1 , · · · , Xi−1 , Xi , Xi+1 , · · · , Xn );
for some (universally) measurable function H. Now since
(X1 , · · · , Xi−1 , Xi+1 , · · · , Xn+1 )(A) = (X1 , · · · , Xi−1 , Xi , Xi+1 , · · · , Xn )(A)
for all A ∈ Sn+1 (by invariance of permutation of the first n + 1 coordinates
for all sets in Sn+1 ), one has:
Z
Z
∞
kPi,n+1 − P kF dP =
kPn − P kF dP ∞ .
A
A
Hence the proof is complete.
Before coming to the main theorem of this section we state a definition.
Definition 5.3.1. Let (X, A, P ) be a probability space, F ⊂ L(X, A, P ) a class
of integrable real–valued functions. Then F will be called a strong (respectively
weak) Glivenko–Cantelli class for P iff kPn − P kF → 0 as n → ∞ almost
uniformly (respectively in outer probability).
Theorem 5.3.2 (Glivenko–Cantelli theorem). Let (X , A, P ) be a probability space,
F ∈ L1 (X , A, P ), and F ⊂ L1 (X , A, P ) where F is an envelope function of F.
Suppose that F is image admissible Suslin via (Y, S, T ). Also assume that,
(1)
DF (δ, F) < ∞ for all δ > 0.
Recall that
Xj : (X ∞ , A∞ , P ∞ ) → (X , A, P ) : {xn }n≥1 7→ xj
is the standard model.
Then limn→∞ kPn − P kF = 0, P ∞ –a.s., i.e. F is a (strong) Glivenko–Cantelli
class for P or the empirical process satisfies a uniform law of large numbers.
5.3. MARTINGALE PROPERTY, GLIVENKO–CANTELLI THEOREM.
81
Proof. By corollary 4.2.2 the Suslin image admissibility property of the class F
makes kPn − P kF universally measurable, in particular it is completion measurable. Then by theorem 5.3.1 part (ii), kPn − P kF (with the right filtration Sn as in
theorem 5.3.1 ) is a reversed submartingale. Being nonnegative it converges a.s.
and in L1 by Doob’s theorem A.2.24. If we prove that limn→∞ kPn − P kF = 0 in
probability, then the a.s. (and L1 ) limit will also be 0.
F I{F >n} → 0 pointwise for any x ∈ X as n → ∞, so for > 0, take M
large enough such that P (F I{F >M } ) < /4 by Lebesgue’s theorem of dominated
convergence A.2.19.
Next we use this to bound
k(Pn − P )I{F >M } kF := sup{|(Pn − P )(f I{F >M } )| : f ∈ F}
by:
k(Pn − P )I{F >M } kF ≤ kPn I{F >M } kF + kP I{F >M } kF
≤ Pn (F I{F >M } ) + P (F I{F >M } )
Then by the usual Strong Law of Large Numbers (A.2.20) we have
Pn (F I{F >M } ) + P (F I{F >M } ) → 2P (F I{F >M } ) < /2 a.s. as n → ∞.
This means we can look at the set FI{F ≤M } := {f I{F ≤M } : f ∈ F}, which is
also Suslin image admissible, and so pretend that the envelope function is bounded
(a.s.). F is image admissible Suslin via some (Y, S, T ), then FI{F ≤M } is also
image admissible Suslin as follows
X × Y → R : (x, y) 7→ T (y)(x)I{F ≤M } ,
which as a product of two measurable functions is again measurable. Let FM
stand for the modified class FI{F ≤M } and FM for F I{F ≤M } .
In the next step we will use the symmetrization lemma 5.2.1. Since F ≤ M
a.s., let ζ := M and η := n1/2 /4 > 2M for n large enough. Then because of the
symmetrization lemma 5.2.1, which says;
0
00
(P (2n) ⊗ Q){n1/2 kPn − Pn kF > η}
≥ (1 − ζ 2 η −2 )P (2n) {n1/2 kPn − P kF > 2η}.
and which applied to our particular choice of η, ζ:
0
00
(P (2n) ⊗ Q){kPn − Pn kF > /4} ≥(1 − ζ 2 η −2 )P (2n) {kPn − P kF > /2}
3
≥ P (2n) {kPn − P kF > /2}.
4
82
CHAPTER 5. UNIFORM LIMIT THEOREMS.
0
00
Therefore it is enough to show kPn0 kF = kPn − Pn kF → 0 in probability.
(1)
Note that for the modified class FM one also has DF (δ, FM ) < +∞ for all
δ > 0, lemma D.1.1.
Next we let γ := P2n , and we use lemma 5.3.3, in order to choose functions
g1 , · · · , gm ∈ FM which satisfy the condition in equation 5.1.1 for := /(9M )
(x)
and such that gj := gjn depends in a measurable on x := (x1 , · · · , xn ).
(1)
For g ∈ FM and m := DFM (/(9M ), FM , P2n (x)), we have got
min P2n (|g − gj |) ≤
1≤j≤m
P2n (FM )
9M
and for each j where the minimum is reached:
0
00
|Pn0 (g) − Pn0 (gj )| = |Pn (g − gj ) − Pn (g − gj )|
0
00
≤ (Pn + Pn )(|g − gj |)
(5.3.1)
and recalling that for each z ∈ Z:
(σ1 , · · · , σn , τ1 , · · · , τn )(z),
is a permutation of (1, · · · , 2n)
0
00
(Pn + Pn )(|g − gj |)
!
n
n
X
X
= n−1
δXσi +
δXτi (|g − gj |)
= n−1
i=1
2n
X
i=1
!
δXi (|g − gj |)
i=1
= 2P2n (|g − gj |) ≤ 2
P2n (FM )
9M
(5.3.2)
where (ki )ni=1 ∈ {0, 1}n .
As n → ∞ by the usual strong law of large numbers ( A.2.20 ):
2
2
P2n (FM ) →
P (FM )
9M
9M
P ∞ –a.s.
and since FM ≤ M : (2P (FM ))/(9M ) < /4. Further, since
(1)
(1)
m := DFM (/(9M ), FM , P2n (x)) < DF (δ, FM ) < +∞,
(5.3.3)
5.3. MARTINGALE PROPERTY, GLIVENKO–CANTELLI THEOREM.
83
m can be chosen to be independent of n and in particular thus remains bounded
as n → ∞. Recall the well known bound for the probability of a finite maximum:
o
n
(∞)
0
(P
⊗ Q) max |Pn (gj )| > /4
1≤j≤m
(
)
[
= (P (∞) ⊗ Q)
|Pn0 (gj )| > /4
1≤j≤m
≤
X
(P (∞) ⊗ Q){|Pn0 (gj )| > /4}
1≤j≤m
≤ m max
1≤j≤m
(P (∞) ⊗ Q){|Pn0 (gj )| > /4}
We show now that the last term goes to zero, as n → ∞.
Finally one concludes
(P (∞) ⊗ Q){kPn0 kFM > /2} ≤ (P (∞) ⊗ Q){ sup |Pn0 (g)| > /2}
g∈FM
o
n
≤ (P (∞) ⊗ Q) sup min |Pn0 (g − gj )| + max |Pn0 (gj )| > /2
1≤j≤m
g∈F 1≤j≤m
n M
o
(∞)
∗
0
≤ (P
⊗ Q) sup min |Pn (g − gj )| > /4
g∈F 1≤j≤m
M
+ m max (P (∞) ⊗ Q){|Pn0 (gj )| > /4}
1≤j≤m
Finally we will show that both probabilities can be made arbitrary small for n
large enough.
Firstly combining the result of equations 5.3.1 and 5.3.2
o
n
(∞)
∗
0
(P
⊗ Q) sup min |Pn (g − gj )| > /4
g∈FM 1≤j≤m
)
(
≤ P ∞ 2
P2n (FM ) > /4
9M
(
)
≤ P ∞ 2
P2n (FM ) − 2
P (FM ) + 2
P (FM ) > /4
9M
9M
9M
(
)
≤ P ∞ 2
P2n (FM ) − 2
P (FM ) >
< /2
9M
9M
36
where we used equation 5.3.3 for the last inequality.
84
CHAPTER 5. UNIFORM LIMIT THEOREMS.
And secondly
m max
1≤j≤m
≤m
(P (∞) ⊗ Q){|Pn0 (gj )| > /4}
16
max Var(Pn0 (gj ))
2 1≤j≤m
because
E[Pn0 (gj )]
n
n
i
h
X
X
−1
−1
0 = 0.
(gj (Xσi ) − gj (Xτi ) = n
=E n
i=1
i=1
Furthermore since the Xσi , Xτi , 1 ≤ i ≤ n are all independent
16
max Var(Pn0 (gj ))
2 1≤j≤m
n
n
X
X
16 1
≤ m 2 2 max Var
gj (Xτi )
gj (Xσi ) −
n 1≤j≤m
i=1
i=1
m
16 1
2n max Var(gj (X1 ))
2 n2 1≤j≤m
32m
≤ 2 M 2 < /2
n
≤m
if n is large enough.
We have just proved that, by theorems 5.3.2 and A.2.24, kPn − P kF converges
a.s. and in L1 to 0.
Lemma 5.3.3. Let (X n , An , P n ) be a probability space for n ≥ 1 and Xj the
j–th projectionP
onto (X , A), i.e. the standard model. For each x ∈ X 2n and
P2n := (2n)−1 2n
i=1 δXj ,
(1)
DF (δ, F, P2n ) = k(x),
where k is a universally measurable function and for k(x) = k, ηk is a universally
measurable function from X 2n into Y k such that the functions fi , in equation
5.1.1, can be taken as T ((ηk(x) (x))i ); i = 1, · · · , k(x).
(1)
Proof. For any δ > 0 : k(x) := DF (δ, F, P2n (x)) < +∞. If k(x) = k, then
since (Y, S) is a Suslin measurable space, there exists a Polish space (S, d) and
a Borel measurable map b from (S, d) onto (Y, S). From lemma A.1.15 one has
that Y k with the product σ–algebra is a Suslin measurable space too.
5.3. MARTINGALE PROPERTY, GLIVENKO–CANTELLI THEOREM.
85
The set
Bk := {(x, y1 , · · · , yk ) ∈ X 2n × Y k : P2n (|T (yi ) − T (yj )|) > δP2n (F ), i 6= j},
with x = (x1 , · · · , x2n ) ∈ X 2n , is product measurable by image admissibility as
follows.
For 1 ≤ i 6= j ≤ k let gi,j : Rk → R :→ |πi (a) − πj (a)|, then clearly gi,j is a
Borel measurable function. Using theorem 4.2.3 one has that
(|T (yi )(x) − T (yj )(x)| − δF (x)) = gi,j (T (y1 ), · · · , T (yk )) − δF (x)
is Suslin image admissible, i.e. jointly measurable. Since (X 2n , A2n ), by lemma
A.2.11, is again a separable measurable space:
(X 2n × Y k , A2n ⊗ S k ) : (x, y1 , · · · , yk ) 7→
2n X
−1
gi,j (T (y1 ), · · · , T (yk ))(xl ) − δF (xl )
(2n)
l=1
P2n (|T (yi ) − T (yj )|) − δF )(x)
is measurable. Hence
2n
o
\ n
X
−1
Bk =
(2n)
(gi,j (T (y1 ), · · · , T (yk ))(xl ) − δF (xl ) ∈]0, +∞[
1≤i6=j≤k
l=1
is product measurable.
Let y (k) := (y1 , · · · , yk ) and Ak the projection, i.e.
{x ∈ X 2n : (x, y (k) ) ∈ Bk , for some y (k) },
for k ≥ 2,
and A1 := X 2n .
By Sainte–Beuve’s selection theorem 4.2.1; Ak is a universally measurable set of
X 2n and there exists a universally measurable function ηk from Ak into Y k such
that (x, ηk (x)) ∈ Bk for all x ∈ Ak .
Let η(x) := ηk (x) if and only if x ∈ Ak \Ak+1 , k ≥ 1. For each x, we let k(x)
be the unique k such that x ∈ Ak \Ak+1 . This means that (ηk (x))i , i = 1, · · · k
satisfies the condition of equation 5.1.1 for γ = P2n (x), and is maximal, in the
sense we can not add another function such that (x, ηk+1 (x)) ∈ Bk or with other
words equation 5.1.1 remains true for γ = P2n (x) and (ηk+1 (x))i , i = 1, · · · k + 1.
If we let k(x) the maximal k for which x ∈ Ak \Ak+1 , then k(x) is also u.m.
Indeed,
−1
M\
−1
k {n} = (An \An+1 ) ∩
(Aj \Aj+1 )
j=n+1
(1)
(1)
where DF (δ, F, P2n ) ≤ DF (δ, F) = M < +∞. All the sets Al are u.m., so
k(x) is a universally measurable function.
86
5.4
CHAPTER 5. UNIFORM LIMIT THEOREMS.
A uniform central limit theorem with entropy
condition for the empirical process.
Here follows the main theorem: a central limit theorem for the empirical process
νn , where the class of sets is Suslin image admissible via some (Y, S, T ). It was
first stated by D. Pollard, and then extended in the present form by R.M. Dudley.
We will first need a lemma which relates the covering number of F to the covering
of the square of differences of functions of F. Those differences also appear in
the conditions of the asymptotic equicontinuity, definition 2.7.1. After the main
theorem several (easy) corollaries will be given. Those corollaries will allow us
to give concrete examples of classes F for which the central limit theorem for the
empirical process is valid (by mean of Pollards entropy condition being satisfied).
Lemma 5.4.1. Let (X , A, P ) be a probability space, F ∈ L2 (X, A, P ) and F ⊂
L2 (X , A, P ) having as envelope function F . Let H := 4F 2 and
H := {(f − g)2 : f, g ∈ F}.
Then 0 ≤ φ(x) ≤ H(x) for all φ ∈ H and x ∈ X , and for any δ > 0,
(1)
(2)
DH (4δ, H) ≤ DF (δ, F)2
Proof. First note that from the definition of H, it follows that 0 ≤ φ ≤ H, for all
(2)
φ ∈ H. For any given γ ∈ Γ, we choose m ≤ DF (δ, F)2 and f1 , · · · , fm ∈ F
such that the balls BLp (X ,A,γ) (fi , rp ) cover F, where:
Z
rp = δ( F p dγ)1/p for p = 1, 2.
For any f, g ∈ F take fi , fj such that
max γ (fi − f )2 , γ (fj − g)2 ≤ δ 2 γ(F 2 );
this is possible because the closed balls cover F. Then letting:
A := γ (f − g)2 − (fi − fj )2 ,
by the Cauchy–Schwartz inequality (in the second step):
A = γ [f − g − (fi − fj )][f − g + fi − fj ]
1/2
1/2
≤ γ [f − fi − (g − fj )]2
γ (f − g + fi − fj )2
1/2
≤ γ [f − fi − (g − fj )]2
4γ(F 2 )1/2
1/2
≤ 41/2 max γ (f − fi )2 , γ (g − fj )2
4γ(F 2 )1/2
≤ 8δγ(F 2 ) = 2δγ(H)
5.4. POLLARD’S CENTRAL LIMIT THEORM.
87
where we used that
(f − g + fi − fj )2 ≤ 2(f − g)2 + 2(fi − fj )2 ≤ 4H = 16F 2 .
So if one takes functions of the form hk(i,j) := (fi − fj )2 , for k(i, j) = in −
j; i, j = 1, · · · n the closed balls centered in those hk(i,j) with radius 2δγ(H) will
(1)
cover H. To finish the prove let n ≤ DH (4δ, H) and H1 , · · · , Hn ∈ H such that
no Hj lies in the (closed) ball centered in Hi of radius 4δγ(H) for any i 6= j, i.e.
Z
|Hj − Hi | dγ > 4δγ(H).
(2)
Then we would like to have n ≤ DF (δ, F)2 . But this is straightforward for the
closed balls centered in hk(i,j) with radius 2δγ(H) cover H. So with each Hi
associate one hk(i,j) . Because of the triangle inequality the same hk(i,j) can not be
used for two different Hi1 , Hi2 (otherwise the condition γ(|Hi1 − Hi2 | > 4δγ(H)
would be violated). Thus we have an injective map from {Hi : i = 1, · · · , n} into
(1)
(1)
{hk(j,l) : j, l = 1, · · · , m}. This holds for all n ≤ DH (4δ, H), so DH (4δ, H) ≤
(2)
DF (δ, F)2 .
(2)
This lemma is more effective if one has DF (δ, F)2 < ∞ for all δ > 0 not
(1)
just for one single δ, because then DH (, H) < ∞ for all > 0. Also sets of
squared differences of elements of F are important; they show up in the condition
for asymptotic equicontinuity (2.7.1 )
We first give a formal definition of a Donsker class.
Definition 5.4.1. Let (X , A, P ) be a probability space and F ⊂ L2 (Ω, A, P ). F
is said to be a Donsker class for P (or to be a P –Donsker class, or to satisfy the
central limit theorem for empirical measures), if νn ⇒ GP , where GP stands for
a tight Borel measurable map, in the metric space (`∞ (F), k · kF ).
Theorem 5.4.2 (Pollard). Let (X , A, P ) be a probability space and
F ⊂ L2 (X, A, P ). Let F be image admissible Suslin via (Y, S, T ) and have an
envelope function F ∈ L2 (X , A, P ). Suppose that
Z
1
1/2
(2)
log DF (x, F)
dx < ∞.
(5.4.1)
0
Then F is a P –Donsker class, i.e. νn ⇒ GP in `∞ (F) and there exists a tight
version of GP (also GP will thus be a tight Brownian bridge process).
The hypothesis 6.4.1 is called Pollard’s entropy condition.
88
CHAPTER 5. UNIFORM LIMIT THEOREMS.
Remark. Equation 5.4.1 gives a condition on the rate of growth for the entropy. In
(2)
particular we obtain information about the behaviour of DF (x, F) near x = 0,
it may not increase too rapidly if the integral has to converge. One always has
(2)
log DF (x, F) ≥ 0.
(2)
We note also that DF (x, F) is non increasing in x.
Theorem 2.7.3 gives condition on νn to converge weakly to a Gaussian process (here a Brownian Bridge) GP in the space (`∞ (F), k · kF ). As mentioned
in the second chapter, fifth section, we can and will use theorem 2.7.4, which
gives equivalent condition for asymptotic tightness of νn , namely νn (f ) has to be
asymptotically tight for all f ∈ F, but this is trivial since by the usual CLT (
A.2.21 ), it converges to a normal variable on R, and Borel laws on Polish spaces
are tight ( [Bill2] chapter 1 theorem 1.3 on page 8), in the case of Polish spaces,
tightness of the limit is the same as uniform tightness and asymptotical tightness.
Indeed, when the limit is tight (which is the case since any Borel law on a
Polish space is tight), by the portmanteau theorem 2.5.1 part (e), the sequence is
asymptotically tight (K δ is open!). Also a finite family of measures on a Polish
space is still tight, so choosing K a little bigger, we have uniform tightness. (A
family of probability measures Π is called uniform tight, iff for every > 0, there
exists a compact K: for all P ∈ Π: P (K) > 1 − .)
Conversely if a sequence of probability measure is uniformly tight, then it is trivially asymptotically tight. By Prohorov’s theorem (in Polish spaces) (see [Bill2]
chapter 1 theorem 5.9 on page 59), Π is relative compact, hence the limit is a
probability measure and on a Polish space the limit is tight. As claimed we have
that ν(f ) is asymptotically tight for every f ∈ F.
Since the limit process is a Gaussian process in (`∞ (F), k · kF ), it is enough
to prove, by theorem 2.7.4, firstly that F is totally bounded for ρP (actually one
just needs some semimetric as seen in the theorems from chapter 2 section 7) and
secondly that νn satisfies the asymptotic equicontinuity condition (2.7.1 ).
Thus we will split the proof in two parts, and start with the easiest part, the
total boundedness of F.
(2)
Proof. Under the weaker assumption: DF (δ, F) < +∞ for all δ > 0, one has
that F is totally bounded in L2 (X , A, P ). Assume P (F 2 ) > 0, otherwise F is
trivially totally bounded. Let
H := {(f − g)2 : f, g ∈ F}.
(1)
Then by the above lemma 5.4.1: DH (δ, H) < +∞ for all δ > 0. Since H has
an integrable envelope function, is Suslin image admissible (F is and subtraction and squaring are continuous operations on R so are measurable) and has a
5.4. POLLARD’S CENTRAL LIMIT THEORM.
89
finite entropy, H is a P –strong Glivenko–Cantelli class by theorem 5.3.2. In other
words:
sup{|(Pn − P )(f − g)2 : f, g ∈ F} → 0 almost surely as n → ∞.
Also by the usual Strong Law of Large Numbers
Z
2
−1
F dPn = n
n
X
2
F (Xi ) →
Z
F 2 dP a.s. as n → ∞.
i=1
We choose n0 such that for all n ≥ n0 :
Z
Z
2
2 F dP >
F 2 dP2n
and,
/2 > sup{|(P2n − P )(f − g)2 : f, g ∈ F}.
Take also 0 < δ < (/(4P (F 2 )))1/2 and choose f1 , · · · , fm ∈ F, such that no
fi lies in any of the closed ball of radius δ(P2n (F 2 ))1/2 centered in fj , 1 ≤ j ≤
m, j 6= i for all i = 1, · · · , m (recall that the balls cover F). Then for any f ∈ F
we have got, for some j ∈ {1, · · · , m}:
P ((f − fj )2 ) < (P2n − P )((f − fj )2 ) + P2n ((f − fj )2 )
< sup |(P2n − P )((f − fj )2 )| + δ 2 P2n (F 2 )
g,h
< /2 + δ 2 2P (F 2 ) < We proved that F is totally bounded in the L2 metric.
In the second part we are concerned with the asymptotic equicontinuity condition, definition 2.7.1.
Proof. This second part is quite lengthy, many things have to be done and then put
together to arrive at the conclusion. We divide it and start with the rough sketch
of the proof.
i) In the first step, we explain briefly how we will tackle the problem.
So given > 0, we need to find a δ > 0 such that
lim sup Pr∗ {sup{|νn (f − g)| : f, g ∈ F, ρP (f, g) < δ} > } < (5.4.2)
n→∞
R
where Pr∗ = (P ∞ )∗ and ρP (f, g) := ( X (f − g)2 dP )1/2 the L2 (P ) semimetric, see appendix on Gaussian processes for more about the semimetric.
90
CHAPTER 5. UNIFORM LIMIT THEOREMS.
So let > 0. We start defining subclasses:
Z
Fj,δ := {f − fj : f ∈ F, (f − fj )2 dP < δ 2 }
of F, where δ > 0 will be specified later on. Fj,δ is Suslin image admissible
and has a finite entropy, lemma D.1.2. As in the first step, by lemma 5.4.1
(to bound the entropy of the class of squared differences of elements of
Fj,δ ), and theorem 5.3.2 (the strong Glivenko–Cantelli theorem)
sup{|(Pn − P )(f − g)2 : f, g ∈ Fj,δ } → 0 almost surely as n → ∞.
We define the set
2n
Fj,δ
Z
:= {f − fj : f ∈ F,
(f − fj )2 dP2n < δ 2 }.
2n
Note that in the limit the sets Fj,δ and Fj,δ
are concordant. Indeed for
f − fj ∈ Fj,δ :
P2n ((f − fj )2 ) ≤ (P2n − P )((f − fj )2 ) + P ((f − fj )2 ) < n + δ
2n
and vice versa limn→∞ Fj,δ
⊂ Fj,δ (on a set of P ∞ measure 1).
2n
Then for P2n (x) fixed, Fj,δ
is Suslin image admissible. (X 2n , A2n ) is a
separable measurable space, because (X , A) is, see theorem A.2.11. So on
a set of big enough probability, we can change Fj,δ for the more tractable
2n
sets Fj,δ
.
But as seen in equation 5.4.2 we are still stuck with a supremum over many
large sets. A way out is to firstly to condition on x and then use the finite
entropy for the discrete measures P2n (x), to reduce the supremum to a maximum over a finite set, every function lies in some closed ball around one
of the only finitely centers. And this will be done in a measurable way, so
if we are willing add a small error we can reduce the supremum to a finite
maximum.
The (conditional ) probability over that maximum will then be bounded,
using exponential inequalities, and such that the final bound will not depend
on the choice of the centers
Let T (yj ) = fj for yj fixed,
X 2n × Y → R : (x, y) 7→ (2n)−1
2n
X
l=1
(T (y) − T (yj )2 (xl ) − δ 2
5.4. POLLARD’S CENTRAL LIMIT THEORM.
91
is A2n ⊗ S measurable, so the set C :=
{(x, y) ∈ X
2n
× Y : (2n)
−1
2n
X
(T (y) − T (yj )2 (xl ) − δ 2 ∈] − ∞, 0[}
l=1
belongs to A2n ⊗ S. The function IC(x) (·) is S measurable, theorem A.2.17
of Tonelli–Fubini: C(x) := {y : (x, y) ∈ C} is S measurable. The function
2n
(Y, S) → Fj,δ
: y 7→ IC(x) (y)(T (y) − T (yj ))
2n
is onto (zero function lies also in Fj,δ
) and (z, y) 7→ IC(x) (y)(T (y)(z) −
T (yj )(z)) is A ⊗ S measurable.
We want to bound
∗
Pr{sup{|νn (f − g)| : f, g ∈ F, ρP (f, g) < δ} > }
if ρP (f, g) < δ then f − g ∈ Fg,δ := {f − g : f ∈ F, ρP (f, g) < δ} and
we start to bound it, by the symmetrization lemma ( 5.2.1 ), for η = /2 and
ζ = /4 like this:
∗
(1 − (/4)2 (2/)2 ) Pr{sup{|νn0 (f − g)| : f, g ∈ F, ρP (f, g) < δ} > /2}
∗
≥ Pr{sup{|νn (f − g)| : f, g ∈ F, ρP (f, g) < δ} > }
0
00
0
00
where νn0 := νn − νn and νn , νn independent and copies of νn . The probability involving the νn0 will be treated as follows: conditionally on
σ((X1 , · · · , X2n )) =: A2n and on a set of probability P ∞ converging to 1,
for , η > 0:
n h
n
Pr Pr sup |νn0 (f − g)| : f, g ∈ F,
Z
o
i
o
2
2
(f − g) dP2n < δ > 3η A2n > 3
< 3
for δ small enough, and n large enough, since then integrating out over
(X1 , · · · , X2n ) yields an upper bound for equation 5.4.2. We are allowed to
drop the star since the event:
(
)
Z
n
o
0
2
2
sup |νn (f − g)| : f, g ∈ F, (f − g) dP2n < δ > 3η
is measurable, for a proof we refer to lemma D.1.3.
92
CHAPTER 5. UNIFORM LIMIT THEOREMS.
ii) In this second step we go more into detail about the conditioning on x.
Given A2n , or that is given (X1 , · · · , X2n ) = (x1 , · · · , x2n ) := x, let
kf k2n := (P2n (f 2 ))1/2 . Let δi := 2−i ; i ≥ 1. Now we choose subsets
F(i, x); i ≥ 1 of F such that for all i and f ∈ F
min{kf − gk2n : g ∈ F(i, x)} ≤ δi kF k2n ,
with other words F(i, x) is such that the closed balls
B L2 (P2n (x) (g, δi kF k2n )
for g ∈ F(i, x) cover F. This can be done (entropy is finite) and in fact for
all x, i.e. for all P2n (x) ∈ Γ, where Γ is the set in definition 5.1.1, |F(i, x)|
(2)
can be chosen smaller than or equal to DF (δi , F), which is always finite.
But we can say more about the elements of F(i, x). For each i fixed, F(i, x)
can be written as
(x)
(x)
{gi,1 , · · · , gi,k(i,x) } = {T (yi,1 (x)), · · · , T (yi,k(i,x) (x))}
(x)
where, by lemma 5.3.3, gi,j = T (yi,j (x)), 1 ≤ j ≤ k(i, x) and
k(i, ·) : (X 2n , A2n ) → {0, · · · , D(2) (δi , F, P2n (x))} : x 7→ k(i, x)
yim (·) : (X 2n , A2n ) → Y : x 7→ yim (x)
with yim only defined for 1 ≤ m ≤ k(i, x), are universally measurable
functions.
For each f ∈ F, let fi := g := gim ∈ F(i, x) such that
kf − fi k2n = min{kf − gk2n : g ∈ F(i, x)},
(5.4.3)
and in case where for multiple gim the minimum in 5.4.3 is achieved, we
choose the one with minimal m. Let Ak be the σ–algebra of universally
measurable sets for probability measures defined on Ak . We claim that
(2)
m(·, ·, i) : X 2n × Y → {1, · · · , DF (δi , F)} : (x, y) 7→ m(x, y, i)
is A2n ⊗ S measurable (for i fixed).
(2)
Indeed, let Ap ; 1 ≤ p ≤ DF (δi , F) be
2n
n
X
(x)
−1
(T (y)(xl ) − gi,p (xl ))2 −
(x, y) : (2n)
l=1
2n
X
−1
δi (2n)
l=1
o
F (xl ) ∈] − ∞, 0]
5.4. POLLARD’S CENTRAL LIMIT THEORM.
93
Consider first {(x, y) ∈ X 2n × Y : m(x, y, i) = 1} = A1 and
−1 c
{m(x, y, i) = N } := (∩N
l=1 Al ) ∩ AN ∩ {N ≤ k(i, x)}. So it is enough to
prove that Ap is a measurable set.
P
So clearly, since F is measurable (x, y) 7→ δi (2n)−1 2n
l=1 F (xl )IY (y) is
A2 ⊗ S measurable, since F is Suslin image admisible
P
(x)
(x, y) 7→ (2n)−1 2n
l=1 (T (y)(xl ) is also A2n ⊗ S measurable. For gi,m consider the function below:
X 2n → (Y × X )2n
→R
(x)
((yi,m , x1 ), · · ·
x 7→
(x)
, (yi,m , x2n ))
−1
7→ (2n)
2n
X
(x)
T (yi,m )(xl );
l=1
(x)
(x)
(gi,m (u) = T (yi,m )(u)) which is measurable, as composition of two measurable functions. We can conclude that
(x)
(u, x, y) 7→ gi,m(x,y,i) (u)
is A2n+1 ⊗ S measurable. Hence
X 2n × Y → R : (x, y) 7→ νn0 (T (y) − T (y)i ),
(x)
with T (y)i = gi,m(x,y,i) , is A2n ⊗ S measurable (as in the begin of proof of
5.2.1 as a composition of the measurable function (T (y) − T (y)i ) and
(Xσ(1) , · · · , Xσ(n) , Xτ (1) · · · , Xτ (n) ) :
It equals a A2n ⊗ S measurable function G(x, y) for all x on a set of P 2n
measure one. As in the proof of corollary 4.2.2, x 7→ supy |G(x, y)| is a
universally measurable function. Thus for i fixed:
sup |νn0 (T (y) − T (y)i )| is P 2n –measurable in x.
(5.4.4)
y
For any f ∈ F, by our choice of fi :
kfi − f k2n ≤ δi kF k2n = 2−i kF k2n ,
so kfi − f k2n → 0, as i → ∞ and moreover for r any fixed positive integer
and xl ∈ {x1 , · · · , x2n }:
|(f −fr −(
p
X
fj −fj−1 ))(xl )| = |(f −fp )(xl )| ≤ (2n)2−p kF k2n (5.4.5)
j=r+1
so (f − fr )(xl ) =
P
r<j<+∞ (fj
− fj−1 )(xl ).
94
CHAPTER 5. UNIFORM LIMIT THEOREMS.
iii) In the third step we rewrite the integral condition into an equivalent series
condition, which we use to show existence of a related series. That related
series will be useful in later steps.
(2)
Further let Hj := log DF (δj (:= 2−i ), F), j ≥ 0, the condition 5.4.1 is
P
1/2
equivalent to the one for the infinite series j≥0 δj Hj ( δj = 2−j );
Z
1
(2)
(log DF (δ, F))1/2
2−j
XZ
dδ ≤
0
2−j−1
j≥0
2−j
XZ
≤
≤
≤
X
(2)
2−j−1
j≥0
X
(2)
(log DF (δ, F))1/2 dδ
(log DF (2−j−1 , F))1/2 dδ
1/2
2−(j+1) Hj+1
j≥0
1/2
2−j Hj
j≥0
(2)
since the entropy DF (δ, F) is a non increasing function in δ (the remark
just before the proof) and conversely:
X Z 2−j+1
X
(2)
−j 1/2
=
(log DF (2−j , F))1/2 dδ
2 Hj
2−j
j≥0
j≥0
X Z
≤
2
2−j
(2)
2−j−1
j≥0
X Z
≤
2
2−j
(2)
2−j−1
j≥0
1
Z
(log DF (2−j , F))1/2 dδ
(log DF (δ, F))1/2 dδ
(2)
(log DF (δ, F))1/2 dδ
= 2
0
(2)
For all x : |F(j, x)| ≤ DF (δj (:= 2−j ), F)) ≤ exp(Hj ); let
ηj := max(jδj , (576P (F 2 )δj2 Hj )1/2 ) > 0,
(since jδj > 0). In particular
ηj2 ≥ 576P (F 2 )δj2 Hj
For this choice of ηj we have
P
j≥1
ηj < +∞. Indeed
ηj ≤ jδj + (576P (F 2 )δj2 Hj )1/2
(5.4.6)
5.4. POLLARD’S CENTRAL LIMIT THEORM.
95
√ j √ −j
√ j
P
jδj < +∞, consider for example j≥1 j/ 2 2 , j/ 2 → 0
P √ −j
as j → ∞ and j≥1 2 < +∞ so
X
√ j √ −j √ −N0 X √ −j
j/ 2 2 + 2
2 < +∞
and
P
j≥1
j≥1
1≤j≤N0
√
j
if j/ 2 < 1 for all j ≥ N0 . And the second term has also a convergent
series
X
X
1/2
δj Hj < +∞
(576P (F 2 )δj2 Hj )1/2 = 24(P (F 2 ))1/2
j≥1
j≥1
We also have:
X
X
exp(−2ηj2 /(576δj2 P (F 2 ))) ≤
exp(−j 2 /(288P (F 2 ))) < +∞
j≥1
j≥1
(5.4.7)
because
−j 2 δj2
≥
−ηj2 .
iv) As announced in the first step, we will bound the supremum, using that in
fact it reduces to a maximum, and then the exponential series of the third
step will come in handy. We turn our attention to
o
n
X
0
(5.4.8)
Pr sup |νn (f − fr )| >
ηj ||A2n .
f ∈F
j>r
where in equation 5.4.4 and the discussion above we showed that
supf ∈F |νn0 (f − fr )| is measurable for r fixed.
P
Since, for x given f − fr = j>r fj − fj−1 pointwise on {x1 , · · · , x2n } as
seen in equation 5.4.5, it follows
νn0 (f − fr )
= n1/2 Pn0 (f − fr )
0
00
= Pn (f − fr ) − Pn (f − fr )
n
n
X
X
−1
−1
= n
δXσ(p) (f − fr ) − n
δXτ (p) (f − fr )
= n
−1
p=1
n
X
p=1
δXσ(p)
p=1
=
X
j>r
=
X
j>r
−1
n
X
−1
fj − fj−1 − n
δXσ(p) (fj − fj−1 ) −
p=1
νn0 (fj − fj−1 )
δXτ (p)
p=1
j>r
n
X
n
X
X
j>r
n
−1
n
X
p=1
X
fj − fj−1
j>r
δXτ (p) (fj − fj−1 )
96
CHAPTER 5. UNIFORM LIMIT THEOREMS.
and this because (Xσ(1) , · · · , Xσ(n) , Xτ (1) , · · · , Xτ (n) ) is a permutation of
(X1 , · · · , X2n ) = x. Also given x:
n
n
X o
X
X o
sup |νn0 (f − fr )| >
ηj
⊂
sup |νn0 (
fj − fj−1 )| >
ηj
f ∈F
f ∈F
j>r
⊂
n
⊂
nX
sup
X
f ∈F j>r
and since
\n
j>r
sup |νn0 (fj − fj−1 )| ≤ ηj
j>r
j>r
|νn0 (fj − fj−1 )| >
X
sup |νn0 (fj
j>r f ∈F
o
⊂
n
o
ηj
o
j>r
− fj−1 )| >
X
j>r
sup |νn0 (f − fr )| ≤
X
f ∈F
f ∈F
ηj
ηj
o
j>r
we have got that
o
n
X o [n
sup |νn0 (fj − fj−1 )| > ηj .
ηj ⊂
sup |νn0 (f − fr )| >
f ∈F
j>r
j>r
f ∈F
So that the last term in equation 5.4.8 can be bounded:
h
i
= E I{supf ∈F |νn0 (f −fr )|>Pj>r ηj } ||A2n
i
h
≤ E I{∪j>r {supf ∈F |ν 0 (fj −fj−1 )|>ηj }} ||A2n
hX
i
≤ E
I{supf ∈F |νn0 (fj −fj−1 )|>ηj } ||A2n
j>r
i
X h
≤
E I{supf ∈F |νn0 (fj −fj−1 )|>ηj } ||A2n
j>r
≤
X
Pr
j>r
n
sup |νn0 (fj
f ∈F
o
− fj−1 )| > ηj ||A2n
!
Which in turn is bounded by:
X
exp(Hj + Hj−1 ) sup Pr{|νn0 (fj − fj−1 | > ηj ||A2n }.
(5.4.9)
f ∈F
j>r
This can be seen as follows, conditionally on x (thus first 2n coordinates of
x∞ are fixed)
{x∞ : sup |νn0 (fj − fj−1 )(x∞ )| > ηj }
f ∈F
=
k(j,x)
∪l=1
k(j−1,x)
∪q=1
{x∞ : |νn0 (gj,l − gj−1,q )(x∞ )| > ηj }
5.4. POLLARD’S CENTRAL LIMIT THEORM.
97
Hence
Pr
sup |νn0 (fj
f ∈F
x∞ :
− fj−1 )(x∞ )| > ηj ||A2n
k(j,x) k(j−1,x)
≤
X
X
l=1
q=1
Pr
x∞ : |νn0 (gj,l − gj−1,q )(x∞ )| > ηj ||A2n
≤ exp(Hj + Hj−1 ) sup Pr {x∞ : |νn0 (fj − fj−1 )(x∞ )| > ηj }||A2n
f ∈F
since k(j, x) ≤ exp(Hj ).
In the following lines we will use an exponential inequality to get a bound
on the ( conditional ) probability and then give a second bound not dependening on the class F anymore.
Let, for fixed j(> r) and f ∈ F
zi := (fj − fj−1 )(x2i ) − (fj − fj−1 )(x2i−1 )
and ei := I{σi =2i−1} , 1 ≤ i ≤ n are random variables taking the values 1
and −1 with probability 1/2, thus are Rademacher random variables. Then
νn0 (fj
− fj−1 ) = n
−1/2
n
X
(−1)e(i) zi .
i=1
So we can apply Hoeffding’s inequality D.2.2 (in the last step);
Pr {νn0 (fj − fj−1 )| > ηj }||A2n
(
)
n
X
= Pr n−1/2 (−1)e(i) zi > ηj
i=1
!
n
n
nX
o
X
= Pr
(−1)e(i) zi ≥ ηj n1/n ∪
(−1)e(i) zi ≥ −ηj n1/n
i=1
! i=1
!
n
n
nX
o
nX
o
= Pr
(−1)e(i) zi ≥ ηj n1/n
+ Pr
(−1)e(i) zi ≥ −ηj n1/n
i=1
≤ 2 exp
1
− nηj2
2
i=1
,
n
X
i=1
!
zi2
98
CHAPTER 5. UNIFORM LIMIT THEOREMS.
P
Now, still for fixed j, we bound ni=1 zi2 uniformlyfor f ∈ F in order to
obtain a bound for Pr {|νn0 (fj − fj−1 )| > ηj }||A2n ;
n
X
zi2
n
X
2
=
(fj − fj−1 )(x2i ) − (fj − fj−1 )(x2i−1 )
i=1
=
i=1
n
X
((fj − fj−1 )(x2i ))2 − 2(fj − fj−1 )(x2i )
i=1
(fj − fj−1 )(x2i−1 ) + ((fj − fj−1 )(x2i−1 ))2
n
X
≤
2 ((fj − fj−1 )(x2i ))2 + ((fj − fj−1 )(x2i−1 ))2
i=1
≤
≤
≤
=
2(2n)P2n ((fj − fj−1 )2 ) = 4nkfj − fj−1 k22n
4n(kfj − f k2n + kf − fj−1 k2n )2
4n(δj kF k2n + δj−1 kF k2n )2 = 4nkF k22n (δj + δj−1 )2
4nkF k22n (3δj )2 ≤ 36δj2 |F k22n
which is less than 72δj2 P (F 2 ) on the set Bn := {kF k22n ≤ 2P (F 2 )}. The
probability of Bn goes to 1, since, by the strong law of large numbers (
theorem A.2.20 ) :
kF k22n
2
= P2n (F ) = (2n)
−1
2n
X
F 2 (Xj ) → P (F 2 )
j=1
P ∞ –almost surely, as n → ∞.
v) We have already achieved a lot, and we gather all the results obtained until
now in a whole. So bringing everything, up to now, together we have that on
the set Bn equation 5.4.8 is bounded by equation 5.4.9 and which is smaller
than
, n
!
X
1 2 X 2
zi
exp(Hj + Hj−1 ) sup 2 exp − nηj
2
f
∈F
i=1
j>r
, n
!
X
X
1
≤
exp(2Hj ) sup 2 exp − nηj2
zi2
2
f ∈F
j>r
i=1
!
X
−ηj2
≤
exp(2Hj )2 exp
144δj2 P (F 2 )
j>r
5.4. POLLARD’S CENTRAL LIMIT THEORM.
99
By equation 5.4.6 ( ηj2 ≥ 576P (F 2 )δj2 Hj ) in the second step:
2
X
!
!
−2ηj2
2Hj 288δj2 P (F 2 )
exp
288δj2 P (F 2 )
288δj2 P (F 2 )
!
!
−2ηj2
ηj2
exp
288δj2 P (F 2 )
288δj2 P (F 2 )
exp
j>r
≤ 2
X
exp
j>r
and by equation 5.4.7 the last term is convergent, so for r large enough
!
X
−ηj2
2
< .
exp
288δj2 P (F 2 )
j>r
P
We also choose
P our ηj such that j≥1 ηj was convergent. So for some r
large enough j>r ηj < η and almost surely on Bn
Pr
n
sup |νn0 (f
f ∈F
− fr )| > η||A2n
o
n
o
X
≤ Pr sup |νn0 (f − fr )| >
ηj ||A2n ≤ f ∈F
j>r
vi) This small part is similar to step 4, where we used an exponential inequality
to bound a certain probability uniformly in F.
If kf − gk22n > δr2 P (F 2 ) and if kF k22n ≤ 2P (F 2 ), i.e. x ∈ Bn , then by
definition of fr :
kfr − gr k2n ≤
≤
≤
≤
kfr − f k2n + kf − gk2n + kg − gr k2n
δr kF k2n + δr (P (F 2 ))1/2 + δr kF k2n
√
δr (P (F 2 ))1/2 + 2δr 2(P (F 2 ))1/2
4δr (P (F 2 ))1/2 .
By an argument as in the fourth step (measurability follows by arguments
as in the second step ):
n
o
Pr sup{|νn0 (fr − gr )| : kf − gk2n < δ, f, g ∈ F} > η||A2n
n
2
≤ (cardF(r, x)) sup Pr sup{|νn0 (fr − gr )| :
o
kf − gk2n < δ, f, g ∈ F} > η||A2n
100
CHAPTER 5. UNIFORM LIMIT THEOREMS.
where (cardF(r, x))2 ≤ DF2 (δr , F) := exp(Hr ). And again by Hoeffding’s
inequality D.2.2, as in the fourth step, here applied to
n
X
0
−1/2
(−1)e(i) zi ,
with,
νn (fr − gr ) := n
i=1
zi := (fr − gr )(x2i ) − (fr − gr )(x2i−1 );
(see also step 4 for the definition of the random variable e(i) ) we have that
the last probability is
≤ (cardF(r, x))2 sup 2 exp(−η 2 /[8kfr − gr k2n ])
≤ (cardF(r, x))2 2 exp(−η 2 /[8 sup kfr − gr k2n ])
since
Pn
2
i=1 zi
≤ 4nkfr − gr k2n , and then
≤ 2 exp(2Hr ) exp
≤ 2 exp
≤ 2 exp
−η 2
8 ∗ 42 δr2 P (F 2 )
!
(2Hr )(128δr2 P (F 2 )) − η 2
128δr2 P (F 2 )
!
−η 2
< 256δr2 P (F 2 )
!
if η 2 ≥ 512δr2 Hr P (F 2 ) and for δr small enough the last expression will
the less than . Note also that we showed in the third step (at the end):
P
1/2
is convergent, hence the general term goes to zero and since (·)2
j δj Hj
is continuous: δj2 Hj → 0 as j → ∞. So for any η > 0, we can find an r
large enough so that δr2 Hr < η 2 and the exponential is smaller than some
given and this holds for all x ∈ Bn , where Bn are sets which have probability converging to one, as seen in at the end of the fourth step.
vii) Finally.
sup{|νn0 (f − g)| : kf − gk2n < δ}
≤ 2 sup{|νn0 (f − fr )| : kf − gk2n < δ}
+ sup{|νn0 (fr − gr )| : kf − gk2n < δ}
Hence, by subadditivity of Pr:
Pr {sup{|νn0 (f − g)| : kf − gk2n < δ} > η}||A2n
0
≤ Pr {2 sup{|νn (f − fr )| : kf − gk2n < δ} > η}||A2n
+ Pr {sup{|νn0 (fr − gr )| : kf − gk2n < δ} > η}||A2n
5.4. POLLARD’S CENTRAL LIMIT THEORM.
101
In the 6th step we showed that the second term can be made small if we let
r be large, and at the end of the fifth step, again for r large, we showed that
the first term can be made small. If the parts 4 and 6 we noted that all the
events appearing here are measurable.
Two corollaries for special classes F, from which the second also due to Pollard. The first states that the class of indicators of sets in a Vapnik–C̆ervonenkis
class, together with measurability as seen in chapter 5, section 2 are Donsker
classes. The second says that VC subgraph classes, with an additional measurability property, are also Donsker classes.
Corollary 5.4.2. Let (X , A, P ) be a probability space, F ∈ L2 (X , A, P ) and
F := {F IC : C ∈ C} where C is a Suslin image admissible Vapnik–C̆ervonenkis
class of sets. Then F is a P –Donsker class.
Proof. Clearly F is measurable and if C is Suslin image admissible via some
(Y, S, T ), i.e. X × Y → R : (x, y) 7→ T (y)(x) is jointly measurable and T :
Y → C is onto, then Y → F : y 7→ F T (y) is onto, X × Y → R : (x, y) 7→
F (x)T (y)(x) is still jointly measurable. And thus F is Suslin image admissible.
By proposition 5.4.3, the entropy of F satisfies the condition of equation 5.4.1.
Indeed let p = 2, K < +∞ and dens(C) < ω < +∞.
Z 1
Z 1
1/2
1/2
(2)
dx ≤
log(Kx−pω )
dx
log DF (x, F)
0
0
Z 1
1/2
=
log(Kx−2ω )
dx
0
Z 1
1/2
=
log(K) + log(x−2ω )
dx.
0
1/2
Since (a + b)
≤a
1/2
+b
1/2
, for a, b large enough:
Z 1
1/2
1/2
≤ log(K)
+
log(x−2ω )
dx
0
√ Z 1
1/2
log(1/x)
dx
= c + 2ω
0
√ Z +∞
1/2 −2
= c + 2ω
log(u)
u du
1
where c :=
log(u) = z
log(K)
1/2
and where we put 1/x = u. Continuing and putting
√ Z
= c + 2ω
0
+∞
(t)1/2 exp(−2t) exp(t) dt
102
CHAPTER 5. UNIFORM LIMIT THEOREMS.
Recall that the
gamma function Γ(z) has the representation of an indefinite
R +∞
integral, namely 0 tz−1 e−t dt, on the positive complex plane, i.e. on {z ∈ C :
<(z) > 0} ( proposition IV.1.1 on page 193 in [FreitagAndBusam] ) and satisfies
the functional equation
Γ(z + 1) = zΓ(z), for all z ∈ C\{0, −1, −2, · · · }
by proposition IV.1.2 on page 195 in [FreitagAndBusam], and by proposition
IV.1.11 on page 201 in [FreitagAndBusam];
π
for all z ∈ C\Z.
Γ(z)Γ(1 − z) =
sin(πz)
So we conclude that
Z +∞
√
t1/2 e−t dt = Γ(3/2) = 1/2Γ(1/2) = 1/2 π,
0
Hence
Z
1
r
ωπ
2
0
where the latter term is finite and so theorem 5.4.2 applies, giving the result.
1/2
(2)
log DF (x, F)
p
dx ≤ log(K) +
Proposition 5.4.3. Let (X , A, P ) be a probability space and C ⊂ A, with
dens(C) < +∞ and let F ∈ Lp (X , A, P ) for p ∈ [1, +∞[ and F ≥ 0. If
F := {F IC : C ∈ C}, then for any ω > dens(C), there is a constant
0 < K < +∞ such that
(p)
DF (δ, F) ≤ Kδ −pω ,
δ ∈]0, 1].
(p)
(p)
Proof. DF (δ, F) = sup{DF (δ, F, γ) : γ ∈ Γ}, so let γ ∈ Γ, and G the small(p)
est set in X with γ(G) = 1. If F (x) = 0 for each x ∈ G, then DF (δ, F, γ) = 1
since otherwise, by the definition:
(p)
DF (δ, F, γ) :=
(
Z
sup m : f1 , · · · , fm ∈ F, and all i 6= j,
|fi − fj |p dγ > δ p
for fi , fj different functions:
Z
Z
p
p
0=δ
F dγ <
|fi − fj |p dγ
= (kfi − fj kLp (γ) )p
≤ (kfi kLp (γ) + kfj kLp (γ) )p
≤ 2p (kF kLp (γ) = 0
Z
)
F p dγ .
5.4. POLLARD’S CENTRAL LIMIT THEORM.
103
(p)
a contradiction. But if DF (δ, F, γ) = 1 then for any K ≥ 1
(p)
DF (δ, F, γ) ≤ Kδ −pω , 0 < δ ≤ 1 .
We may suppose that for some x ∈ G : F (x) > 0. Let C(1), · · · , C(m) ∈ C,
with m maximal and such that
Z
p
p
(kF IC(i) − F IC(j) kLp (γ) ) > δ
F p dγ
for all i 6= j. Let Q = Qγ be the probability measure defined by
Z
.
B 7→
F p dγ γ(F p )
B
then, since |F IC(i) − F IC(j) | = 1 iff x ∈ C(i)\C(j) or x ∈ C(j)\C(i):
.
Z
F p dγ γ(F p ) > δ p
Q(C(i)∆C(j)) =
C(i)∆C(j)
Then by maximality and theorem 3.2.1, there is a strictly positive constant K <
+∞ depending only on ω and C such that
(p)
DF (δ, F, γ) ≤ m ≤ D(δ p , C, dQ ) ≤ K(ω, C)δ −pω
where dQ (A, B) := Q(A∆B) for A, B ∈ A, as in definition 3.2.1.
The second corollary as announced just after the central limit theorem (5.4.2).
Corollary 5.4.3 (Pollard). Let (X , A, P ) be a probability space and let F be a
Suslin image admissible Vapnik–C̆ervonenkis subgraph class of functions with
envelope F ∈ L2 (X , A, P ). Then F is a P –Donsker class.
Proof. This is an immediate consequence of theorem 5.4.2 and theorem 3.2.2 for
p = 2.
(2)
Indeed, by theorem 3.2.2, we have got that DF (, F) is bounded by A(2/2 )W
for any W ≥ S(C) and A = A(S(C), W ) < +∞ a constant.qBy calculations
R1
(2)
similar as those executed in corollary 5.4.2 one shows that 0 log(DF (, F))
is finite, so that theorem 5.4.2 applies.
104
CHAPTER 5. UNIFORM LIMIT THEOREMS.
Appendix A
Topology and Measure Theory.
A.1
Metric and topological spaces.
A.1.1
Definitions.
Definition A.1.1. A pair (S, T ), where S is a set and T a class of subsets of S,
which satisfies:
i) S and ∅ are belong to T ,
ii) whenever O1 , O2 ∈ T , then O1 ∩ O2 ∈ T ,
iii) let I be any (index)set and A = {Oi | Oi ∈ T , i ∈ I}, then ∪A = ∪i∈I Oi ∈
T too.
is said to be a topological space. T is called a topology on S.
Let S be any set. A function
d : S × S → R+ : (s, r) 7→ d(s, r)
is said to be a metric iff it has the properties:
M1) d(s, r) = 0 ⇐⇒ s = r,
M2) d(s, r) = d(r, s),
M3) d(s, t) 6 d(s, r) + d(r, t)
for all s, r, t in S. The pair (S, d) is called a metric space .
A topological space (S, T ) is said to be separable iff there is a countable set
D ⊂ S such that for any s ∈ S and s ∈ O ∈ T : O ∩ D 6= ∅, i.e. the set D is
dense.
105
106
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
A subclass B of a topology T is said to be a base for T iff every O ∈
T is a union of elements of B. A topological space (S, T ) is said to be second–countable or A2 iff T has a countable base.
There are many other ways to introduce a topology on a set. We will discuss
two of them which are used in this texts.
For the first characterization we need the concept of closure (operator).
Definition A.1.2. Let (S, T ) be a topological space and F ⊂ S: F is said to be
closed in S iff S\F ∈ T , i.e. the complement of F is open.
Let (S, T ) be a topological space and E ⊂ S, the closure of E in S is the set
E = ClS (E) = ∩{F ⊂ S : F is closed and E ⊂ F }.
Theorem A.1.1. The operation A 7→ A on a topological space (S, T ) satisfies
the following properties:
a) E ⊂ E;
b) (E) = E;
c) A ∪ B = A ∪ B;
d) ∅ = ∅;
e) E is closed in S iff E = E.
Moreover, given a set S and a mapping A 7→ A of P(S) into P(S) satisfying (a)
through (d), if we define closed sets in S by condition (e), we have that S becomes
a topological space and its closure operator will then be the same operation we
started with.
Proof. We refer to [Will] chapter 1 theorem 3.7 on page 25 for a proof.
The second characterization by mean of the neighborhoods of points.
Definition A.1.3. Let (S, T ) be a topological space and let s ∈ S, a neighborhood of s is a set U which contains an open set V , i.e. V ∈ T , containing s.
The collection of all neighborhoods of the point s will be denoted by V(s) and is
called a neighborhood system at s.
Theorem A.1.2. Let (S, T ) be a topological space and s ∈ S; the neighborhood
system V(s) at s has the following properties:
a) if U ∈ V(s), then x ∈ U ;
A.1. METRIC AND TOPOLOGICAL SPACES.
107
b) for all U, V ∈ V(s): U ∩ V ∈ V(s);
c) for all U ∈ V(s) there exists a V ∈ V(s): U ∈ V(t) for all t ∈ V ;
d) for all U ∈ V(s) and V ⊂ S: if U ⊂ V , then V ∈ V(s);
e) G ⊂ S is open iff G contains a neighborhood of each of its points..
Conversely, if in a set S, a collection V(s) of subsets of S is assigned to each
s ∈ S, so as to satisfy (a) through (d), and if (e) is used to define “open“ sets,
the result is a topology on S, in which the neighborhood system at each s ∈ S is
precisely V(s).
Proof. We refer to [Will] chapter 1 theorem 4.2 on page 31 for a proof.
Here is a lemma which gives the relation between separable and
second–countable topological spaces.
Lemma A.1.3. Let (S, T ) be a second–countable topological space, then S is
separable. If T is induced by a metric, i.e. T is the topology generated by the
metric d, and S is separable, then S is also second–countable.
Proof. For the first assertion let B be a countable base, from each non–empty
B ∈ B pick some b ∈ B. The set D := {b : b ∈ B} is at most countable, and
its intersection with any open set is non void, because it contains an element from
each set of the base of the topology. Thus D is dense.
If T is induced by a metric, and D is a at most countable dense set, then define
B := {B(b, 1/n) : b ∈ D, n ≥ 1}.
If O is open then O = ∪b∈O∩D B(b, 1/nb ) for 1/nb as big as possible. Because
for every a ∈ O : B(a, 1/m) ⊂ O for some m := m(a). But D is dense so
D ∩ B(a, 1/(2m)) 6= ∅. Thus for some b ∈ D ∩ O : a ∈ B(b, 1/(2m)) ⊂ O and
B(b, 1/(2m)) ⊂ B(b, 1/nb ).
We give here a definition of the important Hausdorff topological space.
Definition A.1.4. Let (S, T ) be a topological space. (S, T ) is called a Hausdorff
topological space iff for every x, y ∈ S : x 6= y, there exists open neighbourhouds
Vx ∈ V(x) and Vy ∈ V(y) such that Vx ∩ Vy = ∅.
Now we come to the definition of compact topological spaces.
108
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
Definition A.1.5. Let (K, T ) be a topological space. (K, T ) is called a compact
topological space iff for every collection of open sets that covers K, there exists
a finite open subcover.
There is an equivalent characterization in terms of (ultra)filters. Recall that a
collection W of subsets of a set X, is said to be a filter, iff it satisfies
a) W =
6 ∅ and ∅ ∈
/ W;
b) for all W, Z ⊂ X: if W ∈ W, W ⊂ Z, then Z ∈ W;
c) for all W, Z ∈ W: W ∩ Z ∈ W.
A filter U, which is maximal in the class of filters, i.e. for any other filter M such
that U ⊂ M, U = M, is said to be an ultrafilter . Let (S, T ) be a topological
space, a filter W is said to converge to s ∈ S iff V(s) ⊂ W. Now we can state an
equivalent condition for compactness of S, S is said to be a compact topological
space iff every ultrafilter U on S converges.
For a proof of the equivalences between this two apparently different definitions, see e.g. [Will] chapter 6 theorem 17.4 on page 118 or [Dud1] theorem 2.2.5
on page 36.
0
Proposition A.1.4. Let (K, T ) be a compact topological space and (S, T ) any
topological space. If g is a continuous function from K into S, then g(K) is
compact.
Proof. Let {Ui : i ∈ I}, I index set, be an open cover of g(K). Then
{g −1 (Ui ) : i ∈ I} is an open cover of K. Hence there exists a finite open
subcover {g −1 (Uj ) : j ∈ J} that covers K. Because
g(g −1 (Uj )) ⊂ Uj
so
g(K) ⊂ ∪j∈J Uj .
Two propositions about compactness and the Hausdorff property.
Proposition A.1.5. Let (S, T ) be a compact topological space. Then any closed
subset F of S is compact. If S is Hausdorff, then a compact subspace of S is
closed.
Proof. Let F be closed in S. Let {Ui : i ∈ I} be open sets such that F ⊂ ∪i∈I Ui .
Then S ⊂ (∪i∈I Ui ) ∪ (S\F ), since S is compact there exists a finite
J ⊂ I : S ⊂ (∪i∈J Ui ) ∪ (S\F ),
but then F ⊂ (∪i∈J Ui ) hence compact.
A.1. METRIC AND TOPOLOGICAL SPACES.
109
Let K be a compact set contained in S. Let k ∈ K. Suppose k ∈
/ K. Then it is
enough to show that we can separate {k} and K by disjoint open neighbourhoods,
which would be a contradiction with k ∈ K. By the Hausdorff property for each
l ∈ K there exists a Vl ∈ V(l) and Wl ∈ W(k) disjoint open neighbourhoods of
{l} respectively {k}. Then K ⊂ ∪l∈K Vl and by compactness there exists a finite
subset L of K such that K ⊂ ∪l∈L Vl . The sets V := ∪l∈L Vl and W := ∩l∈L Wl
remains open. The latter is an open neighbourhood of {k} which has an empty
intersection with K.
[ [
[
W ∩K ⊂W ∩
Vl ⊂ (W ∩ Vl ) ⊂ (Wl ∩ Vl ) = ∅
l∈L
l∈L
l∈L
Proposition A.1.6. Let (S, T ) be a Hausdorff topological space, K ⊂ S compact
and y ∈ S\K. Then there exists disjoint (open) neighborhoods V of y and W of
K.
Proof. Let k ∈ K, and since y 6= k, we choose a Vk , Vyk disjoint open neighborhoods of k respectively y by the Hausdorff property. Then K ⊂ ∪k∈K Vk and
by compactness there exists a finite L ⊂ K: K ⊂ ∪k∈L Vk . Let VK := ∪k∈L Vk ,
VK ∈ VS (K) and W := ∩k∈L Vyk ∈ VS (y). Finally: V ∩ W = ∅.
Theorem A.1.7. Let K be a compact topological space, S a Hausdorff topological space and g a continuous bijection from K into S. Then g is a homeomorfism,
i.e. g −1 is continuous too.
Proof. Let G be an open set of K, consider (g −1 )−1 (G) = g(G). Since G is open,
K\G =: F ⊂ K, is compact, as a closed subspace of a compact space.
Since g is continuous, g(F ) is compact in S, and closed (a compact set in a
Hausdorff space is closed ). Then g(G) = S\g(F ) is open.
Products allow one to create new spaces from old ones. If the spaces in the
beginning were topological spaces, then one can define a new topology on the
product such that all projections are continuous, namely the product topology.
This topology is the smallest which makes the projections continuous.
Definition
A.1.6. Let (Si , Ti ) be topological spaces, and iQ∈ I an index set. Let
Q
Let πj : i∈I Si Q
→ Sj be the
i∈I Si denote the product of the spaces Si . Q
projection onto Sj . The product topology, on i∈I Si denoted by i∈I Ti , is the
topology generated by or having as basis the sets ∩kl=1 Aml , where
−1
Aml := πm
(G)
l
110
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
for G any open set of (Sml , Tml ). It is known as the coarsest (i.e. smallest) topology making all the projections πj continuous. It is the topology of pointwise
convergence.
We state three propositions about properties that countable product spaces inherit from their (metric) factors: the first is about metrizability, the second about
separability and the third about completeness.
Proposition
A.1.8. Let (Sn , dn ) be metric spaces for n ≥ 1. The product space
Q
S := ∞
S
n=1 n equipped with the product topology is metrizable, e.g. for the
metric:
X
d({xn }, {yn }) :=
2−n f (dn (xn , yn )),
n≥1
with f (t) := t/(1 + t).
Proof. We refer to [Dud1] proposition 2.4.4 on page 50 or to [Will] chapter 7
theorem 22.3 on page 161 for a proof.
PropositionQA.1.9. Let (Sn , Tn ) be (non empty) topological spaces for n ≥ 1.
Then S := n≥1 Sn with the product topology is separable.
Proof. We proof the proposition only for countable product. For an optimal result
we refer to [Will] theorem 16.4(c) on page 109.
Let Dn be a dense subset of Sn . We choose an {an } ∈ S (this can
Qnbe done,
since
by
AC
S
is
non
empty),
and
define
D
as
the
union
over
all
n
of
j=1 Dj ×
Q
j≥n+1 aj , then obviously D is at most countable. It is also dense since an set of
(Aij ), with Aij open in Sij ,
the basis for the product topology is given by ∩nj=1 πi−1
j
then
!
!
n
n
in
Y
\
\
Y
πi−1
(Aij ) ∩ D ⊃
πi−1
(Aij ) ∩
Dj ×
aj
j
j
j=1
j=1
j=1
j≥in +1
and if k ∈ {i1 , · · · , in } Dij ∩ Aij 6= ∅; k ∈
/ {i1 , · · · , in }, k ≤ in and for k 6= ij ,
1 ≤ j ≤ n and k ≥ in , Sk ∩ {ak } = {ak }. Hence D has a non empty intersection
with any set of the basis, thus also with any open neighbourhood of any point {sn }
in S. So D is dense.
Since a countable product of metric spaces, with product topology is metrizable, proposition A.1.8, it makes sense to consider if completeness is carried over.
This is the purpose of the next proposition.
Proposition A.1.10. Let (Sn , dn ) be complete metric spaces, the product space S
with product topology and with metric as in proposition A.1.8 is complete.
A.1. METRIC AND TOPOLOGICAL SPACES.
111
Proof. Let {{snm }n≥1 }m≥1 be a Cauchy sequence in S for d and xm := {snm }n≥1 ,
then xm ∈ S. For n fixed, consider πn (xm ). By proposition 2.4.3 on page 49
of [Dud1] the identity from (Sn , f ◦ dn ) → (Sn , dn ) is uniformly continuous, so
for a sequence {yk }k≥1 in Sn : yn is Cauchy for f ◦ dn iff it is Cauchy for dn .
From the definition of the metric d on the product space S, it is easy to see that
{πn (xm )}m≥1 is a Cauchy sequence in (S, f ◦dn ), therefore also in (Sn , dn ), which
is complete by assumption. Let 2−n > 0 then for any p, q ≥ N for some N ≥ 1:
f (dn (snp , snq ))2−n ≤
X
2−n f (dn (snp , snq )) = d({xp }, {xq }) < 2−n
n≥1
So coordinate wise we have a limit say x(n) . Let x := {x(n) }n≤1 , finally we need
to show xm → x in (S, d). Let η > 0 and fix smallest n so that 2−n < η/2,
choose Ni , 1 ≤ i ≤ n such that f (di (sip , x(i) )) < η/2n for all p ≥ Ni and let
N := maxni=1 Ni + 1; then for all q ≥ N :
X
−n
2
(n)
f (dn (snq , x
)) <
n≥1
≤
n
X
i=1
n
X
i=1
2−i f (dj (siq , x(i) )) + 2−n
X
2−j f (dj (sjq , x(j) ))
j≥1
η/2n + 2−n
X
2−j < η
j≥1
since f ≤ 1.
Often real–valued sequences don’t converge, but bounded monotone sequence
do converge. For bounded sequence, one can define convergent subsequences.
Definition A.1.7. Let {xn }n≥1 be a bounded sequence of real numbers. Define
yn := supk≥n xk . The yn form a bounded non increasing sequence, thus a convergent one, with limit inf n yn . Denote by lim supn xn := limn→∞ supk≥n xk , as said
above the limit exists and equals inf n yn .
One can define zn := inf k≥n xn the zn then form a bounded non decreasing sequence. So it must converge to its supremum, supn zn . Then let lim inf n xn :=
limn→∞ inf k≤n xk . This lim inf equals then supn zn .
Lemma A.1.11. Let {xn }n≥1 be a bounded sequence in R. If y < lim supn→∞ xn
then there exists a subsequence {xk(n) }n≥1 : xk(n) > y.
Proof. By contradiction, suppose such subsequence doesn’t exist. Then, for only
finitely many indices, xn > y, so there exists a n0 : xn ≤ y for all n ≥ n0 and
supk≥n xk ≤ y for all n ≥ n0 , implying that lim supn xn ≤ y, a contradiction.
112
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
Lemma A.1.12. Let R, S be topological spaces, where S moreover is Hausdorff,
and f a continuous function from R into S. Let Fn be closed sets of R for n ∈ N
and K compact set of R such that Fn ↓ K, as n → ∞. If for every U ⊂ K open,
there is an n : Fn ⊂ U then:
\
\
f [K] =
f [Fn ] =
f [Fn ].
n≥1
n≥1
Proof. The inclusions from the left to the right follows from: K = ∩n Fn and
"
#
\
\
\
\
f [Fn ].
f [K] = f
Fn ⊂
f [Fn ] =
f [Fn ] ⊂
n≥1
n≥1
n≥1
n≥1
For the converse inclusions, we propose two ways. For both it is enough to show
f [K] ⊃ ∩n f [Fn ].
⊃: shorter proof. By considering the complements, one has to show:
S\f [K] ⊂ ∪n S\f [Fn ].
Let y ∈ S\f [K]. Since f is continuous and K compact, f [K] is compact too,
proposition A.1.4. In any Hausdorff space, a compact set and a singleton not
belonging to that compact set can be separated by disjoint open neighborhoods,
proposition A.1.6.
Hence there are V ∈ VS (f [K]), W ∈ VS (y) open and disjoint. Since f is continuous, there exists an open neighborhood U ∈ VR (K) : (f [K] ⊂)f [U ] ⊂ V . Now
by the condition, for some N large enough, Fn ⊂ U , n ≥ N . Hence
f [Fn ] ⊂ f [U ], and y ∈
/ f [Fn ] (the converse leading to the contradiction
V ∩ f [Fn ] = ∅). So for n ≥ N :
y ∈ S\f [Fn ] ⊂ ∪n≥N S\f [Fn ] ⊂ ∪n S\f [Fn ].
⊃: longer proof. So take y ∈ ∩n f [Fn ] and suppose that for every x ∈ K has
open neighborhood Vx : y ∈
/ f [Vx ]. The Vx form an open cover of the compact
set K, let Vxi ; i = 1, · · · , n be an open subcover, which we will denote by U .
Now since U is open, U ⊃ K and f [U ] = ∪ni=1 f [Vxi ] by definition of the closure
operator, y ∈
/ f [U ]. But the condition on the Fn implies Fn ⊂ U from a certain,
fixed index n on. Thus y ∈
/ f [U ] ⊃ f [Fn ] ⊃ f [Fn ] = f [Fn ], a contradiction.
So take x ∈ K such that y ∈ f [V ]¯ for every V open neighborhood of x. We
can drop the closure because if y ∈
/ f [V ] for some V open neighborhood of x, then
f (x) 6= y and by the Hausdorff property there are open disjoint neighborhoods
A.1. METRIC AND TOPOLOGICAL SPACES.
113
V1 , V2 of f (x), y. But then V2C is a closed neighborhood of f (x) and f −1 (V1 ) ⊂
f −1 (V2C ) where the first is open, the second closed and both neighborhoods of x.
f [f −1 (V1 )] ⊂ f [f −1 (V2C )] ⊂ V2C = V2C
which contradicts y ∈ f [V ] for all open neighborhoods V of x. Then we claim
that for that x we have f (x) = y. If it would not be true, then again by the Hausdorff property there are disjoint open neighborhoods W, Z from f (x), y. But then
f −1 (W ) is an open neighborhood of x such that y ∈ f [f −1 (W )]. A contradiction
with the choice of x. So x = f (y).
0
0
0
Definition A.1.8. Let (S, d), (S , d ) be two metric spaces, f : S → S a function.
Then f is said to be k–Lipschitz continuous iff for all r, s ∈ S:
0
d (f (r), f (s)) ≤ kd(r, s).
Lemma A.1.13. Let (S, d) be a metric space and A any subset of S.
i) The function
d(x, A) := inf{d(x, y) : y ∈ A}
is a function (any non–empty subset of R that is bounded from below has a
greatest lower bound, which is the same as the infimum) that is 1–Lipschitz
continuous.
ii) The function gk (x) := max(1 − kd(x, A), 0) is k–Lipschitz, k = 1, 2, · · · .
Proof.
i) We have to show that |d(x, A)−d(y, A)| ≤ d(x, y). Note first that d(x, A) ≤
d(x, z) for any z ∈ A, so that by the triangle inequality it follows:
d(x, A) ≤ d(x, z) ≤ d(x, y) + d(y, z).
Because this holds for all z ∈ A, we get d(x, A) − d(y, A) ≤ d(x, y).
Interchanging the roles of x and y gives the desired result.
ii) If gk (x) = 0 = gk (y) there is nothing to prove, if gk (x) > 0 and gk (y) > 0
|gk (x) − gk (y)| = k|d(x, A) − d(y, A)| ≤ kd(x, y),
by step (i). Finally if gk (x) = 0 and gk (y) > 0, then
1 − kd(x, A) ≤ 0
0 < 1 − kd(y, A)
kd(y, A) < 1 ≤ kd(x, A)
⇐⇒
⇐⇒
⇐⇒
1 ≤ kd(x, A)
kd(y, A) < 1
0 < 1 − kd(y, A) ≤ kd(x, A) − kd(y, A)
114
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
so
|gk (y)| = |1 − kd(y, A)| ≤ k|d(x, A) − d(y, A)| ≤ kd(x, y).
Hence gk is k–Lipschitz.
Definition A.1.9. Let (R, T ) be a topological space, it is said to be Polish iff
there exist a metric d that metrizes the topology and such that d is a complete and
separable metric. A separable measurable space (Y, S) will be called a Suslin
space iff there exists a Polish space R and a Borel measurable map from R onto
Y . If (Y, S) is a measurable space, a subset Z of Y will be called Suslin set iff
(Z, Z u S) is a Suslin space, where Z u S := {Z ∩ S : S ∈ S} is the relative
σ–algebra on Z.
Let (X , B) be a given measurable space and M ⊂ X , M is called universally
measurable (u.m.) iff for every probability measure P on B, M is measurable
for the completion of P . Or in other words there exists A, B ∈ B : A ⊂ M ⊂ B
and P (A) = P (B).
Polish spaces behave well under taking (at most countable) products as is seen
in the following lemma.
Q
Lemma A.1.14. Let (S, d) be a Polish space, then i≥1 Si with the product topology is also Polish.
Proof. By proposition A.1.8 the product topology is metrizable. The product
topology a countably many spearable spaces remains separable, proposition A.1.9.
Finally, as seen in proposition A.1.10 the countable product remains complete. So
as claimed, S with the product topology is metrizable by a complete and separable
metric.
Lemma A.1.15. Let (Y, S) be a Suslin measurable space, then Y k with the product σ–algebra remains a Suslin measurable space.
Proof. Since (Y, S) is a Suslin measurable space, by definition, there exists a Polish space (S, d) and a Borel measurable map b from (S, d) onto (Y, S). (S k , dsum )
is again Polish (lemma A.1.14 ) and
(b, · · · , b) : (S k , dsum ) → (Y k , ⊗ki=1 S)
is measurable. (Y k , ⊗ki=1 S) as a finite product of separable measurable space
remains separable (theorem A.2.11 ).
A.1. METRIC AND TOPOLOGICAL SPACES.
A.1.2
115
Some important theorems.
The first theorem, due to Tychonoff, describes compact sets in arbitrary product
spaces. It is a very powerful theorem. Actually it even happens to be equivalent
to the Axiom of Choice.
Theorem A.1.16 (Tychonoff). Let {(Kj , Tj )}j∈J be a family of compact topological spaces, where J is an Q
index set.
Q
Then the product space ( j Kj , j Tj ) endowed with the product topology is
compact too.
Proof. We refer to [Dud1] theorem 2.2.8. page 39 or [Will] chapter 6 theorem
17.8 page 120 for a proof.
Compact sets play an important role everywhere in mathematics. In Rd , Cd
(with the usual Euclidean metric) compact sets are those sets that are closed and
bounded. More generally in uniform spaces compactness is equivalent to completeness together with total boundedness. Here we deal primarily with functions
spaces, e;g. space of all real–valued continuous functions or all real–valued càdlàg
(right continuous with left limits) functions on some (compact) (Hausdorff) topological space (K, T ). The next theorem from Arzelà–Ascoli completely characterizes relative compactness in function spaces. It will turn out that the concept
of equicontinuity plays an important role. Before going to the theorem we give a
definition of equicontinuity.
Definition A.1.10. Let (S, d) be a metric space and F ⊂ C(S). F is said to be
equicontinuous at s ∈ S iff for each > 0 there is a δ > 0 such that for all
t ∈ S : d(s, t) < δ:
|f (s) − f (t)| < for all f ∈ F.
If F is equicontinuous at every s ∈ S, then F is said to be equicontinuous. If for
any > 0 the δ > 0 in the definition of equicontinuity is suitable for any s, t ∈ S
less than δ away from each other, then F is called uniformly equicontinuous.
Theorem A.1.17. Let (K, d) be a compact metric space and (S, e) a metric space.
Any equicontinuous family from K into S is uniformly equicontinuous.
Proof. Suppose by contradiction that it is not the case, then from the negation of
the definition of uniformly equicontinuous family it follows that there exists an
> 0 such that for δn = 1/n > 0 there are xn , yn ∈ K and fn ∈ F such that
d(xn , yn ) < 1/n, but e(fn (xn ), fn (yn )) ≥ 116
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
By compactness of K we choose a subsequence {xk(n) }n≥1 that converges to some
x ∈ K. Because {yk(n) }n≥1 is an equivalent sequence, yk(n) → x. For n large
enough, by equicontinuity one has
e(fk(n) (xk(n) ), fk(n) (x)) < /2 and e(fk(n) (yk(n) ), fk(n) (x)) < /2
This would imply e(fk(n) (xk(n) ), fk(n) (yk(n) )) < , a contradiction.
A first corollary of theorem A.1.17 is that continuous functions on compact
domains are in fact uniformly continuous. And this remains true for any finite
family of continuous functions.
Corollary A.1.11. Let (K, d) be a compact metric space, (S, e) a metric space
and f a continuous function from K into S. Then f is uniformly continuous.
Proof. By the previous theorem, A.1.17, it is enough to show that the family {f }
is equicontinuous. But in this case this is reduced to the continuity of f .
Theorem A.1.18 (Arzelà–Ascoli). Let (K, e) be a compact metric space and F ⊂
C(K), where C(K) is the space of all real–valued continuous functions equiped
with the uniform topology (induced by uniform metric). F is totally bounded in
(C(K), d∞ ) iff F is uniformly bounded (i.e. bounded for d∞ ) and equicontinuous.
Proof. Assume that F is totally bounded. We will first prove the (uniform) equicontinuity of F. Let > 0, then there are f1 , · · · , fn ∈ F such that F ⊂ ∪nj=1 Bd∞ (fj , ).
Each of the functions fj is a continuous function from a compact space to a metric
space, and so is uniformly continuous. The set {f1 , · · · , fn } is finite. That is is
uniformly equicontinuous is seen as following: for > 0, chose δj > 0 such that
|fj (x) − fj (y)| < whenever d(x, y) < δj
Let δ := min{δj () : j = 1, · · · , n}, then δ > 0 and for j = 1, · · · , n:
|fj (x) − fj (y)| < whenever d(x, y) < δ.
For all f ∈ F for δ(= min{δj () : j = 1, · · · , n}), if d(x, y) < δ then:
|f (x) − f (y)| < |f (x) − fj (x)| + |fj (x) − fj (y)| + |fj (y) − f (y)| < 3.
F is totally bounded, then also bounded. For = 1 there are f1 , · · · , fm ∈ F such
that F ⊂ ∪m
i=1 Bd∞ (fi , 1). let M := max{d∞ (fi , fj ) : 1 ≤ i < j ≤ m} < +∞.
Then obviously F ⊂ Bd∞ (f1 , M + 1).
Conversely, let F be equicontinuous and uniformly bounded, then by theorem
A.1.17 F is uniformly equicontinuous. Because F is uniformly bounded there
A.1. METRIC AND TOPOLOGICAL SPACES.
117
exists 0 < M < +∞ such that |f (x)| < M for all x ∈ K and f ∈ F. The set
[−M, M ] is compact. Also C(K) ⊂ RK := {g : K → R function }. So F ⊂ RK
and since f (x) ∈ [−M, M ] for each x ∈ K, f ∈ F, one sees: F ⊂ [−M, M ]K ,
which is a compact space by Tychonoff’s theorem (A.1.16). Taking the closure
of F in the product topology of RK (or the relative topology in [−M, M ]K ) and
denoting it G, one has that G as a closed set of a compact set is again compact. G
will also also inherit the uniform equicontinuity from F. For every > 0, let
A := {k ∈ RK : |f (x) − f (y)| ≤ for every x, y ∈ K}
This set is closed in RK . Indeed, consider h ∈ cl(Ax,y ), where the closure is taken
for the topology of RK . Let
Ax,y := {k ∈ RK : |f (x) − f (y)| ≤ }
If Ax,y is closed for every x, y ∈ K then A = ∩x,y∈K Ax,y will be too. We
claim that Ax,y is closed for the product topology. Indeed, consider h ∈ cl(Ax,y ),
where the closure is taken for the topology of RK . Then Ax,y has a non–empty
intersection with any (open) neighbourhoud of h, by definition A.1.2. We need to
have that, for any x, y ∈ K,
|h(x) − h(y)| ≤ (Oi ) where Oi is open
In the product topology open sets are of the form ∩ni=1 πx−1
i
in R, see definition A.1.6. Take here the open neighbourhood
Vx,y,n := {g ∈ RK : |g(x) − h(x)| < 1/2n} ∩ {g ∈ RK : |g(y) − h(y)| < 1/2n}
Then, since h ∈ cl(Ax,y ) : Vx,y,n ∩ A 6= ∅, for each n choose gn ∈ Vx,y,n ∩ A, so
|h(x) − h(y)| ≤ |gn (x) − h(x)| + |gn (y) − gn (x)| + |gn (y) − h(y)| ≤ + 1/n.
F was assumed to be uniformly equicontinuous. So for > 0 there is a δ > 0
such that for all x, y ∈ K, with e(x, y) < δ and f ∈ F : |f (x) − f (y)| ≤ . As
above the set
{k ∈ RK : |k(x) − k(y)| ≤ for every x, y ∈ K : e(x, y) < δ}
is closed in the product topology, and includes F, thus includes G too. Hence repeating the previous argument for every > 0, G is seen to be uniformly equicontinuous.
Then last step will be to prove that G is compact for the uniform topology.
So let U be an ultrafilter in G. Because G is compact for the product topology,
U converges to some g ∈ G, i.e. U contains the neighbourhood filter (in product
118
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
topology) of g. G was seen to be uniformly equicontinuous, for > 0, let δ > 0
such that whenever
e(x, y) < δ then |f (x) − f (y)| ≤ /4 < /3
for all f ∈ G. K is compact, in particular totally bounded, let S be a finite subset
of K such that for any y ∈ K : e(x, y) < δ for some x ∈ S. Let
U := {k ∈ RK : |k(x) − g(x)| < /3 for all x ∈ S}
\
=
πx−1 ]g(x) − /3, g(x) + /3[ ,
x∈S
so U is open in for the product topology on RK , it contains g. Now U is an (open)
neighbourhood of g, hence U ∈ U. Take Bd∞ (g, ). If U ⊂ Bd∞ (g, ), then
Bd∞ (g, ) ∈ U, meaning U → g in the uniform topology. This in turn would
imply compactness of G for the uniform topology. Compactness implies totally
boundedness, which is trivially inherited by subspaces, so F would be totally
bounded. Thus it suffices to prove the inclusion U ⊂ Bd∞ (g, ) to finish the
prove. Let k ∈ U , for any y ∈ K and x ∈ S : e(x, y) < δ:
|k(y) − g(y)| < |k(y) − k(x)| + |k(x) − g(x)| + |g(x) − g(y)| < .
The |k(y) − k(x)|, |g(x) − g(y)| are small, due to uniform equicontinuity of G,
the middle because k ∈ U . Since the above calculation holds uniformly over all
y ∈ K, the inclusion is proved.s
Definition A.1.12. A set V is said to be a real vector space iff there are two
operations, denoted + and .
+ : V × V → V : (v, w) 7→ v + w,
. : R × V → V : (r, v) 7→ r.v
on V such that (V, +) is an abelian group and
(ab).v = a(b.v) and 1.v = v
and
(a + b).v = a.v + b.v
a.(v + w) = a.v + a.w
hold for all a, b, r ∈ R, v, w ∈ V .
A.1. METRIC AND TOPOLOGICAL SPACES.
119
A triple (R, +, ·), with R a se and two operations, + and · defined on it, is said
to be a ring iff
+ : R × R → R : (r, s) 7→ r + s,
· : R × R → R : (r, s) 7→ r · s
such that (R, +) is an abelian group, (R, ·) a semigroup with unity and
(r + s) · t = r · t + s · t
t · (r + s) = t · r + t · r
for all r, s, t ∈ R. A quadruple (A, +, ·, .) is said to be an R–algebra iff (A, +, ·)
is a ring, (A, +, .) is an R–vector space and
r.(a · b) = (r.a) · b
for all r ∈ R; a, b ∈ A.
A vector space V is said to be a vector lattice if for every v ∈ V
v+ := max(v, 0) ∈ V
This lemma provides an explanation of the definition of vector lattice. Usually
a lattice is a set together with a partial order and where for each two elements,
there exists a unique supremum and infimum.
Lemma A.1.19. Let F be a vector lattice. Then max(f, g), min(f, g) ∈ F for all
f, g ∈ F.
Proof. Let f, g ∈ F, it is easy to see that:
1
[(f + g) + |f − g|]
2
1
min(f, g) =
[(f + g) − |f − g|]
2
max(f, g) =
So it suffices to show that |f − g| ∈ F. Now |f − g| = f − g, where f ≥ g and
g − f , where f ≤ g. Then
(f (x) − g(x))I{f ≥g} = max(f − g, 0) = (f − g)+
(g − f )(x)I{f ≤g} = max(g − f, 0) = (g − f )+ .
120
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
The following theorem, due to Stone and Weierstrass, where the latter proved
it for C([0, 1]) and the former generalized it, states conditions on a subset F ⊂
C(K) = Cb (K), for K some compact topological space, such that it is dense in
the uniform topology. Since weak convergence of variables is actually point wise
convergence of functionals on Cb (S), the Stone–Weierstrass theorem provides a
way to find convergence determining sets.
Theorem A.1.20 (Stone–Weierstrass). Let (K, T ) be any compact Hausdorff topological space and F ⊂ C(K) with the uniform topology, i.e. the topology induced
by d∞ (f, g) = supx∈K kf (x) − g(x)k.
If F is an algebra (definition A.1.12 ), separates points in K, i.e. for every
x, y ∈ K: x 6= y there exists an f ∈ F: f (x) 6= f (y), and contains the constants;
then F is dense in (C(K), d∞ ).
Proof. We refer to [Dud1] theorem 2.4.11. page 54 or [Will] chapter 10 theorem
44.5 page 291 for a proof.
Another version of the Stone–Weierstrass theorem, where less restrictions are
put on the class F.
Theorem A.1.21. Let (K, T ) be any compact Hausdorff topological space and
F ⊂ C(K) with the uniform topology, i.e. the topology induced by d∞ (f, g) =
supx∈K kf (x) − g(x)k.
If F is a vector lattice (definition A.1.12 ), separates points in K, i.e. for every
x, y ∈ K: x 6= y there exists an f ∈ F: f (x) 6= f (y), and contains the constants;
then F is dense in (C(K), d∞ ).
Proof. We refer to the book by J.O. Jameson Topology and Normed Spaces, Chapman and Hall, London, 1974 on page 263 for a proof.
Lemma A.1.22. Let T be a metric space and F ⊂ Cb (`∞ (T )),
F := {h : `∞ (T ) → R :z → h(z) := G(z(t1 ), · · · , z(tk )) :
G ∈ Cb (Rk ); ti ∈ T, i = 1, · · · , k; k ∈ N}.
Then F is an algebra, a lattice and a vector space.
Proof. This is easy to see.
This lemma shows that one can not only approximate on compact sets, but
that the approximation can be extended to some open neighborhood of the form
K δ := {y ∈ S : d(s, K) > δ} of K.
A.1. METRIC AND TOPOLOGICAL SPACES.
121
Lemma A.1.23. Let (S, d) be a metric space, K ⊂ S compact and F a subalgebra of Cb (S). Then for all f ∈ Cb (S) for any > 0 there exists a δ > 0 and
F ∈ F such that |f (x) − F (x)| ≤ /3 for all x ∈ K δ .
Proof. By the Stone–Weierstrass theorem (A.1.20 ) f ∈ Cb (S) can be uniformly
approximated on K by some F ∈ F.
Since K is compact:
K δ := {s ∈ S : d(s, K) < δ}
Then for any δ > 0
K, K δ ⊂
[
B(S,d) (x, δ),
x∈K
so there exists a finite subset
{xδ1 , · · ·
K⊂
, xδNδ } ⊂ K
Nδ
[
B(S,d) (xδi , δ).
i=1
δ
δ
Now we would like to have K δ ⊂ ∪N
i=1 B(S,d) (xi , η) too for some η > 0. Let
δ
δ
η = 2δ then K δ ⊂ ∪N
i=1 B(S,d) (xi , 2δ). Indeed, note that
K⊂
Nδ
[
i=1
B(S,d) (xδi , δ)
⊂
Nδ
[
B(S,d) (xδi , 2δ)
i=1
and if y ∈ K δ \K:
d(xδi , y) < d(k, y) + d(k, xδi ) < 2δ,
where we choose k ∈ K : d(y, K) = d(y, k) < δ and choose xδi : k ∈
B(S,d) (xδi , δ) ⊂ B(S,d) (xδi , 2δ). Since f, F are both continuous functions on K
compact, we know that f, F are uniformly continuous on K (corollary A.1.11).
For /24 choose δ > 0
|f (x) − f (y)| < /24 and |F (x) − F (y)| < /24,
for all x, y ∈ K : d(x, y) < 2δ. Then on K : |f (x) − F (x)| ≤ /4 < /3 and for
y ∈ K δ \K, if d(xδi , z) < 2δ:
|f (z) − F (z)| ≤ |f (z) − f (xδi )| + |f (xδi ) − F (xδi )| + |F (xδi ) − F (z)|
≤ /24 + /4 + /24 ≤ /3
Hence as claimed: |f (x) − F (x)| ≤ /3, x ∈ K δ .
122
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
A.2
Measure Theory.
A.2.1
Rings, algebra’s σ–algebra’s and (outer) measures.
Let us first recall some classes of sets, which are closed under certain operations.
Definition A.2.1. Let X be a set and A ⊂ P(X ), i.e. the powerset of X . If A is
satisfies the following properties:
i) ∅ ∈ A;
ii) for every A, B ∈ A : A ∪ B ∈ A;
iii) for every A, B ∈ A : A\B ∈ A;
then A is said to be a ring . If A satisfies (i), (ii), (iii) and if one also has X ∈ A,
then A is called an algebra. And if A is an algebra such that it is closed under
countable unions, then A is said to be a σ–algebra .
In some cases, starting from some class of sets, we can explicitly state a formula for the algebra generated by that class, i.e. the smallest algebra which contains that particular class.
Proposition A.2.1. Let A be a set and A1 , · · · , An subsets of X . Let A be the
smallest algebra containing Ai , 1 ≤ i ≤ n. Then
n[
o
n
A=
F (j) : J ⊂ {0, 1} := F,
j∈J
where j ∈ {0, 1}n , F (j) := ∩ni=1 Aji i , with A1i = Aci and A0i = Ai .
Proof. Note that F ⊂ A, since elements of F are finite unions of finite intersections of Ai or their complement. Next we prove: F ⊃ A, by showing that F is an
algebra containing each of the Ai .
X is contained in F, because X = ∪j∈{0,1}n F (j). Indeed, let J1 ⊂ {0, 1}n
such that for all j ∈ J1 : jn = 0, let J2 := {j : jn = 1} then J1 and J2 are disjoint
and
[
F (j) =
j∈{0,1}n
[
F (j) ∪
j∈J1
[
j∈J2
Let l ∈ J1 take l̃ := (0, · · · , 0, 1) + l in J2 . Then
n−1 li
Ai
F (l) ∪ F (l̃) = ∩i=1
F (j)
A.2. MEASURE THEORY.
123
So ∪j∈{0,1}n F (j) = ∪l∈{0,1}n−1 F (l). If we continue, we end up with the union of
two sets A1 and Ac1 which is X as claimed.
That F is closed under finite unions is trivial.
We have to verify the second axiom: closed under complementation. We claim
c
F := (∪j∈J F (j))c = (∪j∈J c F (j)). Since
F c = X \F = (∪j∈J F (j)) ∪ (∪j∈J c F (j))\F = ∅ ∪ (∪j∈J c F (j))\F
F c ⊂ (∪j∈J c F (j)). Now the converse is also true, to see this let x ∈ ∪j∈J c F (j),
then x ∈ F (j) for some p ∈ J c , this means that x ∈ ∩ni=1 Api i .
F c = ∩l∈J F (l)c = ∩l∈J ∪ni=1 Ai1−li .
Let q ∈ J, then for some k ∈ {1, · · · , n} : qk 6= pk , implying 1 − qk = pk . But
k
since x ∈ Apkk , then x ∈ A1−q
, hence x ∈ F (q)c and this for any q ∈ J. Hence
k
x ∈ F c.
Note that the minimal algebra containing a given finite number of sets, say n,
n
has cardinality at most 22 . And is thus finite.
Elements of a σ–algebra are called measurable sets. If
f : (X1 , A1 ) → (X2 , A2 ) is a function such that f −1 (A2 ) ∈ A1 for all A2 ∈ A2 ,
then f is said to be measurable A1 /A2 . But sometimes it may happen that the
σ–algebra on the codomain is too large to have measurability of the map. One way
out to have measurability is to allow a slightly bigger σ–algebra on the domain.
But first we recall the notions of measure, outer measure and completion of a
measure.
Here is a proposition about continuous functions and Borel measurability. Recall that the Borel σ–algebra is the σ–generated by the open sets (the topology).
Proposition A.2.2. Let (R, Z), (S, T ) be topological spaces and B1 := B(Z),
B2 := B(T ) their respective Borel σ–algebra. If f : (R, Z) → (S, T ) is continuous, then it is also B1 /B2 measurable.
Proof. It suffices to note that the class
Q := {B ∈ B2 : f −1 (B) ∈ B1 }
is a σ–algebra and by continuity of f contains T . Hence
Q ⊂ B2 = σ(T ) ⊂ σ(Q) = Q.
In other words f is Borel measurable.
124
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
Definition A.2.2. Let (X , A) be a measurable space. If µ is a measure on A,
i.e.
P µ is σ–additive (Ai ∈ A, i = 1, 2, · · · are disjoint sets, then µ(∪i Ai ) =
i≥1 µ(Ai ) ) and µ(·) ≥ 0, on A, then we define the outer measure of µ as:
( ∞
)
X
[
µ∗ (E) := inf
µ(An ) : An ∈ A, E ⊂
An
n=1
n≥1
or +∞ if no such sequence {An } exists.
Let (X , A, µ) be a measure space. The measure theoretic completion of
the measure µ is defined as following, one extends µ to the smallest σ–algebra
containing A and the null sets of µ, i.e. the sets E ⊂ X : µ∗ (E) = 0. By
proposition 3.3.2. page 102 in [Dud1] it is the same as adding to A all sets E ⊂ X
which differ only up to a set of measure zero of some set in A. More formally one
adds all E ⊂ X such that for some B ∈ A the symmetric difference E∆B :=
(E\B) ∪ (B\E) is negligible (i.e. of measure zero).
A measure µ is said to be σ–finite iff there exists {An }n≥1 ⊂ A, ∪n≥1 An = X
and µ(An ) < +∞ for all n ≥ 1.
Definition A.2.3. A function f from a measurable space (X , B) into another measurable space (Z, A) is called universally measurable (u.m.) iff f −1 (A) is universally measurable in (X , B) for every A ∈ A.
The class of all universally measurable sets is a σ–algebra.
Lemma A.2.3. Let (X , A) be a measurable space. The universally measurable
sets form a σ–algebra.
Proof.
i) The first proof.
It is trivial that the class of universally measurable sets contains the whole
space. It contains also complements: let U ⊂ X be a u.m. set, then for any
P probability measure on (X , A) there are sets A ⊂ U ⊂ B; A, B ∈ A
and P (B\A) = P (B ∩AC ) = 0. Then one obviously has: AC ⊃ U C ⊃ B C
with AC , B C ∈ A and P (AC \B C ) = P (AC ∩ (B C )C ) = 0. And if Ui , i =
1, · · · are u.m. sets, then for any P there are Ai , Bi ∈ A : Ai ⊂ Ui ⊂ Bi
with P (Bi \Ai ) = 0. Then ∪i Ai ⊂ ∪i Ui ⊂ Bi , and ∪i Ai , ∪i Bi ∈ A.
Finally note that;
P (∪i Bi \(∪j Aj )) ≤
∞
X
P (Bi \(∪j Aj ))
i=1
≤
∞
X
i=1
P (Bi \Ai ) = 0.
A.2. MEASURE THEORY.
125
ii) The second proof.
By definition, universal sets are sets contained in the completion of every
measure on A. Still by definition ( A.2.2 ), universally measurable sets are
those sets who lie in
\
σ(A t Nµ )
µ
where µ is any measure on A and Nµ are the null sets of µ. An intersection
of σ–algebra’s, remains a σ–algebra.
LemSelecTheoSt-Beuve =¿ inverse ima um sets is meas.
This lemma shows that u.m. sets are preserved through inverse images of
measurable functions.
Lemma A.2.4. Let (D, D) and (G, G) be two measurable spaces and g a measurable function from D into G. If U is a u.m. set of G, then g −1 (U ) is a u.m. set of
D.
Proof. Let P be a probability measure on (D, D), then the image measure of P
under g denoted by Pg := P ◦ g −1 is also a probability measure.
Thus by the definition of universal measurability there are
A, B ∈ G : A ⊂ U ⊂ B and Pg (B\A) = 0.
By measurability of g:
g −1 (A), g −1 (B) ∈ D and also g −1 (A) ⊂ g −1 (U ) ⊂ g −1 (B).
Finally
P (g −1 (B)\g −1 (A)) = P (g −1 (B\A)) = Pg (B\A) = 0.
A σ–algebra is closed under countable intersections, unions, symmetric unions
and so on. So starting with a class A of subsets of a set X , it is not surprising
that each element of σ(A) is constructed out of countably many operations (with
countably many sets of A). In the following lemma this is made rigourous.
Lemma A.2.5. Let X be a set and A a class of subsets of X . For each B ∈ σ(A)
there exists a countable subclass AB of A such that B ∈ σ(AB ).
126
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
Proof. Define the following class of sets
C := {B ∈ σ(A) : there is a countable subclass AB ⊂ A : B ∈ σ(AB )}
For B ∈ A clearly: B ∈ σ({B}). So A ⊂ C. To finish the prove it will be enough
that C is a σ–algebra, because then, by definition C ⊂ σ(A) and by definition of
σ(A) (the smallest σ–algebra that contains A), C ⊃ σ(A).
X is always contained in any σ–algebra by definition, thus X ∈ C, because
X ∈ σ({A}) for any A ∈ A. If B ∈ C, then there is a countable AB : B ∈ σ(AB ),
but then B c ∈ σ(AB ) and B c ∈ C. Finally if Bn , n = 1, 2, · · · are elements of C
then ∪n Bn ∈ σ(∪ABn ) and so C is preserved under countable unions. Thus C is a
σ–algebra.
Lemma A.2.6. Let (R, T ) and (S, U) be two topological spaces. Denote their
respective Borel σ–algebras by B(R, T ) and B(S, U). Then the Borel σ–B(R ×
S, T × U) on the product space (R × S, T × U) contains the product σ–algebra
B(R, T ) ⊗ B(S, U).
Moreover if both topological spaces are second–countable, then the two σ–algebras
on R × S coincide.
Proof. We refer to [Bill2] theorem M.10 on page 243 or [Dud1] proposition 4.1.7
page 119.
Here is a lemma needed in chapter 4 of our exposition.
Lemma A.2.7. Let (X , B) be a separable measurable space, i.e. {x} ∈ B for all
x ∈ X and B is generated by countably many sets. Let I := [0, 1], then
f : X → 2∞ : x 7→ {IAj (x)}j≥1 ,
is a 1–1 function onto its range f (X ) =: Z.
Proof. We present two proofs, one more elementary and the other based on determining classes for measures on σ–algebra’s.
i) Let A := {Aj }j≥1 be a countable set of generators for B. Consider the map,
f : X → 2∞ : x 7→ {IAj (x)}j≥1 ,
Suppose, by contradiction, that f would not be injective. Then for some
x 6= y ∈ X we would have x ∈ Ai ⇐⇒ y ∈ Ai for all i ∈ N.
We define the class
{A ∈ B : {x, y} ⊂ A or {x, y} ∩ A = ∅},
A.2. MEASURE THEORY.
127
i.e. the sets of B such that the set contains both x and y, or such that both
x and y are in the complement. It contains, by assumption, the generating
sets. It then suffices to prove that the above class is a σ–algebra, since then
it would equal B, which is a contradiction.
It contains certainly the whole space X . It is also closed under complements. Because either {x, y} ⊂ A, then {x, y} ∩ Ac = ∅ or {x, y} ∩ A = ∅,
but then {x, y} ⊂ Ac . Finally consider sets Bn in the above class. Two situations are possible, either {x, y} ⊂ Bk for some k, then {x, y} ⊂ ∪n Bn .
Or for all n, {x, y} ∩ Bn = ∅. But then {x, y} ∩ (∪n Bn ) = ∅.
ii) Denote by Fn the algebra generated by A1 , · · · , An . Proposition A.2.1 tells
us that Fn is finite. So without loss of generality we may and do assume
that B is generated by ∪n Fn which consists of at most countably many sets.
Since Fn ⊂ Fn+m for m ≥ 1, and are all algebra’s. It follows that ∪n Fn is
also an algebra, which will be denoted by A0 .
Since we assumed that IAn (x) = IAn (y) for all n ≤ 1. As seen in the
proof of A.2.1, the F (j) form a partition of X . Hence IC (x) = IC (y) for all
C ∈ A0 .
Now we define two measures µ1 and µ2 on A, where µ1 := δx and µ2 := δy ,
two Dirac delta measures. The previous discussion implies that µ1 and µ2
are equal on the generating algebra A0 of B. By the uniqueness of extension
of measures, theorem A.2.15, µ1 and µ2 agree on B. Hence we can not
differentiate x from y by sets of B, a contradiction with the separability of
B.
The empirical measures are (normalized) sums of i.i.d. random variables of
a special kind, namely projections. Because of that their σ–algebra enjoys a nice
property, the sets is contains are invariant under permutation of the first n coordinates for Pn . Here is a more general lemma about the class of sets invariant under
the permutations of the first n coordinates.
Lemma A.2.8. Let (X ∞ , A∞ , P ∞ ) denote the standard model and Pn the n–th
empirical measure. Then the class of all sets invariant under permutations, say C,
of the first n coordinates is a σ–algebra, and thus contains σ(Pn (f )), the smallest
σ–algebra which makes Pn (f ) measurable, where f is a real–valued measurable
function.
Proof. Permutations are in particular bijections so clearly X ∞ ∈ C. If A ∈ C and
Ac is not invariant under a permutation of the first n coordinates, then for some
128
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
x ∈ Ac and Π ∈ Sym(n), where Sym(n) stands for the symmetric group of order
n, (xΠ(1) , · · · , xΠ(n) , xn+1 , · · · ) ∈ A, but A is invariant, so x ∈ A; a contradiction.
If Am ∈ C, m ≥ 1, we obviously have ∪Am ∈ C.
The theorem below is about a property of the empirical measure, in some sense
empirical measures contain and the information one needs in order to do statistics
on them.
Theorem A.2.9. For any measurable space (X , A) and for each n = 1, 2, · · · ,
the empirical measure Pn is sufficient for P n , where P is the class of all laws on
(X , A). In other words the set H of functions x 7→ Pn (B)(x) for all B ∈ A, is
sufficient.
In fact the σ–algebra SH , i.e. the smallest σ–algebra making all function of H
measurable, is exactly Sn ( the σ–algebra of all subsets of An which are invariant
under permutation of the n coordinates).
Proof. We refer to [Dud2] theorem 5.1.9 on page 177.
Next we recall the Monotone Class Theorem. It provides a way to prove
that certain classes are σ–algebras, without having to verify all the axioms of a
σ–algebra.
Definition A.2.4. Let X be a set, and M a family of subsets of X . M is said to
be a monotone class iff for all {An }n≥1 ⊂ M: if An ↑ A, i.e. A1 ⊂ A2 ⊂ · · ·
and A = ∪n An or An ↓ A, i.e. A1 ⊃ A2 ⊃ · · · and A = ∩n An , then A ∈ M too.
Theorem A.2.10 (Monotone Class Theorem). Let (X , A) be a measurable space.
Let C be an algebra that generates A. If B is a monotone class, and if C is
contained in B, then B also contains A.
Proof. Note that it is enough to prove that the smallest monotone class, say M,
that contains C is a σ–algebra, namely σ(C) = A. This will be a consequence of
the fact a σ–algebra being a monotone class. Also then A = M ⊂ N , for N any
monotone class containing C.
Let D := {E ∈ M : X \E ∈ M}. Then, because C is an algebra, one has
C ⊂ D. And D is a monotone class. Indeed let An ∈ D, n = 1, 2, · · · and An ↑ A,
C
then AC
= (∪An )C = ∩AC
n ∈ M by definition of D and moreover A
n . Now
C
C
C
C
because An ↓ A one has A ∈ M or then by definition A ∈ D. The same
reasoning applies to a sequence An ↓ A. This proves D ⊂ M is a monotone class
that contains C. By minimality of M one has M ⊂ D so both are equal and M
is closed under taking complements.
Now for each set Y ⊂ X, let
MY := {E ∈ M : E ∩ Y ∈ M}.
A.2. MEASURE THEORY.
129
Then for each C ∈ C, because C is an algebra, and in particular stable under
intersections, C ⊂ MC . Also MC ⊂ M is a monotone class, so by minimality
of M we have that M ⊂ MC , thus they are the same. That MC is a monotone
class follows from the following considerations. Let An ∈ MC , n = 1, 2, · · · and
An ↑ A (the case for An ↓ A is quite analogue). But then An ∩ C ↑ A ∩ C, since
An ∩ C ∈ M we have by the monotone class property of M that A ∩ C ∈ M.
So by definition of MC one has A ∈ MC .
For every C ∈ C we have the important equality of MC = M.
Next take any B ∈ M (not necessarily contained in C), and note that MB contains
C. Indeed by our previous considerations we have that MC = M for every C ∈ C,
this means that C ∩ B ∈ M for every C ∈ C. Also, as in the previous step one
can show that MB is a monotone class. So M = MB for any B ∈ M, meaning
that M is closed under (finite) intersections.
Thus M is not only a monotone class, but also an algebra (contains the whole
space: X ∈ C ⊂ M, closed under complements, closed under (finite) intersections). We claim now that M is a σ–algebra. It remains only to prove that M is
closed under (arbitrary) countable intersections. So let An ∈ M (where An not
necessarily ↓ A for some A) then:
!
n
\
\ \
\
An =
Ai =
Bn
n≥1
n≥1
i=1
n≥1
with Bn = ∩ni=1 Ai ∈ M, in particular M is an algebra, and note that Bn ↓
∩n≥1 An ; the last set belongs to M by the monotone class property.
As a consequence of the Monotone Class Theorem one shows:
Theorem A.2.11. Let (X , A), (Y, B) be two measurable spaces. Let A be generated by the countable algebra {Ai : i ∈ N} and B by {Bi : i ∈ N}. Then the
product σ–algebra A ⊗ B is again countably generated by {Ai × Bj : i, j ∈ N}.
Proof. In order to see this, define, for fixed Ai and Bj , the following classes of
sets:
DAi := {B ∈ B : Ai × B ∈ σ({Ak × Bl : k, l ≥ 1}) }
DBj := {A ∈ A : A × Bj ∈ σ({Ak × Bl : k, l ≥ 1}) }
Then DAi (respectively DBj ) contains the algebra {Bi : i ∈ N} (respectively
{Ai : i ∈ N}) and is a monotone class, so it equals B (respectively A). Before
moving on let us prove our last assertion. So let Cn ∈ DAi (⊂ B), n = 1, 2, · · · for
some i ∈ N and Cn ↑ C (the case Cn ↓ C is treated in an analogue way). Then
130
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
Ai × Cn Next define
DA := {B ∈ B : A × B ∈ σ({Ak × Bl : k, l ≥ 1}) }
DB := {A ∈ A : A × B ∈ σ({Ak × Bl : k, l ≥ 1}) }
By the previous step for each Ai ; i = 1, 2, · · · : DAi = B, and if we fix any
B ∈ B, then we know that:
Ai × B ∈ σ({Ak × Bl : k, l ≥ 1}).
Hence DB contains again the A–generating algebra {Ai : i ∈ N}, and it is also
a monotone class (this is done as in the previous step), so equals A. (Repeating
the argument one also obtains that for any A ∈ A : DA = B.) Thus A × B is
contained in σ({Ak × Bl : k, l ≥ 1}) and thus the product σ–algebra
σ(A × B) = A ⊗ B
is countably generated.
Proving theorems about measurable functions is often easier when it is proven
for simple functions and by then passing to the limit, since measurable functions
can be written as the everywhere pointwise limit of simple functions.
Theorem A.2.12. Let (X , B) be a measurable space and f a real–valued function
on X , measurable B/B(R). There exists a sequence {fn }n≥1 of simple functions,
each measurable B/B(R), such that
0 ≤ fn (x) ↑ f (x) if f (x) ≥ 0,
and
0 ≥ fn (x) ↓ f (x) if f (x) ≤ 0.
Proof. We refer to [Bill1] chapter 2 theorem 13.5 on page 195 or [Dud1] theorem
4.1.5 on page 116 for a proof.
This theorem is due to Lebesgue. It relates each Borel class to a measurable
functions on the product
PI × X . Recall from chapter 5, section 1 that F0 is the
class of all finite sums nj=1 qj IAj with rational qj and n = 1, 2, · · · and Aj ∈ A
the algebra generating B. W is still an uncountable well–ordered set where each
segment Wa for a ∈ W is at most countable. For w ∈ W : Fw is the class of
all functions f who are the everywhere limit of functions from Fv , v < w. And
U := ∪v∈W Fv . We will need following definition.
Definition A.2.5. Let (L, <) be a linearly ordered set. We define the order topology on (L, <) as the topology with as base sets of the form Ax := {a : a < x},
or Bx := {a : a > x} or Cx,y := {a : x < a < y}, for all y, x ∈ L.
A.2. MEASURE THEORY.
131
Theorem A.2.13 (Lebesgue). Let (X , B) be a separable measurable space. For
any w ∈ (W, ≤) there exists a universal class w (or α) function G : I × X → R,
i.e. for all f ∈ Fw , there is a t ∈ I such that G(t, ·) = f (·).
Proof. The proof will be based on transfinite induction.
Let w = 0, where 0 denotes the smallest element of (W, ≤).PWe start noting
that F0 as defined above is countable. Let f ∈ F0 , thus f := nj=1 qj IAj . The
total for fixed n we have only the choice out of Qn × An , since both Q and A are
at most countable Qn × An is at most countable too. ∪n Qn × An is still at most
countable. Now we enumerate the functions of F0 by positive integers. Let
(
fk (x), if t = 1/k for some k ≥ 1;
G(t, x) :=
0,
otherwise.
P
Then G(t, x) = k≥1 I{1/k} (t)fk (x)
Next, let (0 <)w ∈ (W, ≤) and suppose the result holds for all v < w. Since
there are at most countably many v’s smaller than w, we can number them with the
positive integers, from the smallest to the greatest. We consider two possibilities,
either the segment is finite or countable.
When the segment is finite define
(
lim supn→∞ H(tn , x) if the limit exists;
G : I ∞ × X → R : ({tn }n≥1 , x) 7→
0
otherwise,
where H := Gmax{v:v<w} .
• As such G is (jointly) measurable, since it is the limit of the functions
sup H ◦ (πk , idX )
k≥n
who are measurable, hence is measurable. The function 0 is measurable
too. And as seen in the proof of theorem 4.2.5 on page 127 in [Dud1], the
set of points where a sequence of (real–valued ) measurable functions fails
to converge is measurable. So G is indeed measurable.
• Let f ∈ Fw , then f is the everywhere limit of a sequence fn , and without
loss of generality all fn can be taken in Fmax{v:v<w} . For each fn there exists
a tn such that H(tn , x) = fn (x) for all x ∈ X , then if t := {tn } we have
got that:
f (x) = lim fn (x) = lim sup H(tn , x) := G(t, x)
n→∞
n→∞
because the lim sup equals the limit when the limit exists and this holds for
all x ∈ X .
132
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
By the Borel isomorphism theorem C.1.1, and since I and I ∞ are Polish
spaces, proposition A.1.14, there exists a Borel measurable map φ : I → I ∞
with Borel measurable inverse. The composition of G and (φ, idX ) remains measurable. So let G := G ◦ (φ, idX ).
Consider now the w which have countable segment. As in the previous case;
let
G : I ∞ × X → R : ({tn }n≥1 , x) 7→ lim sup Gvn (tn , x)
n→∞
whenever the lim sup exists, and zero otherwise and where vn is a sequence converging to w for the order topology, definition A.2.5. Then as such Gw is jointly
measurable, and the sections are the functions from Fw .
• We start with the measurability of Gw . I ∞ is metrizable, proposition A.1.8,
and separable, proposition A.1.9. Moreover its Borel σ–algebra equals the
product σ–algebra, lemma A.2.14. In particular the projections,
(πk , idX ) : I ∞ × X → I × X
are measurable functions. Gvk : I × X → R are also (jointly) measurable,
so that the composition Gvk ◦ (πk , idX ) is again (jointly) measurable. Since,
by definition A.1.7,
!
lim sup Gvn (tn , x) = lim sup (Gvk ◦ (πk , idX ))({tj }j≥1 , x)
n→∞
n→∞
k≥n
it follows that the LHS is measurable, because it is the limit, when it exists,
of the measurable variables supk≥n Gvk ◦ (πk , idX ).
• For the second point, it is clear that the section of G are by definition elements of Fw .
Conversely any element of Fw can be written as a section of G for some
t ∈ I ∞ . Let f ∈ Fw , by definition, there exists a sequence of measurable
functions {fn }n≥1 , such that f is the everywhere limit of fn and such that
fn ∈ Fvn for some vfn := vn < w. If all vn are strictly smaller than
some v < w, then the assertion is trivial. Indeed, the segment of w is at
most countable, so all its elements can be numbered by the positive integers.
Because f ∈ Fz for all v ≤ z < w, by our induction hypothesis, there exists
for each such z a tz ∈ I: G(tz , ·) = f . So
f = lim sup Gz (tz , ·) =: G({tz }z<w , ·)
z→w
A.2. MEASURE THEORY.
133
where lim supz→w means lim supn→∞ , the segment of w is at most countable.
Now the second case where fn ∈ Fun , and un is strictly increasing to w. Let
u be the smallest among the un (u exists, since (W, ≤) is well ordered) and
u equals vk for some k. For each fixed m, the maximum maxm
i=1 um exists
and is not w. Indeed, w has countable segment, and ui , 1 ≤ i ≤ m is finite,
so Ww \{ui : 1 ≤ i ≤ m}, is non empty. For u1 , take the smallest vn greater
than or equal to u1 , and denote it by vn(1) , this will be our starting point,
and it doesn’t affect the lim sup. So f1 ∈ Fu1 ⊂ Fvn(1) , hence by induction
there exists a t1 ∈ I: for which f1 := Gvn(1) (t1 , · · · ). For u2 , again there
is a smallest vk greater than or equal to u2 , so f2 ∈ Fu2 ⊂ Fvn(2) , and
f2 = Gvn(2) (tn(2) , ·). And for all l in between 1 and n(2), we now that
u1 < vl , for those, hence for each such l, let tl be such that f1 = Gvl (tl , ·).
If we continue by recursion, we obtain a sequence where some terms are
possibly finitely many times repeated, this doesn’t changes the convergence,
the limit will still be f . And moreover, for t = {ti }, where ti is defined as
above, we have got
f (x) = lim fn (x) = lim sup Gvn (tn , x) =: G({tn }, x)
n→∞
n→∞
Repeat the operation from the previous case to find a jointly measurable function with domain I × X and the same properties as G.
Lemma A.2.14. Let I = [0, 1], and I ∞ =
I ∞ equals the product σ–algebra.
Q
n≥1 In .
Then the Borel σ–algebra on
Proof. Recall that the Q
product σ–algebra is the σ–algebra generated by sets
∞
A ⊂ I of the form n≥1 Ai , where Ai * I for at most finitely many i and if
Ai 6= I, then Ai ∈ B(I).
Q
We first prove that σ( n Tn ) includes the product σ–algebra, since this will
hold for any index set and not just countable ones.
Recall that the product topology is the coarsest (or smallest) topology making
all the projections
Y
πj :
(Ik , TEucl ) → (Ij , TEucl )
k≥1
continuous. In particular, the projections are all B(I ∞ )/B(I) measurable (proposition A.2.2). Also the product σ–algebra is the smallest σ–algebra making all the
134
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
projections πj Borel measurable. Indeed, it’s easily seen that the projections are
measurable for the product σ–algebra, since for E ∈ B(I):
(
E
if k = j,
πj (πk−1 (E)) =
[0, 1] if k 6= j.
If C is any σ–algebra on I ∞ making the projections Borel measurable, then it has
to contain the sets A of the form described above, and hence it has to contain the
product σ–algebra.
We conclude, by minimality of the product σ–algebra:
O
B(Ik , TEucl ) ⊂ B(I ∞ , T ∞ ).
k≥1
It is for the converse that the notion of separability will be crucial.
Conversely, since I ∞ is separable (proposition A.1.9) each open set is an at most
countable union of sets of the basis, which are of the form
Q described in definition
A.1.6. It is then trivial to see that any open set G ∈ n≥1 Tn , is contained in
Q
Q
∪k≥1 kl=1 B(I)l × l≥k+1 [0, 1]l . So
σ
[
n
Y
n≥1
l=1
!!
B(I)l ×
Y
[0, 1]l
l≥n+1
⊃σ
Y
Tn .
n
The following two theorems are about the extension and the uniqueness of that
extension of a (probability) measure defined on some class of sets, which satisfies
some properties, to the σ–algebra generated by that class.
We first mention the extension theorem.
Theorem A.2.15. Let µ be a set function on a semiring C and that µ is non negative, µ(∅) = 0 and that µ is finitely additive and countably subadditive. Then µ
extends uniquely to a measure on σ(C).
Proof. We refer to [Bill1] chapter 2 theorem 11.3 on page 176 for a proof.
Next comes the uniqueness theorem for probability measures.
Theorem A.2.16. Let µ1 , µ2 be two probability measures on σ(P), where P is a
π–system, i.e. for all A, B ∈ P: A ∩ B ∈ P. If µ1 and µ2 agree on P, then they
agree on σ(P).
Proof. We refer to [Bill1] chapter 1 theorem 3.3 on page 44 for a proof.
A.2. MEASURE THEORY.
135
The theorem of Tonelli–Fubini tells us under what conditions an integral with
respect to a product measure can be seen as an iterated integral w.r.t. the measures
on the components of the productspace.
Theorem A.2.17 (Tonelli–Fubini). Let (Xi , Ai , µi ), i = 1, 2 be two measure spaces,
where µi , i = 1, 2 are σ–finite measures. Let f : X1 × X2 → [0, +∞] be an
A1 ⊗ A2 , B([0, +∞]) measurable function or f ∈ L1 (X1 × X2 , A1 ⊗ A2 , µ1 × µ2 ).
Then
Z
Z Z
Z Z
f d(µ1 × µ2 ) =
f (x, y) dµ1 (x) dµ2 (y) =
f (x, y) dµ2 (y) dµ1 (x)
Rwhenever one of the integrals exists. In particular
f (·, y) dµ2 (y) is A1 measurable.
R
f (x, ·) dµ1 (x) is A2 and
Proof. We refer to [Dud1] theorem 4.4.5. page 137 for a proof.
Limit theorems.
Theorem A.2.18 (Monotone Convergence). Let (X , A, µ) be a measure space
+∞],Rn = 1, · · · measurable functions. If fn ↑ f and
Rand fn : (X , A) → [−∞,
R
f1 dµ > −∞, then fn dµ ↑ f dµ.
Proof. We refer to [Dud1] theorem 4.3.2. page 131 for a proof.
Theorem A.2.19 (Dominated convergence). Let (X , A, µ) be a measure space
and g, fn ∈ L1 (X , A, µ), n = 1, · · · . If |fn (x)| ≤ g(x)
R and if for
R all x ∈ X :
1
fn (x) → f (x) as n → ∞, then f ∈ L (X , A, µ) and fn dµ → f dµ.
Proof. We refer to [Dud1] theorem 4.3.5. page 132 for a proof.
Theorem A.2.20 (Strong Law of Large Numbers). Let (X , A, P ) be a probability measure and X1 , X2 , · · · independent
Pn and identically distributed, abbreviated
i.i.d., random variables. Let Sn := i=1 Xi . Then Sn /n → E[X1 ] P –a.s. iff
E[|X1 |] < +∞.
Proof. We refer to [Dud1] theorem 8.3.5. page 263 for a proof.
Theorem A.2.21 (The Central Limit Theorem). Let X1 , X2 , · · · be i.i.d. random
variables with values in Rk such that E[X1 ] = 0 and E[kX1 kdE ] ∈]0, +∞[. Then,
as n → ∞
√
Sn / n → Z
where Z is a random variable on Rk with characteristic function
n
1 X
fZ (t) = exp −
Crs tr ts
2 r,s=1
and Crs := E[X1 X1t ](r, s).
136
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
Proof. We refer to [Dud1] theorem 9.5.6. page 306 for a proof.
Continuity of (outer) measures.
We recall continuity from below for measures. And in the second lemma continuity from below for outer measures is proved.
Lemma A.2.22. Let (X , A, µ) be a measure space. Then µ is continuous from
below, i.e. for any {An }n≥1 ⊂ A and A1 ⊂ A2 ⊂ · · · (shortly An ↑ ∪n An =: A
), then µ(An ) ↑ µ(A).
Proof. Let An ∈ A : An ↑ ∪n An =: A. Define
[
Bn := An \
Ak ,
1≤k≤n−1
for n ≥ 1. The Bn as constructed above have two important properties, namely:
[
Bk = An and Bi ∩ Bj = ∅ for i 6= j.
1≤k≤n
The second is easy, because for i 6= j one may suppose that i < j and note that
Bi ⊂ Ai and Bj = Aj ∩ (∩1≤k≤j Ack ) ⊂ Aci . The former follows by induction.
Thus by σ–additivity:
[ µ(A) = µ
An
n≥1
= µ
[
Bn
n≥1
=
X
µ(Bn )
n≥1
=
lim
N →∞
N
X
n=1
µ(Bn ) = lim µ(AN ).
N →∞
Lemma A.2.23. Let µ be a σ–additive, non negative measure on an algebra A of
subsets of a set X . Denote its outer measure by µ∗ . Let An , n = 1, 2, · · · and A
be arbitrary subsets of X such that An ↑ A, then µ∗ (An ) ↑ µ∗ (A).
Proof. Clearly µ∗ (A) ≥ µ∗ (An ) by monotonicity of µ∗ . The sequence µ∗ (An )
is bounded above and monotone, so converges to some limit c ≤ µ∗ (A). So it
A.2. MEASURE THEORY.
137
remains to prove the converse inequality. First assume µ∗ (An ) → +∞, as n → ∞
then obviously (+∞ =)c ≥ µ∗ (A).
So assume c < +∞. Let > 0, then for every An , by definition of µ∗ , there is a
sequence
[
X
{Anj }j≥1 ⊂ A with An ⊂
Anj : µ∗ (An ) + ≥
µ(Anj ).
j≥1
j≥1
Let Bnk = ∪1≤j<k Anj and let µ̃ be the (unique) extension of µ to σ(A). This
can be done by using the Extension Theorem A.2.15. Furthermore let Bn :=
∪j≥1 Anj . Then clearly one has: Bnk , Bn ∈ σ(A) for any k, n ∈ N. Now note
that for k ≥ 2 : Bnk = ∪16j<k (Anj \Bnj ) which follows by induction on k. So
Bn = ∪j≥1 Anj = An1 ∪ (∪j≥2 (Anj \Bnj )). Thus we wrote Bn as a disjoint union.
c
c
= ∅, because Anr ⊂ Bns .
∩ Ans ∩ Bns
If r < s, then Anr ∩ Bnr
µ̃(Bn ) = µ̃( An1 ∪ ∪j>2 Anj \Bnj
X
= µ̃(An1 ) +
µ̃(Anj \Bnj )
j>2
= µ(An1 ) +
X
µ(Anj \Bnj )
j>2
6
X
µ(Anj )
j>1
∗
≤ µ (An ) + Let Cn := ∩r>n Br , then C1 ⊂ C2 ⊂ · · · so Cn ↑ ∪n≥1 Cn which is contained in
σ(A). Also because An ↑ A and by definition of Bn one has
[
An ⊂ An+l and Bk =
Akj ⊃ Ak
j≥1
for all l ∈ N and k ≥ n. Thus
An ⊂ Cn =
\
Bj and Cn ⊂ Bn .
j≥n
So from the above calculations one obtains:
µ∗ (An ) ≤ µ̃(Cn ) ≤ µ̃(Bn ) ≤ µ∗ (An ) + ≤ c + .
But we also have that µ̃ is a measure, and measures are continuous ( lemma A.2.22
) from below, so µ̃(Cn ) ↑ µ̃(C), which implies µ̃(C) ≤ c + . Finally,
[
[
A=
An ⊂
Cn = C,
n≥1
n≥1
thus µ∗ (A) ≤ µ̃(C) ≤ c + , because > 0 is arbitrary one has µ∗ (A) ≤ c.
138
A.2.2
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
(Sub)Martingales and reversed (sub)martingales.
Definitions.
Definition A.2.6. Let (X , A, P ) be a probability space, T a set and (S, d) a metric
space. A stochastic process is a map
X : T × X → S : (t, ω) 7→ X(t, ω) = Xt (ω)
such that for each t ∈ T fixed, Xt : (X , A) → (S, B(Td )) is measurable. Td is the
topology generated by the metric d and B(Td ) the Borel σ–algebra on S. Usually
T = R or a subset of it (like N).
If one has also that (T, ≤) is a linearly ordered set and {At }t∈T a family of
σ–algebras on X and At ⊂ Au ⊂ A; t ≤ u; t, u ∈ T . Such a family is often
called a filtration on (X , A, P ). Then {Xt , At }t∈T is said to be a martingale if:
i) Xt is At measurable for each t ∈ T (or more generally measurable for the
P –completion of At );
ii) E[|Xt |] < ∞ for each t ∈ T ;
iii) E[Xu ||At ] = Xt for all t, u ∈ T : u ≥ t.
As usual equality means equality almost surely. If condition (iii) is replaced by:
Xt ≤ E[Xu ||At ] for all s, t ∈ T : t ≤ u,
then (Xt , At )t∈T is said to be a submartingale .
Finally we define the concept of a reversed (sub)martingale. Let {Mn , Bn }n≥1
be a sequence of random variables Mn measurable for a Borel σ–algebra Bn . Then
change the index to negative integers: {Mn , Bn }n≤−1 . The sequence is said to be a
reversed martingale , if the reversed sequence is a martingale. I.e. · · · ⊂ B−2 ⊂
B−1 and
i) Mn is Bn measurable for each n ≤ −1 (or more generally measurable for
the P –completion of Bn );
ii) E[|Xn |] < ∞ for each n ≤ −1;
iii) E[Xk ||Bn ] = Xn for all n, k ≤ −1 : k ≥ n.
In particular E[X−1 ||Bn ] = Xn , n = −1, −2, · · · . If one has
E[Xk ||Bn ] ≥ Xn for all n, k ≤ −1 : k ≥ n
A reversed (sub)martingale can be thought as a backward (sub)martingale, when
we start in the future to end in the present.
A.2. MEASURE THEORY.
139
A convergence theorem for reversed (sub)martingales.
Here we state a convergence theorem without proof.
Theorem A.2.24. [Doob’s Decomposition and Convergence for Reversed Submartingales]Let {Mn , Bn }n≤−1 be a reversed submartingale . Suppose that
K := inf E[Mn ] > −∞.
n≤−1
Then there is a decomposition Mn ≡ Nn + Zn where Zn is measurable for Bn−1
and {Nn , Bn }n≤−1 is a reversed martingale and
Z−1 ≥ Z−2 ≥ · · · almost surely, with Zn ↓ 0 a.s. as n → −∞.
Thus Mn converges almost surely and in L1 as n → −∞.
Proof. See [Dud1] theorem 10.6.4 page 373.
140
APPENDIX A. TOPOLOGY AND MEASURE THEORY.
Appendix B
Gaussian Processes.
Definition B.0.7. Let {Xt }t∈T , be a stochastic process, i.e. a collection of random
variables, with T a set. Then {Xt }t∈T is called Gaussian, if for all F ⊂ T finite:
(Xt )t∈F has a multivariate normal distribution.
Recall that the (multivariate) normal distribution is entirely determined by its
mean and its covariance matrix. The following theorem states (necessary and)
sufficient conditions on functions m, C such that there exists a Gaussian process
with mean and covariance matrix given by the function m, respectively function
C.
This will be done using a theorem of Kolmogorov, which roughly states that if
a given collection of finite dimensional distributions satisfy some ”consistency”
assumptions, then there exist a probability space and stochastic process on that
p-space with the given distributions as its finite dimensional distributions.
Definition B.0.8. Let T be a set and
{PF : F ⊂ T finite, PF probability distribution on RF }
a collection of probability distributions. The collection is said to be consistent, if
for all finite subsets F, G of T , where F ⊂ G and πGF the natural projection from
−1
RG onto RF , we have that PG ◦ πGF
= PF .
The next theorem is due to Kolomogorov and gives conditions on a system of
distribution to come from a stochastic process. It can be stated for more general
range spaces, but for our purposes here R as range space will be enough.
Theorem B.0.25 (Kolmogorov). Let T be a set and µt1 ···tk a system of distribution
satisfying
141
142
APPENDIX B. GAUSSIAN PROCESSES.
i) µt1 ···tk (H1 ×· · ·×Hk ) = µtπ1 ···tπk (Hπ1 ×· · ·×Hπk ) for π any permutation of
(1, · · · , k) and any k ∈ N and any t1 , · · · , tk ∈ T and Hi ∈ R, 1 ≤ i ≤ k;
ii) µt1 ···tk−1 (H1 × · · · × Hk−1 ) = µt1 ···tk (H1 × · · · × Hk−1 × R) for any k ∈ N,
any choice t1 , · · · , tk ∈ T and any Hi ∈ R, 1 ≤ i ≤ k − 1.
Then there exists a probability measure P on B(RT ) ( the product σ–algebra on
the product space RT ) such that the coordinate variable process [Zt : t ∈ T ] on
(RT , B(RT ), P ) has the µt1 ···tk as its finite dimensional distributions.
If x denotes an element of RT , and πt the projection from RT into R, then
Zt (x) := xt := x(t) := πt (x).
Proof. We refer to [Bill1], Chapter 7, theorem 36.1 on page 517 for a proof.
Theorem B.0.26. Let T be a set, m : T → R, C : T × T → R functions.
C(s, t) = C(t, s) for every s, t ∈ T and {C(s, t)}t,s∈F is nonnegative definite
matrix, for every F ⊂ T finite iff there exists a Gaussian process {Xt }t∈T with
mean function m and covariance C.
Proof. We refer to [Dud1] theorem 12.1.3 on page 443 for a proof.
So let GP be a Gaussian process, indexed by f ∈ L2 (P ), and such
R that
G
(f
)
has
mean
zero,
covariance
the
covariance
of
f,
g,
i.e.
Cov(f,
g)
:=
(f −
RP
R
f dP )(g − g dP ) dP , for all f, g ∈ (L)(P ). Then using theorem B.0.26 about
existence of Gaussian processes, one sees that such a process exists, because the
usual covariance of square–integrable functions is nonnegative definite.
Proof. Let fi ; i = 1, · · · , n be square–integrable functions and ai ; i = 1, · · · , n
real numbers. Remark that:
at F a = Var
n
X
ai f i
i=1
where F := covariance matrix of (fi )ni=1 , and that the latter term is non negative.
Example B.0.1 (Brownian Bridge). A Brownian Bridge {Yt }t∈[0,1] is a Gaussian
process, with zero mean and Cov(Yt , Ys ) = s(1 − t), if t > s. Let P be Lebesgue
measure on the unit interval, then GP (I[0,t] ) has zero mean, and covariance
Z
Z
(I[0,t] − t)(I[0,s] − s) dP =
I[0,t] I[0,s] dP − st = s ∧ t − st.
143
There is one major difference, there exists a version of the Brownian Bridge,
which has continuous sample paths, [Dud1] theorem 12.1.5 on page 446. But a
general Gaussian process on L2 (P ) can not have all its sample paths continuous,
[Dud2] chapter 2 secton 6.
Definition B.0.9. A version of a random variable ( stochastic process ) is any
other random variable (respectively stochastic process) on some (not necessarily
the same) probability space such that their laws coincide.
Definition B.0.10. Let H be a separable Hilbert space.
i) The isonormal process L on H is the Gaussian process such that for each
x ∈ H, L(x) is distributed according to the normal law with mean zero
and variance kxk2 . The finite dimensional distributions L(xi )ni=1 are multivariate normal with mean zero and covariance < xk , xl >. Here < ·, · >
denotes the innerproduct on H.
For any A ⊂ H define L(A)∗ as the ess.supx∈A L(x), the smallest random variable Y : Y > L(x) a.s. for all x ∈ A. Similarly |L(A)|∗ :=
ess.supx∈A |L(x)|.
Note that GP (c) = 0 a.s. for c some constant function. Also GP is linear, that
is GP (αf + g) = αGP (f ) + GP (g) a.s.
Proof. Both have mean zero and the difference GP (αf + g) − αGP (f ) − GP (g),
which we denote by G, has variance equal to zero.
Var(G) = Var(GP (αf + g)) + Var(αGP (f )) + Var(GP (g))
−2[Cov(GP (αf + g), αGP (f )) + Cov(GP (αf + g), GP (g))
−Cov(αGP (f ), GP (g))]
= Var(αf + g) + Var(αf ) + Var(g)
−2[Cov(αf + g), αf ) + Cov(αf + g, g)
−Cov(αf, g)]
= Var(αf + g − αf − g)
= 0
Combining those two facts, we have
0 = Var(GP (αf +g)−αGP (f )−GP (g)) = E(GP (αf +g)−αGP (f )−GP (g))2 ,
implying that (GP (αf + g) − αGP (f ) − GP (g)) = 0 a.s.
144
APPENDIX B. GAUSSIAN PROCESSES.
The covariance w.r.t. P induces also a pseudometric on L2 (P )R as follows:
ρP (f, g) := {E[(GP (f ) − GP (g))2 ]}1/2 . On (L)20 := {f : f ∈ (L)2 , f dP = 0}
this pseudometric coincides with the usual one for
ρP (f, g) = {E[(GP (f − g))2 ]}1/2 = Var(f − g)1/2 = E[(f − g))2 ]1/2 .
It plays an important role when proving tightness of the empirical process.
Appendix C
More about Suslin / Analytic Sets.
C.1
The Borel Isomorphism Theorem.
The main theorem from this part is not only on interested for its use throughout
the text but is also of interested on its own. It tells us that for some spaces being
Borel isomorphic is nothing more than having the same cardinality, and that that
cardinality can only be finite, countable, or c (cardinality of the continuum). This
is remarkable: for the spaces about to be specified the continuum hypothesis is
consequence of the other axioms of sets theory where as its more general for is
known to be independent of the usual ZFC axioms of set theory.
Theorem C.1.1. Let (R, d) and (S, e) be two separable metric spaces which are
Borel subsets of their (metric) completions. Then R ∼ Y ; i.e. R and S are
Borel isomorphic, there exists a Borel measurable bijection f from (R, d) onto
(S, e) with measurable inverse; iff R and S have the same cardinality, which is
moreover finite, countable or c (the cardinality of the continuum: [0, 1] ).
Proof. We refer to [Dud1] theorem 13.1.1 on page 487–492 for a (quite lengthy)
proof.
We give a short comment on the implications of the theorem. First of all
note that the theorem tells us that for separable space which are Borel sets of
their metric completion that for such spaces being Borel isomorphic is nothing
else than having the same size as sets. This is remarkable, in the fact that in
general the bijection between two sets which are topological spaces, need not be
the right Borel isomorphism and that it isn’t clear how one could make that set
isomorphism and its inverse measurable functions for the Borel σ–algebra’s. But
this theorem says there is such a function. Secondly, like mentioned before the
theorem, we get the continuum hypothesis (CH) for free for such spaces, whereas
in the general case CH is not decidable in ZFC theory. Recall that the CH states
145
146
APPENDIX C. MORE ABOUT SUSLIN / ANALYTIC SETS.
that c, the cardinality of the continuum, is the smallest cardinal strictly greater
than the cardinality of the positive integers.
C.2
Definitions and properties of Analytical Sets.
We present some (well known) facts about Suslin sets and spaces that were used in
the fifth chapter. We start with a characterization of Suslin, also called Analytic,
sets.
Let us first recall the definition, given also in the second section of the fifth
chapter, of a Suslin space.
Definition C.2.1. A separable measurable space (Y, S) will be called a Suslin
space iff there exists a Polish space R and a Borel measurable map from R onto
Y . If (Y, S) is a measurable space, a subset Z of Y will be called Suslin set iff
(Z, Z u S) is a Suslin space.
Theorem C.2.1. Let (S, d) be a Polish space and A a non–empty subset of S, the
following are equivalent:
a) A = f [N∞ ], for f some continuous function;
a’) A = f [N∞ ], for f some Borel measurable function;
b) A = f [R], for f some continuous function and (R, e) Polish;
b’) A = f [R], for f some (Borel) measurable function and (R, e) Polish;
c) A = f [B], where f : R → S is continuous, B Borel and (R, e) Polish;
c’) A = f [B], where f : R → S Borel measurable, B Borel and (R, e) Polish.
Proof. The implications a → a0 , b → b0 and c → c0 , follow immediately. Moreover, N∞ is, as a countable product of Polish spaces, itself Polish, so a → b → c
and a0 → b0 → c0 are straight forward too.
To finish the proof it remains to show c0 → a. We proceed in two steps, firstly
we will proof c0 → b0 . So we’re given a Borel measurable map, and we would
like to extended the domain of the function. This is easily done as follows, let
a ∈ A and f (x) := a for all x ∈ R\B, possible because B is Borel, so that the
relative Borel σ–algebra on B, B u σ(Te ) ⊂ σ(Te ). Secondly b0 → a. If we were
able to prove that the graph of f was a Borel set in R × Y we would be done,
because a Borel set in a Polish space can be written as the image of N∞ through
a continuous function g. This last statement will be shown in the next lemma (
C.2.2 ).
C.2. DEFINITIONS AND PROPERTIES OF ANALYTICAL SETS.
147
Lemma C.2.2. Let R, S be Polish spaces, f from R into S a Borel measurable
function. Then the graph of f , i.e. {(r, f (r)) : r ∈ R}, is a Borel set in R × S.
Proof. Making use of the Borel isomorphism theorem C.1.1 , i.e. two separable
space which are Borel sets of their completion are Borel isomorphic iff they have
the same cardinality, and moreover this cardinality is either finite, countable or
c the cardinality of the continuum. So w.l.o.g. we can and do assume that S
is a subset of 2∞ (2∞ has the cardinality of the continuum and is complete and
separable). Actually if S is uncountable, then S ∼ 2∞ , if S is countable then e.g.
S ∼ N and N can be identified with a Borel set of 2∞ , e.g. n 7→ ∩i6=n πi−1 ({0}) ∩
πn−1 ({1}). So that we can identify S with a Borel set of 2∞ and a Borel set in S is
also Borel in 2∞ . From now on we continue with 2∞ rather than S.
Consider the sets:
T (n) := {s = {sj }1≤j≤n : sj = 0 or 1, for each j}
= {0, 1}n .
For s ∈ T (n) let:
Cs := {u = {uj }j≥1 ∈ 2∞ : uj = sj , for j = 1, · · · , n}
= ∩nj=1 πj−1 ({sj }),
then Cs is clopen, i.e. closed and open, in 2∞ . Further define Bn = ∪s∈T (n) f −1 (Cs )×
Cs , which is a Borel set of 2∞ . Finally G := ∩n Bn . We claim that G is the graph
of f . For x ∈ R : (x, f (x)) is an element of the graph of f . For any n ∈ N, define
sn = (f (x)1 , · · · , f (x)n ), then clearly f (x) ∈ Csn and x ∈ f −1 (Csn ). Conversely, let (x, y) be contained in G, then for each n there is some s := sn ∈ T (n)
such that (x, y) ∈ f −1 (Cs ) × Cs , this means (y1 , · · · , yn ) = (sn1 , · · · , snn ) and
f (x) ∈ Csn ; (f (x)1 , · · · , f (x)n ) = (sn1 , · · · , snn ). Then y = f (x).
So examples of Suslin space include Polish space, these spaces are separable,
so their Borel σ–algebra is countably generated and they are metric, in particular
Hausdorff hence the singletons are closed. Borel sets of Polish spaces are also
Suslin, and more generally separable metric spaces which are Borel sets of their
completion are also Suslin.
If an analytic sets is equiped with his Borel σ–algebra, then measurable subsets
of it are again analytic.
Corollary C.2.2. Let (A, A) be an analytic set with Borel σ–algebra. Then any
Z ∈ A is analytic again.
Proof. Since A is analytic, there exists some Polish space (S, T ) and Borel measurable map f from S onto A. Let Z ∈ A, then f −1 (Z) ∈ B(T ), so by the remark
148
APPENDIX C. MORE ABOUT SUSLIN / ANALYTIC SETS.
just above f −1 (Z) is analytic. Since f is Borel measurable and onto, its restriction to f −1 (Z) is also Borel measurable and onto, its image f (f −1 (Z)) = Z is
analytic, definition C.2.1.
C.3
Universal measurability of Analytic Sets.
In what follows we will show the important property of universal measurability of
analytic sets.
Theorem C.3.1. Let (S, d) be a Polish space. Then every analytic subset of S is
universally measurable.
Proof. Let A be an analytic subset of S and P a probability measure on the Borel
σ –algebra of S. One way to prove the assertion of the theorem is to construct a
Borel set, whose difference with A has measure zero for the completion of P . By
P ∗ denote the outer measure of P .
By the definition of analytic subset, definition/theorem C.2.1 we have got a
function f from N∞ into S which is continuous and whose range is A. Let > 0;
for k, M ∈ N let
N (k, M ) := {{nj }j>1 ∈ N∞ : nk 6 M }.
Then clearly one has: f [N (1, M )] ↑ f [N∞ ], as M → ∞. Because outer measures
are continuous from below, lemma A.2.23, there is an M1 such that:
P ∗ (f [N (1, M1 )]) ≥ P ∗ (A) − /2.
Next we note that, N (1, M1 ) ∩ N (2, M ) ↑ N (1, M1 ), as M → ∞. Again by
continuity from below for outer measures, for some M2 :
P ∗ (f [N (1, M1 ) ∩ N (2, M2 )]) ≥ P ∗ (A) − /2 − /4.
Recursively define Mk such that
P
∗
(f [∩kj=1 N (j, Mj )])
∗
≥ P (A) − k
X
2−j .
j=1
Every N (j, Mj ) is closed in N∞ for the product topology. So
Fk := ∩kj=1 N (j, Mj )
remains closed and Fk ↓QC, C := ∩j>1 N (j, Mj ). By Tychonoff’s theorem,
A.1.16, F , which equals j≥1 {1, · · · , Mj } is seen to be compact. We would
C.3. UNIVERSAL MEASURABILITY OF ANALYTIC SETS.
149
like to find a Borel sets that doesn’t differ too much from A in probability, the
above calculations indicate that a possible candidate could be ∩k f [Fk ]. But this
last set could possibly not be Borel anymore, and also differ too much from A.
Taking the closure of f [Fk ] in A resolves the first problem. Now we will see that
it is true that f [Fk ] ↓ f [C]. We will use lemma A.1.12.
Let U be an open set of N∞ such that C ⊂ U . Consider the usual base of the
product topology, A.1.6
(
)
\
B :=
πj−1 (Uj ) : J ⊂ N finite, Uj open subset of N .
j∈J
Every open set is a union of such basis sets, so denote U = ∪k∈K Uk , Uk ∈ B
and K arbitrary index set (actually because N∞ is separable we could take K as
a subset of N). Since C is compact in N∞ there is L ⊂ K finite such that C
is covered by ∪l∈L Ul . Because Ul are sets from the basis, there is a index Nl
such that from that index on the projection πn (Ul ) = N for all n ≥ Nl . So if
N := maxl∈L Nl , then for all n ≥ N : πn (Ul ) = N, l ∈ L. Noting that the
projections of Fn on the coordinates n + k; k ∈ N0 and C ⊂ ∪l∈L Ul , for all
n > Nl : Fn ⊂ ∪l∈L Ul .
Lemma A.1.12 applies and as we hoped for, we obtain that f [Fk ] ↓ f [C], as
k → ∞. Therefore:
P (f [C]) = P (∩k f [Fk ])
= P (∩k f [Fk ])
≥ P ∗ (A) − ,
and we have found a Borel set, f [C], that is close to A. Repeating that argument
for = 1/n; n = 1, 2, · · · one obtains Borel sets Bn :
P (Bn ) ≥ P ∗ (A) − 1/n.
This finishes the proof because by construction the sets Bn ⊂ A : P (B) ≤
P ∗ (A).
This results actually holds for σ finite measures. Recall that a measure is called
σ finite if there exists a sequence of measurable sets Ωn with finite measure such
that ∪n Ωn = Ω. Then σ finite measure are equivalent, i.e. they dominate each
other, so they have the same nullsets, to probability measures of the form
P (·) =
X µ(· ∩ Ωj )
j≥1
µ(Ωj )
2−j .
150
APPENDIX C. MORE ABOUT SUSLIN / ANALYTIC SETS.
That P is dominated by µ is easily seen. For the converse, if A is such that
P (A) = 0 then µ(A ∩ Ωj ) = 0, but because ∪j≥1 Ωj = Ω on has
µ(A) ≤
X
µ(A ∩ Ωj ) = 0.
j≥1
So P also dominates µ.
C.4
A selection theorem for Analytic Sets.
Next we state a selection theorem for analytic sets.
Theorem C.4.1. Let R and S be Polish spaces and let A be an analytic subset
of the product space R × S. Let C := πS (A) = {s ∈ S : (r, s) ∈ A for
some r ∈ R}. Then there is a function g from the analytic set C into R such that
(g(s), s) ∈ A for all s ∈ C and such that g is measurable from the σ–algebra of
universally measurable sets of S to the Borel sets of R.
Proof. A nice property of analytic sets is that they are conserved by projections,
because those are continuous surjective maps, so C is analytic (definition/theorem
C.2.1).
There is some link with the usual Axiom of Choice, which is equivalent to the
well ordering principle. Here the proof will go along another, weaker, form of this
well ordering principle.
We define an ordering on N∞ , the lexicographical one: {mj }j≥1 < {nj }j≥1
iff for some i ∈ N : mj = nj for all j < i and mi < ni . Next we claim that
every non–empty closed subset F of N∞ has a smallest element, i.e. there is an
x ∈ F : x ≤ y for all y ∈ F . Indeed let m1 be the smallest positive integer that
appears in the first coordinate of elements of F . This can be done because N is
well ordered. From all elements of F which have m1 as first coordinate choose
m2 the smallest positive integer that appears as second coordinate. Continuing by
induction we obtain {mk }k≥1 , by construction it is a lower bound of F (one never
has {nk }k≥1 < {mk }k≥1 for any {nk }k≥1 ∈ F ). Moreover because F is closed in
N∞ , by picking an element from each set Fl := {{nk }k≥1 ∈ F : mj = nj for
j = 1, · · · , l} we have a sequence of elements in F that converges to {mk }k≥1 , so
{mk }k≥1 ∈ F .
Continuing with the proof let γ := πS ◦f . Then γ is a continuous function from
N∞ onto C. For s ∈ C : γ −1 ({s}) is closed in N∞ . By the above consideration
there exists a smallest element in γ −1 ({s}). Let h : C → N∞ : s → { smallest
member of γ −1 {s}}. Let g := πR ◦ f ◦ h. If h is universally measurable, it will
C.4. A SELECTION THEOREM FOR ANALYTIC SETS.
151
follow that g will be too.
First note that if we set
A(s) := {r ∈ R : (r, s) ∈ A},
the section of A along s ∈ S, then
γ −1 (C) = f −1 (πS−1 (C)) = f −1 (πS−1 (C))
= f −1 (A ∩ (R × C)) = f −1 (∪s∈C A(s) × {s}).
So y = h(s) ∈ γ −1 {s} = f −1 (A(s) × {s}), and f (h(s)) ∈ A(s) × s.
Now let
Cn := {y ∈ C : there is some m = {mj }j≥1 ∈ N∞ : γ(m) = y, m1 = n}
−1
:= γ[π1,N
∞ ({n})].
−1
Then Cn is an analytic set. In order to see that γ[π1,N
∞ ({n})] is analytic, we note
−1
∞
that N is a Polish space, π1,N∞ ({n}) is an open set in N∞ hence Borel. Finally γ
is a continuous surjections and according to definition/theorem C.2.1 continuous
images of Borel sets of Polish spaces are analytic sets.
Let h1 : C → N : y 7→ h1 (y) = n where n is the positive integer such that
y ∈ Cn . This h1 is a measurable function from the Borel σ–algebra on N to the
n−1
universally measurable sets from C, because h−1
1 ({n}) = Cn \ ∪i=1 Ci . (Recall
also that analytic sets are universally measurable). We continue recursively if
we have h1 , · · · , hk then hk+1 : C → N and hk+1 (y) the least n such that y =
γ(m), m ∈ N∞ and mj = hj (y), j = 1, · · · , k; hk+1 (y) = n.
Our purpose is to show that h is a measurable function, because its codomain is
N∞ we can write h as a sequence hj as above. The space N∞ is equipped with
the Borel σ–algebra generated by the product topology. Here the Borel σ–algebra
equals the product σ–algebra, so in order to prove measurability of h it is sufficient
to prove it for each component function hj .
Before proving measurability of an arbitrary hk we first consider the case k = 2
as it will show how to easily generalize to an arbitrary k ∈ N. So start with
considering j = 2. Now h2 (y) = n iff y = γ(m) for some m in
−1
−1
π2−1 ({n}) ∩ π1−1 (h1 (y))\ ∪n−1
i=1 π2 ({i}) ∩ π1 (h1 (y)).
Meaning that h2 (y) = n = m2 , h1 (y) = m1 and for any other m̃ ∈ N∞ with
h1 (y) = m̃1 : h2 = m̃2 > n. Then y ∈ h−1
2 ({n}) iff
"
n−1
#
[
[ y∈γ
π2−1 ({n}) ∩ π1−1 ({k}) \
π2−1 ({i}) ∩ π1−1 ({k})
k∈K
i=1
152
APPENDIX C. MORE ABOUT SUSLIN / ANALYTIC SETS.
because h1 (y) = k for some (unique) k ∈ N and h1 (C) = K ⊂ N. Actually
for h2 (y) one is looking for the set of sequences of positive integers in the set
π1−1 (h1 (y)) ∩ γ −1 ({y}) which has the smallest second coordinate. That set is
u.m., because the set on which γ acts is a Borel set, see theorem
Now we can move on to the general case. So assume h1 , · · · , hk are measurable
for the Borel σ–algebra on N and the σ–algebra of u.m. sets on C. Then let
hk+1 be as defined above. Denote the range of hj as Kj , j = 1, 2, · · · , k. Then
y ∈ h−1
k+1 ({n}) iff y ∈
"
[
k
k
k
[
[ \
\
n−1
−1
−1
−1
−1
γ
πk+1 ({n})∩ πj ({lj }) \
πk+1 ({i})∩ πj ({lj })
j=1 lj ∈Kj
j=1
i=1
#
j=1
Again all operations are countable operations on open sets so what is inside the
brackets is a Borel set of N∞ , which is Polish. By definition/theorem C.2.1 about
analytic sets, the image of γ of that Borel set is analytic, thus universally measurable by theorem C.3.1. As said before, h = {hj }j≥1 by definition and thus h is
measurable.
Appendix D
Entropy and useful inequalities.
D.1
Entropy.
We often modified our class of functions F to obtain a more tractable class. It
is important to prove that those modified classes still enjoy the properties of the
original one, such as a finite entropy number. This will be proven first, after the
definition of entropy is recalled.
Definition D.1.1. For F a class of measurable functions on (X , A) let
FF (x) := sup{|f (x)| : f ∈ F} = kδx kF ,
where δx : F → R is linear functional.
A measurable function F ≥ FF is called an envelope function for F. If FF
is A–measurable, then it is said to be the envelope function. Given a law P on
(X , A) we call FF∗ the essential infimum of FF the envelope
function of F for P .
P
Let Γ be the set of all laws on X of the form n−1 nj=1 δxj for some xj ∈ X
and j = 1, · · · , n, n ∈ N0 . For > 0, 0 < p < ∞ and γ ∈ Γ, then for an
envelope function F of F let
(
DFp (, F, γ) := sup m : f1 , · · · , fm ∈ F, and all
Z
i 6= j,
p
p
|fi − fj | dγ > Z
)
p
F dγ .
Furthermore let
Lemma D.1.1. Let (X , A, P ) be a probability space, F ∈ Lp (X , A, P ), and
F ⊂ L1 (X , A, P ) where F is the envelope function of F and p ∈ [1, +∞[. If
(p)
DF (δ, F) < ∞ for all δ > 0;
153
154
APPENDIX D. ENTROPY AND USEFUL INEQUALITIES.
then also for FM := {f I{F ≤M } : f ∈ F}
(p)
DF̃ (, FM ) < +∞,
Proof. Indeed, let
γ := n
−1
n
X
for all > 0.
δx l ∈ Γ
l=1
and xl ∈ X , 1 ≤ l ≤ n. If g1 , · · · , gm ∈ FI{F ≤M } such that
Z
p
|gi − gj |p dγ > p γ(FM
).
P
p
−1
p
Now γ(FM
) = nP
l∈AM (F (xl )) , where AM := {k : F (xk ) ≤ M }. Let
γ̃ :=card(AM )−1 l∈AM δxl and fi I{F ≤M } := gi . Then
Z
p
|(fi − fj )IAM |p dγ > p γ(FM
)
Z
|AM |
|AM |
|(fi − fj )|p dγ̃ > p
γ̃(F p ).
n
n
(p)
(p)
So m ≤ DF (, F, γ̃) ≤ DF (, F).
In the next series of lemma we will prove that certain classes G derived from
F, which is Suslin image admissible, are still Suslin image admissible.
Lemma D.1.2. Let
{Xl : (X ∞ , A∞ , P ∞ ) → (X , A, P ) : {xn }n≤1 7→ xl }l≥1
be the standard model, with (X , A, P ) a probability space. Let F ⊂ L2 (X , A, P )
and F ∈ L2 (X , A, P ) be an envelope function for F.
(2)
If F is Suslin image admissible through some (Y, S, T ) and DF (ξ, F) < +∞
for all ξ > 0. Then, for any δ > 0,
Z
Fj,δ := {f − fj : f ∈ F, (f − fj )2 < δ 2 },
with fj ∈ L2 (X , A, P ), is also Suslin image admissible and has a finite entropy.
Proof. fj + Fj,δ ⊂ F ∩ BL2 (fj , δ), thus fj + Fj,δ is relatively d2,P –open and
according to theorem 4.1.4,
Z := {y ∈ Y : T (y) ∈ fj + Fj,δ } = T −1 (fj + Fj,δ ) ∈ S.
D.1. ENTROPY.
155
(Y, S) is a Suslin space, by corollary C.2.2, Z with relative σ–algebra is also
Suslin. Moreover (Z, S u Z) is also a separable space measurable space, since
(Y, S) is. fj + Fj,δ is Suslin image admissible, and because fj is measurable Fj,δ
is Suslin image admissible.
(2)
(2)
DF (ξ, Fj,δ ) < +∞,
because let m ≤ DF (ξ, Fj,δ ), γ ∈ Γ and hi ∈ Fj,δ ,
R
1 ≤ i ≤ m such that (hk − hl )2 dγ > ξ 2 γ(F 2 ). Then hi − fj ∈ F and
Z
Z
2
((hk − fj ) − (hl − fj )) dγ = (hk − hl )2 dγ > ξ 2 γ(F 2 )
(2)
but hi − fj , 1 ≤ i ≤ m may possibly not cover whole F, so m ≤ DF (ξ, F).
A lemma needed in chapter 6, in theorem 5.3.2 and 5.4.2.
Lemma D.1.3. Let (X , A) be a separable measurable space and F a class of
real–valued measurable functions on X . If Xi denote the coordinates on
(X ∞ , A∞ , P ∞ ), then for any fixed x = (x1 , · · · , xn )
Z
G := F × F ∩ {(f, g) :
(f − g)2 dP2n (x) < δ 2 }
is Suslin image admissible and
|νn0 (f − g)| for f, g ∈ G
is measurable.
Proof. It is enough to have that
sup
n
|νn0 (f
Z
− g)| : f, g ∈ F,
(f − g)2 dP2n < δ 2
o
is measurable.
• The product F × F is certainly Suslin image admissible via
(Y 2 , S ⊗ S, (T, T )), as in the proof of theorem 4.2.3. Now we show that
Z
n
o
A := (y, z) :
(T (y) − T (z))2 dP2n (x) < δ 2
is measurable. Then G will be image admissible Suslin via
(Y 2 , S ⊗ S, (T, T )IA ).
156
APPENDIX D. ENTROPY AND USEFUL INEQUALITIES.
• Measurability of A. First of all consider the function
h1 : X 2n × Y × Y → R4n : (x, y, z) 7→
(T (y)(x1 ), T (z)(x1 ), · · · , T (y)(x2n ), T (z)(x2n ));
which is measurable, because the coordinate functions, i.e.
(x, y, z) 7→ (π1 (x), y) 7→ T (y)(π1 (x))
are measurable and R4n is separable implying,
B(R4n ) = ⊗4n
i=1 B(R), so that measurability of the coordinate functions are
sufficient to have measurability of h1 . The function
4n
h2 : R
−1
→ R : (r) 7→ (2n)
2n
X
(π2l−1 (r) − π2l (r))2
l=1
is also measurable, since it is Rcontinuous. The composition h := h2 ◦
h1 (x, y, z) equals by definition (T (y) − T (z))2 dP2n (x).
• Finally we come toRthe measurability of νn0 (f − g) for
f, g ∈ F ∩ {f, g : (f − g)2 dP2n < δ 2 }.
For the functions h, h1 and h2 as in the previous step, νn0 (f − g) is measurable as the product of (x, y, z) 7→ I{h−1 ([0,δ2 [)} and
−1/2
n
n
X
i=1
(T (y) − T (z))(Xσ(i) ) −
n
X
(T (y) − T (z))(Xτ (i) ) .
i=1
Then as in corollary 4.2.2,
Z
n
o
0
2
2
sup |νn (f − g)| : f, g ∈ F, (f − g) dP2n < δ
is universally measurable. Indeed, noting that (X 2n , A2n ) is a separable
measurable space, ( theorem A.2.11 together with (X , A) is a separable
measurable space), we look at the jointly measurable map
Y 2 × X 2n → R : (y1 , y2 , x) 7→ IA(x) (y1 , y2 )(T (y1 )(x) − T (y2 )(x))
and we consider
Ψ : (F × F) × X 2n → R : ((f1 , f2 ), x) 7→ IB (f1 , f2 )(f1 (x) − f2 (x))
is jointly measurable, where B := {(T, T )(A)}. This since, by Aumann’s
theorem ( 4.1.3 ) admissibility is equivalent to image admissibility. Hence
D.2. EXPONENTIAL INEQUALITIES.
157
sup{Ψ(f1 , f2 , x) : for some (f1 , f1 ) ∈ F × F} is a universally measurable
function.
So kνn0 (f − g)kF ×F is universally measurable as the composition of a universally measurable and a measurable function:
kΨkF ×F ◦ (Xσ1 , · · · , Xσn , Xτ1 , · · · , Xτn ).
D.2
Exponential inequalities.
We first prove an auxilary theorem that forms the starting point for a whole bunch
of exponential inequalities.
Theorem D.2.1. Let X be any real random variable and t ∈ R, then
Pr{X ≥ t} ≤ inf e−tu E[euX ].
u≥0
Proof. Let u ≥ 0 fixed, then I{X≥t } ≤ eu(X−t) . Integrating gives Pr{X ≥ t} ≤
e−tu E[euX ]. So taking inf u≥0 finishes the proof.
This inequality is about one of the most simple random variables there are,
namely Rademacher random variables and it is due to Hoeffding.
Definition D.2.1. A real random variables X that takes a.s. the values ±1 with
equal probaiblity, i.e. P ({X = 1}) = 1/2 = P ({X = −1}), is called a
Rademacher random variable .
Proposition D.2.2 (Hoeffding). Let s1 , · · · , sn be Rademacher random variables.
For any t ≥ 0 and aj ∈ R:
Pr
n
nX
j=1
n
o
. X
2
2
aj sj ≥ t ≤ exp − t 2
aj .
j=1
Proof. Note that
(2n)! = (2n)(2n − 1) · · · (2n − (n − 1))n! > 2n n!
for n ≥ 2, and equality holds for n = 0, 1. So 1/(2n)! ≤ 2−n /n! for n ≥ 0.
Consider the function cosh(x) := (ex + e−x )/2, then by the inequality above
158
APPENDIX D. ENTROPY AND USEFUL INEQUALITIES.
cosh(x) ≤ exp(x2 /2) for all x:
cosh(x) = 1/2(ex + e−x )
X
X
n
n n
= 1/2
x /n! +
(−1) x /n!
n≥0
n≥0
X
= 1/2
2(x2 )n /(2n)!
n≥0
≤
X
(x2 /2)n /n! = exp(x2 /2).
n≥0
Applying theorem D.2.1, gives
n
n
nX
o
h
X
i
−tu
Pr
aj sj ≥ t ≤ inf e E exp u
aj s j
j=1
u≥0
j=1
= inf e−tu E
n
hY
u≥0
j=1
= inf e−tu
u≥0
n
Y
E[exp(uaj sj )]
j=1
= inf e−tu
u≥0
n
Y
1/2(exp(uaj ) + exp(−uaj ))
j=1
≤ inf e−tu
u≥0
i
exp(uaj sj )
n
Y
exp((uaj )2 /2)
j=1
= inf exp − tu +
u≥0
n
X
(uaj )2 /2
j=1
where in the third step we used the independence of the sj ’s and in the fourth their
special form to calculate the expectation. Finally, we only have to calculate the
inf u≥0 using ordinary calculus. The derivative with respect to u is given by
n
n
h
i
X
X
exp − tu +
(uaj )2 /2 − t +
(uaj )2
j=1
j=1
P
and switches sign from minus to plus in the point u = t/( nj=1 aj )2 ), which is
thus the minimum. Plugging it in one obtains,
! n
n
n
n
X
X
2 X
. X
exp −t2 /
a2j + t2 /
aj ) 2
a2j /2 = exp −t2 2
a2j
j=1
j=1
j=1
j=1
D.2. EXPONENTIAL INEQUALITIES.
159
Remark. Hoeffding’s inequality is not only useful for Rademacher random variables with fixed coefficients, but one can also consider coefficients which are independent random variables (and independent from the si too).
160
APPENDIX D. ENTROPY AND USEFUL INEQUALITIES.
Bibliography
[Bill1] P. Billingsley, Probability and Measure, John Wiley And Sons, New York,
2012 (ISBN: 978-1-1181-2237-2)
[Bill2] P. Billingsley, Convergence of Probability Measures, John Wiley And
Sons, New York, 1999 (ISBN: 978-0-471-19745-4)
[Cohn] Donald L. Cohn, Measure Theory, Birkhäuser, Quinn-Woodbine, 1980
[Dud1] Richard M. Dudley, Real Analysis and Probability, Cambridge University
Press,, Cambridge, 2002 (ISBN: 0-521-80972-X)
[Dud2] Richard M. Dudley, Uniform Central Limit Theorems, Cambridge University Press, Cambridge, 2008 (ISBN: 972-0-521-05221-4)
[FreitagAndBusam] E. Freitag and R. Busam, Complex Analysis., Springer Verlag, Berlin Heidelberg, 2009 (ISBN: 978-3-540-93982-5)
[HewAndStrom] E. Hewitt and K. Stromberg, Real and Abstract Analysis,
Springer Verlag, New York, 1965 (ISBN: 3-540-90138-8)
[Jech] T. Jech, Set Theory, Academic Press, London, 1978 (ISBN: 0-12-3819504)
[Pol] D. Pollard, Convergence of Stochastic Processes., Springer Verlag, New
York, 1984 (ISBN: 0-387-90990-7)
[vdVaartAndWell] Aad W. van der Vaart and Jon A. Wellner, Weak Convergence
and Empirical Processes., Springer Verlag, Berlin Heidelberg, 2000 (ISBN:
0-387-94640-3)
[Will] S. Willard, General Topology, Dover Publications, New York, 2004
(ISBN: 978-0-486-43479-7)
161