Download lecture notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Transcript
Notes on weak convergence and related topics
Shota Gugushvili
Mathematical Institute, Faculty of Science, Leiden University,
P.O. Box 9512, 2300 RA Leiden, The Netherlands
E-mail address: [email protected]
2010 Mathematics Subject Classification. 60-01
Key words and phrases. Central limit theorem, sequential compactness, tightness,
weak convergence, weak law of large numbers
Abstract. These notes deal with weak convergence of probability measures
on the real line. They are largely based on the lecture notes written by Peter
Spreij to accompany the Measure Theoretic Probability course.
Contents
Preface
vii
Chapter 1. Weak convergence
1.1. Generalities
1.2. Criteria for weak convergence
1.3. Convergence of distribution functions
1.4. Sequential compactness
1.5. Continuous mapping theorem
1.6. Almost sure representation theorem
1.7. Relation to other modes of convergence
1.8. Slutsky’s lemma
Exercises
1
1
2
4
5
8
9
12
14
15
Chapter 2. Characteristic functions
2.1. Definition and first properties
2.2. Inversion formula and uniqueness
2.3. Necessary conditions
2.4. Multidimensional case
Exercises
17
17
20
23
23
24
Chapter 3. Limit theorems
3.1. Characteristic functions and weak convergence
3.2. Weak law of large numbers
3.3. Probabilities of large deviations
3.4. Central limit theorem
3.5. Delta method
3.6. Berry-Esseen theorem
Exercises
27
27
29
31
32
33
34
35
Bibliography
37
v
Preface
These notes deal with weak convergence of probability measures on the real line
and related topics. They are to a large extent based on the lecture notes written by
Peter Spreij to accompany the Measure Theoretic Probability course. Other sources
we used are listed in the bibliography.
Shota Gugushvili
vii
CHAPTER 1
Weak convergence
1.1. Generalities
We start with the definition of weak convergence of probability measures on
(R, B), and that of a sequence of random variables.
Definition 1. Let µ, µ1 , µ2 , . . . be probability measures on (R, B). It is said
w
that µn converges weakly to µ, and we then write µn → µ, if µn (f ) → µ(f ) for
all f ∈ Cb (R). If X, X1 , X2 , . . . are random variables (possibly defined on different
probability spaces) with distributions µ, µ1 , µ2 , . . . , then we say that Xn converges
w
weakly to X, and write Xn
X, if it holds that µn → µ.
Another accepted notation for weak convergence of a sequence of random varid
ables is Xn → X, and one says that Xn converges to X in distribution.
Consider the following example that illustrates for a special case that there is
some reasonableness in Definition 1.
Example 2. Let {xn } be a convergent of real numbers sequence with limn→∞ xn =
0. Then for every f ∈ Cb (R) one has f (xn ) → f (0). Let µn be the Dirac measure
concentrated on xn and µ the Dirac measure concentrated in the origin. Since
w
µn (f ) = f (xn ), we see that µn → µ.
As a further result, here is a statement that gives an appealing sufficient condition for weak convergence of a sequence of random variables, when the random
variables involved admit densities.
Theorem 3. Consider distributions µ, µ1 , µ2 , . . . having densities f, f1 , f2 , . . .
w
w.r.t. Lebesgue measure λ on (R, B). Suppose that fn → f λ-a.e. Then µn → µ.
Proof. We apply Scheffé’s lemma to conclude that fn → f in L1 (R, B, λ). Let
g ∈ Cb (R). Since g is bounded, we also have fn g → f g in L1 (R, B, λ) and hence
w
µn → µ.
One could naively think of another definition of convergence of probability
measures, for instance by requiring that µn (B) → µ(B) for every B ∈ B, which
is the same as to require that th class of test functions f consists of indicators of
Borel sets, or even by requiring that the integrals µn (f ) converge to µ(f ) for every
bounded measurable function. It turns out that each of these requirements is too
strong to get a useful convergence concept. One drawback of such a definition is
demonstrated by the following example with Dirac measures.
Example 4. Assume the same setup as in Example 2 and take for concreteness
xn = 1/n. Let B = (−∞, x] for some x > 0. Then for all xn < x, we have
µn (B) = 1B (xn ) = 1 and µ(B) = 1B (0) = 1, so that µn (B) → µ(B). For x < 0 we
1
2
1. WEAK CONVERGENCE
get that µn (B) = µ(B) = 0, and thus µn (B) → µ(B). But for B = (−∞, 0] we have
µn (B) = 0 for all n, whereas µ(B) = 1. Hence convergence of µn (B) → µ(B) does
not hold for this last choice of B, although it is quite natural in this case to say
that µn → µ. For the future reference note the following: if Fn is the distribution
function of µn and F that of µ, then we have seen that Fn (x) → F (x), for all x ∈ R,
except for x = 0.
1.2. Criteria for weak convergence
In this section we give several criteria for weak convergence. These are primarily
useful in the proofs.
Theorem 5. The following are equivalent:
w
(i) µn → µ;
(ii) every subsequence {µnj } of {µn } has a further subsequence {µnjk }, such that
w
µnjk → µ as k → ∞.
Proof. That (i) implies (ii) is obvious. We prove the reverse implication. Asw
sume the convergence µn → µ fails. This means there exists a bounded continuous
function f, a subsequence {nj } of {n} and a constant ε > 0, such that
|µnj (f ) − µ(f )| ≥ ε
for all j. But then
|µnjk (f ) − µ(f )| ≥ ε
for any subsequence {njk } of {nj } as well. Hence {µnj } has no further subsequence
converging weakly to µ, a contradiction.
Recall that the boundary ∂E of a set E ∈ B is ∂E = E \ E ◦ , where E is the
closure and E ◦ is the interior of E. The distance from a point x to a set E is
d(x, E) = inf{|x − y| : y ∈ E}.
The δ-neighbourhood of E (here δ > 0) is the set E δ = {x : d(x, E) < δ}.
The following result is known as the portmanteau lemma.
Theorem 6 (Portmanteau lemma). Let µ, µ1 , µ2 , . . . be probability measures
on (R, B). The following statements are equivalent.
(i)
(ii)
(iii)
(iv)
w
µn → µ.
lim supn→∞ µn (F ) ≤ µ(F ) for all closed sets F .
lim inf n→∞ µn (G) ≥ µ(G) for all open sets G.
limn→∞ µn (E) = µ(E) for all sets E with µ(∂E) = 0 (all µ-continuity sets).
Proof. We start with (i)⇒(ii). Given > 0, choose m ∈ N, such that for
δ = 1/m > 0, µ(F δ ) < µ(F ) + ε. This is possible, because F is closed and hence
F δ ↓ F as m → ∞. Let


if x ≤ 0,
1
ϕ(x) = 1 − x if 0 < x < 1,


0
if x ≥ 1,
and define
f (x) = ϕ
1
d(x, F ) .
δ
1.2. CRITERIA FOR WEAK CONVERGENCE
3
Note that f is continuous, nonnegative, is bounded by 1 on R, equals 1 on F and
vanishes outside F δ . Therefore,
Z
Z
µn (F ) =
f dµn ≤
f dµn ,
F
and
Z
R
Z
f dµ ≤ µ(F δ ).
f dµ =
Fδ
R
We also have
Z
lim
n→∞
Z
f dµn =
R
f dµ.
R
Combining the above,
Z
Z
lim sup µn (F ) ≤ lim
n→∞
n→∞
f dµn =
f dµ ≤ µ(F δ ) < µ(F ) + ε.
R
R
Since ε is arbitrary, the implication follows.
(ii)⇔(iii) follows by a simple complementation argument.
(ii) and (iii) together imply (iv) by
µ(E) ≥ lim sup µn (E) ≥ lim sup µn (E)
n→∞
n→∞
≥ lim inf µn (E) ≥ lim inf µn (E ◦ ) ≥ µ(E ◦ ),
n→∞
n→∞
because µ(∂E) = 0 implies that the extreme terms are equal, the inequalities are
in fact equalities, and so limn→∞ µn (E) = µ(E).
(iv)⇒(i) Let ε > 0, g ∈ Cb (S) and choose two constants C1 , C2 , such that
C1 < g < C2 . Let D = {x ∈ R : µ({g = x}) > 0}. So, D is the set of atoms of
g and hence it is at most countable (if not, µ would have an infinite total mass).
Let C1 = x0 < . . . < xm = C2 be a finite set of points not in D, such that
max{xk − xk−1 : k = 1, . . . , m} < ε. Let Ik = (xk−1 , xk ]. The continuity of g
implies that if y is a boundary point of a set
{x : xk−1 < g(x) ≤ xk },
then g(y) is either xk−1 or xk . Hence the sets in the above display are µ-continuity
sets. We have
(1.1)
Z
m
m
X
X
xk−1 µn (x : xk−1 < g(x) ≤ xk ) ≤
gdµn ≤
xk µn (x : xk−1 < g(x) ≤ xk ),
k=1
R
k=1
and likewise,
(1.2)
Z
m
m
X
X
xk−1 µ(x : xk−1 < g(x) ≤ xk ) ≤
gdµ ≤
xk µ(x : xk−1 < g(x) ≤ xk ).
k=1
R
k=1
Now note that the extreme terms in (1.1) converge to the respective extreme terms
in (1.2). The
both the limit superior and limit
R latter differ by at most ε. Hence
R
inferior of R gdµn are within distance ε of R gdµ. Since ε is arbitrary, the result
follows. This finishes the proof of the theorem.
Part (iv) of the portmanteau lemma is quite illustrative for understanding the
definition of the weak convergence and in what way it differs from the requirement
4
1. WEAK CONVERGENCE
µn (B) → µ(B) for every set B in the case of another would-be definition of weak
convergence (cf. Section 1.1).
1.3. Convergence of distribution functions
In this section we give an appealing characterisation of weak convergence (convergence in distribution) in terms of distribution functions, which makes the definition of weak convergence look less abstract.
Definition 7. We shall say that a sequence of distribution functions {Fn } on
R converges weakly to a limit distribution function F, and shall write Fn
F, if
Fn (x) → F (x) for all x ∈ CF , where CF is the set of all those points, at which F
is continuous.
Theorem 8. Let µ, µ1 , µ2 , . . . be probability measures on the real line and denote by F, F1 , F2 , . . . the corresponding distribution functions. The following statements are equivalent:
w
(i) µn −
→ µ;
(ii) Fn
F.
Proof. Assume (i). If x is a continuity point of F, the set (−∞, x], the boundary of which is {x}, is a µ-continuity set. Hence
Fn (x) = µn ((−∞, x]) → µ((−∞, x]) = F (x)
by the portmanteau lemma and thus (ii) holds.
Conversely, let (ii) hold. Fix an arbitrary 0 < ε < 1 and pick two continuity
points a and b of F in such a way that F (a) < ε and F (b) > 1 − ε. Next, given f ∈
C(R), choose the continuity points xi of F, such that a = x0 < x1 < . . . < xk = b
and |f (x) − f (xi )| < ε for xi−1 ≤ x ≤ xi (this is possible by the uniform continuity
of f on [a, b]). Define
S=
k
X
f (xi )[F (xi ) − F (xi−1 )],
Sn =
k
X
f (xi )[Fn (xi ) − Fn (xi−1 )].
i=1
i=1
By assumption, Sn → S as n → ∞. Let M = supx∈R |f (x)|. We have
Z
f dµ − S < (2M + 1)ε.
R
Likewise,
Z
f dµn − Sn ≤ ε + M Fn (a) + M (1 − Fn (b))
R
→ ε + M F (a) + M (1 − F (b))
< (2M + 1)ε.
As a result,
Z
Z
Z
lim sup f dµn − f dµ ≤ lim sup f dµn − Sn n→∞
n→∞
R
R
R
Z
+ f dµ − S + lim |Sn − S|
R
≤ 2(2M + 1)ε.
n→∞
1.4. SEQUENTIAL COMPACTNESS
5
Since ε is arbitrary, the limit superior on the left-hand side of the first inequality
is in fact zero and the result follows.
As shown in the next result, when the limit distribution function F is continuous
everywhere, i.e. CF = R, the convergence Fn (t) → F (t) is in fact uniform in t ∈ R.
Theorem 9. Suppose Fn
F and F is continuous. Then
lim sup |Fn (t) − F (t)| = 0.
n→∞ t∈R
Proof. Let k ∈ N be fixed. By continuity of F and the intermediate value
theorem, there exist points −∞ = x0 < x1 < . . . < xk = ∞, such that F (xi ) = i/k.
Therefore, for xi−1 ≤ x ≤ xi ,
Fn (x) − F (x) ≤ Fn (xi ) − F (xi−1 ) = Fn (xi ) − F (xi ) + 1/k,
Fn (x) − F (x) ≥ Fn (xi−1 ) − F (xi ) = Fn (xi−1 ) − F (xi−1 ) − 1/k.
Thus
|Fn (x) − F (x)| ≤ sup |Fn (xi ) − F (xi )| + 1/k,
x ∈ R.
0≤i≤k
For any ε > 0, choose k so large that 1/k ≤ ε/2. Next note that with this k, by
convergence of Fn (x) → F (x) at all x ∈ R, the supremum sup0≤i≤k |Fn (xi )−F (xi )|
can be made arbitrarily small, in particular smaller than ε/2, by taking n is large
enough. Conclude that supx∈R |Fn (x) − F (x)| ≤ ε for all n large enough. Since ε
is arbitrary, the result follows.
1.4. Sequential compactness
In the previous sections we studied several alternative characterisations of weak
convergence. In this section we will take a more abstract stance and study a condition guaranteeing that a sequence of probabaility measures has at least one weakly
convergent subsequence. We first introduce the notion of sequential compactness
of a sequence of probability measures.
Definition 10. A sequence of probability measures {µn } on (R, B) is called
sequentially compact, if every subsequence {µnk } of {µn } has a further weakly convergent subsequence.
A general answer to the question whether a sequence {µn } is sequentially compact or not is given by Prokhorov’s theorem. In its proof we need one auxiliary
result, known as Helly’s theorem.
The Bolzano-Weierstraß theorem states that every bounded sequence of real
numbers has a convergent subsequence. The theorem easily generalises to sequences
in Rd , but fails to hold for uniformly bounded sequences in general metric spaces.
But if extra properties are imposed, there can still be an affirmative answer. Something like that happens in Helly’s theorem. At this point it is convenient to introduce the notion of a defective distribution function. Such a function, F say,
has values in [0, 1], is right-continuous and increasing, but at least one of the two
properties limx→∞ F (x) = 1 and limx→−∞ F (x) = 0 fails to hold. The measure µ
corresponding to F on (R, B) will then be a subprobability measure, µ(R) < 1.
Theorem 11 (Helly’s theorem). Let {Fn } be a sequence of distribution functions. Then there exists a possibly defective distribution function F and a subsequence {Fnk }, such that Fnk (x) → F (x), for all x ∈ CF .
6
1. WEAK CONVERGENCE
Proof. The main ingredient of the proof is an infinite repetition of the BolzanoWeierstraß theorem combined with the Cantor diagonalisation. First we restrict
ourselves to working on Q instead of R, and exploit the countability of Q. Write
Q = {q1 , q2 , . . .} and consider restrictions of Fn to Q. Then the sequence {Fn (q1 )}
is bounded and along some subsequence {n1k } it has a limit, `(q1 ) say. Look then
at the sequence {Fn1k (q2 )}. Again, along some subsequence of {n1k }, call it {n2k },
we have a limit, `(q2 ) say. Note that along the thinned subsequence, we still have
limk→∞ Fn2k (q1 ) = `(q1 ). Continue like this to construct a nested sequence of
subsequences {njk } for which we have that limk→∞ Fnj (qi ) = `(qi ) holds for every
k
i ≤ j. Define a diagonal sequence {nk } by nk = nkk . For an arbitrary i, along
this sequence one has limk→∞ Fnk (qi ) = `(qi ). In this way we have constructed
a function ` : Q → [0, 1], and by the monotonicity of Fn (t) in t this function is
increasing.
In the next step we extend this function to a function F on R that is rightcontinuous, and still increasing. We put
F (x) = inf{`(q) : q ∈ Q, q > x}.
Obviously, F is an increasing function. It is also right-continuous: let x ∈ R and
ε > 0. There is q ∈ Q with q > x such that `(q) < F (x) + ε. Pick y ∈ (x, q). Then
F (y) < `(q) and we have F (y) − F (x) < ε, which shows that F is right-continuous.
However, limx→∞ F (x) = 1 or limx→−∞ F (x) = 0 do not necessarily hold true.
Thus F is a possibly defective distribution function.
We now show that Fnk (x) → F (x) if x ∈ CF . Fix x ∈ CF and let ε > 0. Pick
q as above. By left-continuity of F at x, there is y < x such that F (x) < F (y) + ε.
Take now r ∈ (y, x) ∩ Q. Then F (y) ≤ `(r), and hence F (x) < `(r) + ε. So we have
the inequalities
`(q) − ε < F (x) < `(r) + ε.
Then
lim sup Fnk (x) ≤ lim Fnk (q) = `(q) < F (x) + ε,
k→∞
k→∞
lim inf Fnk (x) ≥ lim inf Fnk (r) = `(r) > F (x) − ε.
k→∞
k→∞
Since ε is arbitrary, the result follows.
Here is an example, for which the limit in Theorem 11 is not a true distribution
function.
Example 12. Let µn be the Dirac measure concentrated on {n}. Then its
distribution function is given by Fn (x) = 1[n,∞) (x) and hence limn→∞ Fn (x) = 0.
Hence the limit function F in Theorem 11 has to be the zero function, which
is clearly defective. One colloquially says that in the limit the probability mass
escapes to infinity.
Translated in terms of probability laws, Helly’s theorem says that every sequence of probability measures {µn } has a (weakly) convergent subsequence, but
that the limit law in general is a subprobability measure only. We are now interested in finding a condition, that would guarantee that the limit is a bona fide
probability measure. A possible path is to require that all probability measures
involved have probability one on a fixed bounded set. That would prevent the
1.4. SEQUENTIAL COMPACTNESS
7
phenomenon described in Example 12. However, this is a too stringent assumption, because it rules out many useful distributions. Fortunately, a considerably
weaker assumption suffices. For any probability measure µ on (R, B) it holds that
limM →∞ µ([−M, M ]) = 1. The next condition, tightness, gives a uniform version
of this.
Definition 13. A sequence of probability measures {µn } on (R, B) is called
tight, if limM →∞ inf n µn ([−M, M ]) = 1.
Remark 14. Note that a sequence {µn } is tight iff every tail sequence {µn }n≥N
is tight. In order to show that a sequence is tight it is thus sufficient to show
tightness from a certain suitably chosen index on.
Theorem 15 (Prokhorov’s theorem1). A sequence {µn } of probability measures
on (R, B) is tight if and only if it is sequentially compact.
Proof. Suppose {µn } is sequentially compact, but not tight. Then there exists
ε > 0, such that for any M > 0 and all n, µn ([−M, M ]c ) > ε. It follows that for any
j ∈ N and Ij = (−j, j), one can find an index nj , such that µnj (Ijc ) > ε. Extract
from the sequence {µnj } a weakly convergent subsequence {µnjk }, and denote its
weak limit by µ. By the portmanteau lemma, for every fixed j ∈ N,
lim sup µnjk (Ijc ) ≤ µ(Ijc ).
k→∞
Letting j → ∞, we see that the right-hand side converges to zero, while the lefthand side stays bounded by ε > 0 from below. This contradiction proves the first
implication.
We now prove the second implication. Let Fn be the distribution function
of µn . By Helly’s theorem, there exists a subsequence {Fnj } of the sequence of
distribution functions {Fn }, such that Fnj
F as j → ∞, for some, possibly
defective, distribution function F. We will show that in fact
(1.3)
lim F (x) = 1,
x→∞
lim F (x) = 0,
x→−∞
so that F is a proper distribution function. By tightness of {µn }, for any constant
0 < ε < 1 there exists a constant Mε > 0, such that Fn (Mε ) > 1 − ε for all n ∈ N.
Without loss of generality, Mε can be taken to be a continuity point of F. Then
F (Mε ) = lim Fnj (Mε ) ≥ 1 − ε.
j→∞
Since ε is arbitrary, the above display and monotonicity of F imply the first equality
in (1.3). The second one can be proved in a similar manner. This completes the
proof.
Theorem 15 has a simple corollary.
w
Corollary 16. If µn → µ for some probability measure µ, then the sequence
{µn } is tight.
We also remark that tightness of a sequence {µn } in general is not sufficient for
its weak convergence. Here is a simple counterexample: let µn = N (0, 1) for n odd
and µn = N (0, 2) for n even. Then {µn } is tight, but does not converge weakly.
1The name Prokhorov is alternatively spelled as Prohorov, but Prokhorov is the way it
appears in the English translation of the original paper containing (a much more general version
of) the theorem. See Prokhorov (1956).
8
1. WEAK CONVERGENCE
1.5. Continuous mapping theorem
The continuous mapping theorem is a result asserting that if a sequence of
random variables {Xn } converges in a suitable sense to a random variable X, then
for a continuous function g the transformed sequence {g(Xn )} converges to g(X).
We will prove a slightly more general result, that allows g to be discontinuous on a
negligible set. Such a refinement does not require much additional technical effort,
while occasionally being useful.
Theorem 17 (Continuous mapping theorem). Let g : R 7→ R be continuous at
every point of a set C, such that P(X ∈ C) = 1.
a.s.
a.s.
(i) If Xn −−→ X, then g(Xn ) −−→ g(X).
(ii) If Xn
X, then g(Xn )
g(X).
P
P
(iii) If Xn −
→ X, then g(Xn ) −
→ g(X).
Proof. Part (i) is trivial.
We prove part (ii). Let F be an arbitrary closed set. We have {g(Xn ) ∈ F } =
{Xn ∈ g −1 (F )}. Trivially, g −1 (F ) ⊂ g −1 (F ). Take an arbitrary x ∈ g −1 (F ). By
definition, there exists a sequence {xm }, such that xm → x and g(xm ) ∈ F. If
x ∈ C, then g(xm ) → g(x), and g(x) ∈ F, because F is closed. Otherwise x ∈ C c .
Hence g −1 (F ) ⊂ g −1 (F ) ∪ C c . Then by the portmanteau lemma and the fact that
P(X ∈ C) = 1,
lim sup P(g(Xn ) ∈ F ) ≤ lim sup P(Xn ∈ g −1 (F ))
n→∞
n→∞
≤ P(X ∈ g −1 (F ))
≤ P(X ∈ g −1 (F )) + P(X ∈ C c )
= P(g(X) ∈ F ).
By another application of the portmanteau lemma we conclude that g(Xn )
g(X).
P
We move to part (iii). Assume that g(Xn ) −
→ g(X) fails. Then there exist
ε > 0, δ > 0 and a subsequence {nj } of {n}, such that
(1.4)
P(|g(Xnj ) − g(X)| > ε) > δ.
a.s.
Extract from {nj } a further subsequence {njk }, such that Xnjk −−→ X. By part
a.s.
(iii), g(Xnjk ) −−→ g(X). But this contradicts (1.4). The proof of the theorem is
completed.
It would be more appropriate, albeit clumsier, to call Theorem 17 the almost
surely continuous mapping theorem.
Example 18. Here is a simple illustration of Theorem 17. Let Y1 , . . . , Yn be an
i.i.d. sample from the normal distribution with mean zero and unknown variance
σ 2 . By the strong law of large numbers,
n
1 X 2 a.s. 2
Y −−→ σ ,
σ̂n2 =
n i=1 i
√
and hence σ̂ 2 is a reasonable estimator of σ 2 . Since the function g(x) = x is
continuous, σ̂n is then a reasonable estimator of the standard deviation σ: we have
a.s.
σ̂n −−→ σ.
1.6. ALMOST SURE REPRESENTATION THEOREM
(a) Distribution function
9
(b) Quantile function
Figure 1. Distribution and quantile functions of the discrete uniform distribution on integers 1, 2.
1.6. Almost sure representation theorem
Suppose we want to prove some distributional property of a sequence {Xn } of
random variables, knowing that Xn
X. In general this might be difficult, but
a.s.
is perhaps easier, if we knew that Xn −−→ X. Unfortunately, the latter almost
sure convergence might be difficult to establish, or perhaps is even false. However,
the situation is not hopeless. The almost sure representation theorem, proved
e that supports random
e F,
e P),
below, tells us that there exists a probability space (Ω,
d
d
a.s. e
e
e
e
e
en −
variables {Xn }, X, such that for all n ∈ N, Xn = Xn , X = X, and X
−→ X. We
en on
then prove the distributional property we are interested in for the sequence X
e The result automatically carries over to the original sequence
e F,
e P).
the space (Ω,
{Xn }.
We will need a number of results on quantile functions, which are of independent
interest as well.
A distribution function in general is only non-decreasing, but not necessarily
strictly increasing. Therefore, it typically does not admit the inverse function in
the usual sense. Nevertheless, a kind of inverse, the quantile function, can still be
defined. The quantile function of a distribution function F is a generalised inverse
F −1 : (0, 1) 7→ R given by
F −1 (p) = inf {x : F (x) ≥ p} .
For an illustration see Figure 1. The quantile function is left-continuous. Its range is
equal to the support of F (or rather to the support of the corresponding probability
measure µ; the support of a probability measure on R is defined as the set of all
those points x, such that any open neighbourhood Ux of x has strictly positive
measure: µ(Ux ) > 0. Intuitively, this is the smallest closed subset of R that receives
measure 1 under µ (although you might be wondering at this point, this latter
explanation is valid even for probability measures on separable metric spaces; see
e.g. Theorem 2.1 on pp. 27–28 in Parthasarathy (2005)). As one example, the
support of the standard normal distribution is the whole R) and therefore, F −1 is
often unbounded. An evident fact that the quantile function is monotone implies
that it might have at most a countable number of discontinuity points only. The
following lemma lists some other properties of F −1 . Of these we will only make
partial use of (i)–(iv).
10
1. WEAK CONVERGENCE
Figure 2. Distribution function (red line).
Lemma 19. For every 0 < p < 1 and x ∈ R,
(i) F −1 (p) ≤ x if and only if p ≤ F (x);
(ii) F ◦ F −1 (p) ≥ p, with equality holding if and only if p is in the range of
F ; the equality can fail if and only if F is discontinuous at F −1 (p);
(iii) F− ◦ F −1 (p) ≤ p, with F− (x) = F (x−);
(iv) F −1 ◦ F (x) ≤ x; the equality fails if and only if x is in the interior or at
the right endpoint of a flat part of F ;
(v) F ◦ F −1 ◦ F = F ; F −1 ◦ F ◦ F −1 = F −1 ;
(vi) (F ◦ G)−1 = G−1 ◦ F −1 .
Proof. (i) through (iv) can be proved either directly, by appealing to the
definitions, or through a picture, such as the one given in Figure 2. To prove the
first equality in (v), note that by (ii), the monotonicity of F and (iv),
F (x) = p ≤ F ◦ F −1 (p) = F ◦ F −1 ◦ F (x) ≤ F (x).
The second equality in (v) follows from (ii), the monotonicity of F −1 and (iv) by
F −1 (q) ≤ F −1 ◦F ◦F −1 (q) ≤ F −1 (q). Finally, (vi) is a consequence of the definition
of (F ◦ G)−1 and (i).
As a consequence of (ii) and (iv), F ◦ F −1 (p) ≡ p and F −1 ◦ F (p) ≡ p on (0, 1)
if and only if F is continuous and strictly increasing. In that case F −1 is a proper
inverse of F, as it should be.
Corollary 20. Let F be an arbitrary distribution function and U a uniform
random variable on [0, 1]. Then F −1 (U ) ∼ F.
This follows from Lemma 19 (i). The transformation F −1 (U ) is called the
quantile transformation.
1.6. ALMOST SURE REPRESENTATION THEOREM
11
Corollary 21. Let X ∼ F for a continuous distribution function F. Then
F (X) is uniformly distributed on [0, 1].
Again, this follows from Lemma 19 (i) and (ii) by
P(F (X) ≤ x) = P(F (X) < x) = 1 − P(F (X) ≥ x) = 1 − P(X ≥ F −1 (x))
= P(X < F −1 (x)) = P(X ≤ F −1 (x)) = F ◦ F −1 (x) = x,
where x ∈ (0, 1). The transformation F (X) for X ∼ F is called the probability
integral transformation.
Quantile functions are occasionally useful when studying weak convergence of
a sequence of random variables. In the next definition we introduce the notion of
the weak convergence of a sequence of quantile functions.
Definition 22. We shall say that a sequence of quantile functions Fn−1 conF −1 , if
verges weakly to a limit quantile function F −1 , and denote this by Fn−1
−1
−1
−1
is continuous.
Fn (t) → F (t) at every point 0 < t < 1, at which F
Both the terminology and the notation for the weak convergence of quantile
functions are reminiscent of those for the weak convergence of distribution functions.
In fact, as shown in the next lemma, the two types of convergence are equivalent.
Lemma 23. For any sequence of distribution functions Fn , Fn
F −1 .
if Fn−1
F if and only
Proof. Let U be a standard uniform random variable on some probability
space, for instance on ([0, 1], B[0, 1], λ). Since F −1 has at most a countable number of
F −1
discontinuity points and the distribution of U is absolutely continuous, Fn−1
a.s.
−1
−1
−1
−1
F (U ). By Corollary
implies that Fn (U ) −−→ F (U ). Therefore, Fn (U )
20, this is exactly the weak convergence Fn
F.
Now we prove the reverse implication. Let V be a standard normal random
variable on some probability space, for instance on ([0, 1], B[0, 1], λ), on which it can
be obtained through the quantile transformation Φ−1 (U ) for U a standard uniform
random variable, see Corollary 20. Since the convergence Fn (t) → F (t) can fail only
at a countable number of points t, and since the distribution of V is continuous, we
a.s.
have Fn (V ) −−→ F (V ) (and of course Fn (V )
F (V )). By Lemma 19 (i),
Φ(Fn−1 (t)) = P(V < Fn−1 (t))
= 1 − P(V ≥ Fn−1 (t))
= 1 − P(Fn (V ) ≥ t)
= P(Fn (V ) < t),
and similarly, P(F (V ) < t) = Φ(F
−1
(t)). By the portmanteau lemma,
lim inf P(Fn (V ) < t) ≥ P(F (V ) < t).
n→∞
On the other hand, by elementary properties of the limits inferior and superior and
the portmanteau lemma again,
lim inf P(Fn (V ) < t) ≤ lim sup P(Fn (V ) < t)
n→∞
n→∞
≤ lim sup P(Fn (V ) ≤ t)
n→∞
= 1 − lim inf P(Fn (V ) > t)
n→∞
12
1. WEAK CONVERGENCE
≤ 1 − P(F (V ) > t)
= P(F (V ) ≤ t).
If P(F (V ) ≤ t) is continuous at t, then
P(F (V ) ≤ t) = P(F (V ) < t) = Φ(F −1 (t)),
and in this case
lim inf P(Fn (V ) < t) = lim sup P(Fn (V ) < t)
n→∞
n→∞
= lim P(Fn (V ) < t)
n→
= P(F (V ) < t)
= Φ(F −1 (t)).
The function Φ(F −1 (·)) is certainly continuous at every point t, at which F −1 is.
Since Φ−1 is a continuous function as well (cf. Lemma 19), from this it follows that
Fn−1 (t) → F −1 (t) at every point t, at which F −1 is continuous. Thus Fn−1 (t)
F −1 (t).
The work we put in the previous results allows us to give a short proof of the
almost sure representation theorem.
Theorem 24 (Almost sure representation). Let Xn
X. Then there exists a
e
e
e
e
e
probability space (Ω, F, P) and random variables Xn , X defined on it, such that for
d
d
a.s. e
en =
e=
en −
all n ≥ 1, X
Xn , X
X, and X
−→ X.
Proof. Let Fn and F be the distribution functions of Xn and X, respectively.
e = ([0, 1], B[0, 1], λ) and let U be a random
e F,
e P)
Consider the probability space (Ω,
en = Fn−1 (U ) and X
e=
variable on it with a standard uniform distribution. Define X
d
d
−1
en = Xn and X
e = X. By Lemma 23, the convergence
F (U ). By Corollary 20, X
Fn
F implies that Fn−1
F −1 . By definition the latter means that Fn−1 (t) →
F −1 (t) at all points t, at which F −1 is continuous. Note that F −1 has at most
a countable number of discontinuity points, and hence the convergence Fn−1 (t) →
F −1 (t) can perhaps fail only on a set with Lebesgue measure zero. Since U has a
a.s.
a.s. e
en −
continuous distribution, this implies that Fn−1 (U ) −−→ F −1 (U ), i.e. X
−→ X. Several applications of the almost sure representation theorem will be given in
the next section.
1.7. Relation to other modes of convergence
Firstly, we show that convergence in probability implies convergence in distribution.
Theorem 25. Suppose that a sequence {Xn } of random variables and a random
P
variable X are defined on the same probability space. Assume that Xn −
→ X. Then
Xn
X.
Proof. Suppose the convergence Xn
X fails. By definition this means that
there exists f ∈ Cb (R), such that the convergence µn (f ) → µ(f ) fails. Thus there
exists ε > 0 and a subsequence {nk } of {n}, such that |µnk (f ) − µ(f )| ≥ ε for
all nk . This is obviously true for any further subsequence of {nk } as well. Pick
1.7. RELATION TO OTHER MODES OF CONVERGENCE
13
a.s.
a subsequence {nk` } of {nk }, such that Xnk` −−→ X (this is possible, because
P
Xn −
→ X). Then µnk` (f ) → µ(f ) by the dominated convergence theorem. But this
leads to a contradiction that proves the theorem.
Corollary 26. Suppose that a sequence {Xn } of random variables and a
a.s.
random variable X are defined on the same probability space. Assume that Xn −−→
X. Then Xn
X.
This follows from Theorem 25 and the fact that almost sure convergence implies
convergence in probability.
The converse to Theorem 25 (and Corollary 26) is in general false.
Example 27. Let X ∼ N (0, 1) and Xn = −X for all n ∈ N. Then P(|Xn −
X| > ε) = P(|X| > ε/2) > 0 for all n ∈ N, and thus convergence in probability
fails. Obviously, so does the almost sure convergence. On the other hand, by the
d
symmetry of the standard normal distribution, Xn = X, and hence Xn
X.
There is one notable exception, however.
Theorem 28. Let the random variables X, X1 , X2 , . . . be defined on the same
probability space. If Xn
X, where P(X = x) = 1 for some x ∈ R, then also
P
Xn → X.
Proof. The distribution µ of X is the Dirac measure at x. For any ε > 0, the
sets (x + ε, ∞) and (−∞, x − ε) are µ-continuity sets. Note that
P(|Xn − X| > ε) = P(Xn > x + ε) + P(Xn < x − ε).
The right-hand side of the above display tends to zero as n → ∞ by the portmanteau
lemma. This completes the proof.
Next we move to convergence of the first moments. Since weak convergence
in general does not imply convergence in probability, neither does it in general
imply convergence of means. But when the collection {Xn } is uniformly integrable,
the weak convergence Xn
X can be strengthened to convergence of means:
E[Xn ] → E[X]. The proof is a simple application of the almost sure representation
theorem.
Theorem 29. Assume that Xn
X. If the sequence {Xn } is uniformly integrable, then E[Xn ] → E[X] as n → ∞.
Proof. By the almost sure representation theorem, there exists a probability
d e
d e
e with random variables X,
e F,
e P)
e X
e1 , X
e2 . . . , such that X =
space (Ω,
X, Xn = X
n
a.s.
en −−→ X.
e By the uniform integrability of the family {Xn }, the
for all n ∈ N, and X
en } is also uniformly integrable. Therefore E[X
en ] → E[X],
e and since this
family {X
latter convergence depends only on the laws of the random variables involved, the
result follows.
Remark 30. Assume that {Xn } and X are defined on the same probability
space. Inspecting the proof of the previous theorem, one could have thought that
not only do the means converge, but that we also have the L1 -convergence: E[|Xn −
X|] → 0. However, this in general is false and here is a simple counterexample: take
X ∼ N (0, 1) and Xn = −X for all n ∈ N. Then the conditions of Theorem 29 are
satisfied, but E[|Xn − X|] = 2E[|X|], which does not tend to zero. The point is that
14
1. WEAK CONVERGENCE
E[|Xn − X|] depends on the bivariate law of (Xn , X), and this need not be the same
en , X)
e (marginals do not determine joint distributions uniquely). This
as that of (X
serves as a warning to when the almost sure representation theorem is applicable
and when it is not: the representation does not in general preserve the dependence
structure of {Xn } and X, and hence typically cannot be used for statements dealing
with multivariate vectors obtained from {Xn } and X.
The following is what we can obtain without the uniform integrability assumption in Theorem 29. Again, the proof is an application of the almost sure representation theorem.
Theorem 31. If Xn
X, then lim inf n→∞ E[|Xn |] ≥ E[|X|].
Proof. By the almost sure representation theorem, there exists a probad e
e with random variables X,
e F,
e P)
e X
e1 , X
e2 . . . , such that X =
bility space (Ω,
X,
d e
a.s.
e
e
e
Xn = Xn for all n ∈ N, and Xn −−→ X. Fatou’s lemma implies that E[|X|] ≤
en |], and the statement follows.
lim inf n→∞ E[|X
1.8. Slutsky’s lemma
Suppose Xn
X and the sequence {Yn } is close in some sense to {Xn }. What
can be said about the weak limit of {Yn }? Or suppose that {Xn } and {Yn } are
weakly convergent. What can be said about the weak convergence of the sequence
{Xn Yn }? The following result, known as Slutsky’s lemma2, gives an answer to these
questions.
Theorem 32. Let {Xn } and {Yn } be two sequences of the random variables
defined on the same probability space.
(i) If Xn
(ii) If Xn
P
X and |Xn − Yn | −
→ 0, then Yn
X.
X and Yn
c for a constant c, then Xn Yn
cX.
Proof. We first prove (i). Let F be closed and δ = 1/m for m ∈ N. We have
P(Yn ∈ F ) = P(Xn + Yn − Xn ∈ F )
= P(Xn + Yn − Xn ∈ F ; |Xn − Yn | < δ)
+ P(Xn + Yn − Xn ∈ F ; |Xn − Yn | ≥ δ)
≤ P(Xn ∈ F δ ) + P(|Xn − Yn | ≥ δ).
P
Letting n → ∞ and using the assumption |Xn − Yn | −
→ 0 and the portmanteau
lemma, we obtain that
lim sup P(Yn ∈ F ) ≤ P(X ∈ F δ ).
n→∞
Fδ
Since
↓ F as m → ∞, the result follows by another application of the portmanteau lemma.
Now we prove (ii). Write
(1.5)
Xn Yn = Xn (Yn − c) + cXn .
2An alternative, but less common spelling of Slutsky’s name is Slutzky. Also the result is at
times called a theorem, not lemma.
EXERCISES
15
An elementary argument shows that for any ε > 0 and δ > 0,
ε
(1.6)
P(|Xn (Yn − c)| > ε) ≤ P |Xn | >
+ P(|Yn − c| > δ).
δ
Fix ε and pick δ such that ε/δ and −ε/δ are continuity points of the distribution
of X. Then the first term on the right-hand side of the above display converges to
P(|X| > ε/δ). The latter can be made arbitrarily small by taking δ small enough.
As far as the second term in (1.6) is concerned, for every fixed δ it converges to
P
zero. Hence Xn (Yn −c) −
→ 0. It is also easy to see that cXn
cX (this can be done
in a variety of ways. For instance, the almost sure representation theorem and the
dominated convergence theorem give for f ∈ Cb (R) that E[f (cXn )] → E[f (cX)]).
Now apply part (i) to the right-hand side of (1.5).
Slutsky’s lemma finds numerous applications in asymptotic theorems of mathematical statistics.
Exercises
1 Show that Xn
X iff E[f (Xn )] → E[f (X)] for all bounded uniformly continuous
functions f.
w
2 Show the implication Fn (x) → F (x) for all x ∈ CF ⇒ µn −
→ µ without referring
to the almost sure representation theorem. Hint: first you take for given ε > 0 a
K > 0 such that F (K) − F (−K) > 1 − ε. Approximate a function f ∈ Cb (R) on
the interval [−K, K] by a piecewise constant function, compute the integrals of
this approximating function and use the convergence of {Fn } at continuity points
of F etc.
3 Let {µn } be a sequence of discrete uniform distributions on [0, 1]: µn (i/n) = 1/n,
i = 1, . . . , n. Show that {µn } is weakly convergent and identify the weak limit.
4 Let {Xn } be an i.i.d. sequence of exponentially distributed random variables:
FXn (x) = P(Xn ≤ x) = 1 − e−x for x ≥ 0 and FXn (x) = 0 for x < 0. Let
Mn = − log n + max1≤i≤n Xn . Show that FMn
FM , where FM (x) = P(M ≤
−x
x) = e−e , x ∈ R. The latter distribution is known as the Gumbel distribution
(or the extreme value distribution).
5 Consider the N (µn , σn2 ) distributions, where the µn are real numbers and the
σn2 nonnegative. Show that this family is tight iff the sequences (µn ) and (σn2 )
are bounded. Under what condition do we have that the N (µn , σn2 ) distributions
converge to a (weak) limit? What is this limit?
6 Let random variabes X and Xn possess discrete distributions supported on N.
Show that Xn
X if and only if P(Xn = m) → P(X = m) for every m ∈ N.
7 Give an example of distribution functions F and Fn on the real line, such that
w
Fn → F, but supx |Fn (x) − F (x)| → 0 fails.
8 For a distribution function G on the real line the median is defined by G−1 (1/2).
Assume that Fn
F and let m = med(F ) and mn = med(Fn ) be the medians
of F and Fn , respectively. Find suitable assumptions, under which mn → m as
n → ∞.
9 Let F and G be two distribution functions on R and let
L(F, G) = inf{h > 0 : F (x − h) − h ≤ G(x) ≤ F (x + h) + h}
16
1. WEAK CONVERGENCE
be the Lévy distance between them (accept as a fact, or prove for yourself that
L(F, G) defines a distance). Show that the weak convergence Fn
F is equivalent to convergence in the Lévy metric: L(Fn , F ) → 0. Hint: the implication
L(Fn , F ) → 0 ⇒ Fn
F follows from the definition. The other one can be
established by contradiction.
10 Prove uniqueness of the weak limit µ of a weakly convergent sequence of probability measures µn .
CHAPTER 2
Characteristic functions
2.1. Definition and first properties
Let X be a random variable defined on (Ω, F, P). X induces a probability measure on (R, B), the law or distribution of X, denoted by PX or µ. This probability
measure, in turn, determines the distribution function F of X. Conversely, F also
determines PX . Hence distribution functions on R and probability measures on
(R, B) are in bijective correspondence. In this chapter we develop another such
correspondence. We start with a definition.
Definition 33. Let µ be a probability measure on (R, B). Its characteristic
function φ : R → C is defined by
Z
(2.1)
φ(u) =
eıux µ(dx).
R
Whenever needed, we write φµ instead of φ to express the dependence on µ.
Note that in this definition we integrate a complex valued function. By splitting
a complex Rvalued function
f =R g + ıh into its real part g and imaginary part h,
R
we define f dµ := g dµ + ı h dµ. For integrals of complex valued functions,
previously
R
Rshown theorems are, mutatis mutandis, true. For instance, one has
| f dµ| ≤ |f | dµ, where | · | denotes the norm of a complex number.
If X is a random variable with distribution µ, then φµ can alternatively be
expressed as φ(u) = E[exp(ıuX)]. There are many random variables with distribution µ. They all have the same characteristic function. We also adopt the notation
φX to indicate that we are dealing with the characteristic function of the random
variable X.
Before we give some examples and elementary properties of characteristic functions, we look at a special case. Suppose that X admits a density f with respect
to Lebesgue measure. Then
Z
(2.2)
φX (u) =
eıux f (x) dx.
R
Analysts define for f ∈ L (R, B, λ) the Fourier transform fˆ by
Z
ˆ
f (u) =
e−ıux f (x) dx.
1
R
What we thus see is the equality φX (u) = fˆ(−u). Given usefulness of Fourier transforms in various branches of mathematics, we then get a feeling that characteristic
functions will be important in probability theory as well.
Computation of a characteristic function (if it is explicitly computable) is typically a clever exercise in integration.
17
18
2. CHARACTERISTIC FUNCTIONS
Example 34. Let X ∼ N (0, 1). Then
Z
2
2
1
ıuX
φX (u) = E[e
]= √
eıux e−x /2 dx = e−u /2 .
2π R
In fact,
1
√
2π
Z
eıux e−x
2
/2
R
Z X
∞
(ıux)n −x2 /2
e
dx
n!
R n=0
Z
∞
X
2
(ıu)n 1
√
xn e−x /2 dx
=
n!
2π R
n=0
1
dx = √
2π
=
∞
X
(ıu)n
E[X n ].
n!
n=0
For n odd, E[X n ] = 0, while by Stein’s lemma, see Lemma 35 ahead, for n even,
E[X n ] = (n − 1)!!. Hence the above chain of equalities can be continued as
∞
∞
X
X
(ıu)n
(ıu)2n
E[X n ] =
(2n − 1)!!
n!
(2n)!
n=0
n=0
∞
X
(ıu)2n (2n)!
=
(2n)! 2n n!
n=0
n
∞
X
u2
1
−
=
2
n!
n=0
2
= e−u
/2
.
Here we used the fact that
(2n − 1)!! =
=
n
Y
(2i − 1)
i=1
( n
Y
)(
(2i − 1)
i=1
=
n
Y
)
(2i)
1
Qn
i=1 (2i)
i=1
(2n)!
.
2n n!
Lemma 35 (Stein’s lemma). Let X ∼ N (0, 1) and let g be a differentiable
function satisfying E[g(X)X] < ∞ and E[g 0 (X)] < ∞. Then E[g(X)X] = E[g 0 (X)].
Proof. We have
1
E[g(X)X] = √
2π
Z
g(x)xe−x
2
/2
dx.
R
By partial integration the right-hand side is equal to
Z
2
1
1
−x2 /2 ∞
− √ g(x)e
|−∞ + √
g 0 (x)e−x /2 dx = E[g 0 (X)].
2π
2π R
This completes the proof.
Here is another illustrative example.
2.1. DEFINITION AND FIRST PROPERTIES
19
Example 36. Let X have a standard Cauchy distribution. Directly from the
definition, when u = 0, φX (u) = 1. Now assume u 6= 0. Then
Z
Z
Z
1
1
1
1
cos(ux)
cos(y)
φX (u) =
dx
=
dx
=
|u|
dy.
eıux
π R
1 + x2
π R 1 + x2
π R u2 + y 2
The integral in the last equality is best evaluated through contour integration techniques. Let CR be a closed contour consisting of the real line segment from −R to
R and the upper semi-circle ΓR centred at the origin and of radius R. It can be
shown that
Z
eiz
dz → 0
2
2
ΓR u + z
as R → ∞, see pp. 145–146 in Bak and Newman (2010). Therefore,
Z
Z
eiz
eiy
dz
→
dy.
2
2
2
2
CR u + z
R u +y
Taking real parts on both sides, since z0 = ı|u| is the only pole of the function
eiz /(u2 + z 2 ) in the upper half plain, by the residue theorem we get that
Z
cos(y)
eiz
dy = Re 2πı Res 2
, z0 .
2
2
u + z2
R u +y
Now, since z0 is a pole of order 2, it follows by (ii) on p. 130 in Bak and Newman
(2010) that
e−|u|
eiz
d
eiz
2
,
z
(z
−
z
)
=
.
Res 2
=
0
0
2
2
2
u +z
dz
u + z z=z0
2ı|u|
Thus φX (u) = e−|u| for u 6= 0. We conclude that φX (u) = e−|u| for all u ∈ R.
The following proposition lists some simple properties of characteristic functions.
Proposition 37. Let φ = φX be the characteristic function of some random
variable X. The following hold true:
(i) φ(0) = 1, |φ(u)| ≤ 1, for all u ∈ R
(ii) φ is uniformly continuous on R.
(iii) φaX+b (u) = φX (au)eıub .
(iv) φ is real valued and symmetric around zero, if X and −X have the same
distribution.
(v) If X and Y are independent, then φX+Y (u) = φX (u)φY (u).
(vi) If E|X|k < ∞, then φ ∈ C k (R) and φ(k) (0) = ık EX k .
Proof. Properties (i), (iii) and (iv) are trivial. Consider (ii). Fixing u ∈ R,
we consider φ(u + t) − φ(u) for t → 0. We have
Z
|φ(u + t) − φ(u)| = (exp(ı(u + t)x) − exp(ıux)) µ( dx)
Z
≤ | exp(ıtx) − 1| µ( dx).
The functions x 7→ exp(ıtx) − 1 converge to zero pointwise for t → 0 and are
bounded by 2. The result thus follows from dominated convergence.
Property (v) follows from the product rule for expectations of independent
random variables.
20
2. CHARACTERISTIC FUNCTIONS
Finally, property (vi) for k = 1 follows by an application of the dominated
convergence theorem and the inequality |eıx − 1| ≤ |x|, for x ∈ R. The other cases
can be treated similarly.
Remark 38. Here is a simple application of Proposition 37: if X ∼ N (m, σ),
2 2
then φX (u) = eıum−σ u /2 .
Remark 39. Warning: the converse to Proposition 37 (v) is typically false, i.e.
from the equality
φX+Y (u) = φX (u)φY (u), u ∈ R,
it cannot be concluded that X and Y are independent. Here is a counterexample:
let X have a standard Cauchy distribution and let Y = X. Then
e−2|u| = φ2X (u) = φX+Y (u) = e−|u| e−|u| = φX (u)φY (u),
although X and Y are obviously dependent in this case. More on this example
later.
2.2. Inversion formula and uniqueness
Given a characteristic function φ, how can we find the corresponding distribution function F, or the corresponding law µ? As we will see, an answer to this
qusetion is given by the inversion formula given below. Note that the integration
interval in formula (2.3) is symmetric around zero. This is essential: an improper
integral
Z ∞ −iua
e
− e−ıub
φ(u) du
ıu
−∞
typically does not exist (in the Lebesgue sense). That the limit in (2.3), called the
Cauchy limit, is finite, is actually part of the assertion of the theorem.
Theorem 40. Let µ be a probability law and φ its characteristic function. Then
for all a < b,
Z T −iua
1
1
e
− e−ıub
(2.3)
lim
φ(u) du = µ((a, b)) + µ({a, b}).
T →∞ 2π −T
ıu
2
Proof. We compute, using Fubini’s theorem, which we will justify below,
Z T −ıua
1
e
− e−iub
(2.4)
ΦT :=
φ(u) du
2π −T
ıu
Z T −ıua
Z
1
e
− e−ıub
=
eıux µ(dx) du
2π −T
iu
R
Z Z T −ıua
1
e
− e−iub ıux
=
e du µ(dx)
2π R −T
ıu
Z Z T ı(x−a)u
e
− ei(x−b)u
1
(2.5)
=
du µ(dx)
2π R −T
ıu
Z
=:
ET (x) µ(dx)
R
Application of Fubini’s theorem is justified as follows. First, the integrand in (2.5)
is bounded by b − a, because |eıx − eıy | ≤ |x − y| for all x, y ∈ R. Second, the
product measure λ × µ on [−T, T ] × R is finite.
2.2. INVERSION FORMULA AND UNIQUENESS
21
By splitting the integrand of ET (x) into its real and imaginary part, we see
that the imaginary part vanishes and we are left with
Z T
sin(x − a)u − sin(x − b)u
1
ET (x) =
du
2π −T
u
Z T
Z T
1
sin(x − a)u
sin(x − b)u
1
=
du −
du
2π −T
u
2π −T
u
Z T (x−a)
Z T (x−b)
sin v
sin v
1
1
=
dv −
dv.
2π −T (x−a) v
2π −T (x−b) v
Rt
The function g given by g(s, t) = s siny y dy is continuous in (s, t). Hence it is
bounded on any compact subset of R2 . Moreover, g(s, t) → π as s → −∞ and
t → ∞ (this can be shown by contour integration techniques; see e.g. pp. 146–147
in Bak and Newman (2010)1). Hence g, as a function on R2 , is bounded in s, t. We
conclude that also ET (x) is bounded as a function of T and x, the first ingredient
to apply the dominated convergence theorem to (2.5), since µ is a finite measure.
The second ingredient is to identify E(x) := limT →∞ ET (x). For an arbitrary α, a
change of the integration variable x = αy gives
Z ∞
sin(αy)
π
dy = sgn(α) .
y
2
0
Here sgn(α) denotes 1, 0 or −1 according to whether α > 0, α = 0 or α < 0.
By comparing the location of x relative to a and b, we use the value of the latter
integral to obtain

 1 if a < x < b,
1
if x = a or x = b,
E(x) =
 2
0 else.
We thus get, using the dominated convergence theorem again, that
1
ΦT → µ((a, b)) + µ({a, b})
2
as T → ∞. This completes the proof.
Remark 41. If a and b are continuity points of F, then the right-hand side of
(2.3) is F (b) − F (a). Thus φ determines F at all continuity points of F. But due to
right-continuity of F, the latter completely determines F. F in turn determines µ,
and so φ determines µ.
Let us now give another version of the inversion formula.
Theorem 42. If the characteristic function φ of a probability measure µ on
(R, B) belongs to L1 (R, B, λ), then µ admits a density f w.r.t. the Lebesgue measure
λ. Moreover, f is continuous.
Proof. Define
(2.6)
1
f (x) =
2π
Z
e−ıux φ(u) du.
R
1An alternative derivation is given here: http://staff.science.uva.nl/ hvzanten/ex_5_9.
~
pdf
22
2. CHARACTERISTIC FUNCTIONS
Since |φ| has a finite integral, f is well defined for every x. Observe that f is
real valued, because φ(u) = φ(−u). An easy application of the dominated convergence theorem shows that f is continuous. Now note first that the limit of the
R e−ıua −e−ıub
1
integral in (2.3) is equal to the (Lebesgue) integral 2π
φ(u) du, again
ıu
because of dominated convergence. Next we use Fubini’s theorem to compute for
any continuity points a < b of F that
Z bZ
Z b
1
e−iux φ(u) du dx
f (x) dx =
2π
a
R
a
Z
Z b
1
=
φ(u)
e−iux dx du
2π R
a
Z
1
e−ıua − e−iub
=
φ(u)
du
2π R
ıu
= F (b) − F (a),
Rb
where we also employed Theorem 40. Next, by continuity of a f (x) dx in a and b,
the relationship
Z b
f (x) dx = F (b) − F (a)
a
in fact holds
R y for any a, b ∈ R. By continuity of f, for any y ∈ [a, b] the Lebesgue
integral a f (x) dx equals the Riemann integral. By the fundamental theorem of
calculus it follows that F 0 (y) = f (y) for all y ∈ (a, b) and so for all y ∈ R. Since F
is non-decreasing, f must be nonnegative, and hence it is a probability density. Remark 43. Note the duality between the expressions (2.2) and (2.6). Apart
from the presence of the minus sign in the integral and the factor 2π in the denominator in (2.6), the transformations f 7→ φ and φ 7→ f are similar.
The inversion theorem entails one very important result.
Theorem 44. Random variables X and Y are equal in distribution if and only
if their characteristic functions are the same: φX (t) = φY (t) for all t ∈ R.
Proof. One side of the theorem is trivial. For the other side we argue as
follows: suppose φX (t) = φY (t) for all t ∈ R. By Fubini’s theorem and the inversion
formula for characteristic functions, for every σn > 0 and y ∈ R we have
Z
Z
2 2
2 2
e−ıty e−σn t /2 φX (t)dt =
e−ıty e−σn t /2 E[eitX ]dt
R
R
Z
2 2
=E
e−ıt(y−X) e−σn t /2 dt
R
√
2π h −(y−X)2 /(2σn2 ) i
E e
=
σ
√n Z
2
2
2π
=
e−(y−x) /(2σn ) dFX (x)
σn R
= 2πfX+σn Z (y).
Here Z is a standard normal random variable independent of X and fX+σn Z is the
density of X + σn Z with respect to the Lebesgue measure. Replace φX with φY
in the above argument to see that fX+σn Z (y) = fY +σn Z (y). This implies that for
2.4. MULTIDIMENSIONAL CASE
23
d
every σn > 0, X + σn Z = Y + σn Z. Letting σn → 0 as n → ∞, Slutsky’s lemma
gives that X + σn Z
X. Likewise, X + σn Z
Y. Due to the uniqueness of the
d
weak limit, we then obtain that X = Y.
Put another way, Theorem 44 implies that there is a one-to-one correspondence
between probability measures and characteristic functions.
2.3. Necessary conditions
In the previous sections we have derived some properties a characteristic function possesses. Equally interesting is finding general conditions for a function φ to
be a characteristic function. We will formulate two results in that direction. Their
proofs can be found e.g. in Chung (2001) (see Theorems 6.5.2 and 6.5.3 there). The
first result gives a necessary and sufficient condition, but is not easily verifiable.
The second one is only sufficient, but its conditions are simpler.
Recall that a complex-valued function φ on R is called positive definite, if for
any finite set of real numbers tj ’s and complex numbers zj ’s, 1 ≤ j ≤ n, n = 1, 2, . . . ,
we have
n
n X
X
φ(tj − tk )zj z k ≥ 0,
j=1 k=1
where z k is a complex conjugate of zk .
Theorem 45 (Bochner-Khinchin theorem). A function φ is a characteristic
function if and only if it is positive definite, continuous at 0, and φ(0) = 1.
Theorem 46 (Pólya’s theorem). Let φ satisfy the following conditions: φ(0) =
1, φ is nonnegative, symmetric around zero, and decreasing, continuous and convex
on [0, ∞). Then φ is a characteristic function.
Example 47. Let 0 < α ≤ 1. An application of Pólya’s theorem gives that the
function
α
φα (u) = e−|u|
is a characteristic function (check this). No such luck when 1 < α < 2, but via an
alternative route φα can nevertheless be shown to be a characteristic function in
that case as well (see e.g. pp. 192–193 in Chung (2001)). When α = 2, we know that
φ corresponds to the normal distribution. A probability distribution that has φα
as a characteristic function is called a stable distribution with index α. We finally
remark that it can be shown that φα with α > 2 is not a characteristic function (in
this case φα is twice differentiable at zero and φ0α (0) = φ00α (0) = 0. Assume φα is
a characteristic function. By Theorem 6.4.1 in Chung (2001) the first and second
moments of the corresponding probability law are zero. But then µ must be the
Dirac measure at zero, so that φα (u) = 1 for all u ∈ R. This is a contradiction).
2.4. Multidimensional case
Our treatment in this section is cursory and we omit most details.
The characteristic function φ of a probability measure µ on (Rk , B(Rk )) is
defined by the k-dimensional analogue of (2.1). We have with u, x ∈ Rk , h·, ·i the
standard inner product,
Z
eıhu,xi µ(dx).
φ(u) =
Rk
24
2. CHARACTERISTIC FUNCTIONS
Like in the real case, also here probability measures are uniquely determined by their
characteristic functions. As a consequence we have the following characterization
of independent random variables.
Proposition 48. Let X = (X1 , . . . , Xk ) be a k-dimensional random vector.
Qk
Then X1 , . . . , Xk are independent random variables iff φX (u) = i=1 φXi (ui ), ∀u =
(u1 , . . . , uk ) ∈ Rk .
Proof. If the Xi are independent, the statement about the characteristic functions is proved in the same way as Proposition 37 (v). If the characteristic function
φX factorizes as stated, the result follows by the uniqueness property of characteristic functions.
Remark 49. Let k = 2 in the above proposition. If X1 = X2 as in Remark 39,
then we do not have φX (u) = φX1 (u1 )φX2 (u2 ) for every u1 , u2 (you check!), in
agreement with the fact that X1 and X2 are not independent. But for the special
choice u1 = u2 this product relation holds true.
Example 50. Let X and Y be independent standard normal random variables.
Then somewhat unexpectedly, the random variables X − Y and X + Y are also
independent, which can be shown using Proposition 48.
Exercises
1 Let φ be a characteristic function. Show that so is |φ|2 .P
m
2 If F and G are distribution functions, such that F = j=1 bj δaj and G has a
density, say g, show that the convolution F ∗ G has a density and find it.
3 Show that for any characteristic function φ,
1
Re[1 − φ(u)] ≥ Re[1 − φ(2u)].
4
4 A random variable X with the characteristic function φ is symmetric, if and only
if φ(u) is real for all u ∈ R.
5 Let X1 , X2 , . . . be a sequence of i.i.d. random variables and N a Poisson(λ)
PN
distributed random variable, independent of the Xn . Put Y = n=1 Xn . Let φ
be the characteristic function of the Xn and ψ the characteristic function of Y .
Show that ψ = exp(λφ − λ).
6 If X has an exponential distribution with parameter λ, then φX (u) = λ/(λ − iu).
7 Let φ be a real characteristic function with the property that φ(nu) = φ(u)n for
all u ∈ R and n ∈ N. Show that for some α ≥ 0 it holds that φ(u) = exp(−α|u|).
Let X have characteristic function φ(u) = exp(−α|u|). If α > 0, show that X
admits the density
1
α
.
x 7→
2
π α + x2
What is the distribution of X if α = 0?
8 Prove the statement made in Example 50. Also verify that the function φα from
Example 47 is indeed a characteristic function for 0 < α ≤ 1.
9 Let µ be a probability law on (R, B(R)) and let φ be the corresponding characteristic function. Show that for any fixed x ∈ R,
Z T
1
lim
e−ıux φ(u) du = µ({x}).
T →∞ 2T −T
EXERCISES
25
Hint: reduce the question to studying
Z
Z
sin(T (y − x))
µ( dy) +
µ( dy).
T (y − x)
R\{x}
{x}
10 Let the distribution function F on R have a density f with respect to the Lebesgue
measure. Prove that for the corresponding characteristic function φ one has
φ(u) → 0 as |u| → ∞. This result is known as the Riemann-Lebesgue lemma and
its ‘analytic counterpart’ is of importance in harmonic analysis. You may assume
additionally that f is continuos Lebesgue a.e. You will get a bonus point, if you
prove the result for a general f (not necessarily continuous).
CHAPTER 3
Limit theorems
This chapter deals with a number of important limit theorems in probability
theory. Their proofs are to a considerable extent based on characteristic function
techniques.
3.1. Characteristic functions and weak convergence
In this section we study how characteristic functions relate to weak convergence.
Our first result says that weak convergence of probability measures implies
pointwise convergence of their characteristic functions.
Proposition 51. Let µ, µ1 , µ2 , . . . be probability measures on (R, B) and let
w
φ, φ1 , φ2 , . . . be their characteristic functions. If µn → µ, then φn (u) → φ(u) for
every u ∈ R.
Proof. Consider for fixed u the function f (x) = eiux . It is obviously bounded
and continuous and we obtain straight from the definition of weak convergence that
µn (f ) → µ(f ). But µn (f ) = φn (u).
Proposition 52. Let µ1 , µ2 , . . . be probability measures on (R, B). Let φ1 , φ2 , . . .
be the corresponding characteristic functions. Assume that the sequence (µn ) is tight
and that for all u ∈ R the limit φ(u) := limn→∞ φn (u) exists. Then there exists a
w
probability measure µ on (R, B), such that φ = φµ and µn → µ.
Proof. Since (µn ) is tight we use Prokhorov’s theorem to deduce that there
exists a weakly converging subsequence (µnk ) with a probability measure as limit.
Call this limit µ. From Proposition 51 we know that φnk (u) → φµ (u) for all u.
Hence we must have φµ = φ. We will now show that any convergent subsequence
of (µn ) has µ as a limit. Suppose that there exists a subsequence (µn0k ) with limit
µ0 . Then φn0k (u) converges to φµ0 (u) for all u. But, since (µn0k ) is a subsequence of
the original sequence, by assumption the corresponding φn0k (u) must converge to
φ(u) for all u. Hence we conclude that φµ0 = φµ and then µ0 = µ.
Suppose that the whole sequence (µn ) does not converge to µ. Then there must
exist a function f ∈ Cb (R), such that µn (f ) does not converge to µ(f ). So, there
is ε > 0, such that for some subsequence (n0k ) we have
(3.1)
|µn0k (f ) − µ(f )| > ε.
Using Prokhorov’s theorem, the sequence (µn0k ) has a further subsequence (µn00k )
that has a limit probability measure µ00 . By the same argument as above (convergence of the characteristic functions) we conclude that µ00 (f ) = µ(f ). Therefore
µn00k (f ) → µ(f ), which contradicts (3.1).
27
28
3. LIMIT THEOREMS
Characteristic functions are a tool to give a rough estimate of the tail probabilities of a random variable, useful to establish tightness of a sequence of probability
measures. To that end weRwill use the following lemma. By taking the complex
a
conjugate, check first that −a (1 − φ(u)) du ∈ R for every a > 0.
Lemma 53. Let a random variable X have distribution µ and characteristic
function φ. Then for every K > 0
Z 1/K
(1 − φ(u)) du.
(3.2)
P (|X| > 2K) ≤ K
−1/K
Proof. It follows from Fubini’s theorem and
Z a
sin ax
eiux du = 2
x
−a
that
Z
1/K
Z
1/K
Z
(1 − φ(u)) du = K
K
−1/K
(1 − eiux ) µ(dx) du
−1/K
Z
=
Z
1/K
(1 − eiux ) du µ(dx)
K
−1/K
Z sin(x/K)
µ(dx)
1−
x/K
Z
sin(x/K)
µ(dx)
≥2
1−
x/K
|x/K|>2
=2
≥ µ([−2K, 2K]c ).
since
sin x
x
≤
1
2
for x > 21.
The following theorem is known as Lévy’s continuity theorem.
Theorem 54 (Lévy’s continuity theorem). Let µ1 , µ2 , . . . be a sequence of probability measures on (R, B) and φ1 , φ2 , . . . the corresponding characteristic functions.
Assume that for all u ∈ R the limit φ(u) := limn→∞ φn (u) exists. If φ is continuous
at zero, then there exists a probability measure µ on (R, B), such that φ = φµ and
w
µn → µ.
Proof. We will show that under the present assumptions, the sequence (µn )
is tight. To this end we will use Lemma 53. Let ε > 0. Since φ is continuous at
zero, the same holds for φ, and there is δ > 0 such that |φ(u) + φ(−u) − 2| < ε
if |u| < δ. Notice that φ(u) + φ(−u) is real-valued and bounded from above by 2.
Rδ
Rδ
Hence 2 −δ (1 − φ(u)) du = −δ (2 − φ(u) − φ(−u)) du ∈ [0, 2δε).
By the convergence of the characteristic functions (which are bounded), the
dominated convergence theorem implies that
Z δ
Z δ
(1 − φn (u)) du →
(1 − φ(u)) du.
−δ
−δ
Hence, for all n ≥ N with N chosen large enough, we have
Z δ
(1 − φn (u)) du < 2δε.
−δ
1Function g(x) = (sin x)/x is called the cardinal sine, or simply the sinc function.
3.2. WEAK LAW OF LARGE NUMBERS
29
It now follows from Lemma 53 that for n ≥ N and K = 1/δ
Z
1 δ
µn ([−2K, 2K]c ) ≤
(1 − φn (u)) du
δ −δ
< 2ε.
We conclude that (µn )n≥N is tight and then so is the sequence (µn )n∈N as well.
Apply Proposition 52 to conclude.
Corollary 55. Let µ, µ1 , µ2 , . . . be probability measures on (R, B) and φ,
w
φ1 , φ2 , . . . be their corresponding characteristic functions. Then µn → µ if and
only if φn (u) → φ(u) for all u ∈ R.
Proof. If φn (u) → φ(u) for all u ∈ R, then we can apply Theorem 54. Function φ, being a characteristic function, is continuous at zero. Hence there is a
probability measure to which the µn weakly converge. But since the φn (u) converge to φ(u), the limiting probability measure must be µ. The converse statement
we have encountered as Proposition 51.
3.2. Weak law of large numbers
In this section we present the weak law of large numbers for a sequence of i.i.d.
random variables. In its proof we will need the following elementary result from
calculus.
Lemma 56. Let z be a complex number, such that |z| ≤ 1/2. Then there exists
a complex number θ depending on z, such that |θ| ≤ 1, and log(1 + z) = z + θ|z|2 .
Proof. Without loss of generality, assume that z 6= 0 (when z = 0, log(1+z) =
0 = z, and hence θ = 0). We have
z3
z4
z2
+
−
...
2 3
4
z2
1 z
2
+ ...
=z+z − + −
2 3
4
2
z2
1 z
2 z
= z + |z|
+ ... .
− + −
|z|2
2 3
4
log(1 + z) = z −
We claim that
z2
θ= 2
|z|
1 z
z2
− + −
+ ... .
2 3
4
To verify the claim, we need to check that |θ| ≤ 1. This, however, is easy:
2
∞ k
X
1 z
1 1 1
z2
1 1
1
|θ| = − + −
+
+ . . . ≤ +
+ ... ≤
= 1.
2 3
4
2 3 2
4 2
2
k=1
Corollary 57. If a sequence of complex numbers {cn } converges to the limit
c, then
cn n
lim 1 +
= ec .
n→∞
n
30
3. LIMIT THEOREMS
Proof. It is sufficient to prove that
n
n
cn on cn o
lim log 1 +
= lim n log 1 +
= c.
n→∞
n→∞
n
n
Since the sequence {cn } converges, it is bounded, and furthermore, |cn /n| ≤ 1/2
for all n large enough. Then from Lemma 56,
n
cn o
n log 1 +
= cn + o(1).
n
Because the right-hand side tends to c as n → ∞, the result follows.
Theorem 58 (Weak law of large numbers). Let X1 , . . . , Xn be i.i.d. random
variables with characteristic function φ. Assume that φ is differentiable at zero and
φ0 (0) = ıµ. Then
n
Xn =
1X
P
Xi −
→ µ.
n i=1
Proof. By differentiability of φ at zero, we have
φ(t) = φ(0) + φ0 (0)t + o(t)
= 1 + ıµt + o(t).
By independence of Xi ’s, for every fixed t,
n
h
i
t
1
t
= 1 + ıµ + o
.
E eıtX n = φn
n
n
n
As n → ∞, by Corollary 57 the right-hand side converges to eıtµ . Now φ(t) = eıtµ
is the characteristic function of a constant random variable µ. By Lévy’s continuity
theorem, X n
µ. Since the convergence in distribution and in probability are
P
equivalent for constant limits, it follows that X n −
→ µ.
Remark 59. If E[|X1 |] < ∞, then the dominated convergence theorem allows
one to interchange the order of differentiation and expectation to obtain
d d itX1
(3.3)
φ0 (t) = E eitX1 = E
e
= ıE X1 eitX1 .
dt
dt
P
For t = 0 this yields φ0 (0) = ıE[X1 ] = ıµ and X n −
→ E[X1 ], which is hardly
surprising in light of the strong law of large numbers. However, integrability of X1
is only a sufficient, but not a necessary condition to justify (3.3). Hence the weak
law of large numbers holds under a weaker condition than the strong law.
P
Remark 60. The condition φ0 (0) = ıµ is also necessary for convergence X n −
→
µ. We will not prove this fact. For the proof see e.g. Theorem 2.5.5 in Révész (1968).
An alternative necessary and sufficient condition for the weak law of large numbers,
that does not employ characteristic functions, is also known (see e.g. Chung (2001),
pp. 116–118). Furthermore, Chung (2001), pp. 118–119, contains an example, in
which the weak law of large numbers holds, while the strong law fails.
3.3. PROBABILITIES OF LARGE DEVIATIONS
31
3.3. Probabilities of large deviations
The weak law of large numbers does not provide information on the probabilities
of large deviations of X n from µ. Derivation of results in this setting is a task of an
important and deep branch of probability theory, the large deviations theory. The
latter is beyond the scope of the present course. We only remark that treatment
of the case when a sequence of i.i.d. random variables {Xn } satisfies Cramér’s
condition,
(3.4)
∃λ > 0, s.t. ϕ(λ) = E[eλX1 ] < ∞,
is relatively elementary and refer the reader to pp. 400–403 in Shiryaev (1996)
for details. Under (3.4), E[X1 ] = µ < ∞. The function ϕ is called the momentgenerating function of X1 or the Laplace transform (of the law) of X1 (as it is often
called in nonprobabilistic literature). It is obtained by replacing the argument of
the characteristic function of X1 with −ıλ. In light of this the moment generating
function possesses many properties similar to those of a characteristic function, but
unlike the latter it does not always exist. Define the function ψ by ψ(λ) = log ϕ(λ)
(this is the cumulant-generating function of X1 ). The inequality one gets is
(3.5)
P |X n − µ| ≥ ε ≤ 2 exp (−n · min(H(µ − ε), H(µ + ε))) ,
where the function
H(a) = sup[aλ − ψ(λ)]
λ∈R
is called the Cramér transform of X1 (in terminology of convex analysis this is the
Legendre transform of the cumulant-generating function ψ). The Cramér transform
can be computed explicitly for a number of distributions, which yields explicit
bounds on large deviations probabilities.
Example 61. Let {Xn } be a sequence of i.i.d. Bernoulli random variables with
probability of success 0 < p < 1. Straightforward computations give that
(
a log ap + (1 − a) log 1−a
if a ∈ [0, 1],
1−p
H(a) =
∞
otherwise.
Insert this expression in the right-hand side of (3.5) to obtain a bound on the
probabilities of large deviations.
A much more crude bound on probabilities of large deviations is obtained by
applying Chebyshev’s inequality. If V[X1 ] = σ 2 , then
V[X n ]
σ2
= 2.
P |X n − E[X]| ≥ ε ≤
2
ε
nε
In particular, when {Xn } is an i.i.d. sequence of Bernoulli random variables with
probability of success p,
(3.6)
p(1 − p)
1
P |X n − E[X]| ≥ ε ≤
≤
.
nε2
4nε2
If we denote
n k
pn (k) =
p (1 − p)n−k ,
k
32
3. LIMIT THEOREMS
the inequality (3.6) can be rewritten as
X
pn (k) ≤
{k:|k/n−p|≥ε}
1
.
4nε2
We will use this fact to give a probabilistic proof of the Weierstraß theorem, which
asserts that for any continuous function u : [0, 1] → R there exists a sequence of
polynomials un , such that
(3.7)
lim sup |un (p) − u(p)| = 0,
n→∞ p∈[0,1]
see Theorem 7.26 in Rudin (1976). Take
n
X
k
n k
p (1 − p)n−k .
un (p) =
u
n
k
k=0
These are called Bernstein polynomials. We have
E un (X n ) = un (p).
Since the function u, being continuous on [0, 1], is uniformly continuous on that
interval, for every ε > 0 one can find δ > 0, such that |u(x) − u(y)| ≤ ε, whenever
|x − y| ≤ δ. Also note that u is bounded on [0, 1]. We then get
n X
n k
k
n−k u
|un (p) − u(p)| = − u(p)
p (1 − p)
n
k
k=0
X
u k − u(p) pn (k)
≤
n
{k:|k/n−p|≤δ}
X
u k − u(p) pn (k)
+
n
{k:|k/n−p|≥δ}
kuk∞
.
nδ 2
The bound on the right-hand side is independent of p. Let n → ∞ to obtain that
the right-hand side of (3.7) does not exceed ε. Since ε is arbitrary, the result follows.
≤ε+
3.4. Central limit theorem
Let {Xi } be
Pna sequence of random variables. In general the distribution of
the sum Sn = i=1 Xi might have a complicated form and hence be difficult to
compute. The central limit theorem provides a simple approximation to it, that is
very useful in practice.
Although the result holds in a much greater generality, we will prove the central limit theorem only for a sequence of i.i.d. random variables with finite second
moments. The proof will yet again demonstrate the power of the method of characteristic functions.
Theorem 62 (Central limit theorem). Let {Xn } be a sequence of i.i.d. random
variables
with mean E[Xi ] = µ and variance 0 < Var[Xi ] = σ 2 < ∞. Let Sn =
Pn
i=1 Xi . Then
Sn − nµ
√
N (0, 1).
σ n
3.5. DELTA METHOD
33
Proof. Without loss of generality, we may suppose that E[Xi ] = 0 and Var[Xi ] =
1 (otherwise replace Xi with (Xi −µ)/σ, and note that this has mean 0 and variance
1). Let φ be the characteristic function of Xi . Since by assumption E[Xi2 ] = 1, the
characteristic function φ is twice differentiable and (cf. p. 290 in Hardy (1967) and
Proposition 37 (vi))
φ(u) = φ(0) + φ0 (0)u + φ00 (0)
u2
+ o(u2 )
2
1
= 1 − u2 + o(u2 ).
2
By independence of Xi ’s and Corollary 57 we then get for every fixed t ∈ R that
h
√ i
t
n
itSn / n
√
E e
=φ
n
(
2
2 )n
1
t
|t|
√
= 1−
+o √
2
n
n
n
2
1
t2
+o
= 1−
→ e−t /2 .
2n
n
The limit being the characteristic function of a standard normal random variable
Z, the proof is completed upon invoking Lévy’s continuity theorem.
Example 63. Suppose we have an i.i.d. sample X1 , . . . , Xn from the Bernoulli
distribution with probability of success p, but we do
Pnnot know p. The parameter
p can be estimated by the sample mean p̂n = n−1 i=1 Xi . By the strong law of
a.s.
large numbers p̂n −−→ p, and by the central limit theorem
√
n
p
(p̂n − p)
N (0, 1).
p(1 − p)
Thus for large n the estimator p̂n has approximately the normal distribution with
mean p and variance p(1 − p)/n, which gives an idea on the precision with which
p is recovered as n → ∞. The asymptotic variance p(1 − p)/n of the estimator can
be estimated by p̂n (1 − p̂n )/n, and by Slutsky’s lemma
√
n
p
(p̂n − p)
N (0, 1),
p̂n (1 − p̂n )
so that, roughly speaking, we do not need to know the value of p in order to
determine the precision with which it is recovered by p̂n : by a somewhat circular
argument the latter can be again estimated by using p̂n .
3.5. Delta method
Let {Xn } be a sequence of i.i.d. random variables with mean E[Xi ] = µ and
variance 0 < Var[Xi ] = σ 2 < ∞. By the central limit theorem,
Sn − nµ
√
N (0, 1).
σ n
Can we say something about the weak convergence of a sequence {g(X n )}, where
g : R 7→ R is some fixed function? Such a question often arise in in statistics. When
g is differentiable, the answer is given by the following result, known as the delta
method.
34
3. LIMIT THEOREMS
Theorem 64 (Delta method). Assume that the conditions of the central limit
theorem (Theorem 62) hold. Let g be differentiable at µ and g 0 (µ) 6= 0. Then
√ g(X n ) − g(µ)
n
σg 0 (µ)
N (0, 1).
Proof. The proof is an instance of an elegant application of the almost sure
e there exist random
e F,
e P)
representation theorem. On some probability space (Ω,
variables
d Sn − nµ
en =
e ∼ N (0, 1),
√ , Z
Z
σ n
a.s.
e By the foregoing, the definition of a derivative, the
such that Zen −−→ Ze (under P).
√
a.s.
e Z
e 6= 0) = 1, and the continuous mapping theorem
facts that σ Zen / n −−→ 0 and P(
we have
√
√ g(X n ) − g(µ) d √ g(µ + σ Zen / n) − g(µ)
n
= n
· 1[Zen 6=0]
σg 0 (µ)
σg 0 (µ)
en /√n) − g(µ) σ Z
en
g(µ + σ Z
=
· 0
·1 e
√
σg (µ) [Zn 6=0]
σ Zen / n
a.s.
−−→ g 0 (µ) ·
σ Ze
·1 e .
σg 0 (µ) [Z6=0]
e
The last term is equal to Ze (P-almost
surely), whence it follows that
√ g(X n ) − g(µ)
n
σg 0 (µ)
on the original probability space.
N (0, 1)
Example 65. This is a continuation of Example 63. Suppose we want to
estimate the odds r = p/(1 − p). For example, if the data X1 , . . . , Xn are the
outcomes of a medical treatment with p = 3/4, then a patient has odds 3 : 1 of
getting better. A natural estimator of r is r̂n = p̂n /(1 − p̂n ), but how good is this
estimator? Assume 0 < p < 1. Firstly, by the strong law of large numbers and
a.s.
the continuous mapping theorem, r̂n −−→ r. Secondly, by the delta method (take
g(p) = p/(1 − p) in Theorem 64)
p
n(1 − p)3
(r̂n − r)
N (0, 1),
√
p
so that for large n the estimator r̂n is approximately normally distributed with
mean r and variance p/[n(1 − p)3 ]. The latter can be estimated by p̂n /[n(1 − p̂n )3 ]
and an application of Sutsky’s lemma yields
p
n(1 − p̂n )3
√
(r̂n − r)
N (0, 1).
p̂n
3.6. Berry-Esseen theorem
Convergence of some quantity to a limit inevitably leads to the question of the
rate of convergence. In the setting of the central limit theorem proved
√ above, the
question is this: let Fn be the distribution function of (Sn − nµ)/(σ n). Theorem
62 implies that for all x ∈ R, Fn (x) → Φ(x). Can we say something about the
rate, at which the difference |Fn (x) − Φ(x)| converges to zero? A good estimate
EXERCISES
35
on this quantity might be necessary in applications, in particular for numerical
computations. One result in this direction is the Berry-Esseen theorem.
Theorem 66 (Berry-Esseen theorem). Let {Xn } be a sequence of i.i.d. random
variables with mean zero, variance σ 2 and the third absolute moment γ = E[|Xi |3 ] <
∞. Then there exists a universal constant A0 , such that
sup |Fn (x) − Φ(x)| ≤
x∈R
A0 γ 1
√ .
σ3 n
We will not prove this theorem. For the proof see e.g. Chung (2001), Section
7.4 (that particular proof is based on the method of characteristic functions). The
exact value of the constant A0 is not known, but there exist good estimates on it
(the latest (?) one is A0 ≤ 0.5129; this
√ because there also holds a
√ is quite sharp,
lower bound proved by Esseen: A0 ≥ ( 10 + 3)/(6 2π) ≈ 0.40973). The estimate
in Theorem 66 is ‘generic’. For specific distributions, tighter bounds might hold.
For instance, let X1 , . . . , Xn be jointly normal and i.i.d. with Xi ∼ N (0, 1). Then
Fn = Φ and supx∈R |Fn (x) − Φ(x)| is in fact zero.
Exercises
1 Let {Xn } be a sequence of random variables with E[|Xn |] < ∞ and V[Xn ] < ∞.
Assume that the covariances Cov[Xi , Xj ] → 0 as |i − j| → ∞. Prove the following
version of the law of large numbers (due to Bernstein):
!
n
n
1 X
1X
P Xi −
E[Xi ] > ε → 0
n
n
i=1
i=1
as n → ∞. Hint: a sequence of random variables {ξn } converges to zero in
probability, when both the mean E[ξn ] and the variance V[ξn ] converge to zero
as n → ∞ (show this).
Pn
2 Let {Xn } be a sequence of i.i.d. random variables. Prove that Sn = n−1/2 i=1 Xi
converges in probability as n → ∞ if and only if P(X1 = 0) = 1.
3 Let {Xn } be a sequence of i.i.d. random variables with E[X12 ] < ∞. Prove that
max(|X1 |, . . . , |Xn |)
√
n
0
as n → ∞. Hint: for any ε > 0,
n
max(|X1 |, . . . , |Xn |)
√
P
≤ ε = P(X12 ≤ nε2 )
n
and nε2 P(X1 > nε) → 0 as n → ∞.
4 Let {Xn } be a sequence of i.i.d. random variables with mean zero and variance
one, and let
n } be a sequence of nonnegative numbers, such that dn = o(Dn )
P{d
n
for Dn2 = i=1 d2i . Prove that the sequence {dn Xn } satisfies the central limit
theorem:
n
1 X
di Xi
Z
Dn i=1
for Z ∼ N (0, 1).
36
3. LIMIT THEOREMS
5 Let {Xn } be a sequence of i.i.d. random variables, such that P(X1 = ±1) = 1/2.
Pi
Set Si = k=1 Xi . Show that
1
√ max Si
|Z|,
n 1≤i≤n
where Z ∼ N (0, 1).
6 Let {Xn } be a sequence of i.i.d. random variables with E[X1 ] = 0 and E[X12 ] = 1.
Show that
√ Xn
n
N (0, 1),
σn
where
n
n
1X
1 X
Xn =
Xi , σn2 =
(Xi − X n )2 .
n i=1
n − 1 i=1
Incidentally, the result of this exercise also shows that if Yn possesses t-distribution
with n degrees of freedom, then Yn
N (0, 1). Explain why.
7 Let Xn have a Bin(n, λ/n) distribution (for n > λ). Show that Xn
X, where
X has a Poisson(λ) distribution. This result is known as the Poisson theorem.
8 Let X, X1 , X2 , . . . be a sequence of random variables and Y a N (0, 1)-distributed
random variable independent of that sequence. Let φn be the characteristic
function of Xn and φ that of X. Let pn be the density of Xn + σY and p the
density of X + σY .
(i) If φn → φ pointwise, then pn → p pointwise.
(ii) LetRf : R → R be bounded by B. Show that |Ef (Xn +σY )−Ef (X +σY )| ≤
2B (p(z) − pn (z))+ dz.
(iii) Show that |Ef (Xn + σY ) − Ef (X + σY )| → 0 (with f bounded) if φn → φ
pointwise.
(iv) Prove without referring to Corollary 55 that Xn
X iff φn → φ pointwise
(hint: one implication is straightforward, for the other the result of Exercise
1.1 is useful).
Bibliography
J. Bak and D. J. Newman. Complex analysis. Third edition. Undergraduate Texts
in Mathematics. Springer, New York, 2010.
P. Billingsley. Weak Convergence of Measures: Applications in Probability. Conference Board of the Mathematical Sciences Regional Conference Series in Applied
Mathematics, No. 5. Society for Industrial and Applied Mathematics, Philadelphia, Pa., 1971.
K. L. Chung. A Course in Probability Theory. Third edition. Academic Press, Inc.,
San Diego, CA, 2001.
G. H. Hardy. A Course of Pure Mathematics. Tenth edition. Cambridge University
Press, Cambridge, 1967.
K. R. Parthasarathy. Probability measures on metric spaces. Reprint of the 1967
original. AMS Chelsea Publishing, Providence, RI, 2005.
Yu. V. Prokhorov. Convergence of random processes and limit theorems in probability theory. Theory Probab. Appl., 1(2), 157–214, 1956.
S. I. Resnick. A Probability Path. Birkhäuser Boston, Inc., Boston, MA, 1999.
P. Révész. The Laws of Large Numbers. Academic Press, New York, 1968.
W. Rudin. Principles of Mathematical Analysis. Third edition. International Series
in Pure and Applied Mathematics. McGraw-Hill Book Co., New York-AucklandDüsseldorf, 1976.
A. N. Shiryaev. Probability. Translated from the first (1980) Russian edition by R.
P. Boas. Second edition. Graduate Texts in Mathematics, 95. Springer-Verlag,
New York, 1996.
A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and
Probabilistic Mathematics, 3. Cambridge University Press, Cambridge, 1998.
37