Download lecture notes

Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands E-mail address: [email protected] 2010 Mathematics Subject Classification. 60-01 Key words and phrases. Central limit theorem, sequential compactness, tightness, weak convergence, weak law of large numbers Abstract. These notes deal with weak convergence of probability measures on the real line. They are largely based on the lecture notes written by Peter Spreij to accompany the Measure Theoretic Probability course. Contents Preface vii Chapter 1. Weak convergence 1.1. Generalities 1.2. Criteria for weak convergence 1.3. Convergence of distribution functions 1.4. Sequential compactness 1.5. Continuous mapping theorem 1.6. Almost sure representation theorem 1.7. Relation to other modes of convergence 1.8. Slutsky’s lemma Exercises 1 1 2 4 5 8 9 12 14 15 Chapter 2. Characteristic functions 2.1. Definition and first properties 2.2. Inversion formula and uniqueness 2.3. Necessary conditions 2.4. Multidimensional case Exercises 17 17 20 23 23 24 Chapter 3. Limit theorems 3.1. Characteristic functions and weak convergence 3.2. Weak law of large numbers 3.3. Probabilities of large deviations 3.4. Central limit theorem 3.5. Delta method 3.6. Berry-Esseen theorem Exercises 27 27 29 31 32 33 34 35 Bibliography 37 v Preface These notes deal with weak convergence of probability measures on the real line and related topics. They are to a large extent based on the lecture notes written by Peter Spreij to accompany the Measure Theoretic Probability course. Other sources we used are listed in the bibliography. Shota Gugushvili vii CHAPTER 1 Weak convergence 1.1. Generalities We start with the definition of weak convergence of probability measures on (R, B), and that of a sequence of random variables. Definition 1. Let µ, µ1 , µ2 , . . . be probability measures on (R, B). It is said w that µn converges weakly to µ, and we then write µn → µ, if µn (f ) → µ(f ) for all f ∈ Cb (R). If X, X1 , X2 , . . . are random variables (possibly defined on different probability spaces) with distributions µ, µ1 , µ2 , . . . , then we say that Xn converges w weakly to X, and write Xn X, if it holds that µn → µ. Another accepted notation for weak convergence of a sequence of random varid ables is Xn → X, and one says that Xn converges to X in distribution. Consider the following example that illustrates for a special case that there is some reasonableness in Definition 1. Example 2. Let {xn } be a convergent of real numbers sequence with limn→∞ xn = 0. Then for every f ∈ Cb (R) one has f (xn ) → f (0). Let µn be the Dirac measure concentrated on xn and µ the Dirac measure concentrated in the origin. Since w µn (f ) = f (xn ), we see that µn → µ. As a further result, here is a statement that gives an appealing sufficient condition for weak convergence of a sequence of random variables, when the random variables involved admit densities. Theorem 3. Consider distributions µ, µ1 , µ2 , . . . having densities f, f1 , f2 , . . . w w.r.t. Lebesgue measure λ on (R, B). Suppose that fn → f λ-a.e. Then µn → µ. Proof. We apply Scheffé’s lemma to conclude that fn → f in L1 (R, B, λ). Let g ∈ Cb (R). Since g is bounded, we also have fn g → f g in L1 (R, B, λ) and hence w µn → µ. One could naively think of another definition of convergence of probability measures, for instance by requiring that µn (B) → µ(B) for every B ∈ B, which is the same as to require that th class of test functions f consists of indicators of Borel sets, or even by requiring that the integrals µn (f ) converge to µ(f ) for every bounded measurable function. It turns out that each of these requirements is too strong to get a useful convergence concept. One drawback of such a definition is demonstrated by the following example with Dirac measures. Example 4. Assume the same setup as in Example 2 and take for concreteness xn = 1/n. Let B = (−∞, x] for some x > 0. Then for all xn < x, we have µn (B) = 1B (xn ) = 1 and µ(B) = 1B (0) = 1, so that µn (B) → µ(B). For x < 0 we 1 2 1. WEAK CONVERGENCE get that µn (B) = µ(B) = 0, and thus µn (B) → µ(B). But for B = (−∞, 0] we have µn (B) = 0 for all n, whereas µ(B) = 1. Hence convergence of µn (B) → µ(B) does not hold for this last choice of B, although it is quite natural in this case to say that µn → µ. For the future reference note the following: if Fn is the distribution function of µn and F that of µ, then we have seen that Fn (x) → F (x), for all x ∈ R, except for x = 0. 1.2. Criteria for weak convergence In this section we give several criteria for weak convergence. These are primarily useful in the proofs. Theorem 5. The following are equivalent: w (i) µn → µ; (ii) every subsequence {µnj } of {µn } has a further subsequence {µnjk }, such that w µnjk → µ as k → ∞. Proof. That (i) implies (ii) is obvious. We prove the reverse implication. Asw sume the convergence µn → µ fails. This means there exists a bounded continuous function f, a subsequence {nj } of {n} and a constant ε > 0, such that |µnj (f ) − µ(f )| ≥ ε for all j. But then |µnjk (f ) − µ(f )| ≥ ε for any subsequence {njk } of {nj } as well. Hence {µnj } has no further subsequence converging weakly to µ, a contradiction. Recall that the boundary ∂E of a set E ∈ B is ∂E = E \ E ◦ , where E is the closure and E ◦ is the interior of E. The distance from a point x to a set E is d(x, E) = inf{|x − y| : y ∈ E}. The δ-neighbourhood of E (here δ > 0) is the set E δ = {x : d(x, E) < δ}. The following result is known as the portmanteau lemma. Theorem 6 (Portmanteau lemma). Let µ, µ1 , µ2 , . . . be probability measures on (R, B). The following statements are equivalent. (i) (ii) (iii) (iv) w µn → µ. lim supn→∞ µn (F ) ≤ µ(F ) for all closed sets F . lim inf n→∞ µn (G) ≥ µ(G) for all open sets G. limn→∞ µn (E) = µ(E) for all sets E with µ(∂E) = 0 (all µ-continuity sets). Proof. We start with (i)⇒(ii). Given > 0, choose m ∈ N, such that for δ = 1/m > 0, µ(F δ ) < µ(F ) + ε. This is possible, because F is closed and hence F δ ↓ F as m → ∞. Let   if x ≤ 0, 1 ϕ(x) = 1 − x if 0 < x < 1,   0 if x ≥ 1, and define f (x) = ϕ 1 d(x, F ) . δ 1.2. CRITERIA FOR WEAK CONVERGENCE 3 Note that f is continuous, nonnegative, is bounded by 1 on R, equals 1 on F and vanishes outside F δ . Therefore, Z Z µn (F ) = f dµn ≤ f dµn , F and Z R Z f dµ ≤ µ(F δ ). f dµ = Fδ R We also have Z lim n→∞ Z f dµn = R f dµ. R Combining the above, Z Z lim sup µn (F ) ≤ lim n→∞ n→∞ f dµn = f dµ ≤ µ(F δ ) < µ(F ) + ε. R R Since ε is arbitrary, the implication follows. (ii)⇔(iii) follows by a simple complementation argument. (ii) and (iii) together imply (iv) by µ(E) ≥ lim sup µn (E) ≥ lim sup µn (E) n→∞ n→∞ ≥ lim inf µn (E) ≥ lim inf µn (E ◦ ) ≥ µ(E ◦ ), n→∞ n→∞ because µ(∂E) = 0 implies that the extreme terms are equal, the inequalities are in fact equalities, and so limn→∞ µn (E) = µ(E). (iv)⇒(i) Let ε > 0, g ∈ Cb (S) and choose two constants C1 , C2 , such that C1 < g < C2 . Let D = {x ∈ R : µ({g = x}) > 0}. So, D is the set of atoms of g and hence it is at most countable (if not, µ would have an infinite total mass). Let C1 = x0 < . . . < xm = C2 be a finite set of points not in D, such that max{xk − xk−1 : k = 1, . . . , m} < ε. Let Ik = (xk−1 , xk ]. The continuity of g implies that if y is a boundary point of a set {x : xk−1 < g(x) ≤ xk }, then g(y) is either xk−1 or xk . Hence the sets in the above display are µ-continuity sets. We have (1.1) Z m m X X xk−1 µn (x : xk−1 < g(x) ≤ xk ) ≤ gdµn ≤ xk µn (x : xk−1 < g(x) ≤ xk ), k=1 R k=1 and likewise, (1.2) Z m m X X xk−1 µ(x : xk−1 < g(x) ≤ xk ) ≤ gdµ ≤ xk µ(x : xk−1 < g(x) ≤ xk ). k=1 R k=1 Now note that the extreme terms in (1.1) converge to the respective extreme terms in (1.2). The both the limit superior and limit R latter differ by at most ε. Hence R inferior of R gdµn are within distance ε of R gdµ. Since ε is arbitrary, the result follows. This finishes the proof of the theorem. Part (iv) of the portmanteau lemma is quite illustrative for understanding the definition of the weak convergence and in what way it differs from the requirement 4 1. WEAK CONVERGENCE µn (B) → µ(B) for every set B in the case of another would-be definition of weak convergence (cf. Section 1.1). 1.3. Convergence of distribution functions In this section we give an appealing characterisation of weak convergence (convergence in distribution) in terms of distribution functions, which makes the definition of weak convergence look less abstract. Definition 7. We shall say that a sequence of distribution functions {Fn } on R converges weakly to a limit distribution function F, and shall write Fn F, if Fn (x) → F (x) for all x ∈ CF , where CF is the set of all those points, at which F is continuous. Theorem 8. Let µ, µ1 , µ2 , . . . be probability measures on the real line and denote by F, F1 , F2 , . . . the corresponding distribution functions. The following statements are equivalent: w (i) µn − → µ; (ii) Fn F. Proof. Assume (i). If x is a continuity point of F, the set (−∞, x], the boundary of which is {x}, is a µ-continuity set. Hence Fn (x) = µn ((−∞, x]) → µ((−∞, x]) = F (x) by the portmanteau lemma and thus (ii) holds. Conversely, let (ii) hold. Fix an arbitrary 0 < ε < 1 and pick two continuity points a and b of F in such a way that F (a) < ε and F (b) > 1 − ε. Next, given f ∈ C(R), choose the continuity points xi of F, such that a = x0 < x1 < . . . < xk = b and |f (x) − f (xi )| < ε for xi−1 ≤ x ≤ xi (this is possible by the uniform continuity of f on [a, b]). Define S= k X f (xi )[F (xi ) − F (xi−1 )], Sn = k X f (xi )[Fn (xi ) − Fn (xi−1 )]. i=1 i=1 By assumption, Sn → S as n → ∞. Let M = supx∈R |f (x)|. We have Z f dµ − S < (2M + 1)ε. R Likewise, Z f dµn − Sn ≤ ε + M Fn (a) + M (1 − Fn (b)) R → ε + M F (a) + M (1 − F (b)) < (2M + 1)ε. As a result, Z Z Z lim sup f dµn − f dµ ≤ lim sup f dµn − Sn n→∞ n→∞ R R R Z + f dµ − S + lim |Sn − S| R ≤ 2(2M + 1)ε. n→∞ 1.4. SEQUENTIAL COMPACTNESS 5 Since ε is arbitrary, the limit superior on the left-hand side of the first inequality is in fact zero and the result follows. As shown in the next result, when the limit distribution function F is continuous everywhere, i.e. CF = R, the convergence Fn (t) → F (t) is in fact uniform in t ∈ R. Theorem 9. Suppose Fn F and F is continuous. Then lim sup |Fn (t) − F (t)| = 0. n→∞ t∈R Proof. Let k ∈ N be fixed. By continuity of F and the intermediate value theorem, there exist points −∞ = x0 < x1 < . . . < xk = ∞, such that F (xi ) = i/k. Therefore, for xi−1 ≤ x ≤ xi , Fn (x) − F (x) ≤ Fn (xi ) − F (xi−1 ) = Fn (xi ) − F (xi ) + 1/k, Fn (x) − F (x) ≥ Fn (xi−1 ) − F (xi ) = Fn (xi−1 ) − F (xi−1 ) − 1/k. Thus |Fn (x) − F (x)| ≤ sup |Fn (xi ) − F (xi )| + 1/k, x ∈ R. 0≤i≤k For any ε > 0, choose k so large that 1/k ≤ ε/2. Next note that with this k, by convergence of Fn (x) → F (x) at all x ∈ R, the supremum sup0≤i≤k |Fn (xi )−F (xi )| can be made arbitrarily small, in particular smaller than ε/2, by taking n is large enough. Conclude that supx∈R |Fn (x) − F (x)| ≤ ε for all n large enough. Since ε is arbitrary, the result follows. 1.4. Sequential compactness In the previous sections we studied several alternative characterisations of weak convergence. In this section we will take a more abstract stance and study a condition guaranteeing that a sequence of probabaility measures has at least one weakly convergent subsequence. We first introduce the notion of sequential compactness of a sequence of probability measures. Definition 10. A sequence of probability measures {µn } on (R, B) is called sequentially compact, if every subsequence {µnk } of {µn } has a further weakly convergent subsequence. A general answer to the question whether a sequence {µn } is sequentially compact or not is given by Prokhorov’s theorem. In its proof we need one auxiliary result, known as Helly’s theorem. The Bolzano-Weierstraß theorem states that every bounded sequence of real numbers has a convergent subsequence. The theorem easily generalises to sequences in Rd , but fails to hold for uniformly bounded sequences in general metric spaces. But if extra properties are imposed, there can still be an affirmative answer. Something like that happens in Helly’s theorem. At this point it is convenient to introduce the notion of a defective distribution function. Such a function, F say, has values in [0, 1], is right-continuous and increasing, but at least one of the two properties limx→∞ F (x) = 1 and limx→−∞ F (x) = 0 fails to hold. The measure µ corresponding to F on (R, B) will then be a subprobability measure, µ(R) < 1. Theorem 11 (Helly’s theorem). Let {Fn } be a sequence of distribution functions. Then there exists a possibly defective distribution function F and a subsequence {Fnk }, such that Fnk (x) → F (x), for all x ∈ CF . 6 1. WEAK CONVERGENCE Proof. The main ingredient of the proof is an infinite repetition of the BolzanoWeierstraß theorem combined with the Cantor diagonalisation. First we restrict ourselves to working on Q instead of R, and exploit the countability of Q. Write Q = {q1 , q2 , . . .} and consider restrictions of Fn to Q. Then the sequence {Fn (q1 )} is bounded and along some subsequence {n1k } it has a limit, `(q1 ) say. Look then at the sequence {Fn1k (q2 )}. Again, along some subsequence of {n1k }, call it {n2k }, we have a limit, `(q2 ) say. Note that along the thinned subsequence, we still have limk→∞ Fn2k (q1 ) = `(q1 ). Continue like this to construct a nested sequence of subsequences {njk } for which we have that limk→∞ Fnj (qi ) = `(qi ) holds for every k i ≤ j. Define a diagonal sequence {nk } by nk = nkk . For an arbitrary i, along this sequence one has limk→∞ Fnk (qi ) = `(qi ). In this way we have constructed a function ` : Q → [0, 1], and by the monotonicity of Fn (t) in t this function is increasing. In the next step we extend this function to a function F on R that is rightcontinuous, and still increasing. We put F (x) = inf{`(q) : q ∈ Q, q > x}. Obviously, F is an increasing function. It is also right-continuous: let x ∈ R and ε > 0. There is q ∈ Q with q > x such that `(q) < F (x) + ε. Pick y ∈ (x, q). Then F (y) < `(q) and we have F (y) − F (x) < ε, which shows that F is right-continuous. However, limx→∞ F (x) = 1 or limx→−∞ F (x) = 0 do not necessarily hold true. Thus F is a possibly defective distribution function. We now show that Fnk (x) → F (x) if x ∈ CF . Fix x ∈ CF and let ε > 0. Pick q as above. By left-continuity of F at x, there is y < x such that F (x) < F (y) + ε. Take now r ∈ (y, x) ∩ Q. Then F (y) ≤ `(r), and hence F (x) < `(r) + ε. So we have the inequalities `(q) − ε < F (x) < `(r) + ε. Then lim sup Fnk (x) ≤ lim Fnk (q) = `(q) < F (x) + ε, k→∞ k→∞ lim inf Fnk (x) ≥ lim inf Fnk (r) = `(r) > F (x) − ε. k→∞ k→∞ Since ε is arbitrary, the result follows. Here is an example, for which the limit in Theorem 11 is not a true distribution function. Example 12. Let µn be the Dirac measure concentrated on {n}. Then its distribution function is given by Fn (x) = 1[n,∞) (x) and hence limn→∞ Fn (x) = 0. Hence the limit function F in Theorem 11 has to be the zero function, which is clearly defective. One colloquially says that in the limit the probability mass escapes to infinity. Translated in terms of probability laws, Helly’s theorem says that every sequence of probability measures {µn } has a (weakly) convergent subsequence, but that the limit law in general is a subprobability measure only. We are now interested in finding a condition, that would guarantee that the limit is a bona fide probability measure. A possible path is to require that all probability measures involved have probability one on a fixed bounded set. That would prevent the 1.4. SEQUENTIAL COMPACTNESS 7 phenomenon described in Example 12. However, this is a too stringent assumption, because it rules out many useful distributions. Fortunately, a considerably weaker assumption suffices. For any probability measure µ on (R, B) it holds that limM →∞ µ([−M, M ]) = 1. The next condition, tightness, gives a uniform version of this. Definition 13. A sequence of probability measures {µn } on (R, B) is called tight, if limM →∞ inf n µn ([−M, M ]) = 1. Remark 14. Note that a sequence {µn } is tight iff every tail sequence {µn }n≥N is tight. In order to show that a sequence is tight it is thus sufficient to show tightness from a certain suitably chosen index on. Theorem 15 (Prokhorov’s theorem1). A sequence {µn } of probability measures on (R, B) is tight if and only if it is sequentially compact. Proof. Suppose {µn } is sequentially compact, but not tight. Then there exists ε > 0, such that for any M > 0 and all n, µn ([−M, M ]c ) > ε. It follows that for any j ∈ N and Ij = (−j, j), one can find an index nj , such that µnj (Ijc ) > ε. Extract from the sequence {µnj } a weakly convergent subsequence {µnjk }, and denote its weak limit by µ. By the portmanteau lemma, for every fixed j ∈ N, lim sup µnjk (Ijc ) ≤ µ(Ijc ). k→∞ Letting j → ∞, we see that the right-hand side converges to zero, while the lefthand side stays bounded by ε > 0 from below. This contradiction proves the first implication. We now prove the second implication. Let Fn be the distribution function of µn . By Helly’s theorem, there exists a subsequence {Fnj } of the sequence of distribution functions {Fn }, such that Fnj F as j → ∞, for some, possibly defective, distribution function F. We will show that in fact (1.3) lim F (x) = 1, x→∞ lim F (x) = 0, x→−∞ so that F is a proper distribution function. By tightness of {µn }, for any constant 0 < ε < 1 there exists a constant Mε > 0, such that Fn (Mε ) > 1 − ε for all n ∈ N. Without loss of generality, Mε can be taken to be a continuity point of F. Then F (Mε ) = lim Fnj (Mε ) ≥ 1 − ε. j→∞ Since ε is arbitrary, the above display and monotonicity of F imply the first equality in (1.3). The second one can be proved in a similar manner. This completes the proof. Theorem 15 has a simple corollary. w Corollary 16. If µn → µ for some probability measure µ, then the sequence {µn } is tight. We also remark that tightness of a sequence {µn } in general is not sufficient for its weak convergence. Here is a simple counterexample: let µn = N (0, 1) for n odd and µn = N (0, 2) for n even. Then {µn } is tight, but does not converge weakly. 1The name Prokhorov is alternatively spelled as Prohorov, but Prokhorov is the way it appears in the English translation of the original paper containing (a much more general version of) the theorem. See Prokhorov (1956). 8 1. WEAK CONVERGENCE 1.5. Continuous mapping theorem The continuous mapping theorem is a result asserting that if a sequence of random variables {Xn } converges in a suitable sense to a random variable X, then for a continuous function g the transformed sequence {g(Xn )} converges to g(X). We will prove a slightly more general result, that allows g to be discontinuous on a negligible set. Such a refinement does not require much additional technical effort, while occasionally being useful. Theorem 17 (Continuous mapping theorem). Let g : R 7→ R be continuous at every point of a set C, such that P(X ∈ C) = 1. a.s. a.s. (i) If Xn −−→ X, then g(Xn ) −−→ g(X). (ii) If Xn X, then g(Xn ) g(X). P P (iii) If Xn − → X, then g(Xn ) − → g(X). Proof. Part (i) is trivial. We prove part (ii). Let F be an arbitrary closed set. We have {g(Xn ) ∈ F } = {Xn ∈ g −1 (F )}. Trivially, g −1 (F ) ⊂ g −1 (F ). Take an arbitrary x ∈ g −1 (F ). By definition, there exists a sequence {xm }, such that xm → x and g(xm ) ∈ F. If x ∈ C, then g(xm ) → g(x), and g(x) ∈ F, because F is closed. Otherwise x ∈ C c . Hence g −1 (F ) ⊂ g −1 (F ) ∪ C c . Then by the portmanteau lemma and the fact that P(X ∈ C) = 1, lim sup P(g(Xn ) ∈ F ) ≤ lim sup P(Xn ∈ g −1 (F )) n→∞ n→∞ ≤ P(X ∈ g −1 (F )) ≤ P(X ∈ g −1 (F )) + P(X ∈ C c ) = P(g(X) ∈ F ). By another application of the portmanteau lemma we conclude that g(Xn ) g(X). P We move to part (iii). Assume that g(Xn ) − → g(X) fails. Then there exist ε > 0, δ > 0 and a subsequence {nj } of {n}, such that (1.4) P(|g(Xnj ) − g(X)| > ε) > δ. a.s. Extract from {nj } a further subsequence {njk }, such that Xnjk −−→ X. By part a.s. (iii), g(Xnjk ) −−→ g(X). But this contradicts (1.4). The proof of the theorem is completed. It would be more appropriate, albeit clumsier, to call Theorem 17 the almost surely continuous mapping theorem. Example 18. Here is a simple illustration of Theorem 17. Let Y1 , . . . , Yn be an i.i.d. sample from the normal distribution with mean zero and unknown variance σ 2 . By the strong law of large numbers, n 1 X 2 a.s. 2 Y −−→ σ , σ̂n2 = n i=1 i √ and hence σ̂ 2 is a reasonable estimator of σ 2 . Since the function g(x) = x is continuous, σ̂n is then a reasonable estimator of the standard deviation σ: we have a.s. σ̂n −−→ σ. 1.6. ALMOST SURE REPRESENTATION THEOREM (a) Distribution function 9 (b) Quantile function Figure 1. Distribution and quantile functions of the discrete uniform distribution on integers 1, 2. 1.6. Almost sure representation theorem Suppose we want to prove some distributional property of a sequence {Xn } of random variables, knowing that Xn X. In general this might be difficult, but a.s. is perhaps easier, if we knew that Xn −−→ X. Unfortunately, the latter almost sure convergence might be difficult to establish, or perhaps is even false. However, the situation is not hopeless. The almost sure representation theorem, proved e that supports random e F, e P), below, tells us that there exists a probability space (Ω, d d a.s. e e e e e en − variables {Xn }, X, such that for all n ∈ N, Xn = Xn , X = X, and X −→ X. We en on then prove the distributional property we are interested in for the sequence X e The result automatically carries over to the original sequence e F, e P). the space (Ω, {Xn }. We will need a number of results on quantile functions, which are of independent interest as well. A distribution function in general is only non-decreasing, but not necessarily strictly increasing. Therefore, it typically does not admit the inverse function in the usual sense. Nevertheless, a kind of inverse, the quantile function, can still be defined. The quantile function of a distribution function F is a generalised inverse F −1 : (0, 1) 7→ R given by F −1 (p) = inf {x : F (x) ≥ p} . For an illustration see Figure 1. The quantile function is left-continuous. Its range is equal to the support of F (or rather to the support of the corresponding probability measure µ; the support of a probability measure on R is defined as the set of all those points x, such that any open neighbourhood Ux of x has strictly positive measure: µ(Ux ) > 0. Intuitively, this is the smallest closed subset of R that receives measure 1 under µ (although you might be wondering at this point, this latter explanation is valid even for probability measures on separable metric spaces; see e.g. Theorem 2.1 on pp. 27–28 in Parthasarathy (2005)). As one example, the support of the standard normal distribution is the whole R) and therefore, F −1 is often unbounded. An evident fact that the quantile function is monotone implies that it might have at most a countable number of discontinuity points only. The following lemma lists some other properties of F −1 . Of these we will only make partial use of (i)–(iv). 10 1. WEAK CONVERGENCE Figure 2. Distribution function (red line). Lemma 19. For every 0 < p < 1 and x ∈ R, (i) F −1 (p) ≤ x if and only if p ≤ F (x); (ii) F ◦ F −1 (p) ≥ p, with equality holding if and only if p is in the range of F ; the equality can fail if and only if F is discontinuous at F −1 (p); (iii) F− ◦ F −1 (p) ≤ p, with F− (x) = F (x−); (iv) F −1 ◦ F (x) ≤ x; the equality fails if and only if x is in the interior or at the right endpoint of a flat part of F ; (v) F ◦ F −1 ◦ F = F ; F −1 ◦ F ◦ F −1 = F −1 ; (vi) (F ◦ G)−1 = G−1 ◦ F −1 . Proof. (i) through (iv) can be proved either directly, by appealing to the definitions, or through a picture, such as the one given in Figure 2. To prove the first equality in (v), note that by (ii), the monotonicity of F and (iv), F (x) = p ≤ F ◦ F −1 (p) = F ◦ F −1 ◦ F (x) ≤ F (x). The second equality in (v) follows from (ii), the monotonicity of F −1 and (iv) by F −1 (q) ≤ F −1 ◦F ◦F −1 (q) ≤ F −1 (q). Finally, (vi) is a consequence of the definition of (F ◦ G)−1 and (i). As a consequence of (ii) and (iv), F ◦ F −1 (p) ≡ p and F −1 ◦ F (p) ≡ p on (0, 1) if and only if F is continuous and strictly increasing. In that case F −1 is a proper inverse of F, as it should be. Corollary 20. Let F be an arbitrary distribution function and U a uniform random variable on [0, 1]. Then F −1 (U ) ∼ F. This follows from Lemma 19 (i). The transformation F −1 (U ) is called the quantile transformation. 1.6. ALMOST SURE REPRESENTATION THEOREM 11 Corollary 21. Let X ∼ F for a continuous distribution function F. Then F (X) is uniformly distributed on [0, 1]. Again, this follows from Lemma 19 (i) and (ii) by P(F (X) ≤ x) = P(F (X) < x) = 1 − P(F (X) ≥ x) = 1 − P(X ≥ F −1 (x)) = P(X < F −1 (x)) = P(X ≤ F −1 (x)) = F ◦ F −1 (x) = x, where x ∈ (0, 1). The transformation F (X) for X ∼ F is called the probability integral transformation. Quantile functions are occasionally useful when studying weak convergence of a sequence of random variables. In the next definition we introduce the notion of the weak convergence of a sequence of quantile functions. Definition 22. We shall say that a sequence of quantile functions Fn−1 conF −1 , if verges weakly to a limit quantile function F −1 , and denote this by Fn−1 −1 −1 −1 is continuous. Fn (t) → F (t) at every point 0 < t < 1, at which F Both the terminology and the notation for the weak convergence of quantile functions are reminiscent of those for the weak convergence of distribution functions. In fact, as shown in the next lemma, the two types of convergence are equivalent. Lemma 23. For any sequence of distribution functions Fn , Fn F −1 . if Fn−1 F if and only Proof. Let U be a standard uniform random variable on some probability space, for instance on ([0, 1], B[0, 1], λ). Since F −1 has at most a countable number of F −1 discontinuity points and the distribution of U is absolutely continuous, Fn−1 a.s. −1 −1 −1 −1 F (U ). By Corollary implies that Fn (U ) −−→ F (U ). Therefore, Fn (U ) 20, this is exactly the weak convergence Fn F. Now we prove the reverse implication. Let V be a standard normal random variable on some probability space, for instance on ([0, 1], B[0, 1], λ), on which it can be obtained through the quantile transformation Φ−1 (U ) for U a standard uniform random variable, see Corollary 20. Since the convergence Fn (t) → F (t) can fail only at a countable number of points t, and since the distribution of V is continuous, we a.s. have Fn (V ) −−→ F (V ) (and of course Fn (V ) F (V )). By Lemma 19 (i), Φ(Fn−1 (t)) = P(V < Fn−1 (t)) = 1 − P(V ≥ Fn−1 (t)) = 1 − P(Fn (V ) ≥ t) = P(Fn (V ) < t), and similarly, P(F (V ) < t) = Φ(F −1 (t)). By the portmanteau lemma, lim inf P(Fn (V ) < t) ≥ P(F (V ) < t). n→∞ On the other hand, by elementary properties of the limits inferior and superior and the portmanteau lemma again, lim inf P(Fn (V ) < t) ≤ lim sup P(Fn (V ) < t) n→∞ n→∞ ≤ lim sup P(Fn (V ) ≤ t) n→∞ = 1 − lim inf P(Fn (V ) > t) n→∞ 12 1. WEAK CONVERGENCE ≤ 1 − P(F (V ) > t) = P(F (V ) ≤ t). If P(F (V ) ≤ t) is continuous at t, then P(F (V ) ≤ t) = P(F (V ) < t) = Φ(F −1 (t)), and in this case lim inf P(Fn (V ) < t) = lim sup P(Fn (V ) < t) n→∞ n→∞ = lim P(Fn (V ) < t) n→ = P(F (V ) < t) = Φ(F −1 (t)). The function Φ(F −1 (·)) is certainly continuous at every point t, at which F −1 is. Since Φ−1 is a continuous function as well (cf. Lemma 19), from this it follows that Fn−1 (t) → F −1 (t) at every point t, at which F −1 is continuous. Thus Fn−1 (t) F −1 (t). The work we put in the previous results allows us to give a short proof of the almost sure representation theorem. Theorem 24 (Almost sure representation). Let Xn X. Then there exists a e e e e e probability space (Ω, F, P) and random variables Xn , X defined on it, such that for d d a.s. e en = e= en − all n ≥ 1, X Xn , X X, and X −→ X. Proof. Let Fn and F be the distribution functions of Xn and X, respectively. e = ([0, 1], B[0, 1], λ) and let U be a random e F, e P) Consider the probability space (Ω, en = Fn−1 (U ) and X e= variable on it with a standard uniform distribution. Define X d d −1 en = Xn and X e = X. By Lemma 23, the convergence F (U ). By Corollary 20, X Fn F implies that Fn−1 F −1 . By definition the latter means that Fn−1 (t) → F −1 (t) at all points t, at which F −1 is continuous. Note that F −1 has at most a countable number of discontinuity points, and hence the convergence Fn−1 (t) → F −1 (t) can perhaps fail only on a set with Lebesgue measure zero. Since U has a a.s. a.s. e en − continuous distribution, this implies that Fn−1 (U ) −−→ F −1 (U ), i.e. X −→ X. Several applications of the almost sure representation theorem will be given in the next section. 1.7. Relation to other modes of convergence Firstly, we show that convergence in probability implies convergence in distribution. Theorem 25. Suppose that a sequence {Xn } of random variables and a random P variable X are defined on the same probability space. Assume that Xn − → X. Then Xn X. Proof. Suppose the convergence Xn X fails. By definition this means that there exists f ∈ Cb (R), such that the convergence µn (f ) → µ(f ) fails. Thus there exists ε > 0 and a subsequence {nk } of {n}, such that |µnk (f ) − µ(f )| ≥ ε for all nk . This is obviously true for any further subsequence of {nk } as well. Pick 1.7. RELATION TO OTHER MODES OF CONVERGENCE 13 a.s. a subsequence {nk` } of {nk }, such that Xnk` −−→ X (this is possible, because P Xn − → X). Then µnk` (f ) → µ(f ) by the dominated convergence theorem. But this leads to a contradiction that proves the theorem. Corollary 26. Suppose that a sequence {Xn } of random variables and a a.s. random variable X are defined on the same probability space. Assume that Xn −−→ X. Then Xn X. This follows from Theorem 25 and the fact that almost sure convergence implies convergence in probability. The converse to Theorem 25 (and Corollary 26) is in general false. Example 27. Let X ∼ N (0, 1) and Xn = −X for all n ∈ N. Then P(|Xn − X| > ε) = P(|X| > ε/2) > 0 for all n ∈ N, and thus convergence in probability fails. Obviously, so does the almost sure convergence. On the other hand, by the d symmetry of the standard normal distribution, Xn = X, and hence Xn X. There is one notable exception, however. Theorem 28. Let the random variables X, X1 , X2 , . . . be defined on the same probability space. If Xn X, where P(X = x) = 1 for some x ∈ R, then also P Xn → X. Proof. The distribution µ of X is the Dirac measure at x. For any ε > 0, the sets (x + ε, ∞) and (−∞, x − ε) are µ-continuity sets. Note that P(|Xn − X| > ε) = P(Xn > x + ε) + P(Xn < x − ε). The right-hand side of the above display tends to zero as n → ∞ by the portmanteau lemma. This completes the proof. Next we move to convergence of the first moments. Since weak convergence in general does not imply convergence in probability, neither does it in general imply convergence of means. But when the collection {Xn } is uniformly integrable, the weak convergence Xn X can be strengthened to convergence of means: E[Xn ] → E[X]. The proof is a simple application of the almost sure representation theorem. Theorem 29. Assume that Xn X. If the sequence {Xn } is uniformly integrable, then E[Xn ] → E[X] as n → ∞. Proof. By the almost sure representation theorem, there exists a probability d e d e e with random variables X, e F, e P) e X e1 , X e2 . . . , such that X = space (Ω, X, Xn = X n a.s. en −−→ X. e By the uniform integrability of the family {Xn }, the for all n ∈ N, and X en } is also uniformly integrable. Therefore E[X en ] → E[X], e and since this family {X latter convergence depends only on the laws of the random variables involved, the result follows. Remark 30. Assume that {Xn } and X are defined on the same probability space. Inspecting the proof of the previous theorem, one could have thought that not only do the means converge, but that we also have the L1 -convergence: E[|Xn − X|] → 0. However, this in general is false and here is a simple counterexample: take X ∼ N (0, 1) and Xn = −X for all n ∈ N. Then the conditions of Theorem 29 are satisfied, but E[|Xn − X|] = 2E[|X|], which does not tend to zero. The point is that 14 1. WEAK CONVERGENCE E[|Xn − X|] depends on the bivariate law of (Xn , X), and this need not be the same en , X) e (marginals do not determine joint distributions uniquely). This as that of (X serves as a warning to when the almost sure representation theorem is applicable and when it is not: the representation does not in general preserve the dependence structure of {Xn } and X, and hence typically cannot be used for statements dealing with multivariate vectors obtained from {Xn } and X. The following is what we can obtain without the uniform integrability assumption in Theorem 29. Again, the proof is an application of the almost sure representation theorem. Theorem 31. If Xn X, then lim inf n→∞ E[|Xn |] ≥ E[|X|]. Proof. By the almost sure representation theorem, there exists a probad e e with random variables X, e F, e P) e X e1 , X e2 . . . , such that X = bility space (Ω, X, d e a.s. e e e Xn = Xn for all n ∈ N, and Xn −−→ X. Fatou’s lemma implies that E[|X|] ≤ en |], and the statement follows. lim inf n→∞ E[|X 1.8. Slutsky’s lemma Suppose Xn X and the sequence {Yn } is close in some sense to {Xn }. What can be said about the weak limit of {Yn }? Or suppose that {Xn } and {Yn } are weakly convergent. What can be said about the weak convergence of the sequence {Xn Yn }? The following result, known as Slutsky’s lemma2, gives an answer to these questions. Theorem 32. Let {Xn } and {Yn } be two sequences of the random variables defined on the same probability space. (i) If Xn (ii) If Xn P X and |Xn − Yn | − → 0, then Yn X. X and Yn c for a constant c, then Xn Yn cX. Proof. We first prove (i). Let F be closed and δ = 1/m for m ∈ N. We have P(Yn ∈ F ) = P(Xn + Yn − Xn ∈ F ) = P(Xn + Yn − Xn ∈ F ; |Xn − Yn | < δ) + P(Xn + Yn − Xn ∈ F ; |Xn − Yn | ≥ δ) ≤ P(Xn ∈ F δ ) + P(|Xn − Yn | ≥ δ). P Letting n → ∞ and using the assumption |Xn − Yn | − → 0 and the portmanteau lemma, we obtain that lim sup P(Yn ∈ F ) ≤ P(X ∈ F δ ). n→∞ Fδ Since ↓ F as m → ∞, the result follows by another application of the portmanteau lemma. Now we prove (ii). Write (1.5) Xn Yn = Xn (Yn − c) + cXn . 2An alternative, but less common spelling of Slutsky’s name is Slutzky. Also the result is at times called a theorem, not lemma. EXERCISES 15 An elementary argument shows that for any ε > 0 and δ > 0, ε (1.6) P(|Xn (Yn − c)| > ε) ≤ P |Xn | > + P(|Yn − c| > δ). δ Fix ε and pick δ such that ε/δ and −ε/δ are continuity points of the distribution of X. Then the first term on the right-hand side of the above display converges to P(|X| > ε/δ). The latter can be made arbitrarily small by taking δ small enough. As far as the second term in (1.6) is concerned, for every fixed δ it converges to P zero. Hence Xn (Yn −c) − → 0. It is also easy to see that cXn cX (this can be done in a variety of ways. For instance, the almost sure representation theorem and the dominated convergence theorem give for f ∈ Cb (R) that E[f (cXn )] → E[f (cX)]). Now apply part (i) to the right-hand side of (1.5). Slutsky’s lemma finds numerous applications in asymptotic theorems of mathematical statistics. Exercises 1 Show that Xn X iff E[f (Xn )] → E[f (X)] for all bounded uniformly continuous functions f. w 2 Show the implication Fn (x) → F (x) for all x ∈ CF ⇒ µn − → µ without referring to the almost sure representation theorem. Hint: first you take for given ε > 0 a K > 0 such that F (K) − F (−K) > 1 − ε. Approximate a function f ∈ Cb (R) on the interval [−K, K] by a piecewise constant function, compute the integrals of this approximating function and use the convergence of {Fn } at continuity points of F etc. 3 Let {µn } be a sequence of discrete uniform distributions on [0, 1]: µn (i/n) = 1/n, i = 1, . . . , n. Show that {µn } is weakly convergent and identify the weak limit. 4 Let {Xn } be an i.i.d. sequence of exponentially distributed random variables: FXn (x) = P(Xn ≤ x) = 1 − e−x for x ≥ 0 and FXn (x) = 0 for x < 0. Let Mn = − log n + max1≤i≤n Xn . Show that FMn FM , where FM (x) = P(M ≤ −x x) = e−e , x ∈ R. The latter distribution is known as the Gumbel distribution (or the extreme value distribution). 5 Consider the N (µn , σn2 ) distributions, where the µn are real numbers and the σn2 nonnegative. Show that this family is tight iff the sequences (µn ) and (σn2 ) are bounded. Under what condition do we have that the N (µn , σn2 ) distributions converge to a (weak) limit? What is this limit? 6 Let random variabes X and Xn possess discrete distributions supported on N. Show that Xn X if and only if P(Xn = m) → P(X = m) for every m ∈ N. 7 Give an example of distribution functions F and Fn on the real line, such that w Fn → F, but supx |Fn (x) − F (x)| → 0 fails. 8 For a distribution function G on the real line the median is defined by G−1 (1/2). Assume that Fn F and let m = med(F ) and mn = med(Fn ) be the medians of F and Fn , respectively. Find suitable assumptions, under which mn → m as n → ∞. 9 Let F and G be two distribution functions on R and let L(F, G) = inf{h > 0 : F (x − h) − h ≤ G(x) ≤ F (x + h) + h} 16 1. WEAK CONVERGENCE be the Lévy distance between them (accept as a fact, or prove for yourself that L(F, G) defines a distance). Show that the weak convergence Fn F is equivalent to convergence in the Lévy metric: L(Fn , F ) → 0. Hint: the implication L(Fn , F ) → 0 ⇒ Fn F follows from the definition. The other one can be established by contradiction. 10 Prove uniqueness of the weak limit µ of a weakly convergent sequence of probability measures µn . CHAPTER 2 Characteristic functions 2.1. Definition and first properties Let X be a random variable defined on (Ω, F, P). X induces a probability measure on (R, B), the law or distribution of X, denoted by PX or µ. This probability measure, in turn, determines the distribution function F of X. Conversely, F also determines PX . Hence distribution functions on R and probability measures on (R, B) are in bijective correspondence. In this chapter we develop another such correspondence. We start with a definition. Definition 33. Let µ be a probability measure on (R, B). Its characteristic function φ : R → C is defined by Z (2.1) φ(u) = eıux µ(dx). R Whenever needed, we write φµ instead of φ to express the dependence on µ. Note that in this definition we integrate a complex valued function. By splitting a complex Rvalued function f =R g + ıh into its real part g and imaginary part h, R we define f dµ := g dµ + ı h dµ. For integrals of complex valued functions, previously R Rshown theorems are, mutatis mutandis, true. For instance, one has | f dµ| ≤ |f | dµ, where | · | denotes the norm of a complex number. If X is a random variable with distribution µ, then φµ can alternatively be expressed as φ(u) = E[exp(ıuX)]. There are many random variables with distribution µ. They all have the same characteristic function. We also adopt the notation φX to indicate that we are dealing with the characteristic function of the random variable X. Before we give some examples and elementary properties of characteristic functions, we look at a special case. Suppose that X admits a density f with respect to Lebesgue measure. Then Z (2.2) φX (u) = eıux f (x) dx. R Analysts define for f ∈ L (R, B, λ) the Fourier transform fˆ by Z ˆ f (u) = e−ıux f (x) dx. 1 R What we thus see is the equality φX (u) = fˆ(−u). Given usefulness of Fourier transforms in various branches of mathematics, we then get a feeling that characteristic functions will be important in probability theory as well. Computation of a characteristic function (if it is explicitly computable) is typically a clever exercise in integration. 17 18 2. CHARACTERISTIC FUNCTIONS Example 34. Let X ∼ N (0, 1). Then Z 2 2 1 ıuX φX (u) = E[e ]= √ eıux e−x /2 dx = e−u /2 . 2π R In fact, 1 √ 2π Z eıux e−x 2 /2 R Z X ∞ (ıux)n −x2 /2 e dx n! R n=0 Z ∞ X 2 (ıu)n 1 √ xn e−x /2 dx = n! 2π R n=0 1 dx = √ 2π = ∞ X (ıu)n E[X n ]. n! n=0 For n odd, E[X n ] = 0, while by Stein’s lemma, see Lemma 35 ahead, for n even, E[X n ] = (n − 1)!!. Hence the above chain of equalities can be continued as ∞ ∞ X X (ıu)n (ıu)2n E[X n ] = (2n − 1)!! n! (2n)! n=0 n=0 ∞ X (ıu)2n (2n)! = (2n)! 2n n! n=0 n ∞ X u2 1 − = 2 n! n=0 2 = e−u /2 . Here we used the fact that (2n − 1)!! = = n Y (2i − 1) i=1 ( n Y )( (2i − 1) i=1 = n Y ) (2i) 1 Qn i=1 (2i) i=1 (2n)! . 2n n! Lemma 35 (Stein’s lemma). Let X ∼ N (0, 1) and let g be a differentiable function satisfying E[g(X)X] < ∞ and E[g 0 (X)] < ∞. Then E[g(X)X] = E[g 0 (X)]. Proof. We have 1 E[g(X)X] = √ 2π Z g(x)xe−x 2 /2 dx. R By partial integration the right-hand side is equal to Z 2 1 1 −x2 /2 ∞ − √ g(x)e |−∞ + √ g 0 (x)e−x /2 dx = E[g 0 (X)]. 2π 2π R This completes the proof. Here is another illustrative example. 2.1. DEFINITION AND FIRST PROPERTIES 19 Example 36. Let X have a standard Cauchy distribution. Directly from the definition, when u = 0, φX (u) = 1. Now assume u 6= 0. Then Z Z Z 1 1 1 1 cos(ux) cos(y) φX (u) = dx = dx = |u| dy. eıux π R 1 + x2 π R 1 + x2 π R u2 + y 2 The integral in the last equality is best evaluated through contour integration techniques. Let CR be a closed contour consisting of the real line segment from −R to R and the upper semi-circle ΓR centred at the origin and of radius R. It can be shown that Z eiz dz → 0 2 2 ΓR u + z as R → ∞, see pp. 145–146 in Bak and Newman (2010). Therefore, Z Z eiz eiy dz → dy. 2 2 2 2 CR u + z R u +y Taking real parts on both sides, since z0 = ı|u| is the only pole of the function eiz /(u2 + z 2 ) in the upper half plain, by the residue theorem we get that Z cos(y) eiz dy = Re 2πı Res 2 , z0 . 2 2 u + z2 R u +y Now, since z0 is a pole of order 2, it follows by (ii) on p. 130 in Bak and Newman (2010) that e−|u| eiz d eiz 2 , z (z − z ) = . Res 2 = 0 0 2 2 2 u +z dz u + z z=z0 2ı|u| Thus φX (u) = e−|u| for u 6= 0. We conclude that φX (u) = e−|u| for all u ∈ R. The following proposition lists some simple properties of characteristic functions. Proposition 37. Let φ = φX be the characteristic function of some random variable X. The following hold true: (i) φ(0) = 1, |φ(u)| ≤ 1, for all u ∈ R (ii) φ is uniformly continuous on R. (iii) φaX+b (u) = φX (au)eıub . (iv) φ is real valued and symmetric around zero, if X and −X have the same distribution. (v) If X and Y are independent, then φX+Y (u) = φX (u)φY (u). (vi) If E|X|k < ∞, then φ ∈ C k (R) and φ(k) (0) = ık EX k . Proof. Properties (i), (iii) and (iv) are trivial. Consider (ii). Fixing u ∈ R, we consider φ(u + t) − φ(u) for t → 0. We have Z |φ(u + t) − φ(u)| = (exp(ı(u + t)x) − exp(ıux)) µ( dx) Z ≤ | exp(ıtx) − 1| µ( dx). The functions x 7→ exp(ıtx) − 1 converge to zero pointwise for t → 0 and are bounded by 2. The result thus follows from dominated convergence. Property (v) follows from the product rule for expectations of independent random variables. 20 2. CHARACTERISTIC FUNCTIONS Finally, property (vi) for k = 1 follows by an application of the dominated convergence theorem and the inequality |eıx − 1| ≤ |x|, for x ∈ R. The other cases can be treated similarly. Remark 38. Here is a simple application of Proposition 37: if X ∼ N (m, σ), 2 2 then φX (u) = eıum−σ u /2 . Remark 39. Warning: the converse to Proposition 37 (v) is typically false, i.e. from the equality φX+Y (u) = φX (u)φY (u), u ∈ R, it cannot be concluded that X and Y are independent. Here is a counterexample: let X have a standard Cauchy distribution and let Y = X. Then e−2|u| = φ2X (u) = φX+Y (u) = e−|u| e−|u| = φX (u)φY (u), although X and Y are obviously dependent in this case. More on this example later. 2.2. Inversion formula and uniqueness Given a characteristic function φ, how can we find the corresponding distribution function F, or the corresponding law µ? As we will see, an answer to this qusetion is given by the inversion formula given below. Note that the integration interval in formula (2.3) is symmetric around zero. This is essential: an improper integral Z ∞ −iua e − e−ıub φ(u) du ıu −∞ typically does not exist (in the Lebesgue sense). That the limit in (2.3), called the Cauchy limit, is finite, is actually part of the assertion of the theorem. Theorem 40. Let µ be a probability law and φ its characteristic function. Then for all a < b, Z T −iua 1 1 e − e−ıub (2.3) lim φ(u) du = µ((a, b)) + µ({a, b}). T →∞ 2π −T ıu 2 Proof. We compute, using Fubini’s theorem, which we will justify below, Z T −ıua 1 e − e−iub (2.4) ΦT := φ(u) du 2π −T ıu Z T −ıua Z 1 e − e−ıub = eıux µ(dx) du 2π −T iu R Z Z T −ıua 1 e − e−iub ıux = e du µ(dx) 2π R −T ıu Z Z T ı(x−a)u e − ei(x−b)u 1 (2.5) = du µ(dx) 2π R −T ıu Z =: ET (x) µ(dx) R Application of Fubini’s theorem is justified as follows. First, the integrand in (2.5) is bounded by b − a, because |eıx − eıy | ≤ |x − y| for all x, y ∈ R. Second, the product measure λ × µ on [−T, T ] × R is finite. 2.2. INVERSION FORMULA AND UNIQUENESS 21 By splitting the integrand of ET (x) into its real and imaginary part, we see that the imaginary part vanishes and we are left with Z T sin(x − a)u − sin(x − b)u 1 ET (x) = du 2π −T u Z T Z T 1 sin(x − a)u sin(x − b)u 1 = du − du 2π −T u 2π −T u Z T (x−a) Z T (x−b) sin v sin v 1 1 = dv − dv. 2π −T (x−a) v 2π −T (x−b) v Rt The function g given by g(s, t) = s siny y dy is continuous in (s, t). Hence it is bounded on any compact subset of R2 . Moreover, g(s, t) → π as s → −∞ and t → ∞ (this can be shown by contour integration techniques; see e.g. pp. 146–147 in Bak and Newman (2010)1). Hence g, as a function on R2 , is bounded in s, t. We conclude that also ET (x) is bounded as a function of T and x, the first ingredient to apply the dominated convergence theorem to (2.5), since µ is a finite measure. The second ingredient is to identify E(x) := limT →∞ ET (x). For an arbitrary α, a change of the integration variable x = αy gives Z ∞ sin(αy) π dy = sgn(α) . y 2 0 Here sgn(α) denotes 1, 0 or −1 according to whether α > 0, α = 0 or α < 0. By comparing the location of x relative to a and b, we use the value of the latter integral to obtain   1 if a < x < b, 1 if x = a or x = b, E(x) =  2 0 else. We thus get, using the dominated convergence theorem again, that 1 ΦT → µ((a, b)) + µ({a, b}) 2 as T → ∞. This completes the proof. Remark 41. If a and b are continuity points of F, then the right-hand side of (2.3) is F (b) − F (a). Thus φ determines F at all continuity points of F. But due to right-continuity of F, the latter completely determines F. F in turn determines µ, and so φ determines µ. Let us now give another version of the inversion formula. Theorem 42. If the characteristic function φ of a probability measure µ on (R, B) belongs to L1 (R, B, λ), then µ admits a density f w.r.t. the Lebesgue measure λ. Moreover, f is continuous. Proof. Define (2.6) 1 f (x) = 2π Z e−ıux φ(u) du. R 1An alternative derivation is given here: http://staff.science.uva.nl/ hvzanten/ex_5_9. ~ pdf 22 2. CHARACTERISTIC FUNCTIONS Since |φ| has a finite integral, f is well defined for every x. Observe that f is real valued, because φ(u) = φ(−u). An easy application of the dominated convergence theorem shows that f is continuous. Now note first that the limit of the R e−ıua −e−ıub 1 integral in (2.3) is equal to the (Lebesgue) integral 2π φ(u) du, again ıu because of dominated convergence. Next we use Fubini’s theorem to compute for any continuity points a < b of F that Z bZ Z b 1 e−iux φ(u) du dx f (x) dx = 2π a R a Z Z b 1 = φ(u) e−iux dx du 2π R a Z 1 e−ıua − e−iub = φ(u) du 2π R ıu = F (b) − F (a), Rb where we also employed Theorem 40. Next, by continuity of a f (x) dx in a and b, the relationship Z b f (x) dx = F (b) − F (a) a in fact holds R y for any a, b ∈ R. By continuity of f, for any y ∈ [a, b] the Lebesgue integral a f (x) dx equals the Riemann integral. By the fundamental theorem of calculus it follows that F 0 (y) = f (y) for all y ∈ (a, b) and so for all y ∈ R. Since F is non-decreasing, f must be nonnegative, and hence it is a probability density. Remark 43. Note the duality between the expressions (2.2) and (2.6). Apart from the presence of the minus sign in the integral and the factor 2π in the denominator in (2.6), the transformations f 7→ φ and φ 7→ f are similar. The inversion theorem entails one very important result. Theorem 44. Random variables X and Y are equal in distribution if and only if their characteristic functions are the same: φX (t) = φY (t) for all t ∈ R. Proof. One side of the theorem is trivial. For the other side we argue as follows: suppose φX (t) = φY (t) for all t ∈ R. By Fubini’s theorem and the inversion formula for characteristic functions, for every σn > 0 and y ∈ R we have Z Z 2 2 2 2 e−ıty e−σn t /2 φX (t)dt = e−ıty e−σn t /2 E[eitX ]dt R R Z 2 2 =E e−ıt(y−X) e−σn t /2 dt R √ 2π h −(y−X)2 /(2σn2 ) i E e = σ √n Z 2 2 2π = e−(y−x) /(2σn ) dFX (x) σn R = 2πfX+σn Z (y). Here Z is a standard normal random variable independent of X and fX+σn Z is the density of X + σn Z with respect to the Lebesgue measure. Replace φX with φY in the above argument to see that fX+σn Z (y) = fY +σn Z (y). This implies that for 2.4. MULTIDIMENSIONAL CASE 23 d every σn > 0, X + σn Z = Y + σn Z. Letting σn → 0 as n → ∞, Slutsky’s lemma gives that X + σn Z X. Likewise, X + σn Z Y. Due to the uniqueness of the d weak limit, we then obtain that X = Y. Put another way, Theorem 44 implies that there is a one-to-one correspondence between probability measures and characteristic functions. 2.3. Necessary conditions In the previous sections we have derived some properties a characteristic function possesses. Equally interesting is finding general conditions for a function φ to be a characteristic function. We will formulate two results in that direction. Their proofs can be found e.g. in Chung (2001) (see Theorems 6.5.2 and 6.5.3 there). The first result gives a necessary and sufficient condition, but is not easily verifiable. The second one is only sufficient, but its conditions are simpler. Recall that a complex-valued function φ on R is called positive definite, if for any finite set of real numbers tj ’s and complex numbers zj ’s, 1 ≤ j ≤ n, n = 1, 2, . . . , we have n n X X φ(tj − tk )zj z k ≥ 0, j=1 k=1 where z k is a complex conjugate of zk . Theorem 45 (Bochner-Khinchin theorem). A function φ is a characteristic function if and only if it is positive definite, continuous at 0, and φ(0) = 1. Theorem 46 (Pólya’s theorem). Let φ satisfy the following conditions: φ(0) = 1, φ is nonnegative, symmetric around zero, and decreasing, continuous and convex on [0, ∞). Then φ is a characteristic function. Example 47. Let 0 < α ≤ 1. An application of Pólya’s theorem gives that the function α φα (u) = e−|u| is a characteristic function (check this). No such luck when 1 < α < 2, but via an alternative route φα can nevertheless be shown to be a characteristic function in that case as well (see e.g. pp. 192–193 in Chung (2001)). When α = 2, we know that φ corresponds to the normal distribution. A probability distribution that has φα as a characteristic function is called a stable distribution with index α. We finally remark that it can be shown that φα with α > 2 is not a characteristic function (in this case φα is twice differentiable at zero and φ0α (0) = φ00α (0) = 0. Assume φα is a characteristic function. By Theorem 6.4.1 in Chung (2001) the first and second moments of the corresponding probability law are zero. But then µ must be the Dirac measure at zero, so that φα (u) = 1 for all u ∈ R. This is a contradiction). 2.4. Multidimensional case Our treatment in this section is cursory and we omit most details. The characteristic function φ of a probability measure µ on (Rk , B(Rk )) is defined by the k-dimensional analogue of (2.1). We have with u, x ∈ Rk , h·, ·i the standard inner product, Z eıhu,xi µ(dx). φ(u) = Rk 24 2. CHARACTERISTIC FUNCTIONS Like in the real case, also here probability measures are uniquely determined by their characteristic functions. As a consequence we have the following characterization of independent random variables. Proposition 48. Let X = (X1 , . . . , Xk ) be a k-dimensional random vector. Qk Then X1 , . . . , Xk are independent random variables iff φX (u) = i=1 φXi (ui ), ∀u = (u1 , . . . , uk ) ∈ Rk . Proof. If the Xi are independent, the statement about the characteristic functions is proved in the same way as Proposition 37 (v). If the characteristic function φX factorizes as stated, the result follows by the uniqueness property of characteristic functions. Remark 49. Let k = 2 in the above proposition. If X1 = X2 as in Remark 39, then we do not have φX (u) = φX1 (u1 )φX2 (u2 ) for every u1 , u2 (you check!), in agreement with the fact that X1 and X2 are not independent. But for the special choice u1 = u2 this product relation holds true. Example 50. Let X and Y be independent standard normal random variables. Then somewhat unexpectedly, the random variables X − Y and X + Y are also independent, which can be shown using Proposition 48. Exercises 1 Let φ be a characteristic function. Show that so is |φ|2 .P m 2 If F and G are distribution functions, such that F = j=1 bj δaj and G has a density, say g, show that the convolution F ∗ G has a density and find it. 3 Show that for any characteristic function φ, 1 Re[1 − φ(u)] ≥ Re[1 − φ(2u)]. 4 4 A random variable X with the characteristic function φ is symmetric, if and only if φ(u) is real for all u ∈ R. 5 Let X1 , X2 , . . . be a sequence of i.i.d. random variables and N a Poisson(λ) PN distributed random variable, independent of the Xn . Put Y = n=1 Xn . Let φ be the characteristic function of the Xn and ψ the characteristic function of Y . Show that ψ = exp(λφ − λ). 6 If X has an exponential distribution with parameter λ, then φX (u) = λ/(λ − iu). 7 Let φ be a real characteristic function with the property that φ(nu) = φ(u)n for all u ∈ R and n ∈ N. Show that for some α ≥ 0 it holds that φ(u) = exp(−α|u|). Let X have characteristic function φ(u) = exp(−α|u|). If α > 0, show that X admits the density 1 α . x 7→ 2 π α + x2 What is the distribution of X if α = 0? 8 Prove the statement made in Example 50. Also verify that the function φα from Example 47 is indeed a characteristic function for 0 < α ≤ 1. 9 Let µ be a probability law on (R, B(R)) and let φ be the corresponding characteristic function. Show that for any fixed x ∈ R, Z T 1 lim e−ıux φ(u) du = µ({x}). T →∞ 2T −T EXERCISES 25 Hint: reduce the question to studying Z Z sin(T (y − x)) µ( dy) + µ( dy). T (y − x) R\{x} {x} 10 Let the distribution function F on R have a density f with respect to the Lebesgue measure. Prove that for the corresponding characteristic function φ one has φ(u) → 0 as |u| → ∞. This result is known as the Riemann-Lebesgue lemma and its ‘analytic counterpart’ is of importance in harmonic analysis. You may assume additionally that f is continuos Lebesgue a.e. You will get a bonus point, if you prove the result for a general f (not necessarily continuous). CHAPTER 3 Limit theorems This chapter deals with a number of important limit theorems in probability theory. Their proofs are to a considerable extent based on characteristic function techniques. 3.1. Characteristic functions and weak convergence In this section we study how characteristic functions relate to weak convergence. Our first result says that weak convergence of probability measures implies pointwise convergence of their characteristic functions. Proposition 51. Let µ, µ1 , µ2 , . . . be probability measures on (R, B) and let w φ, φ1 , φ2 , . . . be their characteristic functions. If µn → µ, then φn (u) → φ(u) for every u ∈ R. Proof. Consider for fixed u the function f (x) = eiux . It is obviously bounded and continuous and we obtain straight from the definition of weak convergence that µn (f ) → µ(f ). But µn (f ) = φn (u). Proposition 52. Let µ1 , µ2 , . . . be probability measures on (R, B). Let φ1 , φ2 , . . . be the corresponding characteristic functions. Assume that the sequence (µn ) is tight and that for all u ∈ R the limit φ(u) := limn→∞ φn (u) exists. Then there exists a w probability measure µ on (R, B), such that φ = φµ and µn → µ. Proof. Since (µn ) is tight we use Prokhorov’s theorem to deduce that there exists a weakly converging subsequence (µnk ) with a probability measure as limit. Call this limit µ. From Proposition 51 we know that φnk (u) → φµ (u) for all u. Hence we must have φµ = φ. We will now show that any convergent subsequence of (µn ) has µ as a limit. Suppose that there exists a subsequence (µn0k ) with limit µ0 . Then φn0k (u) converges to φµ0 (u) for all u. But, since (µn0k ) is a subsequence of the original sequence, by assumption the corresponding φn0k (u) must converge to φ(u) for all u. Hence we conclude that φµ0 = φµ and then µ0 = µ. Suppose that the whole sequence (µn ) does not converge to µ. Then there must exist a function f ∈ Cb (R), such that µn (f ) does not converge to µ(f ). So, there is ε > 0, such that for some subsequence (n0k ) we have (3.1) |µn0k (f ) − µ(f )| > ε. Using Prokhorov’s theorem, the sequence (µn0k ) has a further subsequence (µn00k ) that has a limit probability measure µ00 . By the same argument as above (convergence of the characteristic functions) we conclude that µ00 (f ) = µ(f ). Therefore µn00k (f ) → µ(f ), which contradicts (3.1). 27 28 3. LIMIT THEOREMS Characteristic functions are a tool to give a rough estimate of the tail probabilities of a random variable, useful to establish tightness of a sequence of probability measures. To that end weRwill use the following lemma. By taking the complex a conjugate, check first that −a (1 − φ(u)) du ∈ R for every a > 0. Lemma 53. Let a random variable X have distribution µ and characteristic function φ. Then for every K > 0 Z 1/K (1 − φ(u)) du. (3.2) P (|X| > 2K) ≤ K −1/K Proof. It follows from Fubini’s theorem and Z a sin ax eiux du = 2 x −a that Z 1/K Z 1/K Z (1 − φ(u)) du = K K −1/K (1 − eiux ) µ(dx) du −1/K Z = Z 1/K (1 − eiux ) du µ(dx) K −1/K Z sin(x/K) µ(dx) 1− x/K Z sin(x/K) µ(dx) ≥2 1− x/K |x/K|>2 =2 ≥ µ([−2K, 2K]c ). since sin x x ≤ 1 2 for x > 21. The following theorem is known as Lévy’s continuity theorem. Theorem 54 (Lévy’s continuity theorem). Let µ1 , µ2 , . . . be a sequence of probability measures on (R, B) and φ1 , φ2 , . . . the corresponding characteristic functions. Assume that for all u ∈ R the limit φ(u) := limn→∞ φn (u) exists. If φ is continuous at zero, then there exists a probability measure µ on (R, B), such that φ = φµ and w µn → µ. Proof. We will show that under the present assumptions, the sequence (µn ) is tight. To this end we will use Lemma 53. Let ε > 0. Since φ is continuous at zero, the same holds for φ, and there is δ > 0 such that |φ(u) + φ(−u) − 2| < ε if |u| < δ. Notice that φ(u) + φ(−u) is real-valued and bounded from above by 2. Rδ Rδ Hence 2 −δ (1 − φ(u)) du = −δ (2 − φ(u) − φ(−u)) du ∈ [0, 2δε). By the convergence of the characteristic functions (which are bounded), the dominated convergence theorem implies that Z δ Z δ (1 − φn (u)) du → (1 − φ(u)) du. −δ −δ Hence, for all n ≥ N with N chosen large enough, we have Z δ (1 − φn (u)) du < 2δε. −δ 1Function g(x) = (sin x)/x is called the cardinal sine, or simply the sinc function. 3.2. WEAK LAW OF LARGE NUMBERS 29 It now follows from Lemma 53 that for n ≥ N and K = 1/δ Z 1 δ µn ([−2K, 2K]c ) ≤ (1 − φn (u)) du δ −δ < 2ε. We conclude that (µn )n≥N is tight and then so is the sequence (µn )n∈N as well. Apply Proposition 52 to conclude. Corollary 55. Let µ, µ1 , µ2 , . . . be probability measures on (R, B) and φ, w φ1 , φ2 , . . . be their corresponding characteristic functions. Then µn → µ if and only if φn (u) → φ(u) for all u ∈ R. Proof. If φn (u) → φ(u) for all u ∈ R, then we can apply Theorem 54. Function φ, being a characteristic function, is continuous at zero. Hence there is a probability measure to which the µn weakly converge. But since the φn (u) converge to φ(u), the limiting probability measure must be µ. The converse statement we have encountered as Proposition 51. 3.2. Weak law of large numbers In this section we present the weak law of large numbers for a sequence of i.i.d. random variables. In its proof we will need the following elementary result from calculus. Lemma 56. Let z be a complex number, such that |z| ≤ 1/2. Then there exists a complex number θ depending on z, such that |θ| ≤ 1, and log(1 + z) = z + θ|z|2 . Proof. Without loss of generality, assume that z 6= 0 (when z = 0, log(1+z) = 0 = z, and hence θ = 0). We have z3 z4 z2 + − ... 2 3 4 z2 1 z 2 + ... =z+z − + − 2 3 4 2 z2 1 z 2 z = z + |z| + ... . − + − |z|2 2 3 4 log(1 + z) = z − We claim that z2 θ= 2 |z| 1 z z2 − + − + ... . 2 3 4 To verify the claim, we need to check that |θ| ≤ 1. This, however, is easy: 2 ∞ k X 1 z 1 1 1 z2 1 1 1 |θ| = − + − + + . . . ≤ + + ... ≤ = 1. 2 3 4 2 3 2 4 2 2 k=1 Corollary 57. If a sequence of complex numbers {cn } converges to the limit c, then cn n lim 1 + = ec . n→∞ n 30 3. LIMIT THEOREMS Proof. It is sufficient to prove that n n cn on cn o lim log 1 + = lim n log 1 + = c. n→∞ n→∞ n n Since the sequence {cn } converges, it is bounded, and furthermore, |cn /n| ≤ 1/2 for all n large enough. Then from Lemma 56, n cn o n log 1 + = cn + o(1). n Because the right-hand side tends to c as n → ∞, the result follows. Theorem 58 (Weak law of large numbers). Let X1 , . . . , Xn be i.i.d. random variables with characteristic function φ. Assume that φ is differentiable at zero and φ0 (0) = ıµ. Then n Xn = 1X P Xi − → µ. n i=1 Proof. By differentiability of φ at zero, we have φ(t) = φ(0) + φ0 (0)t + o(t) = 1 + ıµt + o(t). By independence of Xi ’s, for every fixed t, n h i t 1 t = 1 + ıµ + o . E eıtX n = φn n n n As n → ∞, by Corollary 57 the right-hand side converges to eıtµ . Now φ(t) = eıtµ is the characteristic function of a constant random variable µ. By Lévy’s continuity theorem, X n µ. Since the convergence in distribution and in probability are P equivalent for constant limits, it follows that X n − → µ. Remark 59. If E[|X1 |] < ∞, then the dominated convergence theorem allows one to interchange the order of differentiation and expectation to obtain d d itX1 (3.3) φ0 (t) = E eitX1 = E e = ıE X1 eitX1 . dt dt P For t = 0 this yields φ0 (0) = ıE[X1 ] = ıµ and X n − → E[X1 ], which is hardly surprising in light of the strong law of large numbers. However, integrability of X1 is only a sufficient, but not a necessary condition to justify (3.3). Hence the weak law of large numbers holds under a weaker condition than the strong law. P Remark 60. The condition φ0 (0) = ıµ is also necessary for convergence X n − → µ. We will not prove this fact. For the proof see e.g. Theorem 2.5.5 in Révész (1968). An alternative necessary and sufficient condition for the weak law of large numbers, that does not employ characteristic functions, is also known (see e.g. Chung (2001), pp. 116–118). Furthermore, Chung (2001), pp. 118–119, contains an example, in which the weak law of large numbers holds, while the strong law fails. 3.3. PROBABILITIES OF LARGE DEVIATIONS 31 3.3. Probabilities of large deviations The weak law of large numbers does not provide information on the probabilities of large deviations of X n from µ. Derivation of results in this setting is a task of an important and deep branch of probability theory, the large deviations theory. The latter is beyond the scope of the present course. We only remark that treatment of the case when a sequence of i.i.d. random variables {Xn } satisfies Cramér’s condition, (3.4) ∃λ > 0, s.t. ϕ(λ) = E[eλX1 ] < ∞, is relatively elementary and refer the reader to pp. 400–403 in Shiryaev (1996) for details. Under (3.4), E[X1 ] = µ < ∞. The function ϕ is called the momentgenerating function of X1 or the Laplace transform (of the law) of X1 (as it is often called in nonprobabilistic literature). It is obtained by replacing the argument of the characteristic function of X1 with −ıλ. In light of this the moment generating function possesses many properties similar to those of a characteristic function, but unlike the latter it does not always exist. Define the function ψ by ψ(λ) = log ϕ(λ) (this is the cumulant-generating function of X1 ). The inequality one gets is (3.5) P |X n − µ| ≥ ε ≤ 2 exp (−n · min(H(µ − ε), H(µ + ε))) , where the function H(a) = sup[aλ − ψ(λ)] λ∈R is called the Cramér transform of X1 (in terminology of convex analysis this is the Legendre transform of the cumulant-generating function ψ). The Cramér transform can be computed explicitly for a number of distributions, which yields explicit bounds on large deviations probabilities. Example 61. Let {Xn } be a sequence of i.i.d. Bernoulli random variables with probability of success 0 < p < 1. Straightforward computations give that ( a log ap + (1 − a) log 1−a if a ∈ [0, 1], 1−p H(a) = ∞ otherwise. Insert this expression in the right-hand side of (3.5) to obtain a bound on the probabilities of large deviations. A much more crude bound on probabilities of large deviations is obtained by applying Chebyshev’s inequality. If V[X1 ] = σ 2 , then V[X n ] σ2 = 2. P |X n − E[X]| ≥ ε ≤ 2 ε nε In particular, when {Xn } is an i.i.d. sequence of Bernoulli random variables with probability of success p, (3.6) p(1 − p) 1 P |X n − E[X]| ≥ ε ≤ ≤ . nε2 4nε2 If we denote n k pn (k) = p (1 − p)n−k , k 32 3. LIMIT THEOREMS the inequality (3.6) can be rewritten as X pn (k) ≤ {k:|k/n−p|≥ε} 1 . 4nε2 We will use this fact to give a probabilistic proof of the Weierstraß theorem, which asserts that for any continuous function u : [0, 1] → R there exists a sequence of polynomials un , such that (3.7) lim sup |un (p) − u(p)| = 0, n→∞ p∈[0,1] see Theorem 7.26 in Rudin (1976). Take n X k n k p (1 − p)n−k . un (p) = u n k k=0 These are called Bernstein polynomials. We have E un (X n ) = un (p). Since the function u, being continuous on [0, 1], is uniformly continuous on that interval, for every ε > 0 one can find δ > 0, such that |u(x) − u(y)| ≤ ε, whenever |x − y| ≤ δ. Also note that u is bounded on [0, 1]. We then get n X n k k n−k u |un (p) − u(p)| = − u(p) p (1 − p) n k k=0 X u k − u(p) pn (k) ≤ n {k:|k/n−p|≤δ} X u k − u(p) pn (k) + n {k:|k/n−p|≥δ} kuk∞ . nδ 2 The bound on the right-hand side is independent of p. Let n → ∞ to obtain that the right-hand side of (3.7) does not exceed ε. Since ε is arbitrary, the result follows. ≤ε+ 3.4. Central limit theorem Let {Xi } be Pna sequence of random variables. In general the distribution of the sum Sn = i=1 Xi might have a complicated form and hence be difficult to compute. The central limit theorem provides a simple approximation to it, that is very useful in practice. Although the result holds in a much greater generality, we will prove the central limit theorem only for a sequence of i.i.d. random variables with finite second moments. The proof will yet again demonstrate the power of the method of characteristic functions. Theorem 62 (Central limit theorem). Let {Xn } be a sequence of i.i.d. random variables with mean E[Xi ] = µ and variance 0 < Var[Xi ] = σ 2 < ∞. Let Sn = Pn i=1 Xi . Then Sn − nµ √ N (0, 1). σ n 3.5. DELTA METHOD 33 Proof. Without loss of generality, we may suppose that E[Xi ] = 0 and Var[Xi ] = 1 (otherwise replace Xi with (Xi −µ)/σ, and note that this has mean 0 and variance 1). Let φ be the characteristic function of Xi . Since by assumption E[Xi2 ] = 1, the characteristic function φ is twice differentiable and (cf. p. 290 in Hardy (1967) and Proposition 37 (vi)) φ(u) = φ(0) + φ0 (0)u + φ00 (0) u2 + o(u2 ) 2 1 = 1 − u2 + o(u2 ). 2 By independence of Xi ’s and Corollary 57 we then get for every fixed t ∈ R that h √ i t n itSn / n √ E e =φ n ( 2 2 )n 1 t |t| √ = 1− +o √ 2 n n n 2 1 t2 +o = 1− → e−t /2 . 2n n The limit being the characteristic function of a standard normal random variable Z, the proof is completed upon invoking Lévy’s continuity theorem. Example 63. Suppose we have an i.i.d. sample X1 , . . . , Xn from the Bernoulli distribution with probability of success p, but we do Pnnot know p. The parameter p can be estimated by the sample mean p̂n = n−1 i=1 Xi . By the strong law of a.s. large numbers p̂n −−→ p, and by the central limit theorem √ n p (p̂n − p) N (0, 1). p(1 − p) Thus for large n the estimator p̂n has approximately the normal distribution with mean p and variance p(1 − p)/n, which gives an idea on the precision with which p is recovered as n → ∞. The asymptotic variance p(1 − p)/n of the estimator can be estimated by p̂n (1 − p̂n )/n, and by Slutsky’s lemma √ n p (p̂n − p) N (0, 1), p̂n (1 − p̂n ) so that, roughly speaking, we do not need to know the value of p in order to determine the precision with which it is recovered by p̂n : by a somewhat circular argument the latter can be again estimated by using p̂n . 3.5. Delta method Let {Xn } be a sequence of i.i.d. random variables with mean E[Xi ] = µ and variance 0 < Var[Xi ] = σ 2 < ∞. By the central limit theorem, Sn − nµ √ N (0, 1). σ n Can we say something about the weak convergence of a sequence {g(X n )}, where g : R 7→ R is some fixed function? Such a question often arise in in statistics. When g is differentiable, the answer is given by the following result, known as the delta method. 34 3. LIMIT THEOREMS Theorem 64 (Delta method). Assume that the conditions of the central limit theorem (Theorem 62) hold. Let g be differentiable at µ and g 0 (µ) 6= 0. Then √ g(X n ) − g(µ) n σg 0 (µ) N (0, 1). Proof. The proof is an instance of an elegant application of the almost sure e there exist random e F, e P) representation theorem. On some probability space (Ω, variables d Sn − nµ en = e ∼ N (0, 1), √ , Z Z σ n a.s. e By the foregoing, the definition of a derivative, the such that Zen −−→ Ze (under P). √ a.s. e Z e 6= 0) = 1, and the continuous mapping theorem facts that σ Zen / n −−→ 0 and P( we have √ √ g(X n ) − g(µ) d √ g(µ + σ Zen / n) − g(µ) n = n · 1[Zen 6=0] σg 0 (µ) σg 0 (µ) en /√n) − g(µ) σ Z en g(µ + σ Z = · 0 ·1 e √ σg (µ) [Zn 6=0] σ Zen / n a.s. −−→ g 0 (µ) · σ Ze ·1 e . σg 0 (µ) [Z6=0] e The last term is equal to Ze (P-almost surely), whence it follows that √ g(X n ) − g(µ) n σg 0 (µ) on the original probability space. N (0, 1) Example 65. This is a continuation of Example 63. Suppose we want to estimate the odds r = p/(1 − p). For example, if the data X1 , . . . , Xn are the outcomes of a medical treatment with p = 3/4, then a patient has odds 3 : 1 of getting better. A natural estimator of r is r̂n = p̂n /(1 − p̂n ), but how good is this estimator? Assume 0 < p < 1. Firstly, by the strong law of large numbers and a.s. the continuous mapping theorem, r̂n −−→ r. Secondly, by the delta method (take g(p) = p/(1 − p) in Theorem 64) p n(1 − p)3 (r̂n − r) N (0, 1), √ p so that for large n the estimator r̂n is approximately normally distributed with mean r and variance p/[n(1 − p)3 ]. The latter can be estimated by p̂n /[n(1 − p̂n )3 ] and an application of Sutsky’s lemma yields p n(1 − p̂n )3 √ (r̂n − r) N (0, 1). p̂n 3.6. Berry-Esseen theorem Convergence of some quantity to a limit inevitably leads to the question of the rate of convergence. In the setting of the central limit theorem proved √ above, the question is this: let Fn be the distribution function of (Sn − nµ)/(σ n). Theorem 62 implies that for all x ∈ R, Fn (x) → Φ(x). Can we say something about the rate, at which the difference |Fn (x) − Φ(x)| converges to zero? A good estimate EXERCISES 35 on this quantity might be necessary in applications, in particular for numerical computations. One result in this direction is the Berry-Esseen theorem. Theorem 66 (Berry-Esseen theorem). Let {Xn } be a sequence of i.i.d. random variables with mean zero, variance σ 2 and the third absolute moment γ = E[|Xi |3 ] < ∞. Then there exists a universal constant A0 , such that sup |Fn (x) − Φ(x)| ≤ x∈R A0 γ 1 √ . σ3 n We will not prove this theorem. For the proof see e.g. Chung (2001), Section 7.4 (that particular proof is based on the method of characteristic functions). The exact value of the constant A0 is not known, but there exist good estimates on it (the latest (?) one is A0 ≤ 0.5129; this √ because there also holds a √ is quite sharp, lower bound proved by Esseen: A0 ≥ ( 10 + 3)/(6 2π) ≈ 0.40973). The estimate in Theorem 66 is ‘generic’. For specific distributions, tighter bounds might hold. For instance, let X1 , . . . , Xn be jointly normal and i.i.d. with Xi ∼ N (0, 1). Then Fn = Φ and supx∈R |Fn (x) − Φ(x)| is in fact zero. Exercises 1 Let {Xn } be a sequence of random variables with E[|Xn |] < ∞ and V[Xn ] < ∞. Assume that the covariances Cov[Xi , Xj ] → 0 as |i − j| → ∞. Prove the following version of the law of large numbers (due to Bernstein): ! n n 1 X 1X P Xi − E[Xi ] > ε → 0 n n i=1 i=1 as n → ∞. Hint: a sequence of random variables {ξn } converges to zero in probability, when both the mean E[ξn ] and the variance V[ξn ] converge to zero as n → ∞ (show this). Pn 2 Let {Xn } be a sequence of i.i.d. random variables. Prove that Sn = n−1/2 i=1 Xi converges in probability as n → ∞ if and only if P(X1 = 0) = 1. 3 Let {Xn } be a sequence of i.i.d. random variables with E[X12 ] < ∞. Prove that max(|X1 |, . . . , |Xn |) √ n 0 as n → ∞. Hint: for any ε > 0, n max(|X1 |, . . . , |Xn |) √ P ≤ ε = P(X12 ≤ nε2 ) n and nε2 P(X1 > nε) → 0 as n → ∞. 4 Let {Xn } be a sequence of i.i.d. random variables with mean zero and variance one, and let n } be a sequence of nonnegative numbers, such that dn = o(Dn ) P{d n for Dn2 = i=1 d2i . Prove that the sequence {dn Xn } satisfies the central limit theorem: n 1 X di Xi Z Dn i=1 for Z ∼ N (0, 1). 36 3. LIMIT THEOREMS 5 Let {Xn } be a sequence of i.i.d. random variables, such that P(X1 = ±1) = 1/2. Pi Set Si = k=1 Xi . Show that 1 √ max Si |Z|, n 1≤i≤n where Z ∼ N (0, 1). 6 Let {Xn } be a sequence of i.i.d. random variables with E[X1 ] = 0 and E[X12 ] = 1. Show that √ Xn n N (0, 1), σn where n n 1X 1 X Xn = Xi , σn2 = (Xi − X n )2 . n i=1 n − 1 i=1 Incidentally, the result of this exercise also shows that if Yn possesses t-distribution with n degrees of freedom, then Yn N (0, 1). Explain why. 7 Let Xn have a Bin(n, λ/n) distribution (for n > λ). Show that Xn X, where X has a Poisson(λ) distribution. This result is known as the Poisson theorem. 8 Let X, X1 , X2 , . . . be a sequence of random variables and Y a N (0, 1)-distributed random variable independent of that sequence. Let φn be the characteristic function of Xn and φ that of X. Let pn be the density of Xn + σY and p the density of X + σY . (i) If φn → φ pointwise, then pn → p pointwise. (ii) LetRf : R → R be bounded by B. Show that |Ef (Xn +σY )−Ef (X +σY )| ≤ 2B (p(z) − pn (z))+ dz. (iii) Show that |Ef (Xn + σY ) − Ef (X + σY )| → 0 (with f bounded) if φn → φ pointwise. (iv) Prove without referring to Corollary 55 that Xn X iff φn → φ pointwise (hint: one implication is straightforward, for the other the result of Exercise 1.1 is useful). Bibliography J. Bak and D. J. Newman. Complex analysis. Third edition. Undergraduate Texts in Mathematics. Springer, New York, 2010. P. Billingsley. Weak Convergence of Measures: Applications in Probability. Conference Board of the Mathematical Sciences Regional Conference Series in Applied Mathematics, No. 5. Society for Industrial and Applied Mathematics, Philadelphia, Pa., 1971. K. L. Chung. A Course in Probability Theory. Third edition. Academic Press, Inc., San Diego, CA, 2001. G. H. Hardy. A Course of Pure Mathematics. Tenth edition. Cambridge University Press, Cambridge, 1967. K. R. Parthasarathy. Probability measures on metric spaces. Reprint of the 1967 original. AMS Chelsea Publishing, Providence, RI, 2005. Yu. V. Prokhorov. Convergence of random processes and limit theorems in probability theory. Theory Probab. Appl., 1(2), 157–214, 1956. S. I. Resnick. A Probability Path. Birkhäuser Boston, Inc., Boston, MA, 1999. P. Révész. The Laws of Large Numbers. Academic Press, New York, 1968. W. Rudin. Principles of Mathematical Analysis. Third edition. International Series in Pure and Applied Mathematics. McGraw-Hill Book Co., New York-AucklandDüsseldorf, 1976. A. N. Shiryaev. Probability. Translated from the first (1980) Russian edition by R. P. Boas. Second edition. Graduate Texts in Mathematics, 95. Springer-Verlag, New York, 1996. A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics, 3. Cambridge University Press, Cambridge, 1998. 37

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download lecture notes