Download A note on “infinitely often,” - University of Southern California

UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016 1 A note on “infinitely often,” “probability 1,” and the law of large numbers Michael J. Neely University of Southern California http://www-bcf.usc.edu/∼mjneely Abstract These notes give details on the probability concepts of “infinitely often” and “with probability 1.” This is useful for understanding the Borel-Cantelli lemma and the strong law of large numbers. I. S EQUENCES OF EVENTS A. Events Consider a general sample space S. Recall that an event is a subset of S that has a well defined probability. That is, a set A is an event if and only if A ⊆ S and P [A] exists. Formally, a probability experiment introduces both a sample space S and a sigma algebra F. The sigma algebra F contains all events, that is, it is the collection of all subsets of S for which probabilities are defined. B. Can an event be “true”? It is common to talk about events as being either “true” or “false.” However, if an event is just a subset of S, does it make sense to say that an event can be “true”? When we say that an event A is “true” (or that an event A “occurs”), we are imagining a probability experiment that produces an outcome ω ∈ S for which ω ∈ A. The event is “false” if the outcome satisfies ω ∈ / A. C. Shrinking a sequence of events Let {An }∞ n=1 be an infinite sequence of events. Let A be another event. We say that An & A if: • An ⊇ An+1 for all n ∈ {1, 2, 3, . . .}. ∞ • ∩n=1 An = A We know that if An & A then P [An ] & P [A]. D. Infinitely often Let {An }∞ n=1 be an infinite sequence of events. We say that events in the sequence occur “infinitely often” if Ai holds true for an infinite number of indices i ∈ {1, 2, 3, . . .}. Define {Ai i.o} as the event that an infinite number of events Ai occur. Formally, this new event needs to be a subset of S. Hence: {Ai i.o} = {ω ∈ S : ω ∈ Ai for an infinite number of indices i ∈ {1, 2, 3, . . .}} (1) Lemma 1. For any sequence of events {An }∞ n=1 we have: ∪∞ i=n Ai & {Ai i.o} (2) and so lim P [∪∞ i=n Ai ] = P [Ai n→∞ i.o.] Proof. To prove (2), note that: ∞ ∪∞ i=n Ai ⊇ ∪i=n+1 Ai ∀n ∈ {1, 2, 3, . . .} It remains to show that: ∞ ∩∞ n=1 ∪i=n Ai = {Ai i.o.} (3) This can be done by considering two cases (left as an exercise): 1) Case 1: Suppose {Ai i.o} is true. Show that ∪∞ i=n Ai must be true for all n ∈ {1, 2, 3, . . .}. 2) Case 2: Suppose {Ai i.o} is false. Show that ∪∞ i=n Ai must be false for some n ∈ {1, 2, 3, . . .}. Note that equation (3) in the above proof shows that the event {Ai i.o.} can be written in terms of unions and intersections of the original events Ai . Formally, this means that the set {Ai i.o.} defined in (1) is indeed in the sigma algebra of events. UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016 2 E. Finitely often We say that Ai occurs “finitely often” if Ai holds true for an at most finite number of indices i ∈ {1, 2, 3, . . .}. Specifically: {Ai i.o.}c f.o.} = {Ai By taking complements in Lemma 1 we obtain: c ∩∞ i=n Ai % {Ai c f.o.} =⇒ lim P [∩∞ i=n Ai ] = P [Ai n→∞ f.o.] Taking complements of (3) gives: ∞ c ∪∞ n=1 ∩i=n Ai = {Ai f.o.} F. Preliminaries for Borel-Cantelli lemma We need two facts about real numbers: 1) log(1 + x) ≤ x for all x ∈ (−1, ∞). P∞ P∞ 2) If {xi }∞ i=1 is a sequence of non-negative real numbers such that i=1 xi < ∞, then limn→∞ i=n xi = 0. The second fact is proven as follows: For all positive integers n we have: ∞ X xi = i=1 Taking a limit as n → ∞ gives: ∞ X n−1 X ∞ X xi + i=1 xi = i=1 ∞ X xi i=n xi + lim n→∞ i=1 ∞ X xi (4) i=n Now, the equation ∞ = ∞ + y is true for all y ∈ R (we cannot cancel ∞ from both sides to conclude y = 0). However, if P ∞ i=1 xi < ∞, we cancel it from both sides of (4) to conclude: 0 = lim n→∞ ∞ X xi i=n G. Borel-Cantelli lemma LemmaP 2. (Borel-Cantelli) Let {An }∞ n=1 be an infinite sequence of events. ∞ a) If Pi=1 P [Ai ] < ∞, then P [Ai f.o.] = 1. ∞ b) If i=1 P [Ai ] = ∞ and if the events are mutually independent, then P [Ai P∞ Proof. (Part (a)) Suppose i=1 P [Ai ] < ∞. For all positive integers n: {Ai i.o.} ⊆ ∪∞ i=n Ai Thus: P [Ai i.o.] = 1. i.o.] ≤ P [∪∞ i=n Ai ] ≤ ∞ X P [Ai ] i=n where the final inequality is the union bound. Taking a limit as n → ∞ gives: P [Ai i.o.] ≤ lim n→∞ ∞ X P [Ai ] = 0 (5) i=n P∞ where the final limit holds because i=1 P [Ai ] < ∞ (recall fact 2 in the previous subsection). Hence P [Ai i.o.] = 0 and so P [Ai f.o.] = 1. P∞ Proof. (Part (b)) Suppose i=1 P [Ai ] = ∞ and that the events are mutually independent. Fix n as a postive integer. We have: c ∞ c P [∩∞ i=n Ai ] = ∩i=n P [Ai ] = ∞ Y (1 − P [Ai ]) i=n Taking the log of both sides and using the fact that log(1 + x) ≤ x gives: c log(P [∩∞ i=n Ai ]) = ∞ X i=n log(1 − P [Ai ]) ≤ − ∞ X i=n P [Ai ] = −∞ UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016 3 c Thus, P [∩∞ i=n Ai ] = 0. This holds for all n. Thus: c 0 = P [∩∞ i=n Ai ] % P [Ai f.o.] So P [Ai f.o.] = 0. Hence, P [Ai i.o.] = 1. P∞ Notice that i=1 P [Ai ] represents the expected number of events that occur. Indeed, one can define indicator functions for each i ∈ {1, 2, 3, . . .} by 1 if event Ai occurs Ii = 0 otherwise P∞ The number of events that occur is then N = i=1 Ii . It can be shown that the expectation of a countably infinite sum of nonnegative random variables is equal to the sum of the expectations. Hence: E [N ] = ∞ X E [Ii ] = i=1 ∞ X P [Ai ] i=1 H. Coin flipping example Suppose we flip a sequence of coins. Let Ai be the event that coin flip i produces “heads.” Suppose that: P [Ai ] = 1/i ∀i ∈ {1, 2, 3, . . .} Then: P∞ P∞ • We know i=1 P [Ai ] = i=1 1/i = ∞. Thus, if the coin flips are mutually independent, the Borel-Cantelli lemma ensures that, with probability 1, there will be an infinite number of heads. This may be surprising because the probability of heads shrinks down to 0. However, the probabilities do not shrink rapidly enough for their sum to be finite. • Consider the sparsely sampled subsequence of events, sampled at indices that are perfect squares: {An2 }∞ n=1 = {A1 , A4 , A9 , A16 , ...} P∞ P∞ We know n=1 P [An2 ] = n=1 1/n2 < ∞. Thus, the Borel-Cantelli lemma ensures that, with probability 1, only a finite number of perfect square indices produce heads. This holds regardless of whether or not the flips are mutually independent. Indeed, the probabilities P [An2 ] shrink rapidly enough for their sum to be finite. II. S EQUENCES OF RANDOM VARIABLES Let S be a sample space. A. Markov and Chebyshev inequalities Recall that: Lemma 3. (Markov inequality) If X is a non-negative random variable then for all > 0: E [X] Lemma 4. (Chebyshev inequality) If X is a random variable (possibly negative) with mean m and variance σ 2 , then: P [X ≥ ] ≤ V ar(X) 2 The Chebyshev inequality can be proven from the Markov inequality by defining the non-negative random variable Y = (X − m)2 and noting that {|X − m| ≥ } = {(X − m)2 ≥ 2 }. P [|X − m| ≥ ] ≤ UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016 4 B. Random sequences defined by outcomes in S We want to treat infinite sequences of random variables {X1 , X2 , X3 , . . .}. When working with random sequences, we know that the entire sequence is determined by the particular outcome ω in the sample space S. That is, the sequence {X1 , X2 , X3 , . . .} can be written more formally as {X1 (ω), X2 (ω), X3 (ω), . . .}. Suppose X is another random variable on the same probability space, so that X can be written as X(ω). We say that limn→∞ Xn = X is true if the probability experiment produces an outcome ω ∈ S such that limn→∞ Xn (ω) = X(ω). Formally: { lim Xn = X} = {ω ∈ S : lim Xn (ω) = X(ω)} n→∞ n→∞ The complement event {limn→∞ Xn = X}c is the set of all outcomes ω for which either the limit does not exist or the limit exists but is not equal to X(ω). We say that a sequence of random variables {Xn }∞ n=1 converges to a random variable X with probability 1 if: P [ lim Xn = X] = 1 n→∞ Probability 1 convergence is also called “almost sure” convergence. It is useful to remember the standard definition of a limit: Definition 1. limn→∞ Xn (ω) = X(ω) if for all > 0, there exists an integer k such that: |Xn (ω) − X(ω)| ≤ ∀n ≥ k C. Different types of convergence Here are three useful types of convergence: (i) mean square convergence, (ii) convergence in probability, (iii) convergence with probability 1. Definition 2. Xn → X in mean square if: lim E (Xn − X)2 = 0 n→∞ Definition 3. Xn → X in probability if for all > 0: lim P [|Xn − X| > ] = 0 n→∞ Definition 4. Xn → X with probability 1 if for all > 0: lim P [∪∞ i=n |Xi − X| > ] = 0 n→∞ equivalently, if P [limn→∞ Xn = X] = 1. Subsection II-E shows why the two definitions given for “probability 1 convergence” are equivalent. Notice that probability 1 convergence implies convergence in probability. That is because, for all positive integers n, {|Xn − X| > } ⊆ ∪∞ i=n {|Xn − X| > } and so P [|Xn − X| > ] ≤ P [∪∞ i=n {|Xn − X| > }] It can also be shown (via the Markov inequality) that mean square convergence implies convergence in probability. If the random variables Xn are deterministically bounded, say, always taking values in [−M, M ] for some positive integer M , then it can be shown that convergence in probability is equivalent to convergence in mean square. There are examples where bounded random variables converge in probability, but not with probability 1 (try constructing such an example using the coin flips of Section I-H). There are also examples where (unbounded) random variables converge with probability 1, but not in mean square. D. A useful fact about limits Let {Xn }∞ n=1 be random variables and let X be another random variable, all on the same probability space S. It can be shown that: { lim Xn = X} = ∩∞ (6) k=1 {|Xi − X| > 1/k f.o.} n→∞ The right-hand-side is the event that, for each positive integer k, the inequality |Xi − X| > 1/k holds for only a finite number of indices i, so that there is an n for which |Xi − X| ≤ 1/k for all i ≥ n. Taking complements of (6) gives: { lim Xn = X}c = ∪∞ k=1 {|Xi − X| > 1/k n→∞ i.o.} (7) As a minor detail: There is nothing special about the 1/k function used in (6). That can equally be replaced by any positive function g(k) that satisfies limk→∞ g(k) = 0. UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016 5 E. Equivalent definitions of “with probability 1” Lemma 5. The following are equivalent statements, all meaning that Xn → X with probability 1. 1) Statement 1: P [ lim Xn = X] = 1 n→∞ 2) Statement 2: For all > 0 we have: lim P [∪∞ i=n {|Xi − X| > }] = 0 n→∞ 3) Statement 3: For all > 0 we have: P [|Xi − X| > i.o.] = 0 Proof. (2 ⇐⇒ 3) To understand the equivalence between statements 2 and 3, fix > 0 and observe ∪∞ i=n {|Xi − X| > } & {|Xi − X| > i.o.} Hence lim P [∪∞ i=n {|Xi − X| > }] = P [|Xi − X| > n→∞ i.o.] Thus, the probabilities in statements 2 and 3 are the same, so the two statements are equivalent. Proof. (3 =⇒ 1) Suppose statement 3 holds, so for all > 0 we have P [|Xi − X| > P [limn→∞ Xn = X] = 1. It suffices to show that P [(limn→∞ Xn = X)c ] = 0. By (7): P [( lim Xn = X)c ] = P [∪∞ k=1 {|Xi − X| > 1/k n→∞ ∞ (a) X ≤ = P [|Xi − X| > 1/k i.o.] = 0. We want to show that i.o.}] i.o.] k=1 ∞ X 0=0 k=1 where (a) holds by the union bound. Proof. (1 =⇒ 3) Suppose statement 1 holds, so that P [limn→∞ Xn = X] = 1. We want to show statement 3 holds. It suffices to show that for all positive integers m we have P [|Xi − X| > 1/m i.o.] = 0. Fix a positive integer m. Then: 1 = P [ lim Xn = X] n→∞ (a) = P [∩∞ k=1 {|Xi − X| > 1/k ≤ P [|Xi − X| > 1/m where (a) holds by (6). It follows that P [|Xi − X| > 1/m f.o.}] f.o.] f.o.] = 1, and so the probability of its complement is 0. III. L AW OF LARGE NUMBERS {Xn }∞ n=1 2 Let be a sequence of mutually independent and identically distributed (i.i.d.) random variables with mean m and variance σ . For now, assume both the mean and variance are finite. For positive integers n, define Ln as the average over the first n random variables: n 1X Xi Ln = n i=1 Notice that: σ2 ∀n ∈ {1, 2, 3, . . .} n The fact that the mean is always m and the variance shrinks to 0 suggests that the random variables Ln are converging to the mean m. In what sense does this convergence hold? The variance condition directly implies that the random variables Xn converge to m in the mean square sense. The Chebyshev inequality can be used to further show that convergence in probability holds. The strong law of large numbers makes the much deeper claim that convergence in fact holds with probability 1. E [Ln ] = m , V ar(Ln ) = 2 TheoremP1. (Strong law of large numbers) Suppose {Xi }∞ i=1 are i.i.d. with finite mean and variance m and σ . Define n 1 Ln = n i=1 Xi . Then: a) Ln → m in probability. b) Ln → m with probability 1. c) Both (a) and (b) hold even if σ 2 = ∞. UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016 6 d) Parts (a)-(c) hold if “i.i.d.” is replaced by “identically distributed and pairwise independent.” The statement in part (a) is referred to as the “weak law of large numbers.” It is included for pedagogical reasons: Its proof is very simple and gives intuition behind the deeper proof for the strong law. Proof. (LLN part (a)) Fix > 0. By the Chebyshev inequality: P [|Ln − m| > ] ≤ σ2 V ar(Ln ) = →0 2 n2 Proof. (LLN part (b)) Suppose the result of part (b) holds whenever the random variables {Xn }∞ n=1 are nonnegative (the next subsection shows it is indeed true in the nonnegative case). For general (possibly negative) random variables {Xn }∞ n=1 define: Xn+ = max[Xn , 0] , Xn− = max[−Xn , 0] Then Xn = Xn+ − Xn− , and Xn+ and Xn− are nonnegative. Because {Xn }∞ n=1 are i.i.d. with finite means and variances, the + ∞ − ∞ + random variables {X } are also i.i.d. with finite means and variances, as are the variables {X } . Define m = E X1+ n n n=1 n=1 − and m− = E X1 . Then: m = E [X1 ] = E X1+ − E X1− = m+ − m− Thus, with probability 1: n n n 1X + 1X − 1X Xi = lim Xi − lim Xi = m+ − m− = m lim n→∞ n n→∞ n n→∞ n i=1 i=1 i=1 For a proof of parts (c)-(d), see [1]. A. Extensions The following exercise extends the weak law of large numbers to treat random variables with possible dependencies and with different distributions. Exercise 1. Let {Xi }∞ i=1 be a sequence of random variables with finite means and variances. Suppose: • E [Xi ] = m for all i ∈ {1, 2, 3, . . .}. 2 • V ar(Xi ) = σi for all i ∈ {1, 2, 3, . . .}. 2 • The random variables are pairwise uncorrelated, so that E [Xi Xj ] = m whenever i 6= j. Pn 1 2 • limn→∞ n2 i=1 σi = 0. P n 1 a) Prove that n i=1 Xi P → m in probability. Hint: Just follow the standard proof of the weak law. n 2 2 for all i ∈ {1, 2, 3, . . .}, where σmax is a finite constant. b) Prove that limn→∞ n12 i=1 σi2 = 0 holds whenever σi2 ≤ σmax Pn 1 σi2 = 0 with the Exercise 2. Consider the same scenario as the previous exercise, but replace the condition limn→∞ n2 i=1 P n 1 2 2 2 more stringent condition σi ≤ σmax for all i ∈ {1, 2, 3, . . .}, where σmax is a finite constant. Define Ln = n i=1 Xi . Prove that Ln2 converges to m with probability 1. Hint: Prove for all > 0 that P [|Ln2 − m| > i.o.] = 0. If the average of a sequence Xn converges to a constant m when sampled over a sparse subsequence of times that correspond to perfect squares, does that mean the average converges to m? It does in the special caseP when the sequence Xn is nonn negative. Specifically, suppose {X1 , X2 , X3 , . . .} is a non-negative sequence. Define Ln = n1 i=1 Xi . Suppose we know that Ln2 → m for some finite constant m. We want to show that Ln → m. For any positive integer k, define n[k] as the largest integer n such that n2 ≤ k. Thus, for all positive integers k we have n[k]2 ≤ k < (n[k] + 1)2 Then for any positive integer k we have (because the Xi values are non-negative): 2 n[k] k X 1 1X 1 X ≤ Xi ≤ i (n[k] + 1)2 i=1 k i=1 n[k]2 Thus: (n[k]+1)2 X Xi i=1 n[k]2 (n[k]+1)2 k X n[k]2 1 X 1X (n[k] + 1)2 1 Xi ≤ Xi ≤ Xi 2 2 2 2 (n[k] + 1) n[k] i=1 k i=1 n[k] (n[k] + 1) i=1 | {z } | {z } Ln[k]2 L(n[k]+1)2 UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016 7 Hence, for all positive integers k: k 1X n[k]2 (n[k] + 1)2 2 L ≤ L(n[k]+1)2 X ≤ i n[k] (n[k] + 1)2 k i=1 n[k]2 Note that limk→∞ n[k] = ∞. Taking a limit as k → ∞ and using the fact that Ln2 → m gives: k 1X Xi ≤ m k→∞ k i=1 m ≤ lim B. Where is independence used? The main idea behind the law of large numbers is that the variance of a sum of pairwise uncorrelated random variables is equal to the sum of the variances. So, why is “independence” so important? The discussion after Exercise 2 shows that, in the special case when the random variables are non-negative and have finite variance, the assumption that the variables are independent can be replaced by the weaker assumption that they are pairwise uncorrelated. However, independence is used when extending this result to general (possibly negative) random variables. Why? If X1 and X2 are uncorrelated, their positive parts X1+ and X2+ are not necessarily uncorrelated. However, if X1 and X2 are independent, then indeed their positive parts X1+ and X2+ are also independent (and hence uncorrelated). Likewise, their negative parts X1− and X2− are independent (and hence uncorrelated). Both “independence” and “identically distributed” are used in the proof of [1] that extends to the case of infinite variance. The strong law of large numbers only requires the random variables to be pairwise independent. Mutual independence is not needed. However, most textbooks state the result in terms of “i.i.d.” random variables, which assumes mutual independence. This is because several other important results use i.i.d. random variables (including the central limit theorem). It is easiest to consistently use “i.i.d.” rather than changing the assumptions to fit each different result. IV. W HAT ARE THE MOST IMPORTANT THINGS TO REMEMBER ? The mean of a random variable X is denoted E [X]. If the mean is finite, the variance is defined by: 2 V ar(X) = E (X − E [X])2 = E X 2 − E [X] A. Mean and variance facts • • • • E [X1 + ... + Xn ] = E [X1 ] + ... + E [Xn ]. V ar(aX) = a2 V ar(X). If {X1 , ..., Xn } are pairwise uncorrelated, then V ar(X1 + ... + Xn ) = V ar(X1 ) + ... + V ar(Xn ). Since “i.i.d.” implies “pairwise uncorrelated,” the same variance-sum result holds for i.i.d. random variables. B. Distinctions between the law of large numbers and the central limit theorem Let {X1 , X2 , X3 , ...} be i.i.d. with finite mean and variance given by E [X1 ] = m, V ar(X1 ) = σ 2 . Assume σ 2 > 0. Define:1 n 1 X √ (Xi − m) Cn = nσ 2 i=1 n 1X Ln = Xi n i=1 It is important to know how to derive the following facts: V ar(Cn ) = 1 ∀n ∈ {1, 2, 3, . . .} σ2 E [Ln ] = m , V ar(Ln ) = ∀n ∈ {1, 2, 3, . . .} n Since V ar(Ln ) converges to 0, it is easy to remember that Ln converges to the constant m (with probability 1). Since V ar(Cn ) = 1 for all n, it is easy to see that Cn does not converge to a constant. In fact, Cn does not converge to a random variable either: it oscillates without converging to anything as n increases. However, the central limit theorem says that the distribution of Cn converges to a Gaussian distribution with zero mean and unit variance: Z ∞ 2 1 √ e−t /2 dt lim P [Cn > x] = Q(x) = n→∞ 2π x E [Cn ] = 0 , R EFERENCES [1] N. Etemadi. An elementary proof of the strong law of large numbers. Z. Wahrscheinlichkeitstheorie verw. Geiete, vol. 5:pp. 119–122, 1981. 1 The letter C is used to remind you of the C entral limit theorem. The letter L is used to remind you of the L aw of large numbers.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A note on “infinitely often,” - University of Southern California