Download A note on “infinitely often,” - University of Southern California

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016
1
A note on “infinitely often,” “probability 1,” and the
law of large numbers
Michael J. Neely
University of Southern California
http://www-bcf.usc.edu/∼mjneely
Abstract
These notes give details on the probability concepts of “infinitely often” and “with probability 1.” This is useful for
understanding the Borel-Cantelli lemma and the strong law of large numbers.
I. S EQUENCES OF EVENTS
A. Events
Consider a general sample space S. Recall that an event is a subset of S that has a well defined probability. That is, a set
A is an event if and only if A ⊆ S and P [A] exists. Formally, a probability experiment introduces both a sample space S
and a sigma algebra F. The sigma algebra F contains all events, that is, it is the collection of all subsets of S for which
probabilities are defined.
B. Can an event be “true”?
It is common to talk about events as being either “true” or “false.” However, if an event is just a subset of S, does it
make sense to say that an event can be “true”? When we say that an event A is “true” (or that an event A “occurs”), we are
imagining a probability experiment that produces an outcome ω ∈ S for which ω ∈ A. The event is “false” if the outcome
satisfies ω ∈
/ A.
C. Shrinking a sequence of events
Let {An }∞
n=1 be an infinite sequence of events. Let A be another event. We say that An & A if:
• An ⊇ An+1 for all n ∈ {1, 2, 3, . . .}.
∞
• ∩n=1 An = A
We know that if An & A then P [An ] & P [A].
D. Infinitely often
Let {An }∞
n=1 be an infinite sequence of events. We say that events in the sequence occur “infinitely often” if Ai holds true
for an infinite number of indices i ∈ {1, 2, 3, . . .}. Define {Ai i.o} as the event that an infinite number of events Ai occur.
Formally, this new event needs to be a subset of S. Hence:
{Ai
i.o} = {ω ∈ S : ω ∈ Ai for an infinite number of indices i ∈ {1, 2, 3, . . .}}
(1)
Lemma 1. For any sequence of events {An }∞
n=1 we have:
∪∞
i=n Ai & {Ai
i.o}
(2)
and so
lim P [∪∞
i=n Ai ] = P [Ai
n→∞
i.o.]
Proof. To prove (2), note that:
∞
∪∞
i=n Ai ⊇ ∪i=n+1 Ai
∀n ∈ {1, 2, 3, . . .}
It remains to show that:
∞
∩∞
n=1 ∪i=n Ai = {Ai
i.o.}
(3)
This can be done by considering two cases (left as an exercise):
1) Case 1: Suppose {Ai i.o} is true. Show that ∪∞
i=n Ai must be true for all n ∈ {1, 2, 3, . . .}.
2) Case 2: Suppose {Ai i.o} is false. Show that ∪∞
i=n Ai must be false for some n ∈ {1, 2, 3, . . .}.
Note that equation (3) in the above proof shows that the event {Ai i.o.} can be written in terms of unions and intersections
of the original events Ai . Formally, this means that the set {Ai i.o.} defined in (1) is indeed in the sigma algebra of events.
UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016
2
E. Finitely often
We say that Ai occurs “finitely often” if Ai holds true for an at most finite number of indices i ∈ {1, 2, 3, . . .}. Specifically:
{Ai
i.o.}c
f.o.} = {Ai
By taking complements in Lemma 1 we obtain:
c
∩∞
i=n Ai % {Ai
c
f.o.} =⇒ lim P [∩∞
i=n Ai ] = P [Ai
n→∞
f.o.]
Taking complements of (3) gives:
∞
c
∪∞
n=1 ∩i=n Ai = {Ai
f.o.}
F. Preliminaries for Borel-Cantelli lemma
We need two facts about real numbers:
1) log(1 + x) ≤ x for all x ∈ (−1, ∞).
P∞
P∞
2) If {xi }∞
i=1 is a sequence of non-negative real numbers such that
i=1 xi < ∞, then limn→∞
i=n xi = 0.
The second fact is proven as follows: For all positive integers n we have:
∞
X
xi =
i=1
Taking a limit as n → ∞ gives:
∞
X
n−1
X
∞
X
xi +
i=1
xi =
i=1
∞
X
xi
i=n
xi + lim
n→∞
i=1
∞
X
xi
(4)
i=n
Now, the equation ∞ = ∞ + y is true for all y ∈ R (we cannot cancel ∞ from both sides to conclude y = 0). However, if
P
∞
i=1 xi < ∞, we cancel it from both sides of (4) to conclude:
0 = lim
n→∞
∞
X
xi
i=n
G. Borel-Cantelli lemma
LemmaP
2. (Borel-Cantelli) Let {An }∞
n=1 be an infinite sequence of events.
∞
a) If Pi=1 P [Ai ] < ∞, then P [Ai f.o.] = 1.
∞
b) If i=1 P [Ai ] = ∞ and if the events are mutually independent, then P [Ai
P∞
Proof. (Part (a)) Suppose i=1 P [Ai ] < ∞. For all positive integers n:
{Ai
i.o.} ⊆ ∪∞
i=n Ai
Thus:
P [Ai
i.o.] = 1.
i.o.] ≤ P [∪∞
i=n Ai ] ≤
∞
X
P [Ai ]
i=n
where the final inequality is the union bound. Taking a limit as n → ∞ gives:
P [Ai
i.o.] ≤ lim
n→∞
∞
X
P [Ai ] = 0
(5)
i=n
P∞
where the final limit holds because i=1 P [Ai ] < ∞ (recall fact 2 in the previous subsection). Hence P [Ai i.o.] = 0 and
so P [Ai f.o.] = 1.
P∞
Proof. (Part (b)) Suppose i=1 P [Ai ] = ∞ and that the events are mutually independent. Fix n as a postive integer. We have:
c
∞
c
P [∩∞
i=n Ai ] = ∩i=n P [Ai ] =
∞
Y
(1 − P [Ai ])
i=n
Taking the log of both sides and using the fact that log(1 + x) ≤ x gives:
c
log(P [∩∞
i=n Ai ]) =
∞
X
i=n
log(1 − P [Ai ]) ≤ −
∞
X
i=n
P [Ai ] = −∞
UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016
3
c
Thus, P [∩∞
i=n Ai ] = 0. This holds for all n. Thus:
c
0 = P [∩∞
i=n Ai ] % P [Ai
f.o.]
So P [Ai
f.o.] = 0. Hence, P [Ai i.o.] = 1.
P∞
Notice that i=1 P [Ai ] represents the expected number of events that occur. Indeed, one can define indicator functions for
each i ∈ {1, 2, 3, . . .} by
1 if event Ai occurs
Ii =
0 otherwise
P∞
The number of events that occur is then N = i=1 Ii . It can be shown that the expectation of a countably infinite sum of
nonnegative random variables is equal to the sum of the expectations. Hence:
E [N ] =
∞
X
E [Ii ] =
i=1
∞
X
P [Ai ]
i=1
H. Coin flipping example
Suppose we flip a sequence of coins. Let Ai be the event that coin flip i produces “heads.” Suppose that:
P [Ai ] = 1/i ∀i ∈ {1, 2, 3, . . .}
Then:
P∞
P∞
• We know
i=1 P [Ai ] =
i=1 1/i = ∞. Thus, if the coin flips are mutually independent, the Borel-Cantelli lemma
ensures that, with probability 1, there will be an infinite number of heads. This may be surprising because the probability
of heads shrinks down to 0. However, the probabilities do not shrink rapidly enough for their sum to be finite.
• Consider the sparsely sampled subsequence of events, sampled at indices that are perfect squares:
{An2 }∞
n=1 = {A1 , A4 , A9 , A16 , ...}
P∞
P∞
We know n=1 P [An2 ] = n=1 1/n2 < ∞. Thus, the Borel-Cantelli lemma ensures that, with probability 1, only a
finite number of perfect square indices produce heads. This holds regardless of whether or not the flips are mutually
independent. Indeed, the probabilities P [An2 ] shrink rapidly enough for their sum to be finite.
II. S EQUENCES OF RANDOM VARIABLES
Let S be a sample space.
A. Markov and Chebyshev inequalities
Recall that:
Lemma 3. (Markov inequality) If X is a non-negative random variable then for all > 0:
E [X]
Lemma 4. (Chebyshev inequality) If X is a random variable (possibly negative) with mean m and variance σ 2 , then:
P [X ≥ ] ≤
V ar(X)
2
The Chebyshev inequality can be proven from the Markov inequality by defining the non-negative random variable Y =
(X − m)2 and noting that {|X − m| ≥ } = {(X − m)2 ≥ 2 }.
P [|X − m| ≥ ] ≤
UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016
4
B. Random sequences defined by outcomes in S
We want to treat infinite sequences of random variables {X1 , X2 , X3 , . . .}. When working with random sequences, we
know that the entire sequence is determined by the particular outcome ω in the sample space S. That is, the sequence
{X1 , X2 , X3 , . . .} can be written more formally as {X1 (ω), X2 (ω), X3 (ω), . . .}.
Suppose X is another random variable on the same probability space, so that X can be written as X(ω). We say that
limn→∞ Xn = X is true if the probability experiment produces an outcome ω ∈ S such that limn→∞ Xn (ω) = X(ω).
Formally:
{ lim Xn = X} = {ω ∈ S : lim Xn (ω) = X(ω)}
n→∞
n→∞
The complement event {limn→∞ Xn = X}c is the set of all outcomes ω for which either the limit does not exist or the limit
exists but is not equal to X(ω).
We say that a sequence of random variables {Xn }∞
n=1 converges to a random variable X with probability 1 if:
P [ lim Xn = X] = 1
n→∞
Probability 1 convergence is also called “almost sure” convergence. It is useful to remember the standard definition of a limit:
Definition 1. limn→∞ Xn (ω) = X(ω) if for all > 0, there exists an integer k such that:
|Xn (ω) − X(ω)| ≤ ∀n ≥ k
C. Different types of convergence
Here are three useful types of convergence: (i) mean square convergence, (ii) convergence in probability, (iii) convergence
with probability 1.
Definition 2. Xn → X in mean square if:
lim E (Xn − X)2 = 0
n→∞
Definition 3. Xn → X in probability if for all > 0:
lim P [|Xn − X| > ] = 0
n→∞
Definition 4. Xn → X with probability 1 if for all > 0:
lim P [∪∞
i=n |Xi − X| > ] = 0
n→∞
equivalently, if P [limn→∞ Xn = X] = 1.
Subsection II-E shows why the two definitions given for “probability 1 convergence” are equivalent. Notice that probability
1 convergence implies convergence in probability. That is because, for all positive integers n,
{|Xn − X| > } ⊆ ∪∞
i=n {|Xn − X| > }
and so
P [|Xn − X| > ] ≤ P [∪∞
i=n {|Xn − X| > }]
It can also be shown (via the Markov inequality) that mean square convergence implies convergence in probability. If the
random variables Xn are deterministically bounded, say, always taking values in [−M, M ] for some positive integer M , then
it can be shown that convergence in probability is equivalent to convergence in mean square.
There are examples where bounded random variables converge in probability, but not with probability 1 (try constructing
such an example using the coin flips of Section I-H). There are also examples where (unbounded) random variables converge
with probability 1, but not in mean square.
D. A useful fact about limits
Let {Xn }∞
n=1 be random variables and let X be another random variable, all on the same probability space S. It can be
shown that:
{ lim Xn = X} = ∩∞
(6)
k=1 {|Xi − X| > 1/k f.o.}
n→∞
The right-hand-side is the event that, for each positive integer k, the inequality |Xi − X| > 1/k holds for only a finite number
of indices i, so that there is an n for which |Xi − X| ≤ 1/k for all i ≥ n. Taking complements of (6) gives:
{ lim Xn = X}c = ∪∞
k=1 {|Xi − X| > 1/k
n→∞
i.o.}
(7)
As a minor detail: There is nothing special about the 1/k function used in (6). That can equally be replaced by any positive
function g(k) that satisfies limk→∞ g(k) = 0.
UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016
5
E. Equivalent definitions of “with probability 1”
Lemma 5. The following are equivalent statements, all meaning that Xn → X with probability 1.
1) Statement 1:
P [ lim Xn = X] = 1
n→∞
2) Statement 2: For all > 0 we have:
lim P [∪∞
i=n {|Xi − X| > }] = 0
n→∞
3) Statement 3: For all > 0 we have:
P [|Xi − X| > i.o.] = 0
Proof. (2 ⇐⇒ 3) To understand the equivalence between statements 2 and 3, fix > 0 and observe
∪∞
i=n {|Xi − X| > } & {|Xi − X| > i.o.}
Hence
lim P [∪∞
i=n {|Xi − X| > }] = P [|Xi − X| > n→∞
i.o.]
Thus, the probabilities in statements 2 and 3 are the same, so the two statements are equivalent.
Proof. (3 =⇒ 1) Suppose statement 3 holds, so for all > 0 we have P [|Xi − X| > P [limn→∞ Xn = X] = 1. It suffices to show that P [(limn→∞ Xn = X)c ] = 0. By (7):
P [( lim Xn = X)c ] = P [∪∞
k=1 {|Xi − X| > 1/k
n→∞
∞
(a) X
≤
=
P [|Xi − X| > 1/k
i.o.] = 0. We want to show that
i.o.}]
i.o.]
k=1
∞
X
0=0
k=1
where (a) holds by the union bound.
Proof. (1 =⇒ 3) Suppose statement 1 holds, so that P [limn→∞ Xn = X] = 1. We want to show statement 3 holds. It
suffices to show that for all positive integers m we have P [|Xi − X| > 1/m i.o.] = 0. Fix a positive integer m. Then:
1 = P [ lim Xn = X]
n→∞
(a)
= P [∩∞
k=1 {|Xi − X| > 1/k
≤ P [|Xi − X| > 1/m
where (a) holds by (6). It follows that P [|Xi − X| > 1/m
f.o.}]
f.o.]
f.o.] = 1, and so the probability of its complement is 0.
III. L AW OF LARGE NUMBERS
{Xn }∞
n=1
2
Let
be a sequence of mutually independent and identically distributed (i.i.d.) random variables with mean m and
variance σ . For now, assume both the mean and variance are finite. For positive integers n, define Ln as the average over the
first n random variables:
n
1X
Xi
Ln =
n i=1
Notice that:
σ2
∀n ∈ {1, 2, 3, . . .}
n
The fact that the mean is always m and the variance shrinks to 0 suggests that the random variables Ln are converging to
the mean m. In what sense does this convergence hold? The variance condition directly implies that the random variables Xn
converge to m in the mean square sense. The Chebyshev inequality can be used to further show that convergence in probability
holds. The strong law of large numbers makes the much deeper claim that convergence in fact holds with probability 1.
E [Ln ] = m
,
V ar(Ln ) =
2
TheoremP1. (Strong law of large numbers) Suppose {Xi }∞
i=1 are i.i.d. with finite mean and variance m and σ . Define
n
1
Ln = n i=1 Xi . Then:
a) Ln → m in probability.
b) Ln → m with probability 1.
c) Both (a) and (b) hold even if σ 2 = ∞.
UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016
6
d) Parts (a)-(c) hold if “i.i.d.” is replaced by “identically distributed and pairwise independent.”
The statement in part (a) is referred to as the “weak law of large numbers.” It is included for pedagogical reasons: Its proof
is very simple and gives intuition behind the deeper proof for the strong law.
Proof. (LLN part (a)) Fix > 0. By the Chebyshev inequality:
P [|Ln − m| > ] ≤
σ2
V ar(Ln )
=
→0
2
n2
Proof. (LLN part (b)) Suppose the result of part (b) holds whenever the random variables {Xn }∞
n=1 are nonnegative (the next
subsection shows it is indeed true in the nonnegative case). For general (possibly negative) random variables {Xn }∞
n=1 define:
Xn+ = max[Xn , 0] ,
Xn− = max[−Xn , 0]
Then Xn = Xn+ − Xn− , and Xn+ and Xn− are nonnegative. Because {Xn }∞
n=1 are i.i.d. with finite means and variances,
the
+ ∞
− ∞
+
random variables
{X
}
are
also
i.i.d.
with
finite
means
and
variances,
as
are
the
variables
{X
}
.
Define
m
=
E
X1+
n
n
n=1
n=1
−
and m− = E X1 . Then:
m = E [X1 ] = E X1+ − E X1− = m+ − m−
Thus, with probability 1:
n
n
n
1X +
1X −
1X
Xi = lim
Xi − lim
Xi = m+ − m− = m
lim
n→∞ n
n→∞ n
n→∞ n
i=1
i=1
i=1
For a proof of parts (c)-(d), see [1].
A. Extensions
The following exercise extends the weak law of large numbers to treat random variables with possible dependencies and
with different distributions.
Exercise 1. Let {Xi }∞
i=1 be a sequence of random variables with finite means and variances. Suppose:
• E [Xi ] = m for all i ∈ {1, 2, 3, . . .}.
2
• V ar(Xi ) = σi for all i ∈ {1, 2, 3, . . .}.
2
• The random variables are pairwise uncorrelated, so that E [Xi Xj ] = m whenever i 6= j.
Pn
1
2
• limn→∞ n2
i=1 σi = 0.
P
n
1
a) Prove that n i=1 Xi P
→ m in probability. Hint: Just follow the standard proof of the weak law.
n
2
2
for all i ∈ {1, 2, 3, . . .}, where σmax
is a finite constant.
b) Prove that limn→∞ n12 i=1 σi2 = 0 holds whenever σi2 ≤ σmax
Pn
1
σi2 = 0 with the
Exercise 2. Consider the same scenario as the previous exercise, but replace the condition limn→∞ n2 i=1 P
n
1
2
2
2
more stringent condition σi ≤ σmax for all i ∈ {1, 2, 3, . . .}, where σmax is a finite constant. Define Ln = n i=1 Xi . Prove
that Ln2 converges to m with probability 1. Hint: Prove for all > 0 that P [|Ln2 − m| > i.o.] = 0.
If the average of a sequence Xn converges to a constant m when sampled over a sparse subsequence of times that correspond
to perfect squares, does that mean the average converges to m? It does in the special caseP
when the sequence Xn is nonn
negative. Specifically, suppose {X1 , X2 , X3 , . . .} is a non-negative sequence. Define Ln = n1 i=1 Xi . Suppose we know that
Ln2 → m for some finite constant m. We want to show that Ln → m. For any positive integer k, define n[k] as the largest
integer n such that n2 ≤ k. Thus, for all positive integers k we have
n[k]2 ≤ k < (n[k] + 1)2
Then for any positive integer k we have (because the Xi values are non-negative):
2
n[k]
k
X
1
1X
1
X
≤
Xi ≤
i
(n[k] + 1)2 i=1
k i=1
n[k]2
Thus:
(n[k]+1)2
X
Xi
i=1
n[k]2
(n[k]+1)2
k
X
n[k]2
1 X
1X
(n[k] + 1)2
1
Xi ≤
Xi ≤
Xi
2
2
2
2
(n[k] + 1) n[k] i=1
k i=1
n[k]
(n[k] + 1)
i=1
|
{z
}
|
{z
}
Ln[k]2
L(n[k]+1)2
UNIVERSITY OF SOUTHERN CALIFORNIA, SPRING 2016
7
Hence, for all positive integers k:
k
1X
n[k]2
(n[k] + 1)2
2
L
≤
L(n[k]+1)2
X
≤
i
n[k]
(n[k] + 1)2
k i=1
n[k]2
Note that limk→∞ n[k] = ∞. Taking a limit as k → ∞ and using the fact that Ln2 → m gives:
k
1X
Xi ≤ m
k→∞ k
i=1
m ≤ lim
B. Where is independence used?
The main idea behind the law of large numbers is that the variance of a sum of pairwise uncorrelated random variables is
equal to the sum of the variances. So, why is “independence” so important?
The discussion after Exercise 2 shows that, in the special case when the random variables are non-negative and have finite
variance, the assumption that the variables are independent can be replaced by the weaker assumption that they are pairwise
uncorrelated. However, independence is used when extending this result to general (possibly negative) random variables. Why?
If X1 and X2 are uncorrelated, their positive parts X1+ and X2+ are not necessarily uncorrelated. However, if X1 and X2
are independent, then indeed their positive parts X1+ and X2+ are also independent (and hence uncorrelated). Likewise, their
negative parts X1− and X2− are independent (and hence uncorrelated).
Both “independence” and “identically distributed” are used in the proof of [1] that extends to the case of infinite variance.
The strong law of large numbers only requires the random variables to be pairwise independent. Mutual independence is not
needed. However, most textbooks state the result in terms of “i.i.d.” random variables, which assumes mutual independence.
This is because several other important results use i.i.d. random variables (including the central limit theorem). It is easiest to
consistently use “i.i.d.” rather than changing the assumptions to fit each different result.
IV. W HAT ARE THE MOST IMPORTANT THINGS TO REMEMBER ?
The mean of a random variable X is denoted E [X]. If the mean is finite, the variance is defined by:
2
V ar(X) = E (X − E [X])2 = E X 2 − E [X]
A. Mean and variance facts
•
•
•
•
E [X1 + ... + Xn ] = E [X1 ] + ... + E [Xn ].
V ar(aX) = a2 V ar(X).
If {X1 , ..., Xn } are pairwise uncorrelated, then V ar(X1 + ... + Xn ) = V ar(X1 ) + ... + V ar(Xn ).
Since “i.i.d.” implies “pairwise uncorrelated,” the same variance-sum result holds for i.i.d. random variables.
B. Distinctions between the law of large numbers and the central limit theorem
Let {X1 , X2 , X3 , ...} be i.i.d. with finite mean and variance given by E [X1 ] = m, V ar(X1 ) = σ 2 . Assume σ 2 > 0. Define:1
n
1 X
√
(Xi − m)
Cn =
nσ 2 i=1
n
1X
Ln =
Xi
n i=1
It is important to know how to derive the following facts:
V ar(Cn ) = 1 ∀n ∈ {1, 2, 3, . . .}
σ2
E [Ln ] = m , V ar(Ln ) =
∀n ∈ {1, 2, 3, . . .}
n
Since V ar(Ln ) converges to 0, it is easy to remember that Ln converges to the constant m (with probability 1). Since
V ar(Cn ) = 1 for all n, it is easy to see that Cn does not converge to a constant. In fact, Cn does not converge to a random
variable either: it oscillates without converging to anything as n increases. However, the central limit theorem says that the
distribution of Cn converges to a Gaussian distribution with zero mean and unit variance:
Z ∞
2
1
√ e−t /2 dt
lim P [Cn > x] = Q(x) =
n→∞
2π
x
E [Cn ] = 0
,
R EFERENCES
[1] N. Etemadi. An elementary proof of the strong law of large numbers. Z. Wahrscheinlichkeitstheorie verw. Geiete, vol. 5:pp. 119–122, 1981.
1 The
letter C is used to remind you of the C entral limit theorem. The letter L is used to remind you of the L aw of large numbers.