Download Lecture 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Randomness wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Random variable wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Central limit theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Conditioning (probability) wikipedia , lookup

Transcript
Lecture 1
1
Probability space and random variables
Let us recall Kolmogorov’s formulation of modern probability theory using measure theory.
Definition 1.1 [Probability space and random variables] A probability space is a triple
(Ω, F, P), where Ω is a set, F is a σ-algebra on Ω, and P is a probability measure on the measurable space (Ω, F). A real-valued random variable is a measurable map, say X : (Ω, F) →
(R, B) (B being the Borel σ-algebra on R), with distribution P ◦ X −1 , i.e.,
P(X ∈ A) = P(ω : X(ω) ∈ A) = P(X −1 (A))
2
2.1
∀ A ∈ B.
Conditional probabilities and expectations
Definition and properties
Let us recall how conditional probability and expectation are defined in the discrete setting.
Example 2.1 Let X and Y be two integer-valued random variables on a probability space
(Ω, F, P). If P(Y = y) > 0, then the conditional probability of X given Y = y is defined by
P(X = x|Y = y) =
P(X = x, Y = y)
P(Y = y)
for all x ∈ Z,
while for any f : Z → R, the conditional expectation of f (X) given Y = y is defined by
X
E[f (X)|Y = y] =
f (x)P(X = x|Y = y).
x∈Z
Therefore the conditional distribution of X given Y is the family of probability distributions
P(X ∈ ·|Y = y), indexed by y ∈ Z with P(Y = y) > 0. For any f : Z → R, the conditional
expectation of f (X) given Y , denoted by E[f (X)|Y ], is a function of the random variable Y ,
which is again a random variable.
If Y in the above example is real-valued with a continuous distribution, i.e., P(Y = y) = 0
for all y ∈ R, then how can we define the conditional probability of X given Y ? The answer
lies in the observation that, conditioning on the realization of Y means effectively conditioning
on a level set of Y , and the level sets of Y are contained in σ(Y ), the σ-algebra generated
by Y (i.e., the smallest σ-algebra on Ω which makes Y measurable). Therefore conditioning
w.r.t. a random variable Y could be thought of as conditioning w.r.t. the σ-algebra σ(Y ).
Example 2.2 Let Ω = [0, 1]2 , F = B be the Borel σ-algebra on [0, 1]2 , and P be Lebesgue
measure on [0, 1]2 . Let X, Y ∈ L1 (Ω, F, P) with Y : (x, y) → y for all (x, y) ∈ Ω. Then σ(Y )
consists of all sets of the form [0, 1] × A for any Borel measurable A ⊂ [0, 1]. The conditional
expectation of X given Y is a random variable measurable w.r.t. σ(Y ), i.e., as a function on
[0, 1]2 , it only depends on the y-coordinate. We can also think of this conditional
expectation
R1
as being conditional upon σ(Y ). It is not difficult to see that E[X|Y ] = 0 X(x, Y )dx.
1
Definition 2.3 [Conditional expectation and probability]
Let (Ω, F, P) be a probability space. Let G ⊂ F be a sub σ-field, and let X ∈ L1 (Ω, F, P)
be a real-valued random variable. The conditional expectation of X given G, denoted by
E[X|G], is defined to be any random variable Y ∈ L1 (Ω, G, P) satisfying the property that
Z
Z
XdP =
Y dP
for all A ∈ G.
(2.1)
A
A
For X = 1B (ω) with B ∈ F, E[1B |G] is called the conditional probability of B given G.
Remark.If Z is another random variable on (Ω, F, P), then E[X|Z] is a real-valued function
of Z, defined by the composition E[X|σ(Z)] ◦ Z −1 , where σ(Z) is the smallest σ-algebra on Ω
which makes Z measurable.
The existence of a versionR of E[X|G] is established via the Radon-Nikodym Theorem. If
X ∈ L1 (Ω, F, P), µ(A) := A XdP for A ∈ G defines a signed finite measure on (Ω, G), which
is absolutely continuous w.r.t. the measure P on (Ω, G). Therefore by the Radon-Nikodym
Theorem,
R there exists a G-measurable function Y which is the density of µ w.r.t. P, and hence
µ(A) = A Y dP for all A ∈ G. For details, see Section 4.1 of Varadhan [2] and Section 4.1 of
Durrett [1]. Note that there may be multiple versions of E[X|G] which differ from each other
on sets of measure 0.
Proposition 2.4 [Basic properties of conditional expectation]
Let X ∈ L1 (Ω, F, P) and G ⊂ F be a sub σ-field. Then
(i) The conditional expectation E[X|G] is P a.s. uniquely defined.
(ii) If H ⊂ G, then E E[X|G] H = E[X|H] almost surely.
(iii) If Y ∈ L1 (Ω, F, P) and X ≥ Y , then E[X|G] ≥ E[Y |G].
(iv) If Y ∈ L1 (Ω, F, P) and a, b are finite constants, then
E[aX + bY |G] = aE[X|G] + bE[Y |G]
a.s.
(2.2)
(v) If Xn ≥ 0 and Xn ↑ X, then E[Xn |G] ↑ E[X|G].
(vi) If φ is a convex function and φ(X) ∈ L1 (Ω, F, P), then
E[φ(X)|G] ≥ φ(E[X|G])
a.s.
(2.3)
(vii) If X, XY ∈ L1 (Ω, F, P) and Y is G-measurable, then
E[XY |G] = Y E[X|G]
a.s.
(2.4)
(viii) If X ∈ L2 (Ω, F, P), then E[X|G] is the orthogonal projection of X onto the subspace
L2 (Ω, G, P) in the Hilbert space L2 (Ω, F, P) with inner product hX, Y i := E[XY ].
Proof. (v) follows from the Monotone Convergence Theorem. For (vi), note that a convex
function φ can be written as φ(x) = supa (ax − ψ(a)) for some convex function ψ, and we may
even restrict to rational a to write φ(x) = supa∈Q (ax − ψ(a)). Therefore by (iii), a.s.
E[φ(X)|G] = E[sup(aX − ψ(a))|G] ≥ sup E[aX − ψ(a)|G] = sup(aE[X|G] − ψ(a)) = φ(E[X|G]).
a∈Q
a∈Q
a∈Q
2
We restricted to a ∈ Q because conditional expectation is uniquely determined up to a set of
measure 0, and the union of an uncountable number of sets of measure 0 may have positive
measure, or even become not measurable. (vii) follows by approximating Y with bounded
functions. For (viii), note that if X ∈ L2 (Ω, F, P), then
hE[X|G], X − E[X|G]i = E[XE[X|G]] − E[E[X|G]2 ] = E E XE[X|G]|G − E[E[X|G]2 ] = 0.
For more details, see Section 4.2 of Varadhan [2] and Section 4.1(b) of Durrett [1].
2.2
Regular conditional distributions and probabilities
We now deal with an important, but much more subtle issue. Let X : (Ω, F, P) → (S, S) be a
random variable taking values in a general space S with σ-field S. Then X has distribution
P ◦ X −1 on (S, S). Let G ⊂ F be a sub σ-field.
For any set A ∈ S, the conditional probability of X ∈ A given G is given by the conditional
expectation P(ω, A) := E[1A (X)|G]. As we vary A ∈ S, we obtain a map from Ω × S to [0, 1].
The question is: when we consider simultaneously all A ∈ S, can we ensure that for P almost
every ω ∈ Ω, P(ω, ·) is in fact a probability measure on (S, S)? The answer is non-trivial.
Note that for each A ∈ S, P(·, A) is almost surely uniquely defined by Proposition 2.4 (i),
and we are free to modify P(·, A) on a set of probability zero. In particular, for any countable
collection of disjoint sets An ∈ S, n ∈ N, we have by Proposition 2.4 (iv)
X
P(ω, ∪n An ) =
P(ω, An )
(2.5)
n
on a set of ω with probability 1. We desire a version of P(·, ·) such that for P every ω ∈ Ω,
P(ω, ·) satisfies the above countable additivity property for any infinite collection of disjoint
sets. However, such a version may not exist because the subset of Ω on which (2.5) fails
depend on the collection {An }. Since there are uncountable number of such collections of
sets, the corresponding exceptional sets of probability zero could add up to a set with positive
probability, or even become non-measurable.
Definition 2.5 [Regular conditional distributions and probabilities]
Let (Ω, F, P), G and X : (Ω, F, P) → (S, S) be as above. A family of probability distributions
on (S, S), denoted by (µ(ω, ·))ω∈Ω , is called a regular conditional distribution of X given
G if for each A ∈ S, µ(·, A) = E[1A (X)|G] a.s. When (S, S) = (Ω, F) and X(ω) = ω,
(µ(ω, ·))ω∈Ω is called a regular conditional probability on F given G.
If X has a regular conditional distribution given G, then conditional expectations of functions of X given G can be expressed as integrals over the regular conditional distribution.
Proposition 2.6 Let (Ω, F, P), G, (S, S), X be as in Definition 2.5. Let (µ(ω, ·))ω∈Ω be
a regular conditional distribution of X given G. Then for any Borel-measurable function
f : (S, S) → (R, B) with E|f (X)| < ∞, we have
Z
E[f (X)|G] = f (x)µ(ω, dx)
a.s.
(2.6)
Proof. By writing f as the sum of its positive and negative parts, we may assume w.l.o.g.
that f ≥ 0. By definition, (2.6) holds when f is an indicator function, and hence also when f
3
is a simple function (finite linear combination of indicator functions). Since any non-negative
measurable function f is the increasing limit of a sequence of simple functions, (2.6) follows
from Proposition 2.4 (v) and the monotone convergence theorem.
When the space (S, S) in which the random variable X takes its value is sufficiently nice,
regular conditional distributions do exist.
Theorem 2.7 [Existence of regular conditional distributions]
Let (Ω, F, P), G and X : (Ω, F, P) → (S, S) be as in Definition 2.5. If S is a complete separable metric space with Borel σ-field S, then there exists a regular conditional distribution
(µ(ω, ·))ω∈Ω for X given G.
Proof. If S contains only a countable number of points, then the existence of a regular
conditional distribution is trivial. If S contains uncountable number of points, then there is
a one-to-one measurable map φ with a measurable inverse φ−1 between (S, S) and ([0, 1], B),
where B is the Borel σ-field on [0, 1] (see Remark 4.6 in Varadhan [2]). Thus w.l.o.g. we may
assume (S, S) = ([0, 1], B).
Let us first construct the conditional probabilities for sets of the form (−∞, q] for q ∈ Q,
i.e., let G(ω, q) := E[1{X≤q} |G]. Since there are only countable such q, by Proposition 2.4 (iii),
we can find Ω0 ⊂ Ω with P(Ω0 ) = 1 such that for all ω ∈ Ω0 ,
G(ω, q) = 0
for all q ∈ Q ∩ (−∞, 0),
G(ω, q) = 1
for all q ∈ Q ∩ (1, ∞),
G(ω, q1 ) ≤ G(ω, q2 )
(2.7)
for all q1 , q2 ∈ Q and q1 < q2 .
For each ω ∈ Ω0 and x ∈ R, define
F (ω, x) =
lim G(ω, q) =
q↓x,q∈Q
lim E[1{X≤q} |G],
q↓x,q∈Q
(2.8)
which defines the distribution function of a probability measure µ(ω, ·) on ([0, 1], B). By
Proposition 2.4 (v), for each x ∈ R, µ(ω, (−∞, x]) := F (ω, x) is a version of the conditional
expectation E[1{X≤x} |G]. It only remains to show that the same is true if we replace (−∞, x]
by any B ∈ B.
Note that the collection of sets Λ := {B ∈ B : µ(ω, B) = E[1{X∈B} |G] a.s.} is a λ-system:
i.e., [0, 1] ∈ Λ; if B1 ⊂ B2 and B1 , B2 ∈ Λ, then B2 \B1 ∈ Λ; if Bn ∈ Λ and Bn ↑ B, then
B ∈ Λ. On the other hand, Λ clearly contains finite disjoint unions of intervals of the form
(a, b], which is a π-system. Therefore by the π-λ theorem, Λ contains the σ-field generated by
the π-system, which is just B.
We used the following result, which is equivalent to the Monotone Class Theorem.
Theorem 2.8 [Dynkin’s π-λ Theorem] Let Π ⊂ Λ be two collections of subsets of Ω, where
Π is a π-system (i.e., A, B ∈ Π ⇒ A ∩ B ∈ Π), and Λ is a λ-system (i.e.: (i) Ω ∈ Λ, (ii)
A, B ∈ Λ and A ⊂ B ⇒ B\A ∈ Λ, (iii) An ∈ Λ and An ↑ A ⇒ A ∈ Λ). Then σ(Π) ⊂ Λ.
Dynkin’s π-λ Theorem is often used to prove that a certain property holds for all sets in a
σ-algebra. For a proof, see Section A.2 of Durrett [1].
4
3
Martingales
Martingales capture the notion of fair future returns given past information. Originally it
refers to a class of betting strategies popular in 18th century France. We will focus on discrete time martingales. We will first recall the definition of a martingale and then collect some
essential results, including Doob’s inequality, martingale convergence theorems, Doob’s decomposition, law of large numbers for martingales, the upcrossing inequality, optional stopping
theorem, concentration of measure for martingales with bounded increments. To illustrate the
use of martingales, we will study several models including Polya’s urn, branching processes,
and birth-death chains.
3.1
Definition and basic properties
Definition 3.1 [Filtration]
Let (Ω, F, P) be a probability space. A filtration (Fn )n∈N is an increasing sequence of sub
σ-algebras of F, i.e.,
F1 ⊂ F2 ⊂ · · · ⊂ Fn ⊂ · · · ⊂ F.
We can think of Fn as information available up to time n.
Definition 3.2 [Martingale, super-martingale and sub-martingale]
Let (Ω, F, P) be a probability space equipped with a filtration (Fn )n∈N ⊂ F. A sequence of
random variables X := (Xn )n∈N is called a martingale adapted to the filtration (Fn )n∈N if
(i) Xn ∈ L1 (Ω, Fn , P ) for all n ∈ N.
(ii) E[Xn+1 |Fn ] = Xn a.s. for all n ∈ N.
If in (ii), = is replaced by ≤ (resp. ≥), X is then called a super-martingale (resp. submartingale). When the filtration (Fn )n∈N is not specified explicitly, we take the canonical
filtration Fn = σ(X1 , · · · , Xn ), i.e., the σ-field generated by the random variables X1 , · · · , Xn .
Example 3.3 [Mean Zero Random Walk] If (ξn )n∈N are i.i.d. random variables with
P
E[ξ1 ] = 0, then Xn := ni=1 ξi is a martingale adapted to the filtration Fn := σ(ξ1 , · · · , ξn ),
n ∈ N. (Xn )n∈N records the position of a random walk on R.
Example 3.4 If X ∈ L1 (Ω, F, P) and (Fn )n∈N ⊂ F is a filtration, then Xn := E[X|Fn ] is a
martingale adapted to (Fn )n∈N .
Example 3.5 [Martingale Transforms as Betting Strategies] If we think of the martingale difference Di = Xi − Xi−1 (with X0 = 0) as the reward/loss of the i-th game in a
sequence of (possibly dependent) games, then a martingale corresponds to a fair game since
E[Xn ] = X0 . A martingale transform is defined by
0
Xi0 = Xi−1
+ hi−1 Di ,
(3.9)
where hi−1 ∈ Fi−1 such that hi−1 Di is integrable. We can interpret hi−1 as the size of the bet
in the i-th game, and one is only allowed to choose hi−1 based on information available prior
to the i-th game. It is easy to verify that Xi0 is still a martingale w.r.t. Fi . Thus no matter
which strategy one chooses, as long as one does not peek into the future, the game remains
fair, i.e., E[Xn0 ] = X00 .
5
As immediate consequences of the properties of conditional expectation, we have
Proposition 3.6 If (Xn )n∈N is a martingale adapted to the filtration (Fn )n∈N , then
(i) E[Xn |Fm ] = Xm a.s. for all 1 ≤ m ≤ n, and E[Xn ] = c is independent of n ∈ N.
(ii) If φ is a convex (resp. concave) function and E[|φ(Xn )|] < ∞ for all n ∈ N, then
(φ(Xn ))n∈N is a sub- (resp. super-)martingale adapted to the filtration (Fn )n∈N .
Example 3.7 If (Xn )n∈N is a martingale adapted to the filtration (Fn )n∈N , then for c ∈ R,
Xn ∧c is a super-martingale while Xn ∨c is a sub-martingale. If for some p ≥ 1, E[|Xn |p ] < ∞
for all n ∈ N, then |Xn |p is a sub-martingale w.r.t. Fn . If (Xn )n∈N is a sub-martingale and φ
is a convex increasing function, then (φ(Xn ))n∈N is a sub-martingale provided they are all
integrable.
3.2
Martingale Decomposition
Let X ∈ L1 (Ω, F, P). A useful technique in bounding variance of X or establish concentration
properties of X is to perform a martingale decomposition. Namely, introduce a filtration
F0 := {∅, Ω} ⊂ F1 · · · ⊂ Fn = F and write
X = E[X] +
n
X
Xi − Xi−1 ,
(3.10)
i=1
where Xi = E[X|Fi ]. Note that (Xi )1≤i≤n is a martingale. If X ∈ L2 (Ω, F, P), then by the
orthogonality of martingale increments, we have
2
Var(X) = E[(X − E[X]) ] =
n
X
2
E[(Xi − Xi−1 ) ] =
i=1
n
X
E[Var(Xi |Fi−1 )].
(3.11)
i=1
The conditional variance Var(Xi |Fi−1 ) := E[Xi2 |Fi−1 ] − E[Xi |Fi−1 ]2 can often be bounded
using coupling techniques.
As an illustration of the martingale decomposition, we prove a concentration of measure
inequality for martingles with bounded increments.
Theorem 3.8 [Azuma-Hoeffding inequality]
Let (Xi )1≤i≤n be a martingale adapted to the filtration (Fi )1≤i≤n on a probability space (Ω, F, P).
Assume X0 := E[X1 ] = 0, and |Xi − Xi−1 | ≤ K for all 1 ≤ i ≤ n a.s. Then for all x ≥ 0,
Xn
The same bound holds for P √
n
X
x2
n
P √ ≥ x ≤ e− 2K 2 .
n
≤ −x .
Note that the bound on the tail probabilities of
distribution.
Xn
√
n
(3.12)
is comparable to that of a Gaussian
Proof. Let Di := Xi − Xi−1 for 1 ≤ i ≤ n. By the exponential Markov inequality, for any
λ > 0 and y ≥ 0,
P(Xn ≥ y) = P(eλXn ≥ eλy ) ≤ e−λy E[eλXn ] = e−λy E eλXn−1 E[eλDn |Fn−1 ] .
(3.13)
6
λK +e−λK
Note that by convexity, eλx ≤ e
a.s. and E[Dn |Fn−1 ] = 0, we have
+
2
E[eλDn |Fn−1 ] ≤
eλK −e−λK
2K
x for all x ∈ [−K, K]. Since |Dn | ≤ K
λ2 K 2
eλK + e−λK
≤e 2 .
2
Substituting this bound into (3.13) and successively conditioning on Fn−2 , . . . , F0 := {∅, Ω}
then yields
P(Xn ≥ y) ≤ e−λy+
nλ2 K 2
2
.
Since λ > 0 can be arbitrary, optimizing over λ > 0 then yields
2 2
− sup λy− nλ 2K
y2
= e− 2nK 2 .
√
Xn
Setting y = x n then gives the desired bound. The bound for P √
≤
−x
is identical.
n
P(Xn ≥ y) ≤ e
λ>0
References
[1] R. Durrett, Probability: Theory and Examples, 2nd edition, Duxbury Press, Belmont,
California, 1996.
[2] S.R.S. Varadhan, Probability Theory, Courant Lecture Notes 7, American Mathematical
Society, Providence, Rhode Island, 2001.
7