Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Birthday problem wikipedia , lookup
Ars Conjectandi wikipedia , lookup
Infinite monkey theorem wikipedia , lookup
Random variable wikipedia , lookup
Inductive probability wikipedia , lookup
Probability interpretations wikipedia , lookup
Central limit theorem wikipedia , lookup
Lecture 1 1 Probability space and random variables Let us recall Kolmogorov’s formulation of modern probability theory using measure theory. Definition 1.1 [Probability space and random variables] A probability space is a triple (Ω, F, P), where Ω is a set, F is a σ-algebra on Ω, and P is a probability measure on the measurable space (Ω, F). A real-valued random variable is a measurable map, say X : (Ω, F) → (R, B) (B being the Borel σ-algebra on R), with distribution P ◦ X −1 , i.e., P(X ∈ A) = P(ω : X(ω) ∈ A) = P(X −1 (A)) 2 2.1 ∀ A ∈ B. Conditional probabilities and expectations Definition and properties Let us recall how conditional probability and expectation are defined in the discrete setting. Example 2.1 Let X and Y be two integer-valued random variables on a probability space (Ω, F, P). If P(Y = y) > 0, then the conditional probability of X given Y = y is defined by P(X = x|Y = y) = P(X = x, Y = y) P(Y = y) for all x ∈ Z, while for any f : Z → R, the conditional expectation of f (X) given Y = y is defined by X E[f (X)|Y = y] = f (x)P(X = x|Y = y). x∈Z Therefore the conditional distribution of X given Y is the family of probability distributions P(X ∈ ·|Y = y), indexed by y ∈ Z with P(Y = y) > 0. For any f : Z → R, the conditional expectation of f (X) given Y , denoted by E[f (X)|Y ], is a function of the random variable Y , which is again a random variable. If Y in the above example is real-valued with a continuous distribution, i.e., P(Y = y) = 0 for all y ∈ R, then how can we define the conditional probability of X given Y ? The answer lies in the observation that, conditioning on the realization of Y means effectively conditioning on a level set of Y , and the level sets of Y are contained in σ(Y ), the σ-algebra generated by Y (i.e., the smallest σ-algebra on Ω which makes Y measurable). Therefore conditioning w.r.t. a random variable Y could be thought of as conditioning w.r.t. the σ-algebra σ(Y ). Example 2.2 Let Ω = [0, 1]2 , F = B be the Borel σ-algebra on [0, 1]2 , and P be Lebesgue measure on [0, 1]2 . Let X, Y ∈ L1 (Ω, F, P) with Y : (x, y) → y for all (x, y) ∈ Ω. Then σ(Y ) consists of all sets of the form [0, 1] × A for any Borel measurable A ⊂ [0, 1]. The conditional expectation of X given Y is a random variable measurable w.r.t. σ(Y ), i.e., as a function on [0, 1]2 , it only depends on the y-coordinate. We can also think of this conditional expectation R1 as being conditional upon σ(Y ). It is not difficult to see that E[X|Y ] = 0 X(x, Y )dx. 1 Definition 2.3 [Conditional expectation and probability] Let (Ω, F, P) be a probability space. Let G ⊂ F be a sub σ-field, and let X ∈ L1 (Ω, F, P) be a real-valued random variable. The conditional expectation of X given G, denoted by E[X|G], is defined to be any random variable Y ∈ L1 (Ω, G, P) satisfying the property that Z Z XdP = Y dP for all A ∈ G. (2.1) A A For X = 1B (ω) with B ∈ F, E[1B |G] is called the conditional probability of B given G. Remark.If Z is another random variable on (Ω, F, P), then E[X|Z] is a real-valued function of Z, defined by the composition E[X|σ(Z)] ◦ Z −1 , where σ(Z) is the smallest σ-algebra on Ω which makes Z measurable. The existence of a versionR of E[X|G] is established via the Radon-Nikodym Theorem. If X ∈ L1 (Ω, F, P), µ(A) := A XdP for A ∈ G defines a signed finite measure on (Ω, G), which is absolutely continuous w.r.t. the measure P on (Ω, G). Therefore by the Radon-Nikodym Theorem, R there exists a G-measurable function Y which is the density of µ w.r.t. P, and hence µ(A) = A Y dP for all A ∈ G. For details, see Section 4.1 of Varadhan [2] and Section 4.1 of Durrett [1]. Note that there may be multiple versions of E[X|G] which differ from each other on sets of measure 0. Proposition 2.4 [Basic properties of conditional expectation] Let X ∈ L1 (Ω, F, P) and G ⊂ F be a sub σ-field. Then (i) The conditional expectation E[X|G] is P a.s. uniquely defined. (ii) If H ⊂ G, then E E[X|G] H = E[X|H] almost surely. (iii) If Y ∈ L1 (Ω, F, P) and X ≥ Y , then E[X|G] ≥ E[Y |G]. (iv) If Y ∈ L1 (Ω, F, P) and a, b are finite constants, then E[aX + bY |G] = aE[X|G] + bE[Y |G] a.s. (2.2) (v) If Xn ≥ 0 and Xn ↑ X, then E[Xn |G] ↑ E[X|G]. (vi) If φ is a convex function and φ(X) ∈ L1 (Ω, F, P), then E[φ(X)|G] ≥ φ(E[X|G]) a.s. (2.3) (vii) If X, XY ∈ L1 (Ω, F, P) and Y is G-measurable, then E[XY |G] = Y E[X|G] a.s. (2.4) (viii) If X ∈ L2 (Ω, F, P), then E[X|G] is the orthogonal projection of X onto the subspace L2 (Ω, G, P) in the Hilbert space L2 (Ω, F, P) with inner product hX, Y i := E[XY ]. Proof. (v) follows from the Monotone Convergence Theorem. For (vi), note that a convex function φ can be written as φ(x) = supa (ax − ψ(a)) for some convex function ψ, and we may even restrict to rational a to write φ(x) = supa∈Q (ax − ψ(a)). Therefore by (iii), a.s. E[φ(X)|G] = E[sup(aX − ψ(a))|G] ≥ sup E[aX − ψ(a)|G] = sup(aE[X|G] − ψ(a)) = φ(E[X|G]). a∈Q a∈Q a∈Q 2 We restricted to a ∈ Q because conditional expectation is uniquely determined up to a set of measure 0, and the union of an uncountable number of sets of measure 0 may have positive measure, or even become not measurable. (vii) follows by approximating Y with bounded functions. For (viii), note that if X ∈ L2 (Ω, F, P), then hE[X|G], X − E[X|G]i = E[XE[X|G]] − E[E[X|G]2 ] = E E XE[X|G]|G − E[E[X|G]2 ] = 0. For more details, see Section 4.2 of Varadhan [2] and Section 4.1(b) of Durrett [1]. 2.2 Regular conditional distributions and probabilities We now deal with an important, but much more subtle issue. Let X : (Ω, F, P) → (S, S) be a random variable taking values in a general space S with σ-field S. Then X has distribution P ◦ X −1 on (S, S). Let G ⊂ F be a sub σ-field. For any set A ∈ S, the conditional probability of X ∈ A given G is given by the conditional expectation P(ω, A) := E[1A (X)|G]. As we vary A ∈ S, we obtain a map from Ω × S to [0, 1]. The question is: when we consider simultaneously all A ∈ S, can we ensure that for P almost every ω ∈ Ω, P(ω, ·) is in fact a probability measure on (S, S)? The answer is non-trivial. Note that for each A ∈ S, P(·, A) is almost surely uniquely defined by Proposition 2.4 (i), and we are free to modify P(·, A) on a set of probability zero. In particular, for any countable collection of disjoint sets An ∈ S, n ∈ N, we have by Proposition 2.4 (iv) X P(ω, ∪n An ) = P(ω, An ) (2.5) n on a set of ω with probability 1. We desire a version of P(·, ·) such that for P every ω ∈ Ω, P(ω, ·) satisfies the above countable additivity property for any infinite collection of disjoint sets. However, such a version may not exist because the subset of Ω on which (2.5) fails depend on the collection {An }. Since there are uncountable number of such collections of sets, the corresponding exceptional sets of probability zero could add up to a set with positive probability, or even become non-measurable. Definition 2.5 [Regular conditional distributions and probabilities] Let (Ω, F, P), G and X : (Ω, F, P) → (S, S) be as above. A family of probability distributions on (S, S), denoted by (µ(ω, ·))ω∈Ω , is called a regular conditional distribution of X given G if for each A ∈ S, µ(·, A) = E[1A (X)|G] a.s. When (S, S) = (Ω, F) and X(ω) = ω, (µ(ω, ·))ω∈Ω is called a regular conditional probability on F given G. If X has a regular conditional distribution given G, then conditional expectations of functions of X given G can be expressed as integrals over the regular conditional distribution. Proposition 2.6 Let (Ω, F, P), G, (S, S), X be as in Definition 2.5. Let (µ(ω, ·))ω∈Ω be a regular conditional distribution of X given G. Then for any Borel-measurable function f : (S, S) → (R, B) with E|f (X)| < ∞, we have Z E[f (X)|G] = f (x)µ(ω, dx) a.s. (2.6) Proof. By writing f as the sum of its positive and negative parts, we may assume w.l.o.g. that f ≥ 0. By definition, (2.6) holds when f is an indicator function, and hence also when f 3 is a simple function (finite linear combination of indicator functions). Since any non-negative measurable function f is the increasing limit of a sequence of simple functions, (2.6) follows from Proposition 2.4 (v) and the monotone convergence theorem. When the space (S, S) in which the random variable X takes its value is sufficiently nice, regular conditional distributions do exist. Theorem 2.7 [Existence of regular conditional distributions] Let (Ω, F, P), G and X : (Ω, F, P) → (S, S) be as in Definition 2.5. If S is a complete separable metric space with Borel σ-field S, then there exists a regular conditional distribution (µ(ω, ·))ω∈Ω for X given G. Proof. If S contains only a countable number of points, then the existence of a regular conditional distribution is trivial. If S contains uncountable number of points, then there is a one-to-one measurable map φ with a measurable inverse φ−1 between (S, S) and ([0, 1], B), where B is the Borel σ-field on [0, 1] (see Remark 4.6 in Varadhan [2]). Thus w.l.o.g. we may assume (S, S) = ([0, 1], B). Let us first construct the conditional probabilities for sets of the form (−∞, q] for q ∈ Q, i.e., let G(ω, q) := E[1{X≤q} |G]. Since there are only countable such q, by Proposition 2.4 (iii), we can find Ω0 ⊂ Ω with P(Ω0 ) = 1 such that for all ω ∈ Ω0 , G(ω, q) = 0 for all q ∈ Q ∩ (−∞, 0), G(ω, q) = 1 for all q ∈ Q ∩ (1, ∞), G(ω, q1 ) ≤ G(ω, q2 ) (2.7) for all q1 , q2 ∈ Q and q1 < q2 . For each ω ∈ Ω0 and x ∈ R, define F (ω, x) = lim G(ω, q) = q↓x,q∈Q lim E[1{X≤q} |G], q↓x,q∈Q (2.8) which defines the distribution function of a probability measure µ(ω, ·) on ([0, 1], B). By Proposition 2.4 (v), for each x ∈ R, µ(ω, (−∞, x]) := F (ω, x) is a version of the conditional expectation E[1{X≤x} |G]. It only remains to show that the same is true if we replace (−∞, x] by any B ∈ B. Note that the collection of sets Λ := {B ∈ B : µ(ω, B) = E[1{X∈B} |G] a.s.} is a λ-system: i.e., [0, 1] ∈ Λ; if B1 ⊂ B2 and B1 , B2 ∈ Λ, then B2 \B1 ∈ Λ; if Bn ∈ Λ and Bn ↑ B, then B ∈ Λ. On the other hand, Λ clearly contains finite disjoint unions of intervals of the form (a, b], which is a π-system. Therefore by the π-λ theorem, Λ contains the σ-field generated by the π-system, which is just B. We used the following result, which is equivalent to the Monotone Class Theorem. Theorem 2.8 [Dynkin’s π-λ Theorem] Let Π ⊂ Λ be two collections of subsets of Ω, where Π is a π-system (i.e., A, B ∈ Π ⇒ A ∩ B ∈ Π), and Λ is a λ-system (i.e.: (i) Ω ∈ Λ, (ii) A, B ∈ Λ and A ⊂ B ⇒ B\A ∈ Λ, (iii) An ∈ Λ and An ↑ A ⇒ A ∈ Λ). Then σ(Π) ⊂ Λ. Dynkin’s π-λ Theorem is often used to prove that a certain property holds for all sets in a σ-algebra. For a proof, see Section A.2 of Durrett [1]. 4 3 Martingales Martingales capture the notion of fair future returns given past information. Originally it refers to a class of betting strategies popular in 18th century France. We will focus on discrete time martingales. We will first recall the definition of a martingale and then collect some essential results, including Doob’s inequality, martingale convergence theorems, Doob’s decomposition, law of large numbers for martingales, the upcrossing inequality, optional stopping theorem, concentration of measure for martingales with bounded increments. To illustrate the use of martingales, we will study several models including Polya’s urn, branching processes, and birth-death chains. 3.1 Definition and basic properties Definition 3.1 [Filtration] Let (Ω, F, P) be a probability space. A filtration (Fn )n∈N is an increasing sequence of sub σ-algebras of F, i.e., F1 ⊂ F2 ⊂ · · · ⊂ Fn ⊂ · · · ⊂ F. We can think of Fn as information available up to time n. Definition 3.2 [Martingale, super-martingale and sub-martingale] Let (Ω, F, P) be a probability space equipped with a filtration (Fn )n∈N ⊂ F. A sequence of random variables X := (Xn )n∈N is called a martingale adapted to the filtration (Fn )n∈N if (i) Xn ∈ L1 (Ω, Fn , P ) for all n ∈ N. (ii) E[Xn+1 |Fn ] = Xn a.s. for all n ∈ N. If in (ii), = is replaced by ≤ (resp. ≥), X is then called a super-martingale (resp. submartingale). When the filtration (Fn )n∈N is not specified explicitly, we take the canonical filtration Fn = σ(X1 , · · · , Xn ), i.e., the σ-field generated by the random variables X1 , · · · , Xn . Example 3.3 [Mean Zero Random Walk] If (ξn )n∈N are i.i.d. random variables with P E[ξ1 ] = 0, then Xn := ni=1 ξi is a martingale adapted to the filtration Fn := σ(ξ1 , · · · , ξn ), n ∈ N. (Xn )n∈N records the position of a random walk on R. Example 3.4 If X ∈ L1 (Ω, F, P) and (Fn )n∈N ⊂ F is a filtration, then Xn := E[X|Fn ] is a martingale adapted to (Fn )n∈N . Example 3.5 [Martingale Transforms as Betting Strategies] If we think of the martingale difference Di = Xi − Xi−1 (with X0 = 0) as the reward/loss of the i-th game in a sequence of (possibly dependent) games, then a martingale corresponds to a fair game since E[Xn ] = X0 . A martingale transform is defined by 0 Xi0 = Xi−1 + hi−1 Di , (3.9) where hi−1 ∈ Fi−1 such that hi−1 Di is integrable. We can interpret hi−1 as the size of the bet in the i-th game, and one is only allowed to choose hi−1 based on information available prior to the i-th game. It is easy to verify that Xi0 is still a martingale w.r.t. Fi . Thus no matter which strategy one chooses, as long as one does not peek into the future, the game remains fair, i.e., E[Xn0 ] = X00 . 5 As immediate consequences of the properties of conditional expectation, we have Proposition 3.6 If (Xn )n∈N is a martingale adapted to the filtration (Fn )n∈N , then (i) E[Xn |Fm ] = Xm a.s. for all 1 ≤ m ≤ n, and E[Xn ] = c is independent of n ∈ N. (ii) If φ is a convex (resp. concave) function and E[|φ(Xn )|] < ∞ for all n ∈ N, then (φ(Xn ))n∈N is a sub- (resp. super-)martingale adapted to the filtration (Fn )n∈N . Example 3.7 If (Xn )n∈N is a martingale adapted to the filtration (Fn )n∈N , then for c ∈ R, Xn ∧c is a super-martingale while Xn ∨c is a sub-martingale. If for some p ≥ 1, E[|Xn |p ] < ∞ for all n ∈ N, then |Xn |p is a sub-martingale w.r.t. Fn . If (Xn )n∈N is a sub-martingale and φ is a convex increasing function, then (φ(Xn ))n∈N is a sub-martingale provided they are all integrable. 3.2 Martingale Decomposition Let X ∈ L1 (Ω, F, P). A useful technique in bounding variance of X or establish concentration properties of X is to perform a martingale decomposition. Namely, introduce a filtration F0 := {∅, Ω} ⊂ F1 · · · ⊂ Fn = F and write X = E[X] + n X Xi − Xi−1 , (3.10) i=1 where Xi = E[X|Fi ]. Note that (Xi )1≤i≤n is a martingale. If X ∈ L2 (Ω, F, P), then by the orthogonality of martingale increments, we have 2 Var(X) = E[(X − E[X]) ] = n X 2 E[(Xi − Xi−1 ) ] = i=1 n X E[Var(Xi |Fi−1 )]. (3.11) i=1 The conditional variance Var(Xi |Fi−1 ) := E[Xi2 |Fi−1 ] − E[Xi |Fi−1 ]2 can often be bounded using coupling techniques. As an illustration of the martingale decomposition, we prove a concentration of measure inequality for martingles with bounded increments. Theorem 3.8 [Azuma-Hoeffding inequality] Let (Xi )1≤i≤n be a martingale adapted to the filtration (Fi )1≤i≤n on a probability space (Ω, F, P). Assume X0 := E[X1 ] = 0, and |Xi − Xi−1 | ≤ K for all 1 ≤ i ≤ n a.s. Then for all x ≥ 0, Xn The same bound holds for P √ n X x2 n P √ ≥ x ≤ e− 2K 2 . n ≤ −x . Note that the bound on the tail probabilities of distribution. Xn √ n (3.12) is comparable to that of a Gaussian Proof. Let Di := Xi − Xi−1 for 1 ≤ i ≤ n. By the exponential Markov inequality, for any λ > 0 and y ≥ 0, P(Xn ≥ y) = P(eλXn ≥ eλy ) ≤ e−λy E[eλXn ] = e−λy E eλXn−1 E[eλDn |Fn−1 ] . (3.13) 6 λK +e−λK Note that by convexity, eλx ≤ e a.s. and E[Dn |Fn−1 ] = 0, we have + 2 E[eλDn |Fn−1 ] ≤ eλK −e−λK 2K x for all x ∈ [−K, K]. Since |Dn | ≤ K λ2 K 2 eλK + e−λK ≤e 2 . 2 Substituting this bound into (3.13) and successively conditioning on Fn−2 , . . . , F0 := {∅, Ω} then yields P(Xn ≥ y) ≤ e−λy+ nλ2 K 2 2 . Since λ > 0 can be arbitrary, optimizing over λ > 0 then yields 2 2 − sup λy− nλ 2K y2 = e− 2nK 2 . √ Xn Setting y = x n then gives the desired bound. The bound for P √ ≤ −x is identical. n P(Xn ≥ y) ≤ e λ>0 References [1] R. Durrett, Probability: Theory and Examples, 2nd edition, Duxbury Press, Belmont, California, 1996. [2] S.R.S. Varadhan, Probability Theory, Courant Lecture Notes 7, American Mathematical Society, Providence, Rhode Island, 2001. 7