Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EE5139R Lecture 1 : Review of Necessary Mathematical Preliminaries for EE5139R Vincent Y. F. Tan August 4, 2016 This is a very brief review on probability theory adapted from the notes of EE7101 (by Prof. Wee Peng Tay of NTU). For more details, please consult the undergraduate level is “Introduction to Probability” by Dimitri Bertsekas and John Tsitsiklis [1]. 1 Probability Space A probability space is represented by a tuple (Ω, F, P) where Ω is the sample space, and F is a σ-algebra (think of it as a collection of events or legitimate subsets of Ω), with the following properties: Ω∈F If A ∈ F, then Ac = Ω \ A ∈ F If A1 , A2 , . . . , An , . . . ∈ F, then ∪∞ i=1 Ai ∈ F Think of F as the “information structure” of Ω. For example, let Ω = [0, 1], and we are interested in the probability of subsets of that are intervals of the form [a, b] where 0 ≤ a < b ≤ 1, but not individual points in Ω. Then we should also be able to say something about the probability of the union, intersections, complement and so on of such intervals. This is captured by the definition of a σ-algebra. The probability measure P is a function P : F → [0, 1] defined on the measurable space (Ω, F), and represents your “belief” about the events in F. In order for P to be called a probability measure, it must satisfy the following properties: P(Ω) = 1 For A1 , A2 , . . . such that Ai ∩ Aj = ∅ for all i 6= j, we have ! ∞ ∞ [ X P Ai = P(Ai ) i=1 i=1 Some basic properties that can be derived from the above definition: P(Ac ) = 1 − P(A) If A ⊂ B, then P(A) ≤ P(B) P(A ∪ B) = P(A) + P(B) − P(A ∩ B) ≤ P(A) + P(B) is called the union bound. Clearly, by induction, the union bound works for finitely many sets Ai , i = 1, . . . , k. It is also true that ! ∞ ∞ [ X P Ai ≤ P(Ai ). i=1 i=1 This is not straightforward. Can you prove it using only the axioms above? 1 Conditional probability of A given B is defined as P(A|B) = P(A ∩ B) P(B) if P(B) 6= 0. It is possible to define conditional probability even if P(B) = 0 but we would not need this for this course. Two events A, B ∈ F are said to be independent if P(A ∩ B) = P(A)P(B) In other words, P(A|B) = P(A), 2 or P(B|A) = P(B). Random Variables A random variable X : Ω → X is a mapping (function) from the space (Ω, F) to a measurable space (X , B). For example, for a real-valued random variable, the space that X takes values in is typically chosen to be (R, B), where B is the Borel σ-algebra.1 In order for X to make any sense, the mapping has to ensure that {ω ∈ Ω : X(ω) ∈ B} for all B ∈ B, because we are restricted to the information structure imposed by F. This is called a measurable function. A random variable is then more formally defined as a measurable mapping from (Ω, F) to (X , B).. The probability measure P induces a probability measure PX on (R, B), given by PX (B) = P({ω ∈ Ω : X(ω) ∈ B}) for all B ∈ B. We often write PX (B) as Pr(X ∈ B) and PX is called the distribution of the random variable X. If X = {a1 , . . . , ad }, then we say that X is a discrete random variable. The distribution of X is then also known as the probability mass function (pmf) of X and is fully defined by (PX (a1 ), . . . , PX (ad )), where PX (a) = Pr(X = a). For real-valued random variable X, if there exists a function fX : R → [0, ∞) such that for all A ∈ B, we have Z Pr(X ∈ A) = fX (x) dx A then we say that X is a continuous random variable. The function fX is called the probability density function (pdf). In this class, we deal mainly with discrete random variables, although we will also encounter Gaussian random variables, which are continuous, later on. Some additional notations and definitions for discrete random variables are given below. The counterparts for continuous random variables can be obtained by simply replacing pmfs with pdfs. Assume that X and Y are discrete random variables taking on values in X and Y respectively. Joint pmf: PX,Y (x, y) = Pr(X = x, Y = y) Conditional pmf: PX|Y (x|y) = PX,Y (x, y) , PY (y) for PY (y) > 0 Bayes rule PY |X (y|x)PX (x) 0 0 x0 PY |X (y|x )PX (x ) PX|Y (x|y) = P 1 The smallest σ-algebra containing all open intervals in R. 2 If X and Y are independent random variables, then PX,Y (x, y) = PX (x)PY (y) Furthermore, for any set A ⊂ X , we have two different ways of denoting the probability that X belongs to A, namely X PX (A) := Pr(X ∈ A) = PX (x), x∈A and similarly for the conditional distribution and the joint distribution. Note that it is incorrect to write Pr(A) for a set A ⊂ X . What one can do is to define an event A := {X ∈ A} and then to write Pr(A ) = Pr(X ∈ A) 3 Expectation and Variance The expectation of a random variable X is defined to be Z E[X] = X(ω) dP(ω). Ω This definition has a very precise mathematical meaning, which unfortunately is out of the scope of this review. Roughly speaking, we think of P(ω) as a “weight” that we impose on X(ω) for each value of ω ∈ Ω, and we are taking the weighted sum of X(ω). If X is a discrete random variable with alphabet X = {a1 , . . . , ad }, this reduces to the familiar formula, X E[X] = xPX (x). x∈X If X is a continuous random variable with pdf fX (x), we have Z E[X] = xfX (x) dx. R Note that the expectation is a statistical summary of the distribution of X, rather than depending on the realized value of X. Perhaps it is more fitting to write it as E[PX ] but for legacy reasons, we use the notation E[X] instead. If g is a function, we can obtain the the expectation of g(X) in the same way. It can be shown that Z E[g(X)] = g(x)fX (x) dx. R In particular if g(X) = aX + b, then E[g(X)] = aE[X] + b = g(E[X]). The variance of X is the expectation of g(X) = (X − E[X])2 . Thus, Z 2 Var(X) = E[(X − E[X]) ] = (x − E[X])2 fX (x) dx R Check from the above definition that the variance can also be expressed as Var(X) = E[X 2 ] − E[X]2 . The conditional expectation E[X|Y ] can be defined with respect to PX|Y , but take note that this is actually a random variable as it depends on the value of Y (a random variable). To get some intuition, suppose that Y = {b1 , . . . , bm }. Then E[X|Y ] is a random variable taking values E[X|Y = bi ], where 3 i = 1, . . . , m, with corresponding probability Pr(Y = bi ). We see that Y “partitions” the sample space Ω into different regions, and E[X|Y = bi ] is the normalized expectation of X on the region corresponding to Y = bi . We have m X E[X] = E[X|Y = bi ] Pr(Y = bi ) = EY [EX [X|Y ]]. i=1 In some sense, E[X|Y ] is the “best” estimator of X you can have, given knowledge of Y . Example 1. Binary symmetric channel. X ∼ Bern(p) sent over channel is corrupted by additive noise Z ∼ Bern(), X ⊥ ⊥ Z. Output of channel is Y = X ⊕ Z. PY |X (y|x) = Pr(X ⊕ Z = y | X = x) = Pr(Z = y ⊕ x | X = x) = Pr(Z = y ⊕ x) = PZ (y ⊕ x) x 0 1 0 1 y 0 0 1 1 PY |X 1− 1− Table 1: Conditional probabilities for binary symmetric channel. 4 Independence and Markov Chains Recall that two random variables X and Z with joint distribution PXZ is said to be independent if PXZ (x, z) = PX (x)PZ (z), (x, z) ∈ X × Z. for all Another way of saying this is that the conditional distribution PX|Z (x|z) does not depend on z, i.e. PX|Z (x|z) = PX (x) for all (x, z) ∈ X × Z. The notion of Markov chains is very similar to independence but we need three random variables (instead of two). Let’s start with the three random variables X, Y and Z. They are said to form a Markov chain in the order X −Y −Z if their joint distribution PXY Z satisfies PXY Z (x, y, z) = PX (x)PY |X (y|x)PZ|Y (z|y) for all (x, y, z) ∈ X × Y × Z. This the same as saying that X and Z are conditionally independent given Y . Notice that if we do not assume anything about the joint distribution PXY Z , then it factorizes (by repeated applications of Bayes rule) as PXY Z (x, y, z) = PX (x)PY |X (y|x)PZ|XY (z|x, y) for all (x, y, z) ∈ X × Y × Z so what Markovity in the order X − Y − Z buys us is that PZ|XY (z|x, y) = PZ|Y (z|y) (i.e., we can drop the conditioning on X). In essence all the information that we can learn about Z is already contained in Y . No other information about Z can be gleaned from knowing X if we already know Y . Another way of saying this is that the conditional distribution of X and Z given Y = y can be factorized as PXZ|Y (x, z|y) = PX|Y (x|y)PZ|Y (z|y) 4 for all (x, y, z) ∈ X × Y × Z. Notice that this is in direct analogy to the situation where X and Z are (marginally) independent. Simply set Y to be a deterministic random variable to recover the definition of independence. Some exercises: 1. Assume X − Y − Z. Show that it is also true that Z − Y − X. 2. If Z is a deterministic function of Y , show that X − Y − Z is true. 3. If X and Z are conditionally independent given Y , this does not imply that X and Z are marginally independent (in general). Construct a counterexample. 5 Probability Bounds In this section, we summarize some bounds on probabilities that we use extensively in the sequel. For a random variable X, we let E[X] and Var(X) be its expectation and variance respectively. To emphasize that the expectation is taken with respect to a random variable X with distribution P , we sometimes make this explicit by using a subscript, i.e., EX or EP . 5.1 Basic Bounds We start with the familiar Markov and Chebyshev inequalities. Proposition 1 (Markov’s inequality). Let X be a real-valued non-negative random variable. Then for any a > 0, we have E[X] . Pr(X > a) ≤ a Proof. By the definition of the expectation, we have Z ∞ Z ∞ Z E[X] = xfX (x) dx ≥ xfX (x) dx ≥ a 0 a ∞ fX (x) dx = a Pr(X > a). a and we are done. In which step is non-negativity of X used? If we let X above be the non-negative random variable (X − E[X])2 , we obtain Chebyshev’s inequality. Proposition 2 (Chebyshev’s inequality). Let X be a real-valued random variable with mean µ and variance σ 2 . Then for any a > 0, we have 1 (1) Pr |X − µ| > aσ ≤ 2 . a Proof. Let X in Markov’s inequality be the random variable g(X) := (X − E[X])2 . This is clearly nonnegative and the expectation of g(X) is Var(X) = σ 2 . Thus, by Markov’s inequality, we have Pr(g(X) > a2 σ 2 ) ≤ σ2 1 = 2. a2 σ 2 a Now, g(X) > a2 σ 2 if and only if |X − µ| > aσ so the claim is proved. We now consider a collection of real-valued random variables that are independent and identically distributed (i.i.d.). In particular, let X n = (X1 , . . . , Xn ) be a collection of independent random variables where each Xi has distribution P with zero mean and finite variance σ 2 . 5 Proposition 3 (Weak Law of Large Numbers). For every > 0, we have ! X 1 n lim Pr Xi > = 0. n→∞ n i=1 Consequently, the average 1 n Pn i=1 Xi converges to 0 in probability. Note that for a sequence of random variables {Sn }∞ n=1 , we say that this sequence converges to a number b ∈ R in probability if for all > 0, lim Pr(|Sn − b| > ) = 0. n→∞ We also write this as p Sn −→ b. Contrast this to convergence of numbers: We say that a sequence of numbers {sn }∞ n=1 converges to a number b ∈ R if for all > 0, we have lim |sn − b| = 0. n→∞ Proof. Let of X is 1 n Pn i=1 Xi take the role of X in Chebyshev’s inequality. Clearly, the mean is zero. The variance ! ! n n n X σ2 1X 1 1 X Xi = 2 Var Xi = 2 Var(Xi ) = . Var n i=1 n n i=1 n i=1 Thus, we have ! n 1 X σ2 Pr Xi > ≤ 2 → 0. n i=1 n which proves the claim. In fact, under mild conditions, the convergence to zero occurs exponentially fast. See, for example, Cramer’s theorem in [2, Thm. 2.2.3]. There is also a theorem known as the “strong law of large numbers” but for the purposes of this course, we would not need it. 5.2 Central Limit-Type Bounds In preparation for the next result, we denote the probability density function (pdf) of a univariate Gaussian as 2 2 1 N (x; µ, σ 2 ) = √ e−(x−µ) /(2σ ) . (2) 2 2πσ We will also denote this as N (µ, σ 2 ) if the argument is unnecessary. For the univariate case, the cumulative distribution function (cdf) of a standard Gaussian is denoted as Z y Φ(y) := N (x; 0, 1) dx. (3) −∞ If the scaling in front of the sum in the statement of the law of large numbers Proposition 3 is √1n instead Pn of n1 , the resultant random variable √1n i=1 Xi converges in distribution to a Gaussian random variable. As in Proposition 3, let X n be a collection of i.i.d. random variables where each Xi is zero mean with finite variance σ 2 . 6 Proposition 4 (Central Limit Theorem). For any a ∈ R, we have ! n 1 X √ Xi < a = Φ(a). lim Pr n→∞ σ n i=1 (4) In other words, n 1 X d √ Xi −→ Z σ n i=1 (5) d where −→ means convergence in distribution and Z is a standard Gaussian random variable. For a sequence of random variables {Sn }∞ n=1 , we say that this sequence of random variables converges in distribution to another random variable S̄ if lim Pr(Sn < a) = Pr(S̄ < a) n→∞ for all a ∈ R. So the distribution functions converge for all points a. Actually this is not quite true but is enough for our purposes in this course. 6 Jensen’s Inequality and Convexity A function f (x) is said to be convex if for all x, y and λ ∈ [0, 1], we have f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y). (6) The function f is strictly convex if equality in (6) holds iff λ = 0 or 1, or x = y. Proposition 5. f (x) = supl∈L l(x), where L = {l : l(u) = au + b ≤ f (u) for all u, and for some a, b}. Proof. It suffices to show that for any given x, we can find a linear function l(·) such that l(x) = f (x) and l(u) ≤ f (u) for all u. For any h > 0, convexity of f implies that 2f (x) ≤ f (x + h) + f (x − h) f (x + h) − f (x) f (x) − f (x − h) ≤ , h h (7) so that by letting h ↓ 0, we have lim h↓0 f (x) − f (x − h) f (x + h) − f (x) ≤ lim , h↓0 h h which means that the left limit of f at x is no greater than the right limit. We can then choose a constant a between these two limits and let l(u) = a(u − x) + f (x). We claim that this linear function is the one we are looking for. Indeed, note that l(x) = f (x), and for any u > x, letting h = u − x, we have l(u) = a(u − x) + f (x) f (x + h) − f (x) (u − x) + f (x) h = f (x + h) = f (u), ≤ where the inequality follows from (7) and our choice of a. A similar argument holds for u < x, and the proof is complete. 7 Often it is hard to check convexity directly. But for twice differentiable functions, this is easy. Proposition 6. Let f : [a, b] → R be twice differentiable. f is convex if and only if f 00 (x) ≥ 0 for all x ∈ (a, b). Proof. Assume f 00 (x) ≥ 0 for all x ∈ [a, b]. By Taylor expansion of f around x0 ∈ (a, b), we have f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x∗ ) (x − x0 )2 2 where x∗ ∈ [x0 , x]. By assumption f 00 (x∗ ) ≥ 0 so the quadratic term is non-negative. Now let x0 = λx1 + (1 − λ)x2 . Further let x = x1 . Then we have f (x1 ) ≥ f (x0 ) + f 0 (x0 )((1 − λ)(x1 − x2 )). Now let x = x2 . Then we have f (x2 ) ≥ f (x0 ) + f 0 (x0 )(λ(x2 − x1 )). Multiply the first inequality by λ and the second by 1 − λ and add them up, we recover the definition of convexity. In the other direction, let f be convex and twice differentiable on [a, b]. Choose a < x1 < x2 < x3 < x4 < b. By definition of convexity (check!) that f (x2 ) − f (x1 ) f (x4 ) − f (x3 ) ≤ x2 − x1 x4 − x3 Now let x2 ↓ x1 and x3 ↑ x4 . We see that f 0 (x1 ) ≤ f 0 (x4 ), and since these were arbitrary points, f 0 is increasing on (a, b). So f 00 (x) ≥ 0 for all x ∈ (a, b). Proposition 7 (Jensen’s Inequality). If f is convex, E[|X|] < ∞, and E[|f (X)|] < ∞, then E[f (X)] ≥ f (E[X]). Furthermore, if f is strictly convex, then equality holds iff X is a constant. Proof. E[f (X)] = E[sup l(X)] from Proposition 5 l∈L ≥ sup E[l(X)] can you see why? l∈L = sup l(E[X]) because l is linear l∈L = f (E[X]). For a simpler proof of Jensen’s inequality for discrete distributions, we may use induction. Proof. By convexity, we have p1 f (x1 ) + p2 f (x2 ) ≥ f (p1 x1 + p2 x2 ). Suppose the statement E[f (X)] ≥ f (E[X]) is true for discrete distributions with k ≥ 2 mass points. Then consider a distribution with k mass points {p1 , p2 , . . . , pk }. Define p0i := pi , 1 − pk i = 1, . . . , k − 1 8 We then have k X pi f (xi ) = pk f (xk ) + (1 − pk ) k−1 X i=1 p0i f (xi ) i=1 ≥ pk f (xk ) + (1 − pk )f k−1 X ! p0i xi i=1 ≥f pk xk + (1 − pk ) k−1 X ! p0i xi i=1 =f k X ! pi xi i=1 where the first inequality is from the induction hypothesis and the second by convexity (of two points). By the definition of expectation we have E[f (X)] ≥ f (E[X]). References [1] D. P. Bertsekas and J. N. Tsitsiklis. Introduction to Probability. Athena Scientific, 1st, 2002. [2] A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. Springer, 2nd edition, 1998. 9