Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EE7101: Introduction to Information and Coding Theory Handout 1 – Probability Theory Review Tay Wee Peng This is a very brief review of probability theory. For more details, please consult EE7401 or MAS713. A good reference at the undergraduate level is “Introduction to Probability” by Dimitri Bertsekas and John Tsitsiklis. A good introduction to measure theoretic probability is “Probability with Martingales” by David Williams. 1 Probability Space A probability space is represented by a tuple (Ω, F, P), where Ω is the sample space, and F is a σ-algebra (think of it as a collection of events or subsets of Ω), with the following properties: (i) Ω ∈ F (ii) If A ∈ F, then Ac ∈ F. (iii) If A1 , A2 , . . . ∈ F, then ∪∞ i=1 Ai ∈ F. Think of F as the “information structure” of Ω. For example, let Ω = [0, 1], and we are interested in the probability of subsets of Ω that are intervals of the form [a, b] where 0 ≤ a < b ≤ 1, but not individual points in Ω. Then we should also be able to say something about the probability of the union, intersections, complement and so on of such intervals. This is captured by the definition of a σ-algebra. The probability measure P is a function P : F 7→ [0, 1] defined on the measurable space (Ω, F), and represents your “belief” about the events in F. In order for P to be called a probability measure, it must satisfy the following properties: (i) P(Ω) = 1 (ii) For A1 , A2 , . . . such that Ai ∩ Aj = ∅ for all i 6= j, we have ! ∞ ∞ [ X P Ai = P(Ai ). i=1 i=1 Some basic properties that can be derived from the above definition: (a) P(Ac ) = 1 − P(A) (b) If A ⊂ B, then P(A) ≤ P(B). 1 (c) P(A ∪ B) = P(A) + P(B) − P(A ∩ B) ≤ P(A) + P(B) is called the union bound. Similarly, ! ∞ ∞ [ X P Ai ≤ P(Ai ). i=1 i=1 (How would you prove the above? Note that induction does not work here, why?) Conditional probability of A given B is defined as P(A | B) = P(A ∩ B) , P(B) if P(B) 6= 0. Two events A and B are said to be independent if P(A ∩ B) = P(A)P(B), i.e., P(A | B) = P(A) or P(B | A) = P(B). (This definition sets probability theory apart from measure theory.) 2 Random Variables A random variable X : Ω 7→ X is a mapping from the space (Ω, F) to a measurable space (X , B). For example, for a real-valued random variable, the space that X takes values in is typically chosen to be (R, B), where B is the Borel σ-algebra.1 In order for X to make any sense, the mapping has to ensure that {ω ∈ Ω : X(ω) ∈ B} ∈ F for all B ∈ B, because we are restricted to the information structure imposed by F. This is called a measurable function. A random variable is then more formally defined as a measurable mapping from (Ω, F) to (X , B). The probability measure P induces a probability measure on (R, B), given by PX (B) = P ({ω ∈ Ω : X(ω) ∈ B}) , for all B ∈ B. We will often write PX (B) as P(X ∈ B), and is called the distribution of X. If X = {x1 , . . . , xn }, then we say that X is a discrete random variable. The distribution of X is then also known as the probability mass function (pmf) of X and is fully defined by (pX (x1 ), . . . , pX (xn )), where pX (·) = P(X = ·). For real-valued random variable X, if there exists a function fX : R 7→ [0, ∞) such that for all A ∈ B, we have Z P(X ∈ A) = fX (x)dx, A then we say that X is a continuous random variable. The function fX is called the probability density function (pdf). In this class, we deal mainly with discrete random variables, although we will also encounter Gaussian random variables, which are continuous, later on. Some additional notations and definitions for discrete random variables are given below. The counterparts for continuous random variables can be obtained by simply replacing pmfs with pdfs. Assume that X and Y are discrete random variables. 1 This is the smallest σ-algebra containing all open intervals in R. 2 (a) Joint pmf: pX,Y (x, y) = P(X = x, Y = y) = P({ω ∈ Ω : X(ω) = x, Y (ω) = y}). (b) Conditional pmf: pX|Y (x | y) = pX,Y (x, y)/pY (y) for pY (y) 6= 0. Bayes rule: pY |X (y | x)pX (x) 0 0 x0 pY |X (y | x )pX (x ) pX|Y (x | y) = P (c) We write X ⊥ ⊥ Y if X is independent of Y , i.e., pX,Y (x, y) = pX (x)pY (y). Example 1. Binary symmetric channel. X ∼ Bern(p) sent over channel is corrupted by additive noise Z ∼ Bern(), X ⊥ ⊥ Z. Output of channel is Y = X ⊕ Z. pY |X (y | x) = P(X ⊕ Z = y | X = x) = P(Z = y ⊕ x | X = x) = P(Z = y ⊕ x) = pZ (y ⊕ x) x 0 1 0 1 y 0 0 1 1 pY |X 1− 1− Table 1: Conditional probabilities for binary symmetric channel. 3 Expectation and Variance The expectation of a random variable X is defined to be Z E[X] = X(ω)dP(ω). Ω This definition has a very precise mathematical meaning, which unfortunately is out of the scope of this review. Roughly speaking, we think of P(ω) as a weight that we impose on X(ω) for each value of ω ∈ Ω, and we are taking the weighted sum of X(ω). If X is a discrete random variable, this reduces to the familiar formula, X E[X] = xpX (x). x∈X If X is a continuous random variable, we have “dP(ω) = fX (X(ω))dx”, and Z E[X] = xfX (x)dx. R Note that the expectation is a statistical summary of the distribution of X, rather than depending on the realized value of X. Perhaps it is more fitting to write it as E[pX ] but for legacy reasons, we use the notation E[X] instead. 3 Since g(X) is a random variable, we can obtain the expectation of g(X) in the same way. It can be shown that Z g(x)fX (x)dx. E[g(X)] = R In particular, if g(X) = aX + b is a linear function of X, then E[g(X)] = aE[X] + b = g(E[X]). The variance of X is just the expectation of g(X) = (X − E[X])2 . The conditional expectation E[X | Y ] can be defined with respect to pX|Y , but take note that this is actually a random variable as it depends on the value of Y . To get some intuition, suppose that Y ∈ {y1 , . . . , yn }. Then E[X | Y ] is a random variable taking values E[X | Y = yi ], where i = 1, . . . , n, with corresponding probability P (Y = yi ). We see that Y “partitions” the sample space Ω into different regions, and E[X | Y = yi ] is the normalized expectation of X on the region corresponding to Y = yi .2 We have X E[X] = E[X | Y = yi ]P (Y = yi ) = E[E[X | Y ]]. i In some sense, E[X | Y ] is the “best” estimator of X you can have, given knowledge of Y . Indeed, recall from your statistics classes that under the squared error loss criterion, E[X | Y ] is the optimal estimator of X, i.e., it minimizes E(X − X̂(Y ))2 for all functions X̂(Y ). 4 Jensen’s Inequality A function f (x) is said to be convex if for all x, y and λ ∈ [0, 1], we have f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y). (1) The function f is strictly convex if equality in (1) holds iff λ = 0 or 1, or x = y. Proposition 1. f (x) = supl∈L l(x), where L = {l : l(u) = au+b ≤ f (u) for all u, and for some a, b}. Proof. It suffices to show that for any given x, we can find a linear function l(·) such that l(x) = f (x) and l(u) ≤ f (u) for all u. For any h > 0, convexity of f implies that 2f (x) ≤ f (x + h) + f (x − h) f (x) − f (x − h) f (x + h) − f (x) ≤ , h h (2) so that by letting h ↓ 0, we have lim h↓0 f (x) − f (x − h) f (x + h) − f (x) ≤ lim , h↓0 h h 2 For general random variables, E[X | Y ] should be viewed as E[X | σ(Y )], where the conditioning is on the σ-algebra σ(Y ) generated by Y . 4 which means that the left limit of f at x is no greater than the right limit. We can then choose a constant a between these two limits and let l(u) = a(u − x) + f (x). We claim that this linear function is the one we are looking for. Indeed, note that l(x) = f (x), and for any u > x, letting h = u − x, we have l(u) = a(u − x) + f (x) f (x + h) − f (x) ≤ (u − x) + f (x) h = f (x + h) = f (u), where the inequality follows from (2) and our choice of a. A similar argument holds for u < x, and the proof is complete. Proposition 2. Jensen’s Inequality. If f is convex, E[|X|] < ∞, and E[|f (X)|] < ∞, then E[f (X)] ≥ f (E[X]). Furthermore, if f is strictly convex, then equality holds iff X is a constant. Proof. E[f (X)] = E[sup l(X)] from Proposition 1 l∈L ≥ sup E[l(X)] can you see why? l∈L = sup l(E[X]) l∈L = f (E[X]). 5 because l is linear