Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Harvard SEAS ES250 – Information Theory Entropy, relative entropy, and mutual information 1 ∗ Entropy 1.1 Entropy of a random variable Definition The entropy of a discrete random variable X with pmf pX (x) is X p(x) log p(x) H(X) = − x The entropy measures the expected uncertainty in X. It has the following properties: • H(X) ≥ 0, entropy is always non-negative. H(X) = 0 iff X is deterministic. • Since Hb (X) = logb (a)Ha (X), we don’t need to specify the base of the logarithm. 1.2 Joint entropy and conditional entropy Definition Joint entropy between two random variables X and Y is H(X, Y ) , −Ep(x,y) [log p(X, Y )] XX = − p(x, y) log p(x, y) x∈X y∈Y Definition Given a random variable X, the conditional entropy of Y (average over X) is H(Y |X) , −Ep(x) [H(Y |X = x)] X p(x)H(Y |X = x) = − x∈X = = −Ep(x) Ep(y|x) [log p(Y |X)] −Ep(x,y) [log p(Y |X)] Note: H(X|Y ) 6= H(Y |X). 1.3 Chain rule Joint and conditional entropy provide a natural calculus: Theorem (Chain rule) H(X, Y ) = H(X) + H(Y |X) Corollary H(X, Y |Z) = H(X|Z) + H(Y |X, Z) ∗ Based on Cover & Thomas, Chapter 2 1 Harvard SEAS 2 ES250 – Information Theory Relative Entropy and Mutual Information 2.1 Entropy and Mutual Information • Entropy H(X) is the uncertainty (“self-information”) of a single random variable • Conditional entropy H(X|Y ) is the entropy of one random variable conditional upon knowledge of another. • We call the reduction in uncertainty mutual information: I(X; Y ) = H(X) − H(X|Y ) • Eventually we will show that the maximum rate of transmission over a given channel p(Y |X), such that the error probability goes to zero, is given by the channel capacity: C = max I(X; Y ) p(X) Theorem Relationship between mutual information and entropy 2.2 I(X; Y ) I(X; Y ) = = H(X) − H(X|Y ) H(Y ) − H(Y |X) I(X; Y ) I(X; Y ) = = H(X) + H(Y ) − H(X, Y ) I(Y ; X) (symmetry) I(X; X) = H(X) (“self-information”) Relative Entropy and Mutual Information Definition Relative entropy (Information- or Kullback-Leibler divergence) D(p k q) , Ep · ¸ X p(x) p(x) log p(x) log = q(x) q(x) x∈X Definition Mutual information (in terms of divergence) I(X; Y ) , D(p(x, y) k p(x)p(y)) ¸ · p(X, Y ) = Ep(x,y) log p(X)p(Y ) XX p(x, y) p(x, y) log = p(x)p(y) x∈X y∈Y 3 3.1 Chain Rules Chain Rule for Entropy The entropy of a collection of random variables is the sum of the conditional entropies: Theorem (Chain rule for entropy) (X1 , X2 , ..., Xn ) ∼ p(x1 , x2 , ..., xn ) H(X1 , X2 , ..., Xn ) = n X H(Xi |Xi−1 , ..., X1 ) i=1 2 Harvard SEAS 3.2 ES250 – Information Theory Chain Rule for Mutual Information Definition Conditional mutual information , I(X; Y |Z) H(X|Z) − H(X|Y, Z) p(X, Y |Z) = Ep(x,y,z) log p(X|Z)p(Y |Z) Theorem (Chain rule for mutual information) I(X1 , X2 , ..., Xn ; Y ) = n X I(Xi ; Y |Xi−1 , Xi−2 , ..., X1 ) i=1 3.3 Chain Rule for Relative Entropy Definition Conditional relative entropy D(p(y|x) k q(y|x)) · ¸ p(Y |X) , Ep(x,y) log q(Y |X) X X p(y|x) = p(x) p(y|x) log q(y|x) x y Theorem (Chain rule for relative entropy) D(p(x, y) k q(x, y)) = D(p(x) k q(x)) + D(p(y|x) k q(y|x)) 4 Jensen’s Inequality • Recall that a convex function on an interval is one for which every chord lies (on or) above the function on that interval. • A function f is concave if −f is convex. Theorem (Jensen’s inequality) If f is convex, then E[f (X)] ≥ f (E[X]). If f is strictly convex, the equality implies X = E[X] with probability 1. 4.1 Consequences Theorem (Information inequality) D(p k q) ≥ 0 with equality iff p = q. Corollary (Nonnegativity of mutual information) I(X; Y ) ≥ 0 with equality iff X and Y are independent. Corollary (Information inequality) D(p(y|x) k q(y|x)) ≥ 0 with equality iff p(y|x) = q(y|x) for all x, y s.t. p(x) > 0. Corollary (Nonnegativity of mutual information) I(X; Y |Z) ≥ 0 with equality iff X and Y are conditionally independent given Z. 3 Harvard SEAS 4.2 ES250 – Information Theory Some Inequalities Theorem H(X) ≤ log |X | with equality iff X has a uniform distribution over X . Theorem (Conditioning reduces entropy) H(X|Y ) ≤ H(X) with equality iff X and Y are independent. Theorem (Independence bound on entropy) H(X1 , X2 , ..., Xn ) ≤ n X H(Xi ) i=1 with equality iff Xi are independent. 5 Log Sum Inequality and its Application Theorem (Log sum inequality) For nonnegative a1 , a2 , ..., an and b1 , b2 , ..., bn , Ã n ! Pn n X X ai ai ≥ ai log Pi=1 ai log n bi i=1 bi i=1 i=1 with equality iff ai /bi = const. Theorem (Convexity of relative entropy) D(p k q) is convex in the pair (p, q), so that for pmf’s (p 1 , q1 ) and (p2 , q2 ), we have for all 0 ≤ λ ≤ 1: D(λp1 + (1 − λ)p2 k λq1 + (1 − λ)q2 ) ≤ λD(p1 k q1 ) + (1 − λ)D(p2 k q2 ) Theorem (Concavity of entropy) For X ∼ p(x), we have that H(p) := Hp (X) is a concave function of p(x). Theorem Let (X, Y ) ∼ p(x, y) = p(x)p(y|x). Then, I(X; Y ) is a concave function of p(x) for fixed p(y|x), and a convex function of p(y|x) for fixed p(x). 6 6.1 Data-Processing Inequality Markov Chain Definition X, Y, Z form a Markov chain in that order (X → Y → Z) iff p(x, y, z) = p(x)p(y|x)p(z|y). Some consequences: • X → Y → Z iff X and Z are conditionally independent given Y . • X→Y →Z ⇒ Z → Y → X. Thus, we can write X ↔ Y ↔ Z. • If Z = f (Y ), then X → Y → Z. 4 Harvard SEAS 6.2 ES250 – Information Theory Data-Processing Inequality Theorem Data-processing inequality If X → Y → Z, then I(X; Y ) ≥ I(X; Z). Corollary If Z = g(Y ), then I(X; Y ) ≥ I(X; g(Y )). Corollary If X → Y → Z, then I(X; Y ) ≥ I(X; Y |Z). 7 Sufficient Statistics 7.1 Statistics and Mutual Information • Consider a family of probability distributions {fθ (x)} indexed by θ. • If X ∼ f (x | θ) for fixed θ and T (X) is any statistic (i.e., function of the sample X), then we have θ → X → T (X). • The data processing inequality in turn implies I(θ; X) ≥ I(θ; T (X)) for any distribution on θ. • Is it possible to choose a statistic that preserves all of the information in X about θ? 7.2 Sufficient Statistics and Compression Definition (Sufficient Statistic) A function T (X) is said to be a sufficient statistic relative to the family {f θ (x)} if the conditional distribution of X, given T (X) = t, is independent of θ for any distribution on θ (Fisher-Neyman): fθ (x) = f (x | t)fθ (t) ⇒ θ → T (X) → X ⇒ I(θ; T (X)) ≥ I(θ; X) Hence, I(θ; X) = I(θ; T (X)) for a sufficient statistic. Definition (Minimal Sufficient Statistic) A function T (X) is a minimal sufficient statistic relative to {f θ (x)} if it is a function of every other sufficient statistic U , in which case: θ → T (X) → U (X) → X and information about θ in the sample is maximally compressed. 8 8.1 Fano’s Inequality Fano’s Inequality and Estimation Error Fano’s inequality relates the probability of estimation error to conditional entropy: Theorem (Fano’s inequality) For any estimator X̂ : X → Y → X̂, with Pe = Pr{X 6= X̂}, we have H(Pe ) + Pe log |X | ≥ H(X|X̂) ≥ H(X|Y ). This implies 1 + Pe log |X | ≥ H(X|Y ) or Pe ≥ H(X|Y ) − 1 . log |X | 5 Harvard SEAS 8.2 ES250 – Information Theory Implications of Fano’s Inequality Some corollaries follow: Corollary Let p = Pr{X 6= Y }. Then, H(p) + p log |X | ≥ H(X|Y ). Corollary Let Pe = Pr{X 6= X̂}, and constrain X̂ : Y → X ; then H(Pe ) + Pe log(|X | − 1) ≥ H(X|Y ). 8.3 Sharpness of Fano’s inequality Suppose no observation Y so that X must simply be guessed, and order X ∈ {1, 2, . . . , m} such that p 1 ≥ p2 ≥ · · · ≥ pm . Then X̂ = 1 is the optimal estimate of X, with Pe = 1 − p1 , and Fano’s inequality becomes H(Pe ) + Pe log(m − 1) ≥ H(X). The pmf (p1 , p2 , . . . , pm ) = µ Pe Pe ,..., 1 − Pe , m−1 m−1 ¶ achieves this bound with equality. 8.4 Applications of Fano’s inequality Lemma If X and X 0 are iid with entropy H(X), then Pr{X = X 0 } ≥ 2−H(X) , with equality iff X has a uniform distribution. Corollary Let X, X 0 be independent with X ∼ p(x), X 0 ∼ r(x); x, x0 ∈ X . Then Pr{X = X 0 } ≥ 2−H(p)−D(pkr) , and Pr{X = X 0 } ≥ 2−H(r)−D(rkp) . 6