Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2: Entropy and Mutual Information University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chapter 2 outline • Definitions • Entropy • Joint entropy, conditional entropy • Relative entropy, mutual information • Chain rules • Jensen’s inequality • Log-sum inequality • Data processing inequality • Fano’s inequality University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Definitions A discrete random variable X takes on values x from the discrete alphabet X . The probability mass function (pmf) is described by pX (x) = p(x) = Pr{X = x}, for x ∈ X . Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye 2 Probability, Entropy, and Inference Definitions Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 This chapter, and its sibling, Chapter 8, devote some time to notation. Just You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ links. as the White Knight for distinguished between the song, the name of the song, and what the name of the song was called (Carroll, 1998), we will sometimes 23 of the need to be careful to distinguish between a random variable, the value random variable, and the proposition that asserts that the random variable has a particular value. In any particular chapter, however, I will use the most Figure 2.2. The probability simple and friendly notation possible, at theover riskthe of upsetting distribution 27×27 pure-minded readers. For example, if something is ‘true with probability 1’, I will usually possible bigrams xy in an English simply say that it is ‘true’. language document, The Frequently Asked Questions Manual for Linux. 2.1 Probabilities and ensembles 2.1: Probabilities and ensembles x a b c d e f g h i j k l m n o p q r s t u v w x y z – An ensemble X is a triple (x, AX , PX ), where the outcome x is the value of a random variable, which takes on one of a set of possible values, AX = {a1 , a2 , . . . , ai , . . . , aI }, having ! probabilities PX = {p1 , p2 , . . . , pI }, with P (x = ai ) = pi , pi ≥ 0 and ai ∈AX P (x = ai ) = 1. The name A is mnemonic for ‘alphabet’. One example of an ensemble is a letter that is randomly selected from an English document. This ensemble is shown in figure 2.1. There are twenty-seven possible letters: a–z, and a space character ‘-’. Abbreviations. Briefer notation will sometimes be used. For example, a b c d e f g h i j k l m n o p q r s t u v w x y z – Py(x = ai ) may be written as P (ai ) or P (x). Probability of a subset. If T is a subset of A X then: " P (T ) = P (x ∈ T ) = P (x = ai ). (2.1) Marginal probability. We can obtain the marginal probability P (x) from ai ∈T the joint probability P (x, y) by summation: For example, if we define V to be vowels from figure 2.1, V = ! {a, e, i, o, u}, then P (x = ai ) ≡ P (x = ai , y). (2.3) P (V ) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31. y∈AY (2.2) Similarly, using briefer notation, the marginal probability of y is:XY is an ensemble in which each outcome is an ordered A joint ensemble pair x, y with x ∈ AX = {a1 , . . . , aI } and y ∈ AY = {b1 , . . . , bJ }. ! P (y) ≡ P (x, y). (2.4) We call P (x, y) the joint probability of x and y. x∈AX Commas are optional when writing ordered pairs, so xy ⇔ x, y. University of Illinois at Chicago ECE 534, Fall 2009, Natasha N.B. InDevroye a joint ensemble XY the two variables are not necessarily indeConditional probability pendent. P (x = ai | y = bj ) ≡ P (x = ai , y = bj ) if P (y = bj ) "= 0. P (y = bj ) (2.5) 22 i ai pi 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 a b c d e f g h i j k l m n o p q r s t u v w x y z – 0.0575 0.0128 0.0263 0.0285 0.0913 0.0173 0.0133 0.0313 0.0599 0.0006 0.0084 0.0335 0.0235 0.0596 0.0689 0.0192 0.0008 0.0508 0.0567 0.0706 0.0334 0.0069 0.0119 0.0073 0.0164 0.0007 0.1928 a b c d e f g h i j k l m n o p q r s t u v w x y z – Figure 2.1. Probability distribution over the 27 outcomes for a randomly selected letter in an English language document (estimated from The Frequently Asked Questions Manual for Linux ). The picture shows the probabilities by the areas of white squares. Definitions The events X = x and Y = y are statistically independent if p(x, y) = p(x)p(y). The variables X1 , X2 , · · · XN are called independent if for all (x1 , x2 , · · · , xN ) ∈ X1 × X2 × · · · XN we have p(x1 , x2 , · · · xN ) = N ! pXi (xi ). i=1 They are furthermore called identically distributed if all variables Xi have the same distribution pX (x). University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Entropy • Intuitive notions? • 2 ways of defining entropy of a random variable: • axiomatic definition (want a measure with certain properties...) • just define and then justify definition by showing it arises as answer to a number of natural questions Definition: The entropy H(X) of a discrete random variable X with pmf pX (x) is given by ! pX (x) log pX (x) = −EpX (x) [log pX (X)] H(X) = − x University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Order these in terms of entropy University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Order these in terms of entropy University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye 3 c 4 d 5 e 6 f 7 g 8 h 9 i The likelihood principle: given a generative model for data d given 10 j parameters θ, P (d | θ), and having observed a particular outcome 11 k 12 l d1 , all inferences and predictions should depend only on the function 13 m P (d1 | θ). 14 n 15 o In spite of the simplicity of this principle, many classical statistical methods What’s the entropy of a uniform discrete random variable taking16 on pK 17 q violate it. 18 r 19 s 2.4 Definition of entropy and related functions 20 t 21 u The Shannon information content of an outcome x is defined to be 22 v 23 w 1 h(x) = logvariable . (2.34) 2 24 x What’s the entropy of a random with P (x) 25 y 26 z X =[The [♣,word ♦, ♥, pXused = [1/2; 1/4;a variable 1/8; 1/8] It is measured in bits. ‘bit’♠], is also to denote 27 whose value is 0 or 1; I hope context will always make clear which of the happened (here, that the ball drawn was black) given the different hypotheses. We need only to know the likelihood, i.e., how the probability of the data that happened varies with the hypothesis. This simple rule about inference is known as the likelihood principle. Entropy examples 1 • .0263 5.2 .0285 5.1 .0913 3.5 .0173 5.9 .0133 6.2 .0313 5.0 .0599 4.1 .0006 10.7 .0084 6.9 .0335 4.9 .0235 5.4 .0596 4.1 .0689 3.9 .0192 5.7 values? .0008 10.3 .0508 4.3 .0567 4.1 .0706 3.8 .0334 4.9 .0069 7.2 .0119 6.4 .0073 7.1 .0164 5.9 .0007 10.4 .1928 2.4 ! • two meanings is intended.] ! In the next few chapters, we will establish that the Shannon information content h(ai ) is indeed a natural measure of the information content of the event x = ai . At that point, we will shorten the name of this quantity to ‘the information content’. pi log2 i 1 pi 4.1 Table 2.9. Shannon information contents of the outcomes a–z. The fourth column in table 2.9 shows the Shannon information content • What’s the27entropy of a deterministic random of the possible outcomes when a random charactervariable? is picked from an English document. The outcome x = z has a Shannon information content of 10.4 bits, and x = e has an information content of 3.5 bits. The entropy of an ensemble X is defined to be the average Shannon information content of an outcome: ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye 1 , (2.35) H(X) ≡ P (x) log P (x) x∈AX with the convention for P (x) limθ→0+ θ log 1/θ = 0. = 0 that 0 × log 1/0 ≡ 0, since Like the information content, entropy is measured in bits. When Copyright it is convenient, we may alsoOn-screen write viewing H(X)permitted. as H(p), p ishttp://www.cambridge.org/0521642981 Cambridge University Press 2003. Printingwhere not permitted. the vector example (p , p , . . . , p ). Another Entropy: 2 name for the entropy of X is the uncertainty of X. You can buy 1 this 2 book for I30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 32 2 — Probability, Entropy, and Inference What do you notice about your solutions? Does each answer depend on the Example 2.12. The entropy of a randomly selected letter in an English docudetailed contents of each urn? ment is about 4.11The bits, assuming its probability is asand given table 2.9.are irdetails of the other possible outcomes theirin probabilities matters islog the probability of in thethe outcome thatcolactually We obtain thisrelevant. numberAll bythat averaging 1/p fourth i (shown (here, distribution that the ball drawn was black) giventhird the different hypotheumn) under thehappened probability p i (shown in the column). ses. We need only to know the likelihood, i.e., how the probability of the data that happened varies with the hypothesis. This simple rule about inference is known as the likelihood principle. The likelihood principle: given a generative model for data d given parameters θ, P (d | θ), and having observed a particular outcome d1 , all inferences and predictions should depend only on the function P (d1 | θ). In spite of the simplicity of this principle, many classical statistical methods violate it. 2.4 Definition of entropy and related functions The Shannon information content of an outcome x is defined to be h(x) = log2 1 . P (x) (2.34) It is measured in bits. [The word ‘bit’ is also used to denote a variable whose value is 0 or 1; I hope context will always make clear which of the two meanings is intended.] In the next few chapters, we will establish that the Shannon information content h(ai ) is indeed a natural measure of the information content of the event x = ai . At that point, we will shorten the name of this quantity to ‘the information content’. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye The fourth column in table 2.9 shows the Shannon information content of the 27 possible outcomes when a random character is picked from an English document. The outcome x = z has a Shannon information content of 10.4 bits, and x = e has an information content of 3.5 bits. i ai pi h(pi ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 a b c d e f g h i j k l m n o p q r s t u v w x y z - .0575 .0128 .0263 .0285 .0913 .0173 .0133 .0313 .0599 .0006 .0084 .0335 .0235 .0596 .0689 .0192 .0008 .0508 .0567 .0706 .0334 .0069 .0119 .0073 .0164 .0007 .1928 4.1 6.3 5.2 5.1 3.5 5.9 6.2 5.0 4.1 10.7 6.9 4.9 5.4 4.1 3.9 5.7 10.3 4.3 4.1 3.8 4.9 7.2 6.4 7.1 5.9 10.4 2.4 1 pi 4.1 ! i pi log2 Table 2.9. Shannon information contents of the outcomes a–z. Entropy: example 3 • Bernoulli random variable takes on heads (0) with probability p and tails with probability 1-p. Its entropy is defined as H(p) := −p log2 (p) − (1 − p) log2 (1 − p) 16 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION 1 0.9 0.8 0.7 H(p) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 p 0.6 0.7 0.8 0.9 1 FIGURE 2.1. H (p) vs. p. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Entropy Suppose that we wish to determine the value of X with the minimum number of binary questions. An efficient first question is “Is X = a?” This splits the probability in half. If the answer to the first question is no, the second question can be “Is X = b?” The third question can be “Is X = c?” The resulting expected number of binary questions required is 1.75. This turns out to be the minimum expected number of binary questions required to determine the value of X. In Chapter 5 we show that the minimum expected number of binary questions required to determine X lies between H (X) and H (X) + 1. ! 2.2 JOINT ENTROPY AND CONDITIONAL ENTROPY The entropy H(X) =− log p(x) has the following properties: x p(x) We defined the entropy of a single random variable in Section 2.1. We now extend the definition to a pair of random variables. There is nothing new in this is definition because (X, Y ) can be considered ≥really 0, entropy always non-negative. H(X) to=be0 aiff single vector-valued random variable. • H(X) (0 log(0) = 0). X is deterministic Definition The joint entropy H (X, Y ) of a pair of discrete random with a joint defineduniform as • H(X) ≤variables log(|X(X,|).Y ) H(X) = distribution log(|X |)p(x, iffy)Xis has distribution over X. !! H (X, Y ) = − p(x, y) log p(x, y), logb (a)Ha (X), we don’t need to x∈X y∈Y • Since Hb (X) = rithm (bits vs. nat). (2.8) specify the base of the loga- Moving on to multiple RVs University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Joint entropy and conditional entropy Definition: Joint entropy of a pair of two discrete random variables X and Y is: H(X, Y ) := −Ep(x,y) [log p(X, Y )] !! p(x, y) log p(x, y) = − x∈X y∈Y Note: H(X|Y ) != H(Y |X). """ University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. Joint entropy and conditional entropy 8.3: Further exercises • Natural definitions, since.... Exercise 8.7.[2, p.143] Consider the ensemble XY Z in which A X = AY = AZ = {0, 1}, x and y are independent with PX = {p, 1 − p} and PY = {q, 1−q} and z = (x + y) mod 2. (8.13) (a) If q = 1/2, what is PZ ? What is I(Z; X)? Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 For general p and q, what is PZ ? What is I(Z; X)? Notice that You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ (b) for links. Theorem: Chain rule ! H(X, Y ) = H(X)this + ensemble H(Y |X) is related to the binary symmetric channel, with x = input, noise, and Random z = output. 8 —y = Dependent Variables 140 H(X, Y ) H(X) H(Y ) H(X | Y ) 8.2 Exercises Corollary: I(X; Y ) H(Y |X) Figure 8.1. The relationship H(Y) between joint information, marginal entropy, conditional entropy mutual entropy. H(X|Y) I(X;Y)and H(Y|X) Figure 8.2. A mi representation of (contrast with fi H(X,Y) H(X) Three term entropies Exercise 8.8.[3, p.143] Many texts draw figure 8.1 in the form of a Venn diagram ! Exercise 8.1.[1 ] Consider three independent random variables u, v, w with en-8.2). Discuss why this diagram is a misleading representation (figure H(X, Y |Z) = H(X|Z) + H(Y |X, Z) tropies Hu , Hv , Hw . Let X ≡ (U, V ) and Y ≡ (V, W ). What is H(X, Y )? of entropies. Hint: consider the three-variable ensemble XY Z in which What is H(X | Y )? What is I(X; Y )? x ∈ {0, 1} and y ∈ {0, 1} are independent binary variables and z ∈ {0, 1} is defined to be z = x + y mod 2. ! Exercise 8.2.[3, p.142] Referring to the definitions of conditional entropy (8.3– 8.4), confirm (with an example) that it is possible for H(X8.3 | y =Further b k ) to exercises exceed H(X), but that the average, H(X | Y ), is less than H(X). So University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye The data-processing theorem data are helpful – they do not increase uncertainty, on average. ! Exercise 8.3. [2, p.143] The data processing theorem states that data processing can only destroy information. Prove the chain rule for entropy, equation (8.7). Joint/conditional entropy examples p(x, y) x=0 x=1 y=0 1/2 0 y=1 1/4 1/4 ! H(X,Y)= H(X|Y)= H(Y|X)= H(X)= H(Y)= University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Entropy is central because... (A) entropy is the measure of average uncertainty in the random variable (B) entropy is the average number of bits needed to describe the random variable (C) entropy is a lower bound on the average length of the shortest description of the random variable (D) entropy is measured in bits? (E) H(X) = − ! x p(x) log2 (p(x)) (F) entropy of a deterministic value is 0 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Mutual information • Entropy H(X) is the uncertainty (``self-information'') of a single random variable • Conditional entropy H(X|Y) is the entropy of one random variable conditional upon knowledge of another. • The average amount of decrease of the randomness of X by observing Y is the average information that Y gives us about X. Definition: The mutual information I(X; Y ) between the random variables X and Y is given by I(X; Y ) = H(X) − H(X|Y ) !! p(x, y) = p(x, y) log2 p(x)p(y) x∈X y∈Y " # p(X, Y ) = Ep(x,y) log2 p(X)p(Y ) University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye " # n ! n itheory because... At thePheart of information f (1 − f )n−i e = i i=m+1 • Information channel capacity: X Channel: p(y|x) C = max I(X; Y ) p(x) 1 C = log2 (1 + |h|2 P/PN ) 2 Highest rate (bits/channel use) that can 1communicate at2 reliably 2 log2 (1 + |h| P/PN ) C = theorem' says: information capacity = operational( capacity • Channel coding Eh 21 log2 (1 + |h|2 P/PN ) ) ) † 1 maxQ:T r(Q)=P 2 log2 )IMR + HQH ) • Operational channel capacity: University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Y Mutual information example p(x, y) x=0 x=1 y=0 1/2 0 y=1 1/4 1/4 X or Y 0 1 p(x) 3/4 1/4 p(y) 1/2 1/2 ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Divergence (relative entropy, K-L distance) Definition: Relative entropy, divergence or Kullback-Leibler distance between two distributions, P and Q, on the same alphabet, is ! " # p(x) p(x) D(p ! q) := Ep log = p(x) log q(x) q(x) x∈X (Note: we use the convention 0 log 0 0 = 0 and 0 log 0 q = p log p 0 = ∞.) • D(p ! q) is in a sense a measure of the “distance” between the two distributions. • If P = Q then D(p ! q) = 0. • Note D(p ! q) is not a true distance. D( , ) = 0.2075 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye D( , ) = 0.1887 K-L divergence example • X = {1, 2, 3, 4, 5, 6} ! • P = [1/6 1/6 1/6 1/6 1/6 1/6] • Q = [1/10 1/10 1/10 1/10 1/10 1/2] • D(p ! q) =? and D(q ! p) =? p(x) q(x) x x University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Mutual information as divergence! Definition: The mutual information I(X; Y ) between the random variables X and Y is given by I(X; Y ) = H(X) − H(X|Y ) !! p(x, y) = p(x, y) log2 p(x)p(y) x∈X y∈Y " # p(X, Y ) = Ep(x,y) log2 p(X)p(Y ) • Can we express mutual information in terms of the K-L divergence? I(X; Y ) = University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye D(p(x, y) ! p(x)p(y)) Mutual information and entropy Theorem: Relationship between mutual information and entropy. I(X; Y ) = H(X) − H(X|Y ) I(X; Y ) = H(Y ) − H(Y |X) I(X; Y ) = H(X) + H(Y ) − H(X, Y ) I(X; Y ) = I(Y ; X) (symmetry) I(X; X) = H(X) (“self-information”) H(X) H(Y) H(Y) H(X|Y) I(X;Y) H(X) H(Y) I(X;Y) I(X;Y) ``Two’s company, three’s a crowd’’ University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chain rule for entropy Theorem: (Chain rule for entropy): (X1 , X2 , ..., Xn ) ∼ p(x1 , x2 , ..., xn ) H(X1) H(X2) H(X1 , X2 , ..., Xn ) = n ! i=1 H(Xi |Xi−1 , ..., X1 ) ! H(X3) H(X1) H(X1,X2,X3) = H(X2|X1) + + H(X3|X1,X2) University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Conditional mutual information H(X) H(Y) H(Z) H(X) I(XlY|Z) H(Y) H(X|Z) H(X) H(X|Y,Z) H(X) H(Y) H(Y) - = H(Z) H(Z) H(Z) University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chain rule for mutual information Theorem: (Chain rule for mutual information) I(X1 , X2 , ..., Xn ; Y ) = n ! i=1 H(X) H(Y) H(X) = I(X,Y;Z) H(Z) I(Xi ; Y |Xi−1 , Xi−2 , ..., X1 ) H(X) H(Y) + I(X,;Z) H(Z) Chain rule for relative entropy in book pg. 24 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye H(Y) I(Y,;Z|X) H(Z) ! What is the grey region? H(X) H(X) H(Y) H(Z) H(Z) University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Another disclaimer.... Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 144 8 — Dependent Random Variables I(X;Y) A I(X;Y|Z) H(Y) H(X,Y|Z) Figure 8.3. A misleading representation of entropies, continued. H(Y|X,Z) H(X|Y,Z) H(X) H(Z|X) H(Z|Y) H(Y) H(Z|X,Y) H(Z) that the random outcome (x, y) might correspond to a point in the diagram, and thus confuse entropies with probabilities. Secondly, the depiction in terms of Venn diagrams encourages one to believe that all the areas correspond to positive quantities. In the special case of two random variables it is indeed true that H(X | Y ), I(X; Y ) and H(Y | X) are positive quantities. But as soon as we progress to three-variable ensembles, we obtain a diagram with positive-looking areas that may actually correspond to negative quantities. Figure 8.3 correctly shows relationships such as H(X) + H(Z | X) + H(Y | X, Z) = H(X, Y, Z). (8.31) But it gives the misleading impression that the conditional mutual information I(X; Y | Z) is less than the mutual information I(X; Y ). In fact the area labelled A can correspond to a negative quantity. Consider the joint ensemble (X, Y, Z) in which x ∈ {0, 1} and y ∈ {0, 1} are independent binary variables and z ∈ {0, 1} is defined to be z = x + y mod 2. Then clearly H(X) = H(Y ) = 1 bit. Also H(Z) = 1 bit. And H(Y | X) = H(Y ) = 1 since the two variables are independent. So the mutual information between X and Y is zero. I(X; Y ) = 0. However, if z is observed, X and Y become dependent — knowing x, given z, tells you what y is: y = z − x mod 2. So I(X; Y | Z) = 1 bit. Thus the area labelled A must correspond to −1 bits for the figure to give the correct answers. The above example is not at all a capricious or exceptional illustration. The binary symmetric channel with input X, noise Y , and output Z is a situation in which I(X; Y ) = 0 (input and noise are independent) but I(X; Y | Z) > 0 (once you see the output, the unknown input and the unknown noise are intimately related!). The Venn diagram representation is therefore valid only if one is aware that positive areas may represent negative quantities. With this proviso kept in mind, the interpretation of entropies in terms of sets can be helpful (Yeung, 1991). Solution to exercise 8.9 (p.141). For any joint ensemble XY Z, the following chain rule for mutual information holds. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye I(X; Y, Z) = I(X; Y ) + I(X; Z | Y ). (8.32) I(W ; D, R) = I(W ; D) (8.33) Now, in the case w → d → r, w and r are independent given d, so I(W ; R | D) = 0. Using the chain rule twice, we have: """ [Mackay’s textbook] Convex and concave functions 10 7.5 5 2.5 0 12.5 10 7.5 5 2.5 0 -2.5 -5 -7.5 -10 -2.5 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye 10 7.5 Convex and concave functions 5 2.5 0 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye 12.5 10 7.5 5 2.5 0 -2.5 -5 -7.5 -10 -2.5 Jensen’s inequality Theorem: (Jensen’s inequality) If f is convex, then E[f (X)] ≥ f (E[X]). ! If f is strictly convex, the equality implies X = E[X] with probability 1. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Jensen’s inequality consequences • Theorem: (Information inequality) D(p ! q) ≥ 0, with equality iff p = q. • Corollary: (Nonnegativity of mutual information) I(X; Y ) ≥ 0 with equality iff X and Y are independent. • Theorem: (Conditioning reduces entropy) H(X|Y ) ≤ H(X) with equality iff X and Y are independent. • Theorem: H(X) ≤ log |X | with equality iff X has a uniform distribution over X. !n • Theorem: (Independence bound on entropy) H(X1 , X2 , ..., Xn ) ≤ i=1 H(Xi )with equality iff Xi are independent. #! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Log-sum inequality Theorem: (Log sum inequality) For nonnegative a1 , a2 , ..., an and b1 , b2 , ..., bn , # " n $n n ! ! ai ai ai log ai log $i=1 ≥ n bi i=1 bi i=1 i=1 with equality iff ai /bi = const. Convention: 0 log 0 = 0, a log a 0 = ∞ if a > 0 and 0 log 00 = 0. ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Log-sum inequality consequences • Theorem: (Convexity of relative entropy) D(p ! q) is convex in the pair (p, q), so that for pmf’s (p1 , q1 ) and (p2 , q2 ), we have for all 0 ≤ λ ≤ 1: D(λp1 + (1 − λ)p2 ! λq1 + (1 − λ)q2 ) ≤ λD(p1 ! q1 ) + (1 − λ)D(p2 ! q2 ) • Theorem: Concavity of entropy For X ∼ p(x), we have that H(p) := Hp (X)is a concave function of p(x). • Theorem: (Concavity of the mutual information in p(x)) Let (X, Y ) ∼ p(x, y) = p(x)p(y|x). Then, I(X; Y ) is a concave function of p(x) for fixed p(y|x). • Theorem: (Convexity of the mutual information in p(y|x)) Let (X, Y ) ∼ p(x, y) = p(x)p(y|x). Then, I(X; Y ) is a convex function of p(y|x) for fixed p(x). University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye #! Markov chains Definition: X, Y, Z form a Markov chain in that order (X → Y → Z) iff p(x, y, z) = p(x)p(y|x)p(z|y) X ≡ N1 N2 ! ! Y p(z|y, x) = p(z|y) Z • X → Y → Z iff X and Z are conditionally independent given Y • X→Y →Z ⇒ ! Z → Y → X. Thus, we can write X ↔ Y ↔ Z. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Data-processing inequality X N1 N2 ! ! Y N1 Z X ! Y f() Z ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Markov chain questions If X → Y → Z, then I(X; Y ) ≥ I(X; Y |Z). What if X, Y, Z do not form a Markov chain, can I(X; Y |Z) ≥ I(X; Y )? If X1 → X2 → X3 → X4 → X5 → X6 , then Mutual Information increases as you get closer together: I(X1 ; X2 ) ≥ I(X1 ; X4 ) ≥ I(X1 ; X5 ) ≥ I(X1 ; X6 ). University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Consequences on sufficient statistics • Consider a family of probability distributions {fθ (x)} indexed by θ. If X ∼ f (x | θ) for fixed θ and T (X) is any statistic (i.e., function of the sample X), then we have θ → X → T (X). • The data processing inequality in turn implies I(θ; X) ≥ I(θ; T (X)) for any distribution on θ. • Is it possible to choose a statistic that preserves all of the information in X about θ? University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Consequences on sufficient statistics • Consider a family of probability distributions {fθ (x)} indexed by θ. If X ∼ f (x | θ) for fixed θ and T (X) is any statistic (i.e., function of the sample X), then we have θ → X → T (X). • The data processing inequality in turn implies I(θ; X) ≥ I(θ; T (X)) for any distribution on θ. • Is it possible to choose a statistic that preserves all of the information in X about θ? Definition: Sufficient Statistic A function T (X) is said to be a sufficient statistic relative to the family {fθ (x)} if the conditional distribution of X, given T (X) = t, is independent of θ for any distribution on θ (Fisher-Neyman): fθ (x) = f (x | t)fθ (t) ⇒ θ → T (X) → X ⇒ I(θ; T (X)) ≥ I(θ; X) """ University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Example of a sufficient statistic University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Fano’s inequality ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Fano’s inequality consequences University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye