Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ECE 534: Elements of Information Theory, Fall 2010 Homework 2 Solutions Ex. 2.28 (Kenneth S. Palacio Baus) Mixing increases entropy. Show that the entropy of the probability distribution (p1 , . . . , pi , . . . , pj , . . . , pm ) is less than the p +p p +p entropy of the distribution (p1 , . . . , i 2 j , . . . , i 2 j , . . . , pm ). Show that in general any transfer of probability that makes the distribution more uniform increases the entropy. Solution We need to compare the entropies for each distribution. There are only two probabilities which are different between the two distributions. Those probabilities determine the distribution with higher entropy. Lets define the probability distribution p(x) = (p1 , ..., pi , ..., pj , ..., pm ) for variable X, and the probp +p p +p ability distribution p(y) = (p1 , ..., i 2 j , ..., i 2 j , ..., pm ) for variable Y. Then, we need to compare the entropies H(X) vs. H(Y ). H(X) = −p1 log2 p1 + ... − pi log2 pi + ... − pj log2 pj + ... − pm log2 pm H(Y ) = −p1 log2 p1 + ... − pi +pj 2 log2 pi +pj 2 + ... − pi +pj 2 log2 pi +pj 2 + ... − pm log2 pm Selecting the terms that are different in the two previous expressions we obtain: For H(X): −pi log2 pi − pj log2 pj For H(Y ): − pi +pj 2 log2 pi +pj 2 − pi +pj 2 log2 pi +pj 2 Applying the Log Sum inequality, theorem 2.7.1 of [?], on the expression for H(X): 1 (pi log2 pi + pj log2 pj ) ≥ (pi + pj ) log2 pi +pj 2 Which implies: −(pi log2 pi + pj log2 pj ) ≤ −(pi + pj ) log2 pi +pj 2 Hence, we can conclude that the entropy of distribution for X is less than the one for Y. In general, for making a distribution more uniform, we should include more similar terms. Lets write the distribution of X as: p(x) = (p1 , ..., pi , ..., pj , ..., pk , ..., pm ), and the probability distribup +p +p p +p +p p +p +p tion p(y) = (p1 , ..., i 3j k , ..., i 3j k , ..., i 3j k , ..., pm ) for variable Y. Then, doing a similar analysis as before, in order to compare the entropies H(X) vs. H(Y ), it’s possible to reach the same conclusion. Applying the Log Sum inequality, theorem 2.7.1 of [?], on the expression for H(X): (pi log2 pi + pj log2 pj + pk log2 pk ) ≥ (pi + pj + pk ) log2 pi +pj +pk 3 Which implies: −(pi log2 pi + pj log2 pj + pk log2 pk ) ≤ −(pi + pj + pk ) log2 pi +pj +pk 3 Hence, we can conclude that the entropy of distribution for X is less than the one for Y. In general, averaging up to n = m − 2 terms of the first distribution to create the terms of second (except for p1 and pm for this particular example) will produce a more uniform distribution. Ex. 2.29 (Davide Basilio Bartolini) Text Inequalities. Let X, Y and Z be joint random variables. Prove the following inequalities and find conditions for equality. (a) H(X, Y |Z) ≥ H(X|Z) 2 (b) I(X, Y ; Z) ≥ I(X; Z) (c) H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X) (d) I(X; Z|Y ) ≥ I(Z; Y |X) − I(Z; Y ) + I(X; Z) Solution (a) By applying the chain rule for the entropy: ≥0 z }| { H(X, Y, Z) = H(Y, X, Z) =H(Y |X, Z) +H(X|Z) + H(Z) = H(X, Y |Z) + H(Z) ⇓ H(X, Y |Z) ≥ H(X|Z) It is H(X, Y |Z) = H(X|Z) iff H(Y |X, Z) = 0, which happens if Y = f (X, Z). I(X;Z) ≥0 z }| { z }| { (b) I(X, Y ; Z) = I(Z; X, Y ) =I(Z; X) + I(Z; Y |X), so I(X, Y ; Z) ≥ I(X; Z). I(X, Y ; Z) ≥ I(X; Z) happens only when I(Z; Y |X) = 0, i.e. when Z is conditionally independent of Y given X, which is to say if Z → X → Y form a Markov chain. (c) H(X, Y, Z) − H(X, Y ) = H(X, Y ) + H(X, Y |Z) − H(X, Y ) = H(Z|X, Y ) H(X, Z) − H(X) = H(X) + H(Z|X)H(X) = H(Z|X) H(Z|X, Y ) ≤ H(Z|X), due to the rule that conditioning reduces entropy; so H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X). It is H(X, Y, Z) − H(X, Y ) = H(X, Z) − H(X) when Z is conditionally independent of Y given X (i.e. when Z → X → Y form a Markov chain). (d) I(X, Y ; Z) = I(X; Z|Y ) + I(Y ; Z) = I(Y ; Z|X) + I(X; Z) m =I(Z;Y |X) z }| { I(X; Z|Y ) + I(Y ; Z) =I(Y ; Z|X) +I(X; Z) m I(X; Z|Y ) = I(Z; Y |X) − I(Y ; Z) + I(X; Z) So, I(X; Z|Y ) ≥ I(Z; Y |X) − I(Z; Y ) + I(X; Z) and the equality always holds. 3 Ex. 2.37 (Matteo Carminati) Text Relative entropy. Let X, Y , Z be three random variables with a joint probability mass function p(x, y, z). The relative entropy between the joint distribution and the product of the marginals is p(x, y, z) D(p(x, y, z)||p(x)p(y)p(z)) = E log p(x)p(y)p(z) Expand this in terms of entropies. When is this quantity zero? Solution Properties of logarithms and divergence’s definition are exploited to expand the given formula: XXX D(p(x, y, z)||p(x)p(y)p(z)) = x y p(x, y, z) log 2 z p(x, y, z) p(x)p(y)p(z) (by definition) XXX XXX = p(x, y, z) log 2 (p(x, y, z)) − p(x, y, z) log 2 (p(x)) x − y z XXX x y x p(x, y, z) log 2 (p(y)) − z y XXX x y z p(x, y, z) log 2 (p(z)) z (logarithms’ properties) X X X = −H(X, Y, Z) − p(x) log2 (p(x)) − p(y) log2 (p(y)) − p(z) log 2 (p(z)) x y z XXX ( p(x, y, z) log 2 (p(x, y, z)) = −H(X, Y, Z) by definition) x y z = −H(X, Y, Z) + H(X) + H(Y ) + H(Z) Now we have to investigate when −H(X, Y, Z) + H(X) + H(Y ) ∗ H(Z) = 0: the relation is true if and only if H(X, Y, Z) = H(X) + H(Y ) + H(Z). In general, from the chain rule for entropy we have that: H(X, Y, Z) = H(X) + H(Y |X) + H(Z|X, Y ). In order to have the two relations true, H(Y ) must be equal to H(Y |X) and H(Z) must be equal to H(Z|X, Y ): this is true if and only if X and Y are independent (H(Y ) = H(Y |X)) and Z is independent from both X and Y (H(Z) = H(Z|X, Y )). Thus, to have both the conditions satisfied, X, Y and Z must be all independent each other. 4 Ex. 2.42 (Davide Basilio Bartolini ) Text Inequalities. Which of the following inequalities are generally ≥, =, ≤? Label each with ≥, = or ≤. (a) H(5X) vs.H(X) (b) I(g(X); Y ) vs.I(X; Y ) (c) H(X0 , X−1 ) vs.H(X0 |X−1 , X1 ) (d) H(X, Y )/(H(X) + H(Y )) vs.1 Solution (a) Setting Y = f (X) = 5X, f is a bijective function, so it is X = f −1 (Y ) = g(Y ). So, we can apply twice the know property H(f (X)) = H(5X) ≤ H(X) and H(g(Y )) = H(X) ≤ H(Y ) = H(5X). Thus, H(5X)=H(X) =I(Y,g(X)) =Y (Y ;X) z }| { z }| { (b) I(g(X); Y )≤I(X; Y ), due to the data processing inequality, which states I(Y ; X) ≥ I(Y, g(X)). (c) H(X0 , X−1 ) ≥ H(X0 |X−1 , X1 ), simply applying the property that conditioning reduces entropy. ≤H(Y ) z }| { (d) H(X, Y ) = H(X)+ H(Y |X)≤ H(X) + H(Y ), so H(X, Y )/(H(X) + H(Y )) ≤ 1. Ex. 2.43 (Johnson Jonaris GadElkarim) Text Mutual information of heads and tails. (a) Consider a fair coin flip. What is the mutual information between the top and bottom sides of the coin? (b) A six-sided fair die is rolled. What is the mutual information between the top side and the front face (the side most facing you)? 5 Solution a) X = {B, T} p(B) = p(T) = 0.5 since fair coin toss B Bernoulli(1/2); we can observer here that the bottom B is a function in the top T. I(B; T ) = H(B) − H(B|T ) = H(B) = −2 ∗ 0.5 log(0.5) = 1 Note that H(B|T ) = 0 since we know the outcome is tail. b) X = 1,2,3,4,5,6 p(X=x) = 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 We can observe here that the Top T has a uniform distribution. on 1,2,3,4,5,6 If we observed one side F of the cube, we will have four equally probable possibilities for the top T. I(T ; F ) = H(T ) − H(T |F ) = log(6) − log(4) = 0.58 6