Download Ex. 2.28 (Kenneth S. Palacio Baus)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
ECE 534: Elements of Information Theory, Fall 2010
Homework 2
Solutions
Ex. 2.28 (Kenneth S. Palacio Baus)
Mixing increases entropy.
Show that the entropy of the probability distribution (p1 , . . . , pi , . . . , pj , . . . , pm ) is less than the
p +p
p +p
entropy of the distribution (p1 , . . . , i 2 j , . . . , i 2 j , . . . , pm ). Show that in general any transfer of
probability that makes the distribution more uniform increases the entropy.
Solution
We need to compare the entropies for each distribution. There are only two probabilities which are
different between the two distributions. Those probabilities determine the distribution with higher
entropy.
Lets define the probability distribution p(x) = (p1 , ..., pi , ..., pj , ..., pm ) for variable X, and the probp +p
p +p
ability distribution p(y) = (p1 , ..., i 2 j , ..., i 2 j , ..., pm ) for variable Y. Then, we need to compare
the entropies H(X) vs. H(Y ).
H(X) = −p1 log2 p1 + ... − pi log2 pi + ... − pj log2 pj + ... − pm log2 pm
H(Y ) = −p1 log2 p1 + ... −
pi +pj
2
log2
pi +pj
2
+ ... −
pi +pj
2
log2
pi +pj
2
+ ... − pm log2 pm
Selecting the terms that are different in the two previous expressions we obtain:
For H(X): −pi log2 pi − pj log2 pj
For H(Y ): −
pi +pj
2
log2
pi +pj
2
−
pi +pj
2
log2
pi +pj
2
Applying the Log Sum inequality, theorem 2.7.1 of [?], on the expression for H(X):
1
(pi log2 pi + pj log2 pj ) ≥ (pi + pj ) log2
pi +pj
2
Which implies:
−(pi log2 pi + pj log2 pj ) ≤ −(pi + pj ) log2
pi +pj
2
Hence, we can conclude that the entropy of distribution for X is less than the one for Y.
In general, for making a distribution more uniform, we should include more similar terms. Lets
write the distribution of X as: p(x) = (p1 , ..., pi , ..., pj , ..., pk , ..., pm ), and the probability distribup +p +p
p +p +p
p +p +p
tion p(y) = (p1 , ..., i 3j k , ..., i 3j k , ..., i 3j k , ..., pm ) for variable Y. Then, doing a similar
analysis as before, in order to compare the entropies H(X) vs. H(Y ), it’s possible to reach the
same conclusion.
Applying the Log Sum inequality, theorem 2.7.1 of [?], on the expression for H(X):
(pi log2 pi + pj log2 pj + pk log2 pk ) ≥ (pi + pj + pk ) log2
pi +pj +pk
3
Which implies:
−(pi log2 pi + pj log2 pj + pk log2 pk ) ≤ −(pi + pj + pk ) log2
pi +pj +pk
3
Hence, we can conclude that the entropy of distribution for X is less than the one for Y. In general,
averaging up to n = m − 2 terms of the first distribution to create the terms of second (except for
p1 and pm for this particular example) will produce a more uniform distribution.
Ex. 2.29 (Davide Basilio Bartolini)
Text
Inequalities. Let X, Y and Z be joint random variables. Prove the following inequalities and find
conditions for equality.
(a) H(X, Y |Z) ≥ H(X|Z)
2
(b) I(X, Y ; Z) ≥ I(X; Z)
(c) H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X)
(d) I(X; Z|Y ) ≥ I(Z; Y |X) − I(Z; Y ) + I(X; Z)
Solution
(a) By applying the chain rule for the entropy:
≥0
z
}|
{
H(X, Y, Z) = H(Y, X, Z) =H(Y |X, Z) +H(X|Z) + H(Z)
= H(X, Y |Z) + H(Z)
⇓
H(X, Y |Z) ≥ H(X|Z)
It is H(X, Y |Z) = H(X|Z) iff H(Y |X, Z) = 0, which happens if Y = f (X, Z).
I(X;Z)
≥0
z }| { z }| {
(b) I(X, Y ; Z) = I(Z; X, Y ) =I(Z; X) + I(Z; Y |X), so I(X, Y ; Z) ≥ I(X; Z).
I(X, Y ; Z) ≥ I(X; Z) happens only when I(Z; Y |X) = 0, i.e. when Z is conditionally independent of Y given X, which is to say if Z → X → Y form a Markov chain.
(c) H(X, Y, Z) − H(X, Y ) = H(X, Y ) + H(X, Y |Z) − H(X, Y ) = H(Z|X, Y )
H(X, Z) − H(X) = H(X) + H(Z|X)H(X) = H(Z|X)
H(Z|X, Y ) ≤ H(Z|X), due to the rule that conditioning reduces entropy; so H(X, Y, Z) −
H(X, Y ) ≤ H(X, Z) − H(X).
It is H(X, Y, Z) − H(X, Y ) = H(X, Z) − H(X) when Z is conditionally independent of Y given
X (i.e. when Z → X → Y form a Markov chain).
(d)
I(X, Y ; Z) = I(X; Z|Y ) + I(Y ; Z)
= I(Y ; Z|X) + I(X; Z)
m
=I(Z;Y |X)
z }| {
I(X; Z|Y ) + I(Y ; Z) =I(Y ; Z|X) +I(X; Z)
m
I(X; Z|Y ) = I(Z; Y |X) − I(Y ; Z) + I(X; Z)
So, I(X; Z|Y ) ≥ I(Z; Y |X) − I(Z; Y ) + I(X; Z) and the equality always holds.
3
Ex. 2.37 (Matteo Carminati)
Text
Relative entropy. Let X, Y , Z be three random variables with a joint probability mass function
p(x, y, z). The relative entropy between the joint distribution and the product of the marginals is
p(x, y, z)
D(p(x, y, z)||p(x)p(y)p(z)) = E log
p(x)p(y)p(z)
Expand this in terms of entropies. When is this quantity zero?
Solution
Properties of logarithms and divergence’s definition are exploited to expand the given formula:
XXX
D(p(x, y, z)||p(x)p(y)p(z)) =
x
y
p(x, y, z) log 2
z
p(x, y, z)
p(x)p(y)p(z)
(by definition)
XXX
XXX
=
p(x, y, z) log 2 (p(x, y, z)) −
p(x, y, z) log 2 (p(x))
x
−
y
z
XXX
x
y
x
p(x, y, z) log 2 (p(y)) −
z
y
XXX
x
y
z
p(x, y, z) log 2 (p(z))
z
(logarithms’ properties)
X
X
X
= −H(X, Y, Z) −
p(x) log2 (p(x)) −
p(y) log2 (p(y)) −
p(z) log 2 (p(z))
x
y
z
XXX
(
p(x, y, z) log 2 (p(x, y, z)) = −H(X, Y, Z) by definition)
x
y
z
= −H(X, Y, Z) + H(X) + H(Y ) + H(Z)
Now we have to investigate when −H(X, Y, Z) + H(X) + H(Y ) ∗ H(Z) = 0: the relation is true
if and only if H(X, Y, Z) = H(X) + H(Y ) + H(Z). In general, from the chain rule for entropy
we have that: H(X, Y, Z) = H(X) + H(Y |X) + H(Z|X, Y ). In order to have the two relations
true, H(Y ) must be equal to H(Y |X) and H(Z) must be equal to H(Z|X, Y ): this is true if and
only if X and Y are independent (H(Y ) = H(Y |X)) and Z is independent from both X and Y
(H(Z) = H(Z|X, Y )). Thus, to have both the conditions satisfied, X, Y and Z must be all independent each other.
4
Ex. 2.42 (Davide Basilio Bartolini )
Text
Inequalities. Which of the following inequalities are generally ≥, =, ≤? Label each with ≥, = or
≤.
(a) H(5X) vs.H(X)
(b) I(g(X); Y ) vs.I(X; Y )
(c) H(X0 , X−1 ) vs.H(X0 |X−1 , X1 )
(d) H(X, Y )/(H(X) + H(Y )) vs.1
Solution
(a) Setting Y = f (X) = 5X, f is a bijective function, so it is X = f −1 (Y ) = g(Y ).
So, we can apply twice the know property H(f (X)) = H(5X) ≤ H(X) and H(g(Y )) = H(X) ≤
H(Y ) = H(5X).
Thus, H(5X)=H(X)
=I(Y,g(X))
=Y (Y ;X)
z }| { z }| {
(b) I(g(X); Y )≤I(X; Y ), due to the data processing inequality, which states I(Y ; X) ≥ I(Y, g(X)).
(c) H(X0 , X−1 ) ≥ H(X0 |X−1 , X1 ), simply applying the property that conditioning reduces entropy.
≤H(Y )
z }| {
(d) H(X, Y ) = H(X)+ H(Y |X)≤ H(X) + H(Y ), so H(X, Y )/(H(X) + H(Y )) ≤ 1.
Ex. 2.43 (Johnson Jonaris GadElkarim)
Text
Mutual information of heads and tails.
(a) Consider a fair coin flip. What is the mutual information between the top and bottom sides of
the coin?
(b) A six-sided fair die is rolled. What is the mutual information between the top side and the
front face (the side most facing you)?
5
Solution
a) X = {B, T} p(B) = p(T) = 0.5 since fair coin toss
B Bernoulli(1/2); we can observer here that the bottom B is a function in the top T.
I(B; T ) = H(B) − H(B|T ) = H(B) = −2 ∗ 0.5 log(0.5) = 1
Note that H(B|T ) = 0 since we know the outcome is tail.
b) X = 1,2,3,4,5,6 p(X=x) = 1/6, 1/6, 1/6, 1/6, 1/6, 1/6
We can observe here that the Top T has a uniform distribution. on 1,2,3,4,5,6
If we observed one side F of the cube, we will have four equally probable possibilities for the top T.
I(T ; F ) = H(T ) − H(T |F ) = log(6) − log(4) = 0.58
6