Download Chapter 2: Entropy and Mutual Information - UIC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Transcript
Chapter 2: Entropy and Mutual Information
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Chapter 2 outline
• Definitions
• Entropy
• Joint entropy, conditional entropy
• Relative entropy, mutual information
• Chain rules
• Jensen’s inequality
• Log-sum inequality
• Data processing inequality
• Fano’s inequality
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Definitions
A discrete random variable X takes on values x from the discrete alphabet X .
The probability mass function (pmf) is described by
pX (x) = p(x) = Pr{X = x}, for x ∈ X .
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
2
Probability, Entropy, and Inference
Definitions
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
This chapter, and its sibling, Chapter 8, devote some time to notation. Just
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/
links.
as the White Knight for
distinguished
between the song, the name of the song,
and what the name of the song was called (Carroll, 1998), we will sometimes
23 of the
need to be careful to distinguish between a random variable, the value
random variable, and the proposition that asserts that the random variable
has a particular value. In any particular
chapter,
however, I will use the most
Figure 2.2.
The probability
simple and friendly notation possible,
at theover
riskthe
of upsetting
distribution
27×27 pure-minded
readers. For example, if something
is ‘true
with probability
1’, I will usually
possible
bigrams
xy in an English
simply say that it is ‘true’.
language document, The
Frequently Asked Questions
Manual for Linux.
2.1 Probabilities and ensembles
2.1: Probabilities and ensembles
x
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
–
An ensemble X is a triple (x, AX , PX ), where the outcome x is the value
of a random variable, which takes on one of a set of possible values,
AX = {a1 , a2 , . . . , ai , . . . , aI }, having
! probabilities PX = {p1 , p2 , . . . , pI },
with P (x = ai ) = pi , pi ≥ 0 and ai ∈AX P (x = ai ) = 1.
The name A is mnemonic for ‘alphabet’. One example of an ensemble is a
letter that is randomly selected from an English document. This ensemble is
shown in figure 2.1. There are twenty-seven possible letters: a–z, and a space
character ‘-’.
Abbreviations. Briefer notation will sometimes be used.
For example,
a b c d e f g h i j k l m n o p q r s t u v w x y z – Py(x = ai ) may be written as P (ai ) or P (x).
Probability of a subset. If T is a subset of A X then:
"
P (T ) = P (x ∈ T ) =
P (x = ai ).
(2.1)
Marginal probability. We can obtain the marginal probability P (x) from
ai ∈T
the joint probability P (x, y) by summation:
For example, if we define V to be vowels from figure 2.1, V =
!
{a, e, i, o, u}, then
P (x = ai ) ≡
P (x = ai , y).
(2.3)
P (V ) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31.
y∈AY
(2.2)
Similarly, using briefer notation, the marginal probability
of y is:XY is an ensemble in which each outcome is an ordered
A joint ensemble
pair x, y with x ∈ AX = {a1 , . . . , aI } and y ∈ AY = {b1 , . . . , bJ }.
!
P (y) ≡
P (x, y).
(2.4)
We call P (x, y) the joint probability of x and y.
x∈AX
Commas are optional when writing ordered pairs, so xy ⇔ x, y.
University
of Illinois at Chicago ECE 534, Fall 2009, Natasha
N.B. InDevroye
a joint ensemble XY the two variables are not necessarily indeConditional
probability
pendent.
P (x = ai | y = bj ) ≡
P (x = ai , y = bj )
if P (y = bj ) "= 0.
P (y = bj )
(2.5)
22
i
ai
pi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
–
0.0575
0.0128
0.0263
0.0285
0.0913
0.0173
0.0133
0.0313
0.0599
0.0006
0.0084
0.0335
0.0235
0.0596
0.0689
0.0192
0.0008
0.0508
0.0567
0.0706
0.0334
0.0069
0.0119
0.0073
0.0164
0.0007
0.1928
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
–
Figure 2.1. Probability
distribution over the 27 outcomes
for a randomly selected letter in
an English language document
(estimated from The Frequently
Asked Questions Manual for
Linux ). The picture shows the
probabilities by the areas of white
squares.
Definitions
The events X = x and Y = y are statistically independent if p(x, y) = p(x)p(y).
The variables X1 , X2 , · · · XN are called independent if for all (x1 , x2 , · · · , xN ) ∈
X1 × X2 × · · · XN we have
p(x1 , x2 , · · · xN ) =
N
!
pXi (xi ).
i=1
They are furthermore called identically distributed if all variables Xi have the same
distribution pX (x).
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Entropy
• Intuitive notions?
• 2 ways of defining entropy of a random variable:
• axiomatic definition (want a measure with certain properties...)
• just define and then justify definition by showing it arises as answer to a
number of natural questions
Definition: The entropy H(X) of a discrete random variable X with pmf pX (x) is
given by
!
pX (x) log pX (x) = −EpX (x) [log pX (X)]
H(X) = −
x
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Order these in terms of entropy
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Order these in terms of entropy
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
3
c
4
d
5
e
6
f
7
g
8
h
9
i
The likelihood principle: given a generative model for data d given
10 j
parameters θ, P (d | θ), and having observed a particular outcome
11 k
12 l
d1 , all inferences and predictions should depend only on the function
13 m
P (d1 | θ).
14 n
15 o
In spite of the simplicity of this principle, many classical statistical methods
What’s
the entropy of a uniform discrete random variable taking16
on pK
17 q
violate it.
18 r
19 s
2.4 Definition of entropy and related functions
20 t
21 u
The Shannon information content of an outcome x is defined to be
22 v
23 w
1
h(x) = logvariable
.
(2.34)
2
24 x
What’s the entropy of a random
with
P (x)
25 y
26 z
X =[The
[♣,word
♦, ♥,
pXused
= [1/2;
1/4;a variable
1/8; 1/8]
It is measured in bits.
‘bit’♠],
is also
to denote
27 whose value is 0 or 1; I hope context will always make clear which of the
happened (here, that the ball drawn was black) given the different hypotheses. We need only to know the likelihood, i.e., how the probability of the data
that happened varies with the hypothesis. This simple rule about inference is
known as the likelihood principle.
Entropy examples 1
•
.0263
5.2
.0285
5.1
.0913
3.5
.0173
5.9
.0133
6.2
.0313
5.0
.0599
4.1
.0006
10.7
.0084
6.9
.0335
4.9
.0235
5.4
.0596
4.1
.0689
3.9
.0192
5.7
values?
.0008
10.3
.0508
4.3
.0567
4.1
.0706
3.8
.0334
4.9
.0069
7.2
.0119
6.4
.0073
7.1
.0164
5.9
.0007
10.4
.1928
2.4
!
•
two meanings is intended.]
!
In the next few chapters, we will establish that the Shannon information
content h(ai ) is indeed a natural measure of the information content
of the event x = ai . At that point, we will shorten the name of this
quantity to ‘the information content’.
pi log2
i
1
pi
4.1
Table 2.9. Shannon information
contents of the outcomes a–z.
The fourth column in table 2.9 shows the Shannon information content
• What’s
the27entropy
of a deterministic
random
of the
possible outcomes
when a random
charactervariable?
is picked from
an English document. The outcome x = z has a Shannon information
content of 10.4 bits, and x = e has an information content of 3.5 bits.
The entropy of an ensemble X is defined to be the average Shannon information content of an outcome:
!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
1
,
(2.35)
H(X) ≡
P (x) log
P (x)
x∈AX
with the convention for P (x)
limθ→0+ θ log 1/θ = 0.
=
0 that 0 × log 1/0 ≡ 0, since
Like the information content, entropy is measured in bits.
When Copyright
it is convenient,
we may
alsoOn-screen
write viewing
H(X)permitted.
as H(p),
p ishttp://www.cambridge.org/0521642981
Cambridge University
Press 2003.
Printingwhere
not permitted.
the vector example
(p , p , . . . , p ). Another
Entropy:
2 name for the entropy of X is the
uncertainty of X.
You can buy
1 this
2 book for I30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
32
2 — Probability, Entropy, and Inference
What do you notice about your solutions? Does each answer depend on the
Example 2.12. The entropy
of a randomly selected letter in an English docudetailed contents of each urn?
ment is about 4.11The
bits,
assuming
its probability
is asand
given
table 2.9.are irdetails
of the other
possible outcomes
theirin
probabilities
matters islog
the
probability
of in
thethe
outcome
thatcolactually
We obtain thisrelevant.
numberAll
bythat
averaging
1/p
fourth
i (shown
(here, distribution
that the ball drawn
was black)
giventhird
the different
hypotheumn) under thehappened
probability
p i (shown
in the
column).
ses. We need only to know the likelihood, i.e., how the probability of the data
that happened varies with the hypothesis. This simple rule about inference is
known as the likelihood principle.
The likelihood principle: given a generative model for data d given
parameters θ, P (d | θ), and having observed a particular outcome
d1 , all inferences and predictions should depend only on the function
P (d1 | θ).
In spite of the simplicity of this principle, many classical statistical methods
violate it.
2.4 Definition of entropy and related functions
The Shannon information content of an outcome x is defined to be
h(x) = log2
1
.
P (x)
(2.34)
It is measured in bits. [The word ‘bit’ is also used to denote a variable
whose value is 0 or 1; I hope context will always make clear which of the
two meanings is intended.]
In the next few chapters, we will establish that the Shannon information
content h(ai ) is indeed a natural measure of the information content
of the event x = ai . At that point, we will shorten the name of this
quantity to ‘the information content’.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
The fourth column in table 2.9 shows the Shannon information content
of the 27 possible outcomes when a random character is picked from
an English document. The outcome x = z has a Shannon information
content of 10.4 bits, and x = e has an information content of 3.5 bits.
i
ai
pi
h(pi )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
-
.0575
.0128
.0263
.0285
.0913
.0173
.0133
.0313
.0599
.0006
.0084
.0335
.0235
.0596
.0689
.0192
.0008
.0508
.0567
.0706
.0334
.0069
.0119
.0073
.0164
.0007
.1928
4.1
6.3
5.2
5.1
3.5
5.9
6.2
5.0
4.1
10.7
6.9
4.9
5.4
4.1
3.9
5.7
10.3
4.3
4.1
3.8
4.9
7.2
6.4
7.1
5.9
10.4
2.4
1
pi
4.1
!
i
pi log2
Table 2.9. Shannon information
contents of the outcomes a–z.
Entropy: example 3
• Bernoulli random variable takes on heads (0) with probability p and tails with
probability 1-p. Its entropy is defined as
H(p) := −p log2 (p) − (1 − p) log2 (1 − p)
16
ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION
1
0.9
0.8
0.7
H(p)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
p
0.6
0.7
0.8
0.9
1
FIGURE 2.1. H (p) vs. p.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Entropy
Suppose that we wish to determine the value of X with the minimum
number of binary questions. An efficient first question is “Is X = a?”
This splits the probability in half. If the answer to the first question is
no, the second question can be “Is X = b?” The third question can be
“Is X = c?” The resulting expected number of binary questions required
is 1.75. This turns out to be the minimum expected number of binary
questions required to determine the value of X. In Chapter 5 we show that
the minimum expected number of binary questions required to determine
X lies between H (X) and H (X) + 1.
!
2.2 JOINT
ENTROPY
AND
CONDITIONAL
ENTROPY
The entropy H(X)
=−
log
p(x) has the
following properties:
x p(x)
We defined the entropy of a single random variable in Section 2.1. We
now extend the definition to a pair of random variables. There is nothing
new in this is
definition
because
(X, Y ) can be considered
≥really
0, entropy
always
non-negative.
H(X) to=be0 aiff
single vector-valued random variable.
• H(X)
(0 log(0) = 0).
X is deterministic
Definition The joint entropy H (X, Y ) of a pair of discrete random
with a joint
defineduniform
as
• H(X) ≤variables
log(|X(X,|).Y ) H(X)
= distribution
log(|X |)p(x,
iffy)Xis has
distribution over X.
!!
H (X, Y ) = −
p(x, y) log p(x, y),
logb (a)Ha (X),
we don’t need to
x∈X y∈Y
• Since Hb (X) =
rithm (bits vs. nat).
(2.8)
specify the base of the loga-
Moving on to multiple RVs
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Joint entropy and conditional entropy
Definition: Joint entropy of a pair of two discrete random variables X and Y is:
H(X, Y ) := −Ep(x,y) [log p(X, Y )]
!!
p(x, y) log p(x, y)
= −
x∈X y∈Y
Note: H(X|Y ) != H(Y |X).
"""
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
Joint entropy and conditional entropy
8.3: Further exercises
• Natural definitions, since....
Exercise 8.7.[2, p.143] Consider the ensemble XY Z in which A X = AY =
AZ = {0, 1}, x and y are independent with PX = {p, 1 − p} and
PY = {q, 1−q} and
z = (x + y) mod 2.
(8.13)
(a) If q = 1/2, what is PZ ? What is I(Z; X)?
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
For general p and q, what is PZ ? What is I(Z; X)? Notice that
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ (b)
for links.
Theorem: Chain rule
!
H(X, Y ) = H(X)this
+ ensemble
H(Y |X)
is related to the binary symmetric channel, with x =
input,
noise, and Random
z = output.
8 —y =
Dependent
Variables
140
H(X, Y )
H(X)
H(Y )
H(X | Y )
8.2 Exercises
Corollary:
I(X; Y )
H(Y |X)
Figure 8.1. The relationship
H(Y)
between joint information,
marginal entropy, conditional
entropy
mutual entropy.
H(X|Y)
I(X;Y)and
H(Y|X)
Figure 8.2. A mi
representation of
(contrast with fi
H(X,Y)
H(X)
Three term entropies
Exercise 8.8.[3, p.143] Many texts draw figure 8.1 in the form of a Venn diagram
! Exercise 8.1.[1 ] Consider three independent random variables u, v, w with
en-8.2). Discuss why this diagram is a misleading representation
(figure
H(X, Y |Z) = H(X|Z)
+ H(Y |X, Z)
tropies Hu , Hv , Hw . Let X ≡ (U, V ) and Y ≡ (V, W ). What is H(X,
Y )?
of entropies.
Hint: consider the three-variable ensemble XY Z in which
What is H(X | Y )? What is I(X; Y )?
x ∈ {0, 1} and y ∈ {0, 1} are independent binary variables and z ∈ {0, 1}
is defined to be z = x + y mod 2.
! Exercise 8.2.[3, p.142] Referring to the definitions of conditional entropy (8.3–
8.4), confirm (with an example) that it is possible for H(X8.3
| y =Further
b k ) to exercises
exceed
H(X),
but
that
the
average,
H(X
|
Y
),
is
less
than
H(X).
So
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
The data-processing theorem
data are helpful – they do not increase uncertainty, on average.
! Exercise 8.3.
[2, p.143]
The data processing theorem states that data processing can only destroy
information.
Prove the chain rule for entropy, equation (8.7).
Joint/conditional entropy examples
p(x, y)
x=0
x=1
y=0
1/2
0
y=1
1/4
1/4
!
H(X,Y)=
H(X|Y)=
H(Y|X)=
H(X)=
H(Y)=
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Entropy is central because...
(A) entropy is the measure of average uncertainty in the random variable
(B) entropy is the average number of bits needed to describe the random
variable
(C) entropy is a lower bound on the average length of the shortest description
of the random variable
(D) entropy is measured in bits?
(E)
H(X) = −
!
x
p(x) log2 (p(x))
(F) entropy of a deterministic value is 0
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Mutual information
• Entropy H(X) is the uncertainty (``self-information'') of a single random variable
• Conditional entropy H(X|Y) is the entropy of one random variable conditional
upon knowledge of another.
• The average amount of decrease of the randomness of X by observing Y is
the average information that Y gives us about X.
Definition: The mutual information I(X; Y ) between the random variables X and
Y is given by
I(X; Y ) = H(X) − H(X|Y )
!!
p(x, y)
=
p(x, y) log2
p(x)p(y)
x∈X y∈Y
"
#
p(X, Y )
= Ep(x,y) log2
p(X)p(Y )
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
" #
n
!
n itheory because...
At thePheart
of
information
f (1 − f )n−i
e =
i
i=m+1
• Information channel capacity:
X
Channel: p(y|x)
C = max I(X; Y )
p(x)
1
C = log2 (1 + |h|2 P/PN )
2
Highest rate (bits/channel use) that can
 1communicate at2 reliably
 2 log2 (1 + |h| P/PN )
C = theorem' says: information capacity = operational( capacity
• Channel coding

Eh 21 log2 (1 + |h|2 P/PN )
)
)

†
1
 maxQ:T r(Q)=P 2 log2 )IMR + HQH )
• Operational channel capacity:
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Y
Mutual information example
p(x, y)
x=0
x=1
y=0
1/2
0
y=1
1/4
1/4
X or Y
0
1
p(x)
3/4
1/4
p(y)
1/2
1/2
!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Divergence (relative entropy, K-L distance)
Definition: Relative entropy, divergence or Kullback-Leibler distance between two
distributions, P and Q, on the same alphabet, is
!
" #
p(x)
p(x)
D(p ! q) := Ep log
=
p(x) log
q(x)
q(x)
x∈X
(Note: we use the convention 0 log
0
0
= 0 and 0 log
0
q
= p log
p
0
= ∞.)
• D(p ! q) is in a sense a measure of the “distance” between the two distributions.
• If P = Q then D(p ! q) = 0.
• Note D(p ! q) is not a true distance.
D( , ) = 0.2075
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
D( , ) = 0.1887
K-L divergence example
• X = {1, 2, 3, 4, 5, 6}
!
• P = [1/6
1/6
1/6
1/6
1/6
1/6]
• Q = [1/10 1/10 1/10 1/10 1/10 1/2]
• D(p ! q) =? and D(q ! p) =?
p(x)
q(x)
x
x
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Mutual information as divergence!
Definition: The mutual information I(X; Y ) between the random variables X and
Y is given by
I(X; Y ) = H(X) − H(X|Y )
!!
p(x, y)
=
p(x, y) log2
p(x)p(y)
x∈X y∈Y
"
#
p(X, Y )
= Ep(x,y) log2
p(X)p(Y )
• Can we express mutual information in terms of the K-L divergence?
I(X; Y )
=
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
D(p(x, y) ! p(x)p(y))
Mutual information and entropy
Theorem: Relationship between mutual information and entropy.
I(X; Y ) = H(X) − H(X|Y )
I(X; Y ) = H(Y ) − H(Y |X)
I(X; Y ) = H(X) + H(Y ) − H(X, Y )
I(X; Y ) = I(Y ; X) (symmetry)
I(X; X) = H(X) (“self-information”)
H(X)
H(Y)
H(Y)
H(X|Y)
I(X;Y)
H(X)
H(Y)
I(X;Y)
I(X;Y)
``Two’s company, three’s a crowd’’
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Chain rule for entropy
Theorem: (Chain rule for entropy): (X1 , X2 , ..., Xn ) ∼ p(x1 , x2 , ..., xn )
H(X1)
H(X2)
H(X1 , X2 , ..., Xn ) =
n
!
i=1
H(Xi |Xi−1 , ..., X1 )
!
H(X3)
H(X1)
H(X1,X2,X3)
=
H(X2|X1)
+
+
H(X3|X1,X2)
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Conditional mutual information
H(X)
H(Y)
H(Z)
H(X)
I(XlY|Z)
H(Y)
H(X|Z)
H(X)
H(X|Y,Z)
H(X)
H(Y)
H(Y)
-
=
H(Z)
H(Z)
H(Z)
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Chain rule for mutual information
Theorem: (Chain rule for mutual information)
I(X1 , X2 , ..., Xn ; Y ) =
n
!
i=1
H(X)
H(Y)
H(X)
=
I(X,Y;Z)
H(Z)
I(Xi ; Y |Xi−1 , Xi−2 , ..., X1 )
H(X)
H(Y)
+
I(X,;Z)
H(Z)
Chain rule for relative entropy in book pg. 24
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
H(Y)
I(Y,;Z|X)
H(Z)
!
What is the grey region?
H(X)
H(X)
H(Y)
H(Z)
H(Z)
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Another disclaimer....
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
144
8 — Dependent Random Variables
I(X;Y)
A
I(X;Y|Z)
H(Y)
H(X,Y|Z)
Figure 8.3. A misleading
representation of entropies,
continued.
H(Y|X,Z)
H(X|Y,Z)
H(X)
H(Z|X)
H(Z|Y)
H(Y)
H(Z|X,Y)
H(Z)
that the random outcome (x, y) might correspond to a point in the diagram,
and thus confuse entropies with probabilities.
Secondly, the depiction in terms of Venn diagrams encourages one to believe that all the areas correspond to positive quantities. In the special case of
two random variables it is indeed true that H(X | Y ), I(X; Y ) and H(Y | X)
are positive quantities. But as soon as we progress to three-variable ensembles,
we obtain a diagram with positive-looking areas that may actually correspond
to negative quantities. Figure 8.3 correctly shows relationships such as
H(X) + H(Z | X) + H(Y | X, Z) = H(X, Y, Z).
(8.31)
But it gives the misleading impression that the conditional mutual information
I(X; Y | Z) is less than the mutual information I(X; Y ). In fact the area
labelled A can correspond to a negative quantity. Consider the joint ensemble
(X, Y, Z) in which x ∈ {0, 1} and y ∈ {0, 1} are independent binary variables
and z ∈ {0, 1} is defined to be z = x + y mod 2. Then clearly H(X) =
H(Y ) = 1 bit. Also H(Z) = 1 bit. And H(Y | X) = H(Y ) = 1 since the two
variables are independent. So the mutual information between X and Y is
zero. I(X; Y ) = 0. However, if z is observed, X and Y become dependent —
knowing x, given z, tells you what y is: y = z − x mod 2. So I(X; Y | Z) = 1
bit. Thus the area labelled A must correspond to −1 bits for the figure to give
the correct answers.
The above example is not at all a capricious or exceptional illustration. The
binary symmetric channel with input X, noise Y , and output Z is a situation
in which I(X; Y ) = 0 (input and noise are independent) but I(X; Y | Z) > 0
(once you see the output, the unknown input and the unknown noise are
intimately related!).
The Venn diagram representation is therefore valid only if one is aware
that positive areas may represent negative quantities. With this proviso kept
in mind, the interpretation of entropies in terms of sets can be helpful (Yeung,
1991).
Solution to exercise 8.9 (p.141). For any joint ensemble XY Z, the following
chain rule for mutual information holds.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
I(X; Y, Z) = I(X; Y ) + I(X; Z | Y ).
(8.32)
I(W ; D, R) = I(W ; D)
(8.33)
Now, in the case w → d → r, w and r are independent given d, so
I(W ; R | D) = 0. Using the chain rule twice, we have:
"""
[Mackay’s textbook]
Convex and concave functions
10
7.5
5
2.5
0
12.5
10
7.5
5
2.5
0
-2.5
-5
-7.5
-10
-2.5
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
10
7.5
Convex and concave functions
5
2.5
0
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
12.5
10
7.5
5
2.5
0
-2.5
-5
-7.5
-10
-2.5
Jensen’s inequality
Theorem: (Jensen’s inequality) If f is convex, then
E[f (X)] ≥ f (E[X]).
!
If f is strictly convex, the equality implies X = E[X] with probability 1.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Jensen’s inequality consequences
• Theorem: (Information inequality) D(p ! q) ≥ 0, with equality iff p = q.
• Corollary: (Nonnegativity of mutual information) I(X; Y ) ≥ 0 with equality
iff X and Y are independent.
• Theorem: (Conditioning reduces entropy) H(X|Y ) ≤ H(X) with equality iff
X and Y are independent.
• Theorem: H(X) ≤ log |X | with equality iff X has a uniform distribution over
X.
!n
• Theorem: (Independence bound on entropy) H(X1 , X2 , ..., Xn ) ≤ i=1 H(Xi )with
equality iff Xi are independent.
#!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Log-sum inequality
Theorem: (Log sum inequality) For nonnegative a1 , a2 , ..., an and b1 , b2 , ..., bn ,
#
" n
$n
n
!
!
ai
ai
ai log
ai log $i=1
≥
n
bi
i=1 bi
i=1
i=1
with equality iff ai /bi = const.
Convention: 0 log 0 = 0, a log
a
0
= ∞ if a > 0 and 0 log 00 = 0.
!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Log-sum inequality consequences
• Theorem: (Convexity of relative entropy) D(p ! q) is convex in the pair (p, q),
so that for pmf’s (p1 , q1 ) and (p2 , q2 ), we have for all 0 ≤ λ ≤ 1:
D(λp1 + (1 − λ)p2 ! λq1 + (1 − λ)q2 )
≤ λD(p1 ! q1 ) + (1 − λ)D(p2 ! q2 )
• Theorem: Concavity of entropy For X ∼ p(x), we have that
H(p) := Hp (X)is a concave function of p(x).
• Theorem: (Concavity of the mutual information in p(x)) Let (X, Y ) ∼ p(x, y) =
p(x)p(y|x). Then, I(X; Y ) is a concave function of p(x) for fixed p(y|x).
• Theorem: (Convexity of the mutual information in p(y|x)) Let (X, Y ) ∼
p(x, y) = p(x)p(y|x). Then, I(X; Y ) is a convex function of p(y|x) for fixed
p(x).
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
#!
Markov chains
Definition: X, Y, Z form a Markov chain in that order (X → Y → Z) iff
p(x, y, z) = p(x)p(y|x)p(z|y)
X
≡
N1
N2
!
!
Y
p(z|y, x) = p(z|y)
Z
• X → Y → Z iff X and Z are conditionally independent given Y
• X→Y →Z
⇒
!
Z → Y → X. Thus, we can write X ↔ Y ↔ Z.
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Data-processing inequality
X
N1
N2
!
!
Y
N1
Z
X
!
Y
f()
Z
!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Markov chain questions
If X → Y → Z, then I(X; Y ) ≥ I(X; Y |Z).
What if X, Y, Z do not form a Markov chain, can I(X; Y |Z) ≥ I(X; Y )?
If X1 → X2 → X3 → X4 → X5 → X6 , then Mutual Information increases as you
get closer together:
I(X1 ; X2 ) ≥ I(X1 ; X4 ) ≥ I(X1 ; X5 ) ≥ I(X1 ; X6 ).
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Consequences on sufficient statistics
• Consider a family of probability distributions {fθ (x)} indexed by θ.
If X ∼ f (x | θ) for fixed θ and T (X) is any statistic (i.e., function of the
sample X), then we have
θ → X → T (X).
• The data processing inequality in turn implies
I(θ; X) ≥ I(θ; T (X))
for any distribution on θ.
• Is it possible to choose a statistic that preserves all of the information in X
about θ?
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Consequences on sufficient statistics
• Consider a family of probability distributions {fθ (x)} indexed by θ.
If X ∼ f (x | θ) for fixed θ and T (X) is any statistic (i.e., function of the
sample X), then we have
θ → X → T (X).
• The data processing inequality in turn implies
I(θ; X) ≥ I(θ; T (X))
for any distribution on θ.
• Is it possible to choose a statistic that preserves all of the information in X
about θ?
Definition: Sufficient Statistic A function T (X) is said to be a sufficient statistic
relative to the family {fθ (x)} if the conditional distribution of X, given T (X) = t,
is independent of θ for any distribution on θ (Fisher-Neyman):
fθ (x) = f (x | t)fθ (t)
⇒
θ → T (X) → X
⇒
I(θ; T (X)) ≥ I(θ; X)
"""
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Example of a sufficient statistic
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Fano’s inequality
!
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye
Fano’s inequality consequences
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye