Download Entropy, relative entropy, and mutual information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Harvard SEAS
ES250 – Information Theory
Entropy, relative entropy, and mutual information
1
∗
Entropy
1.1
Entropy of a random variable
Definition The entropy of a discrete random variable X with pmf pX (x) is
X
p(x) log p(x)
H(X) = −
x
The entropy measures the expected uncertainty in X. It has the following properties:
• H(X) ≥ 0, entropy is always non-negative. H(X) = 0 iff X is deterministic.
• Since Hb (X) = logb (a)Ha (X), we don’t need to specify the base of the logarithm.
1.2
Joint entropy and conditional entropy
Definition Joint entropy between two random variables X and Y is
H(X, Y )
,
−Ep(x,y) [log p(X, Y )]
XX
= −
p(x, y) log p(x, y)
x∈X y∈Y
Definition Given a random variable X, the conditional entropy of Y (average over X) is
H(Y |X)
,
−Ep(x) [H(Y |X = x)]
X
p(x)H(Y |X = x)
= −
x∈X
=
=
−Ep(x) Ep(y|x) [log p(Y |X)]
−Ep(x,y) [log p(Y |X)]
Note: H(X|Y ) 6= H(Y |X).
1.3
Chain rule
Joint and conditional entropy provide a natural calculus:
Theorem (Chain rule)
H(X, Y ) = H(X) + H(Y |X)
Corollary
H(X, Y |Z) = H(X|Z) + H(Y |X, Z)
∗ Based
on Cover & Thomas, Chapter 2
1
Harvard SEAS
2
ES250 – Information Theory
Relative Entropy and Mutual Information
2.1
Entropy and Mutual Information
• Entropy H(X) is the uncertainty (“self-information”) of a single random variable
• Conditional entropy H(X|Y ) is the entropy of one random variable conditional upon knowledge of another.
• We call the reduction in uncertainty mutual information:
I(X; Y ) = H(X) − H(X|Y )
• Eventually we will show that the maximum rate of transmission over a given channel p(Y |X), such that the
error probability goes to zero, is given by the channel capacity:
C = max I(X; Y )
p(X)
Theorem Relationship between mutual information and entropy
2.2
I(X; Y )
I(X; Y )
=
=
H(X) − H(X|Y )
H(Y ) − H(Y |X)
I(X; Y )
I(X; Y )
=
=
H(X) + H(Y ) − H(X, Y )
I(Y ; X) (symmetry)
I(X; X)
=
H(X)
(“self-information”)
Relative Entropy and Mutual Information
Definition Relative entropy
(Information- or Kullback-Leibler divergence)
D(p k q) , Ep
·
¸ X
p(x)
p(x)
log
p(x) log
=
q(x)
q(x)
x∈X
Definition Mutual information (in terms of divergence)
I(X; Y )
,
D(p(x, y) k p(x)p(y))
¸
·
p(X, Y )
= Ep(x,y) log
p(X)p(Y )
XX
p(x, y)
p(x, y) log
=
p(x)p(y)
x∈X y∈Y
3
3.1
Chain Rules
Chain Rule for Entropy
The entropy of a collection of random variables is the sum of the conditional entropies:
Theorem (Chain rule for entropy) (X1 , X2 , ..., Xn ) ∼ p(x1 , x2 , ..., xn )
H(X1 , X2 , ..., Xn ) =
n
X
H(Xi |Xi−1 , ..., X1 )
i=1
2
Harvard SEAS
3.2
ES250 – Information Theory
Chain Rule for Mutual Information
Definition Conditional mutual information
,
I(X; Y |Z)
H(X|Z) − H(X|Y, Z)
p(X, Y |Z)
= Ep(x,y,z) log
p(X|Z)p(Y |Z)
Theorem (Chain rule for mutual information)
I(X1 , X2 , ..., Xn ; Y ) =
n
X
I(Xi ; Y |Xi−1 , Xi−2 , ..., X1 )
i=1
3.3
Chain Rule for Relative Entropy
Definition Conditional relative entropy
D(p(y|x) k q(y|x))
·
¸
p(Y |X)
, Ep(x,y) log
q(Y |X)
X
X
p(y|x)
=
p(x)
p(y|x) log
q(y|x)
x
y
Theorem (Chain rule for relative entropy)
D(p(x, y) k q(x, y)) = D(p(x) k q(x)) + D(p(y|x) k q(y|x))
4
Jensen’s Inequality
• Recall that a convex function on an interval is one for which every chord lies (on or) above the function on
that interval.
• A function f is concave if −f is convex.
Theorem (Jensen’s inequality) If f is convex, then
E[f (X)] ≥ f (E[X]).
If f is strictly convex, the equality implies X = E[X] with probability 1.
4.1
Consequences
Theorem (Information inequality)
D(p k q) ≥ 0
with equality iff p = q.
Corollary (Nonnegativity of mutual information)
I(X; Y ) ≥ 0
with equality iff X and Y are independent.
Corollary (Information inequality)
D(p(y|x) k q(y|x)) ≥ 0
with equality iff p(y|x) = q(y|x) for all x, y s.t. p(x) > 0.
Corollary (Nonnegativity of mutual information)
I(X; Y |Z) ≥ 0
with equality iff X and Y are conditionally independent given Z.
3
Harvard SEAS
4.2
ES250 – Information Theory
Some Inequalities
Theorem
H(X) ≤ log |X |
with equality iff X has a uniform distribution over X .
Theorem (Conditioning reduces entropy)
H(X|Y ) ≤ H(X)
with equality iff X and Y are independent.
Theorem (Independence bound on entropy)
H(X1 , X2 , ..., Xn ) ≤
n
X
H(Xi )
i=1
with equality iff Xi are independent.
5
Log Sum Inequality and its Application
Theorem (Log sum inequality) For nonnegative a1 , a2 , ..., an and b1 , b2 , ..., bn ,
à n
!
Pn
n
X
X
ai
ai
≥
ai log Pi=1
ai log
n
bi
i=1 bi
i=1
i=1
with equality iff ai /bi = const.
Theorem (Convexity of relative entropy) D(p k q) is convex in the pair (p, q), so that for pmf’s (p 1 , q1 ) and (p2 , q2 ),
we have for all 0 ≤ λ ≤ 1:
D(λp1 + (1 − λ)p2 k λq1 + (1 − λ)q2 )
≤ λD(p1 k q1 ) + (1 − λ)D(p2 k q2 )
Theorem (Concavity of entropy) For X ∼ p(x), we have that
H(p) := Hp (X) is a concave function of p(x).
Theorem Let (X, Y ) ∼ p(x, y) = p(x)p(y|x).
Then, I(X; Y ) is a concave function of p(x) for fixed p(y|x), and a convex function of p(y|x) for fixed p(x).
6
6.1
Data-Processing Inequality
Markov Chain
Definition X, Y, Z form a Markov chain in that order (X → Y → Z) iff
p(x, y, z) = p(x)p(y|x)p(z|y).
Some consequences:
• X → Y → Z iff X and Z are conditionally independent given Y .
• X→Y →Z
⇒
Z → Y → X. Thus, we can write X ↔ Y ↔ Z.
• If Z = f (Y ), then X → Y → Z.
4
Harvard SEAS
6.2
ES250 – Information Theory
Data-Processing Inequality
Theorem Data-processing inequality If X → Y → Z, then I(X; Y ) ≥ I(X; Z).
Corollary If Z = g(Y ), then I(X; Y ) ≥ I(X; g(Y )).
Corollary If X → Y → Z, then I(X; Y ) ≥ I(X; Y |Z).
7
Sufficient Statistics
7.1
Statistics and Mutual Information
• Consider a family of probability distributions {fθ (x)} indexed by θ.
• If X ∼ f (x | θ) for fixed θ and T (X) is any statistic (i.e., function of the sample X), then we have
θ → X → T (X).
• The data processing inequality in turn implies
I(θ; X) ≥ I(θ; T (X))
for any distribution on θ.
• Is it possible to choose a statistic that preserves all of the information in X about θ?
7.2
Sufficient Statistics and Compression
Definition (Sufficient Statistic) A function T (X) is said to be a sufficient statistic relative to the family {f θ (x)} if the
conditional distribution of X, given T (X) = t, is independent of θ for any distribution on θ (Fisher-Neyman):
fθ (x) = f (x | t)fθ (t)
⇒
θ → T (X) → X
⇒
I(θ; T (X)) ≥ I(θ; X)
Hence, I(θ; X) = I(θ; T (X)) for a sufficient statistic.
Definition (Minimal Sufficient Statistic) A function T (X) is a minimal sufficient statistic relative to {f θ (x)} if it
is a function of every other sufficient statistic U , in which case:
θ → T (X) → U (X) → X
and information about θ in the sample is maximally compressed.
8
8.1
Fano’s Inequality
Fano’s Inequality and Estimation Error
Fano’s inequality relates the probability of estimation error to conditional entropy:
Theorem (Fano’s inequality) For any estimator X̂ : X → Y → X̂, with Pe = Pr{X 6= X̂}, we have
H(Pe ) + Pe log |X | ≥ H(X|X̂) ≥ H(X|Y ).
This implies
1 + Pe log |X | ≥ H(X|Y )
or
Pe ≥
H(X|Y ) − 1
.
log |X |
5
Harvard SEAS
8.2
ES250 – Information Theory
Implications of Fano’s Inequality
Some corollaries follow:
Corollary Let p = Pr{X 6= Y }. Then,
H(p) + p log |X | ≥ H(X|Y ).
Corollary Let Pe = Pr{X 6= X̂}, and constrain X̂ : Y → X ; then
H(Pe ) + Pe log(|X | − 1) ≥ H(X|Y ).
8.3
Sharpness of Fano’s inequality
Suppose no observation Y so that X must simply be guessed, and order X ∈ {1, 2, . . . , m} such that p 1 ≥ p2 ≥ · · · ≥
pm . Then X̂ = 1 is the optimal estimate of X, with Pe = 1 − p1 , and Fano’s inequality becomes
H(Pe ) + Pe log(m − 1) ≥ H(X).
The pmf
(p1 , p2 , . . . , pm ) =
µ
Pe
Pe
,...,
1 − Pe ,
m−1
m−1
¶
achieves this bound with equality.
8.4
Applications of Fano’s inequality
Lemma If X and X 0 are iid with entropy H(X), then
Pr{X = X 0 } ≥ 2−H(X) ,
with equality iff X has a uniform distribution.
Corollary Let X, X 0 be independent with X ∼ p(x), X 0 ∼ r(x); x, x0 ∈ X . Then
Pr{X = X 0 } ≥ 2−H(p)−D(pkr) ,
and Pr{X = X 0 } ≥ 2−H(r)−D(rkp) .
6
Related documents