Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
SUT Journal of Mathematics
Vol. 32, No. 2 (1996), 149–156
EXPECTED RELATIVE ENTROPY BETWEEN
A FINITE DISTRIBUTION AND ITS
EMPIRICAL DISTRIBUTION
Syuuji Abe
(Received September 13, 1996)
Abstract. The expected relative entropy (or the expected divergence) between
finite probability distribution Q on {1, 2, . . . , `} and its empirical one obtained
from the sample of size n drawn from Q is computed and is found to be given
asymptotically by (` − 1)(log e)/2n which is independent of Q. A method to
compute the entropy of the binomial distribution more accurately than before
is also given.
AMS 1991 Mathematics Subject Classification. 62B10, 94A17, 94A15.
Key words and phrases. expected relative entropy, expected divergence, empirical distribution, entropy of the binomial distribution.
§1.
Introduction
In information theory, the relative entropy (or divergence) D[P ||Q] :=
P (x)
x∈X P (x) log Q(x) plays an important role as a kind of measure of distance
between two probability distributions P, Q on a discrete set X (log will al∑
1
2
ways mean log2 ). It is known that D[P ||Q] ≥ 2 ln
2 ( x∈X |P (x) − Q(x)|)
holds (see for example [1]). The relative entropy is closely related to mathematical statistics. For example, the log-likelihood ratio can be written as
the difference between two relative entropies, and the so-called Fisher information can be expressed in terms of the relative entropy. In this paper, we
compute the expected relative entropy between a finite probability distribution and its empirical one. Let X n = (X1 , X2 , . . . , Xn ) be the sample of size
n drawn from the distribution Q(x) on X = {1, 2, . . . , `} and let PX n (x) be
the empirical (frequency) distribution corresponding to X n . It is known that
∑
149
150
S. ABE
E [D [PX n ||Q]] ≤ E [D [PX n−1 ||Q]] (see [1]). Actually, however, the following
estimate will be found in §3 using a lemma in §2:
(` − 1) log e log e
+
E [D [PX n ||Q]] =
2n
12
§2.
(
∑
x∈X
)
1
1
1
−1
+ O( 3 ).
2
Q(x)
n
n
A Lemma
We prove a lemma which is essential for the proof of the theorem given in
§3. The lemma states that for the random variable X obeying B(n, p) (the
binomial distribution with parameters n, p) and for sufficiently large n,
[
]
2m−2
∑ f (i) (p)E[(X − np)i ]
X
E f( ) ≈
,
n
i!ni
i=0
where f (x) is an arbitrary function such that maxx∈[ 1 ,1] |f (2m) (x)| ≤ cns
n
for some m ≥ 1, for example f (x) = −x ln x that appears in the entropy
∑
− x∈X p(x) log p(x).
Lemma . Let f (x) ∈ C (2m) (0, 1] for some m ≥ 1 and suppose there exist
constants c and s such that maxx∈[ 1 ,1] |f (2m) (x)| ≤ cns for any positive integer
n
n. Then for 0 < p < 1, we have
[g2 (p) − g1 (p)]nm → 0 (n → ∞)
and
g1 (p) =
2m−2
∑
i=0
f (i) (p)µi
+ O(n−m ),
i!ni
where
( )
pk =
g1 (p) =
µi =
g2 (p) =
n k
p (1 − p)(n−k)
k
n
∑
k=1
n
∑
k=0
2m
∑
i=0
( )
pk f
k
n
pk (k − np)i
f (i) (p) µi
.
i! ni
(k = 0, 1, . . . , n)
EXPECTED RELATIVE ENTROPY
151
From the lemma, we have g1 (p) ≈ g2 (p), and it is easy to show
Note:
[ (
E f
X
n
)]
n
∑
=
k=0
2m
∑
≈
i=0
( )
pk f
f (i) (p) µi
i! ni
≈ f (p) +
Proof.
k
n
f 00 (p) p(1 − p)
+ ....
2
n
Since
g1 (p) =
=
n
∑
k
pk f ( )
n
k=1
n
∑
pk [f (p) +
k=1
+
f 0 (p) k
( − p) + . . .
1! n
f (2m) (θ k ) k
f (2m−1) (p) k
n
( − p)2m−1 +
( − p)2m ]
(2m − 1)! n
(2m)!
n
with θ k lying between
n
k
n
and p, we get with some manipulations
[g2 (p) − g1 (p)]nm
(
= p0
)
f 0 (p)
f (2m) (p)
f (p) +
(−p) + . . . +
(−p)2m nm
1!
(2m)!
n
1
1 ∑
+
pk (f (2m) (p) − f (2m) (θ k ))(k − np)2m .
n
(2m)! nm k=1
( )
Since p0 nm = n0 p0 (1 − p)n nm = (1 − p)n nm , the first part of the right hand
side goes to 0 as n → ∞.
The continuity of f (2m) (x) implies
∀² > 0, ∃δ > 0; |p − p0 | < δ ⇒ |f (2m) (p) − f (2m) (p0 )| < ².
Hence in the second part:
n [
(
)
]
1
1 ∑
pk f (2m) (p) − f (2m) (θ k ) (k − np)2m
m
n
(2m)! n k=1
=
1
1
(2m)! nm
= A + B,
∑
k
|n
−p|<δ
[
]+
1
1
(2m)! nm
∑
k
|n
−p|≥δ
[
]
152
S. ABE
we first have
|A| ≤
<
≤
=
¯
¯
∑
1
1
(2m)! nm
¯
¯
pk ¯f (2m) (p) − f (2m) (θ k )¯ (k − np)2m
n
k
|n
−p|<δ
∑
1
1
(2m)! nm
pk ²(k − np)2m
k
|n
−p|<δ
n
²
1 ∑
pk (k − np)2m
(2m)! nm k=0
² µ2m
.
(2m)! nm
We know from Riordan [4] that
µ2m = (2m − 1)(2m − 3) · · · 3 · 1(p(1 − p)n)m + O(nm−1 )
and so we obtain |A| < ² + O( n1 ). Thus A → 0 as n → ∞.
To estimate |B|, we note that, in the case | nk − p| ≥ δ, we have
[(
)
]
k
k
,1 −
||(p, 1 − p)
n
n
(
)2
k
log e
2| − p|
(see §1),
2
n
Dk := D
≥
√
hence
Dk
2 log e
|B| ≤
≤
≤
≤
≥ | nk − p| ≥ δ. Now for large n
1
1
(2m)! nm
1
1
(2m)! nm
(
∑
n
k:Dk ≥2δ 2 log e
∑
(
) )
(
pk |f (2m) (p)| + |f (2m) θ k | (k − np)2m
k:Dk ≥2δ 2 log e
s
∑
2cn
1
(2m)! nm
)
pk |f (2m) (p) − f (2m) θ k |(k − np)2m
n
pk n2m
k:Dk ≥2δ 2 log e
2c s+m
2
n
(n + 1)2 2−2nδ log e .
(2m)!
Here in the last inequality we used
∑
pk ≤ (n + 1)2 2−an
k:Dk ≥a
(see Theorem 12.2.1 in [1]). Thus B → 0 as n → ∞. And [g2 (p) − g1 (p)]nm →
j
0 (n → ∞), hence g1 (p) = g2 (p) + o(n−m ). Recalling µj = O(nb 2 c ) ([4]), we
∑
f (i) µi
−m ), completing the proof.
can write g1 (p) = 2m−2
2
i=0
i! ni + O(n
EXPECTED RELATIVE ENTROPY
153
Example 1. Let f (x) = x ln x and m = 3. We can use the lemma since
maxx∈[ 1 ,1] |f (6) (x)| = 4!n5 . Thus
n
f 00 (p) p(1 − p) f (3) (p) µ3 f (4) (p) µ4
+
+
+ O(n−3 )
2!
n
3! n3
4! n4
−1 p(1 − p)(1 − 2p)
1 p(1 − p)
+ 2
= p ln p +
2p
n
6p
n2
2
2
2 3p (1 − p)
+
+ O(n−3 )
24p3
n2
1 − p (1 − p)(1 + p)
= p ln p +
+
+ O(n−3 )
2n
12pn2
g1 (p) = f (p) +
Example 2 [entropy of the binomial distribution].
Frank and Öhrvik[3] computed the entropy of the binomial distribution.
Here we observe it in more detail using the lemma.
H(X)
= −
= −
= −
n
∑
k=0
n
∑
k=0
n
∑
pk log pk
(
pk
( )
n
log
+ k log p + (n − k) log (1 − p)
k
)
pk (log n! − log k! − log (n − k)! + k log p + (n − k) log (1 − p))
k=0
= − log n! − np log p − n(1 − p) log (1 − p)
+
n
∑
pk (log k! + log (n − k)!)
k=0
= − log n! − np log p − n(1 − p) log (1 − p)
+
n
∑
k=1
pk log k! +
n−1
∑
pk log (n − k)!.
k=0
In a similar way as in Feller[2, II.9], we may show that there exists 0 ≤ bk ≤
such that
(
)
1
1
1
1 − bk
ln k! = ln 2π + (k + ) ln k − k +
−
(k ≥ 1).
2
2
12k
360k 3
Then letting f (x) = ln x, x1 ,
with some computations that
1
x3
5
21
in the lemma and using Example 1, we find
(
)
1
(1 − 2p)2
p4 + (1 − p)4
1
H(X) = log [2πenp(1−p)]−(log e)
+
+O( 3 ).
2
2
2
2
12np(1 − p) 24n p (1 − p)
n
154
S. ABE
§3.
Expected Relative Entropy
We prove our main theorem below, using Example 1 (hence our lemma). This
log e
theorem states that, for large n, E [D [PX n ||Q]] is essentially (`−1)
, in
2n
inverse proportion to the sample size n and not dependent on the true distribution.
Theorem. Let X n = (X1 , X2 , . . . , Xn ) be the sample of size n drawn from
the distribution Q(x) on X = {1, 2, . . . , `} and let PX n (x) be the empirical
(frequency) distribution corresponding to X n , then
(` − 1) log e log e
+
E [D [PX n ||Q]] =
2n
12
Proof.
(
∑
x∈X
)
1
1
1
−1
+ O( 3 ).
Q(x)
n2
n
The expectation to be computed is given by
∑
E[D[PX n ||Q]] =
Qn (x1 , x2 , . . . , xn ) D[Pxn ||Q]
(x1 ,x2 ,...,xn )∈X n
∑
=
Qn (T (P )) D[P ||Q],
P ∈Pn
where Qn (x1 , x2 , . . . , xn ) = P r(X1 =x1 ,X2 =x2 ,...,Xn =xn ), Pn is the set of all possible empirical distributions, Qn (T (P )) denotes the probability that the empirical distribution becomes exactly P . Since
the empirical
distribution P is
(
)
written as ( kn1 , kn2 , . . . , kn` ) and Qn (T (P )) = k1 ,k2n,...,k` Q(1)k1 Q(2)k2 · · · Q(`)k` ,
we have
E[D[PX n ||Q]]
=
∑
(
Qn (T (P ))
P ∈Pn
= −E[H(
∑
P (i) log P (i) −
i∈X
∑
∑
i∈X
∑
)
P (i) log Q(i)
K`
K1 K2
,
,...,
)] −
Qn (T (P )) P (i) log Q(i)
n n
n
i∈X P ∈P
n
K1 K2
K`
= −E[H(
,
,...,
)]
n
n n
∑
−
i∈X
∑
(
n
k1 , k2 , . . . , k`
k1 ,k2 ,...,k` :
k1 +k2 +...+k` =n
= −E[H(
ki
Q(1)k1 Q(2)k2 · · · Q(`)k`
log Q(i)
n
)
K1 K2
K`
,
,...,
)] + H(Q).
n n
n
EXPECTED RELATIVE ENTROPY
155
Note that P (i) = Kni , i = 1, . . . , `, are random variables and H(Π) denotes the
entropy of the distribution Π. Since Ki ∼ B(n, Q(i)), we see using Example 1
that
[
Ki
Ki
log
n
n
n
∑
k
k
=
pk log
n
n
k=0
]
E
1 − Q(i)
1
= Q(i) log Q(i) +
log e +
2n
12n2
(
)
1
1
− Q(i) log e + O( 3 ).
Q(i)
n
Thus
[
]
K1 K2
K`
−E H(
,
,...,
)
n n
n
[
= E
`
∑
Ki
i=1
=
=
`
∑
[
E
i=1
` (
∑
Ki
log
n
n
Ki
Ki
log
n
n
]
]
)
Q(i) log Q(i) +
i=1
1 − Q(i)
1
1
1
log e +
(
− Q(i)) log e + O( 3 ).
2
2n
12n Q(i)
n
Therefore,
E [D [PX n ||Q]] =
(` − 1) log e log e
+
2n
12
(
∑
x∈X
)
1
1
1
−1
+ O( 3 ),
2
Q(x)
n
n
2
finishing the proof.
Acknowledgment
The author is grateful to Professor Yasuichi Horibe for his valuable remarks
and helpful advices.
References
[1] T. M. Cover and J. A. Thomas, Elements of Infomation Theory, New York: John
Wiley & Sons, Inc.,1991.
[2] W. Feller, An Introduction to Probability Theory and Its Applications, Vol.I 2nd
Ed. New York: John Wiley & Sons, Inc.,1968.
[3] O. Frank and J. Öhrvik, Entropy of sums of random digits, Computational Statistics & Data Analysis 17 (1994) 177-184.
156
S. ABE
[4] J. Riordan, Moment recurrence relations for binomial, Poisson and hypergeometric
frequency distributions, Ann. Math. Statist. 8 (1937) 103-111.
Syuuji Abe
Department of Applied Mathematics,
Science University of Tokyo
1-3 Kagurazaka, Shinjuku-ku, Tokyo 162, Japan