Download Asymptotic Equipartition Property (AEP) and Entropy rates ∗

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
Harvard SEAS
ES250 – Information Theory
Asymptotic Equipartition Property (AEP)
and Entropy rates ∗
1
Asymptotic Equipartition Property
1.1
Preliminaries
Definition (Convergence of random variables) We say that a sequence of random variables X 1 , X2 , . . . ,
converges to a random variable X:
1. In probability if for every ² > 0, Pr{|Xn − X| > ²} → 0
2. In mean square if E[(X − Xn )2 ] → 0
3. With probability 1 (also called almost surely) if Pr{limn→∞ Xn = X} = 1
1.2
Asymptotic Equipartition Property Theorem
iid
Theorem (AEP) If X1 , X2 , . . . ∼ p(x), then
1
− log p(X1 , X2 , . . . , Xn ) → H(X)
n
(n)
Definition The typical set A²
property
in probability.
with respect to p(x) is the set of sequences (x1 , x2 , . . . , xn ) ∈ X n with the
2−n(H(X)+²) ≤ p(x1 , x2 , . . . , xn ) ≤ 2−n(H(X)−²) .
Theorem
(n)
1. If (x1 , x2 , . . . , xn ) ∈ A² , then H(X) − ² ≤ − n1 log p(x1 , x2 , . . . , xn ) ≤ H(X) + ².
(n)
2. Pr{A² } > 1 − ² for n sufficiently large.
(n)
3. |A² | ≤ 2n(H(X)+²)
(n)
4. |A² | ≥ (1 − ²)2n(H(X)−²)
The typical set has probability nearly one, cardinality nearly 2nH , and nearly equiprobable elements.
1.3
Consequences of the AEP
A Coding Scheme
(n)
Recall our definition of A² as the typical set with respect to p(x). Can we assign codewords such
that sequences (x1 , x2 , . . . , xn ) ∈ X n can be represented using nH bits on average? Let xn denote
(x1 , x2 , . . . , xn ), and let l(xn ) be the length of the codeword corresponding to xn . If n is sufficiently
(n)
large so that Pr{A² } ≥ 1 − ², the expected codeword length is
X
E[l(X n )] =
p(xn )l(xn )
xn
≤ n(H + ²) + ²n(log |X |) + 2
= n(H + ²0 )
∗
Based on Cover & Thomas, Chapter 3,4
1
Harvard SEAS
ES250 – Information Theory
Data Compression
iid
Theorem Let X n ∼ p(x) and let ² > 0. Then there exists a code that maps sequences xn of length
n into binary strings such that the mapping is one-to-one (and therefore invertible) and
·
¸
1
n
E
l(X ) ≤ H(X) + ²
n
for n sufficiently large.
1.4
High-Probability Sets and the Typical Set
What can we say about the cardinality of the typical set?
(n)
Definition (Smallest set for a given probability) For each n = 1, 2, . . ., let Bδ
with
⊂ X n be the smallest set
(n)
Pr{Bδ } ≥ 1 − δ.
iid
(n)
Theorem (Problem 3.3.11) Let X1 , X2 , . . . , Xn ∼ p(x). For δ < 1/2 and δ 0 > 0, if Pr{Bδ } > 1 − δ,
then for sufficiently large n we have
1
(n)
log |Bδ | > H − δ 0 .
n
(n)
So (to first order) Bδ
has at least 2nH elements.
The following notation expresses equality to first order in the exponent:
.
Definition The notation an = bn means
1
an
log
= 0.
n→∞ n
bn
lim
(n)
Bδ
(n)
has order 2nH elements (at least), and A²
has order 2n(H±²) . Thus, if δn → 0 and ²n → 0, then
. nH
(n) .
.
|Bδn | = |A(n)
²n | = 2
(n)
Hence, as n grows large, the typical set A²
2
2.1
(n)
is about the same size as the smallest set Bδ .
Entropy Rates of a Stochastic Process
Markov Chains
Definition A stochastic process {Xi } is an indexed sequence of random variables.
Definition A discrete-time stochastic process {Xi }i∈I is one for which we associate the discrete index set
I = {1, 2, . . .} with time.
2
Harvard SEAS
ES250 – Information Theory
Definition A discrete-time stochastic process is said to be stationary if the joint distribution of any subset
of the sequence of random variables is invariant with respect to shifts in the time index; that is,
Pr{X1 = x1 , X2 = x2 , . . . , Xn = xn } = Pr{X1+l = x1 , X2+l = x2 , . . . , Xn+l = xn }
for every n and every shift l and for all x1 , x2 , . . . , xn ∈ X .
Definition A discrete stochastic process X1 , X2 , . . . said to be a finite-state Markov chain or Markov
process if, for n = 1, 2, . . ., we have
Pr{Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 , . . . , X1 = x1 }
= Pr{Xn+1 = xn+1 |Xn = xn }
for all x1 , x2 , . . . , xn , xn+1 ∈ X .
In this case, the joint pmf can be (conveniently!) decomposed as
p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) · · · p(xn |xn−1 ).
Definition The Markov chain is said to be homogeneous or time invariant if the conditional probability
p(xn+1 |xn ) does not depend on n; that is, for n = 1, 2, . . .,
Pr{Xn+1 = b|Xn = a} = Pr{X2 = b|X1 = a} for all a, b ∈ X .
Homogeneity of the chain will be assumed unless otherwise stated.
Definition (Aperiodicity and Irreducibility) If {Xn } is a Markov chain taking values in X , with |X | = m,
then
• We define Xn as the state at time n;
• We require a transition matrix P = [Pij ], for i, j ∈ {1, 2, . . . , m}, with Pij = Pr{Xn+1 = j|Xn =
i}.
A homogeneous Markov chain is completely characterized by its initial state distribution and a
probability transition matrix.
A Markov chain is said to be:
• Irreducible if it is possible to move with positive probability from any state to any other state
in a finite number of steps;
• Aperiodic if the largest common factor of the length of the different paths from a state to itself
is 1.
Definition (Stationary Distributions) Note that if Xn ∼ p(xn ) then at time n + 1 we have:
X
p(xn+1 ) =
p(xn )Pxn xn+1 .
xn ∈X
Any distribution p(x), if it exists, such that if Xn ∼ p(x), then Xn+1 ∼ p(x), is called a stationary
distribution of the chain.
In the case of a finite-state Markov chain, irreducibility and aperiodicity are enough to imply the
existence of a unique stationary distribution.
Such a chain is called ergodic. Its stationary distribution is also the equilibrium distribution of the
chain. From any starting point, the distribution of Xn tends to the stationary distribution as n → ∞.
3
Harvard SEAS
2.2
ES250 – Information Theory
Entropy Rate
Definition The entropy rate of a stochastic process {Xi } is defined by
1
H(X ) = lim H(X1 , X2 , . . . , Xn )
n→∞ n
when the limit exists. We can also define an alternative notion:
H 0 (X ) = lim H(Xn |Xn−1 , Xn−2 , . . . , X1 ).
n→∞
Lemma For a stationary stochastic process, H(Xn |Xn−1 , Xn−2 , . . . , X1 ) is nonincreasing in n and has a
limit H 0 (X ).
P
Lemma (Cesáro mean) If an → a and bn = n1 ni=1 ai , then bn → a.
Theorem For a stationary stochastic process, H(X ) and H 0 (X ) exist and are equal:
H(X ) = H 0 (X ).
Theorem (Entropy Rate of Markov Chains) For a Markov chain with stationary distribution µ(x), the
entropy rate is
H(X ) = H 0 (X ) = lim H(Xn |Xn−1 , Xn−2 , . . . , X1 ) = lim H(Xn |Xn−1 )
= Hµ (X 0 |X), calculated using the stationary distribution µ(x).
Theorem Let {Xi } be a finite-state Markov chain having transition matrix P and stationary distribution
µ such that:
X
µi Pij for all j.
µj =
i
Let X1 ∼ µ. Then the entropy rate is
H(X ) = −
X
µi Pij log Pij .
ij
2.3
Functions of Markov Chains
Theorem Let X1 , X2 , . . . , Xn , . . . be a homogeneous Markov chain, and take Yi = φ(Xi ). Then the process
{Yi } is stationary, and thus the limit
H(Y) = lim H(Yn |Yn−1 , Yn−2 , . . . , Y1 ),
n→∞
exists.
Lemma
H(Yn |Yn−1 , . . . , Y2 , X1 ) ≤ H(Y).
Lemma
H(Yn |{Yn−1 , . . . , Y1 }) − H(Yn |{Yn−1 , . . . , Y1 }, X1 ) −→ 0.
n→∞
Theorem (Bounds on Entropy Rate) We can use the preceding lemmas to bound the entropy rate H(Y),
despite the fact that {Yi } may not be a Markov chain, as follows:
H(Yn |{Yn−1 , . . . , Y1 }, X1 ) ≤ H(Y) ≤ H(Yn |{Yn−1 , . . . , Y1 })
and
lim H(Yn |{Yn−1 , . . . , Y1 }, X1 ) = H(Y) = lim H(Yn |{Yn−1 , . . . , Y1 }).
n→∞
n→∞
4