Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Harvard SEAS ES250 – Information Theory Asymptotic Equipartition Property (AEP) and Entropy rates ∗ 1 Asymptotic Equipartition Property 1.1 Preliminaries Definition (Convergence of random variables) We say that a sequence of random variables X 1 , X2 , . . . , converges to a random variable X: 1. In probability if for every ² > 0, Pr{|Xn − X| > ²} → 0 2. In mean square if E[(X − Xn )2 ] → 0 3. With probability 1 (also called almost surely) if Pr{limn→∞ Xn = X} = 1 1.2 Asymptotic Equipartition Property Theorem iid Theorem (AEP) If X1 , X2 , . . . ∼ p(x), then 1 − log p(X1 , X2 , . . . , Xn ) → H(X) n (n) Definition The typical set A² property in probability. with respect to p(x) is the set of sequences (x1 , x2 , . . . , xn ) ∈ X n with the 2−n(H(X)+²) ≤ p(x1 , x2 , . . . , xn ) ≤ 2−n(H(X)−²) . Theorem (n) 1. If (x1 , x2 , . . . , xn ) ∈ A² , then H(X) − ² ≤ − n1 log p(x1 , x2 , . . . , xn ) ≤ H(X) + ². (n) 2. Pr{A² } > 1 − ² for n sufficiently large. (n) 3. |A² | ≤ 2n(H(X)+²) (n) 4. |A² | ≥ (1 − ²)2n(H(X)−²) The typical set has probability nearly one, cardinality nearly 2nH , and nearly equiprobable elements. 1.3 Consequences of the AEP A Coding Scheme (n) Recall our definition of A² as the typical set with respect to p(x). Can we assign codewords such that sequences (x1 , x2 , . . . , xn ) ∈ X n can be represented using nH bits on average? Let xn denote (x1 , x2 , . . . , xn ), and let l(xn ) be the length of the codeword corresponding to xn . If n is sufficiently (n) large so that Pr{A² } ≥ 1 − ², the expected codeword length is X E[l(X n )] = p(xn )l(xn ) xn ≤ n(H + ²) + ²n(log |X |) + 2 = n(H + ²0 ) ∗ Based on Cover & Thomas, Chapter 3,4 1 Harvard SEAS ES250 – Information Theory Data Compression iid Theorem Let X n ∼ p(x) and let ² > 0. Then there exists a code that maps sequences xn of length n into binary strings such that the mapping is one-to-one (and therefore invertible) and · ¸ 1 n E l(X ) ≤ H(X) + ² n for n sufficiently large. 1.4 High-Probability Sets and the Typical Set What can we say about the cardinality of the typical set? (n) Definition (Smallest set for a given probability) For each n = 1, 2, . . ., let Bδ with ⊂ X n be the smallest set (n) Pr{Bδ } ≥ 1 − δ. iid (n) Theorem (Problem 3.3.11) Let X1 , X2 , . . . , Xn ∼ p(x). For δ < 1/2 and δ 0 > 0, if Pr{Bδ } > 1 − δ, then for sufficiently large n we have 1 (n) log |Bδ | > H − δ 0 . n (n) So (to first order) Bδ has at least 2nH elements. The following notation expresses equality to first order in the exponent: . Definition The notation an = bn means 1 an log = 0. n→∞ n bn lim (n) Bδ (n) has order 2nH elements (at least), and A² has order 2n(H±²) . Thus, if δn → 0 and ²n → 0, then . nH (n) . . |Bδn | = |A(n) ²n | = 2 (n) Hence, as n grows large, the typical set A² 2 2.1 (n) is about the same size as the smallest set Bδ . Entropy Rates of a Stochastic Process Markov Chains Definition A stochastic process {Xi } is an indexed sequence of random variables. Definition A discrete-time stochastic process {Xi }i∈I is one for which we associate the discrete index set I = {1, 2, . . .} with time. 2 Harvard SEAS ES250 – Information Theory Definition A discrete-time stochastic process is said to be stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index; that is, Pr{X1 = x1 , X2 = x2 , . . . , Xn = xn } = Pr{X1+l = x1 , X2+l = x2 , . . . , Xn+l = xn } for every n and every shift l and for all x1 , x2 , . . . , xn ∈ X . Definition A discrete stochastic process X1 , X2 , . . . said to be a finite-state Markov chain or Markov process if, for n = 1, 2, . . ., we have Pr{Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 , . . . , X1 = x1 } = Pr{Xn+1 = xn+1 |Xn = xn } for all x1 , x2 , . . . , xn , xn+1 ∈ X . In this case, the joint pmf can be (conveniently!) decomposed as p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) · · · p(xn |xn−1 ). Definition The Markov chain is said to be homogeneous or time invariant if the conditional probability p(xn+1 |xn ) does not depend on n; that is, for n = 1, 2, . . ., Pr{Xn+1 = b|Xn = a} = Pr{X2 = b|X1 = a} for all a, b ∈ X . Homogeneity of the chain will be assumed unless otherwise stated. Definition (Aperiodicity and Irreducibility) If {Xn } is a Markov chain taking values in X , with |X | = m, then • We define Xn as the state at time n; • We require a transition matrix P = [Pij ], for i, j ∈ {1, 2, . . . , m}, with Pij = Pr{Xn+1 = j|Xn = i}. A homogeneous Markov chain is completely characterized by its initial state distribution and a probability transition matrix. A Markov chain is said to be: • Irreducible if it is possible to move with positive probability from any state to any other state in a finite number of steps; • Aperiodic if the largest common factor of the length of the different paths from a state to itself is 1. Definition (Stationary Distributions) Note that if Xn ∼ p(xn ) then at time n + 1 we have: X p(xn+1 ) = p(xn )Pxn xn+1 . xn ∈X Any distribution p(x), if it exists, such that if Xn ∼ p(x), then Xn+1 ∼ p(x), is called a stationary distribution of the chain. In the case of a finite-state Markov chain, irreducibility and aperiodicity are enough to imply the existence of a unique stationary distribution. Such a chain is called ergodic. Its stationary distribution is also the equilibrium distribution of the chain. From any starting point, the distribution of Xn tends to the stationary distribution as n → ∞. 3 Harvard SEAS 2.2 ES250 – Information Theory Entropy Rate Definition The entropy rate of a stochastic process {Xi } is defined by 1 H(X ) = lim H(X1 , X2 , . . . , Xn ) n→∞ n when the limit exists. We can also define an alternative notion: H 0 (X ) = lim H(Xn |Xn−1 , Xn−2 , . . . , X1 ). n→∞ Lemma For a stationary stochastic process, H(Xn |Xn−1 , Xn−2 , . . . , X1 ) is nonincreasing in n and has a limit H 0 (X ). P Lemma (Cesáro mean) If an → a and bn = n1 ni=1 ai , then bn → a. Theorem For a stationary stochastic process, H(X ) and H 0 (X ) exist and are equal: H(X ) = H 0 (X ). Theorem (Entropy Rate of Markov Chains) For a Markov chain with stationary distribution µ(x), the entropy rate is H(X ) = H 0 (X ) = lim H(Xn |Xn−1 , Xn−2 , . . . , X1 ) = lim H(Xn |Xn−1 ) = Hµ (X 0 |X), calculated using the stationary distribution µ(x). Theorem Let {Xi } be a finite-state Markov chain having transition matrix P and stationary distribution µ such that: X µi Pij for all j. µj = i Let X1 ∼ µ. Then the entropy rate is H(X ) = − X µi Pij log Pij . ij 2.3 Functions of Markov Chains Theorem Let X1 , X2 , . . . , Xn , . . . be a homogeneous Markov chain, and take Yi = φ(Xi ). Then the process {Yi } is stationary, and thus the limit H(Y) = lim H(Yn |Yn−1 , Yn−2 , . . . , Y1 ), n→∞ exists. Lemma H(Yn |Yn−1 , . . . , Y2 , X1 ) ≤ H(Y). Lemma H(Yn |{Yn−1 , . . . , Y1 }) − H(Yn |{Yn−1 , . . . , Y1 }, X1 ) −→ 0. n→∞ Theorem (Bounds on Entropy Rate) We can use the preceding lemmas to bound the entropy rate H(Y), despite the fact that {Yi } may not be a Markov chain, as follows: H(Yn |{Yn−1 , . . . , Y1 }, X1 ) ≤ H(Y) ≤ H(Yn |{Yn−1 , . . . , Y1 }) and lim H(Yn |{Yn−1 , . . . , Y1 }, X1 ) = H(Y) = lim H(Yn |{Yn−1 , . . . , Y1 }). n→∞ n→∞ 4