Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Georg Cantor's first set theory article wikipedia , lookup
Fundamental theorem of algebra wikipedia , lookup
Foundations of statistics wikipedia , lookup
Karhunen–Loève theorem wikipedia , lookup
Birthday problem wikipedia , lookup
Inductive probability wikipedia , lookup
Central limit theorem wikipedia , lookup
Elements of Information Theory Thomas M. Cover, Joy A. Thomas Copyright 1991 John Wiley & Sons, Inc. Print ISBN 0-471-06259-6 Online ISBN 0-471-20061-1 Chapter 3 The Asymptotic Property Equipartition In information theory, the analog of the law of large numbers is the Asymptotic Equipartition Property (AEP). It is a direct consequence of the weak law of large numbers. The law of large numbers states that for independent, identically distributed (i.i.d.) random variables, i Cr= 1 Xi is close to its expected value EX for large values of n. The AEP states that 4 log p(x,, x,f, ,x,j is close to the entropy H, where X, , X, , . . . , X, are i.i.d. random variables and p(X,, X,, . . . , XJ is the probability of observing the sequence X1, X,, . . . , X,. Thus the probability p(X1, X,, . . . ,X,) assigned to an observed sequence will be close to 2-“? This enables us to divide the set of all sequences into two sets, the typical set, where the sample entropy is close to the true entropy, and the non-typical set, which contains the other sequences. Most of our attention will be on the typical sequences. Any property that is proved for the typical sequences will then be true with high probability and will determine the average behavior of a large sample. First, an example. Let the random variable X E (0, l} have a probability mass function defined by p( 1) = p and p(O) = q. If X1, X,, . . . ,X, are i.i.d. according to p(x), then the probability of a sequence x, is l-l:=, p(~i). For example, the probability of the sequence x1,x2,.**, (1, 0, 1, 1, 0,l) is pzxiqnmxxi =p4q2. Clearly, it is not true that all 2” sequences of length n have the same probability. However, we might be able to predict the probability of the sequence that we actually observe. We ask for the probability p(Xl, X2, . . . , XJ of the outcomes X1, X2, . . . , X,, where X1, X2, . . . are i.i.d. - p(x). This is insidiously self-referential, but well defined nonetheless. Apparently, we are asking for the probability of an event drawn according to the same probability distribution. Here it turns out that p<x,, X2, . . . , X, ) is close to 2-nH with high probability. 50 3.1 51 THE AEP We summarize this by saying, “Almost surprising.” This is a way of saying that fi{(x, 9 x,, . . . . X,):p(X,,X, ,... all events are almost ,Xn)=2-n(Hkc))=1, equally (3.1) if X,, X,, . . . ,X, are i.i.d. -p(x). In the example just given, where p<x,, X,, . . . , X, ) = pc xi q”-’ x’, we are simply saying that the number of l’s in the sequence is close to np (with high probability), and all such sequences have (roughly) the same probability 2 -nH(p ‘. 3.1 THE AEP The asymptotic theorem: Theorem equipartition Property is formalized 3.1.1 (AEP): If XI, X,, . . . are Lid. --p(x), then -~logp(X,,X,,... ,XJ+H(x) in probability. Proof: Functions of independent random variables dent random variables. Thus, since the Xi are i.i.d., Hence by the weak law of large numbers, -i in the following . . .,X,)=-L Clogp(X) n i + - E log p(X) logp(x,,X,, are also indepenso are log p<x, ). (3.3) i in probability = H(x), which proves the theorem. (3.2) (3.4) (3.5) cl Definition: The typical set A:’ with respect to p(x) is the set of sequences (x 1, x2, . . . , x,) E Z?” with the following property: 2- nW(X)+r) Ip(xp As a consequence following properties: Theorem x2, . . . , xn) I 2-ncHcX)-E) . of the AEP, we can show that the set A:’ (3.6) has the 3.1.2: 1. If (xl,xZ ,..., w,)rH(X)+e. x,)EA~), then H(X)-ES -ftlogp(+ x2 ,..., 52 THE ASYMPTOTZC 2. Pr{Ar’) 3. IA:‘1 5 EQUlPARTlTlON > 1 - e for n sufficiently large. where IAl denotes the number gn(H(X)+O, PROPERTY of elements in the Thus the typical set has probability nearly 1, all elements typical set are nearly equiprobable, and the number of elements typical set is nearly 2”r of the in the set A. 4. IA:’ 1~ (1 - •)2~(~~)-‘) for n sufficiently large. Proof: The proof of property (1) is immediate from the definition of A:‘. The second property follows directly from Theorem 3.1.1, since the probability of the event (XI, X,, . . . , X,J E A:’ tends to 1 as n + 00. Thus for any 8 > 0, there exists an no, such that for all n 2 It,, we have Pr {I . . .,X,)-H(X) - ; logp(x,,X,, I CE I =-l-S. (3.7) Setting S = E, we obtain the second part of the theorem. Note that we are using E for two purposes rather than using both E and 6. The identification of 6 = E will conveniently simplify notation later. To prove property (3), we write 1= 2 p(x) (3.8) XEXn (3.9) XEAS”’ n(H(X)+r) n(H(X)+c) where the second inequality for sufficiently (n) IA6 I 9 (3.11) follows from (3.6). Hence IA:)1 Finally, (3.10) s 2n(H(X)+e) large n, Pr{Ay’} . (3.12) > 1 - E, so that 1- c <Pr{Ar’} (3.13) (3.14) where the second inequality follows from (3.6). Hence 3.2 CONSEQUENCES OF THE AEP: IAl”’ which completes 3.2 DATA COMPRESSION 1 (1 - E)2n(H(X)-t) the proof of the properties CONSEQUENCES 53 , of A%‘. OF THE AEP: DATA (3.16) 0 COMPRESSION Let x1,x,, . . . , X, be independent identically distributed random variables drawn from the probability mass function p(x). We wish to find short descriptions for such sequences of random variables. We divide all sequences in 2” into two sets: the typical set A:’ and its complement, as shown in Figure 3.1. We order all elements in each set according to some order (say lexicographic order). Then we can represent each sequence of A:’ by giving the index of the sequence in the set. Since there are ~2~(~+‘) sequences in A:‘, the indexing requires no more than n(H + E) + 1 bits. (The extra bit may be necessary because n(H + E) may not be an integer.) We prefix all these sequences by a 0, giving a total length of 5 n(H + E) + 2 bits to represent each sequence in A:‘. See Figure 3.2. Similarly, we can index each sequence not in A:’ by using not more than n log I%l+ 1 bits. Prefixing these indices by 1, we have a code for all the sequences in Z’. Note the following features of the above coding scheme. l l l The code is one-to-one and easily decodable. The initial bit acts as a flag bit to indicate the length of the codeword that follows. We have used a brute force enumeration of the atypical set A:” without taking into account the fact that the number of elements in A:” is less than the number of elements in BY. Surprisingly, this is good enough to yield an efficient description. The typical sequences have short descriptions of length = nH. xn : 1x1” elements Non -typical set Typical set A $f) : 2”tH + r ) elements Figure 3.1. Typical sets and source coding. G k + m V Y + Vm + 3.3 HlGH PROBABlLl7Y SETS AND THE 7YPlCAL 55 SET E[;l(X+H(x)+e, for n sufficiently large. Thus we can represent 3.3 HIGH PROBABILITY sequences X” using nH(X) SETS AND From the definition of A:‘, it is contains most of the probability. whether it is the smallest such has essentially the same number order in the exponent. Definition: (3.23) bits on the average. THE TYPICAL SET clear that A:’ is a fairly small set that But from the definition it is not clear set. We will prove that the typical set of elements as the smallest set, to first For each n = 1,2, . . . , let Br’ C S?‘” be any set with Pr{Br’}Zl-6. (3.24) We argue that Bf’ must have significant intersection with A:’ and therefore must have about as many elements. In problem 7, we outline the proof of the following theorem: Theorem 6’>0, 3.3.1: Let Xl, X,, . . . , X, be i.i.d. l-6, then h p(x). For 6 -C i and any if Pr(Bp’}> 1 ; log(B’$( > H - 6’ for n sufficiently large . (3.25) Thus Br’ must have at least 2”H elements, to first order in the exponent. But A:’ has 2n(Hrr) elements. Therefore, A:’ is about the same size as the smallest high probability set. We will now define some new notation to express equality to first order in the exponent. Definition: The notation a, A b, means 1 lim -n log 2 = 0. n--r= (3.26) n Thus a, k b, implies exponent. that a, and b, are equal to the first order in the THE ASYMPTOTZC EQUlPARTlTlON 56 PROPERTY We can now restate the above results as I&‘+IAS”‘le2”H. (3.27) To illustrate the difference between AT’ and Br’, let us consider a Bernoulli sequence XI, X2, . . . , X, with parameter p = 0.9. (A Bernoulli(B) random variable is a binary random variable with takes on the value 1 with probability 0.) The typical sequences in this case are the sequences in which the proportion of l’s is close to 0.9. However, this does not include the most likely single sequence, which is the sequence of all 1’s. The set Bf’ includes all the most probable sequences, and therefore includes the sequence of all 1’s. Theorem 3.3.1 implies that A:’ and Br’ must both contain the sequences that have about 90% l’s and the two sets are almost equal in size. S-Y OF CJMPTER AEP (“Almost all events are almost equally X1, Xz, . . . are Cd. -p(x), then surprising”): . , X, )+ H(X) in probability +ogpar,,x,,.. Definition: fying: 3 The typical set A:’ Specifically, . of the typical (3.28) is the set of sequences x,, x,, . . . , xn satis- 2- n(H(X)+t) sp(x1, x,, . . . ,x,) I 2-n(H(x)-e) . Properties if (3.29) set: 1. If (x1, x,, . . . ,x,)~Ar’, thenp(x,, x,, . . . ,x,1= 2-n(H*c). 2. Pr{Ar’} > 1 - e, for n sufficiently large. 3. IAs”’ 5 2n(HtX)+.), where IAl denotes the number of elements in set A. Definition: a, + b, means 4 log 2 + 0 as n-m. Definition: Let BF’ C %“” be the smallest where X1, Xz, . . . , Xn are Cd. -p(x). Smallest probable set such that Pr{Br’) 11 - S, set: For S < 4, pb”‘pp. (3.30) 57 PROBLEMS FOR CHAPTER 3 PROBLEMS 1. FOR CHAPTER 3 Markov’s inequality and Chebyshev’s inequality. (a) (il4urkou’s inequality) For any non-negative any 6 > 0, show that random variable X and (3.31) Exhibit a random variable that achieves this inequality with equality. (b) (Chebyshe u’s inequality) Let Y be a random variable with mean p and variance u2. By letting X = (Y - pJ2, show that for any E > 0, Pr(lY-pi>c)‘$. (3.32) (c) (The weak law of large numbers) Let Z,, Z,, . . , , 2, be a sequence of i.i.d. random variables with mean ~1 and variance 0”. Let z, = k Cy= 1 Zi be the sample mean. Show that (3.33) Thus Pr{l& - pI > e}+O large numbers. 2. as n+a. This is known as the weak law of An AEP-like limit. Let XI, X2, . . . be i.i.d. drawn according to probability mass function p(x) . Find ;z [p(X,,X,, . . . ,X”V’” . 3. The AEP and source coding. A discrete memoryless source emits a sequence of statistically independent binary digits with probabilities p( 1) = 0.005 and p(O) = 0.995. The digits are taken 100 at a time and a binary codeword is provided for every sequence of 100 digits containing three or fewer ones. (a) Assuming that all codewords are the same length, find the minimum length required to provide codewords for all sequences with three or fewer ones. (b) Calculate the probability of observing a source sequence for which no codeword has been assigned. (cl Use Cheb ys h ev’s inequality to bound the probability of observing a source sequence for which no codeword has been assigned. Compare this bound with the actual probability computed in part (b). 4. Products. Let THE ASYMPTOTlC EQUZPARTZTZON PROPERTY 1,i I x= 2, 3, a t Let x1, x2, . . . be drawn i.i.d. according to this distribution. limiting behavior of the product Find the (x,X2 . . . xyn . 5. AEP. Let Xl, X2,. . . be independent identically distributed random variables drawn according to the probability mass function p(x), x E {1,2,. . . , m}. Thus p(x1,x2,. . . , x,) = IIyZI p(xi). We know that 4 logp(X,,X,, . *. , Xn )-, H(X) in probability. Let q&, x,, . . . , x, ) = II:+ q(=ci), where q is another probability mass function on { 1,2, . . . , 4. (a) Evaluate lim - A log q(X, , XZ, . . . , X, ), where XI, X2, . . . are i.i.d. - pw. ,x ) (b) Now evaluate the limit of the log likelihood ratio k log z>’ ::I, x”, Thus the odds favoring’q a& when X,, X2, . . . are i.i.d. -p(x). exponentially small when p is true. 6. Random box size. An n-dimensional rectangular box with sides XI, X2, x3, ‘**, Xn is to be constructed. The volume is V, = IIyEt=,Xi. The edge length I of a n-cube with the same volume as the random box is 1 = vyn . Let XI, XZ, . . . be i.i.d. uniform random variables over the unit interval [0, 11 . Find lim,_t,aVk’n, and compare to (Ev,)l’“. Clearly the expected edge length does not capture the idea of the volume of the box. 7. Proof of Theorem 3.3.1. Let XI, X2, . . . , X,, be i.i.d. -p(x). Let Br’ C %“” such that Pr(Br’)> 1 - 6. Fix E < &. (a) Given any two sets A, B such that Pr(A) > 1 - e1 and Pr@ >> 1 - l Z, showthatPr(AnB)>l-e,-+HencePr(A~’nB~’)rl--e-6. (b) Justify th e st eps in the chain of inequalities 1- E - S sPr(Acn’E n&n))6 (3.34) =AwnB(“) z pw (3.35) c 5 6 2-n(H-S) c (3.36) Ar’ClBli”) = IAI”’ ,-, B;’ I ~$n’[~-n(H-c) (c) Complete the proof of the theorem. [2-nW-d . (3.37) (3.38) HlSTORlCAl. HISTORICAL 59 NOTES NOTES The Asymptotic Equipartition Property (AEP) was first stated by Shannon in his original 1948 paper [238], where he proved the result for i.i.d. processes and stated the result for stationary ergodic processes. McMillan [192] and Breiman [44] proved the AEP for ergodic finite alphabet sources. Chung [57] extended the theorem to the case of countable alphabets and Moy [197], Perez [208] and Kieffer [154] proved the 3, convergence when {X,} is continuous valued and ergodic. Barron [18] and Orey [202] proved almost sure convergence for continuous valued ergodic processes; a simple sandwich argument (Algoet and Cover [S]) will be used in Section 15.7 to prove the general AEP.