Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3: Asymptotic Equipartition Property University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chapter 3 outline • Strong vs. Weak Typicality • Convergence • Asymptotic Equipartition Property Theorem • High-probability sets and the “typical set” • Consequences of the AEP: data compression University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Strong versus Weak Typicality • Intuition behind typicality? University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Another example of Typicality • Bit-sequences of length n = 8, prob(1) = p (prob(0) = (1-p)) • Strong typicality? • Weak typicality? • What if p=0.5? University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Convergence of random variables University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Weak Law of Large Numbers + the AEP University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Typical sets intuition Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 • What’s the point? You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. • Consider 4.4: Typicalityiid bit-strings strings of length N=100, prob(1) = p1=0.1 • Probability of a given string X with r ones is • Number of strings with r ones is • Distribution of r, the # of ones in a string of length N is thus N = 100 3e+299 1e+29 2.5e+299 !N " n(r) = N = 1000 1.2e+29 r 8e+28 2e+299 6e+28 1.5e+299 4e+28 1e+299 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye 2e+28 5e+298 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 0 0 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 9001000 7 4.4: Typicality P (x) = pr1 (1 2e-05 − p1 ) N −r 2e-05 1e-05 Typical sets intuition 0 1e-05 0 log2 P (x) 0 0 1 2 3 4 5 10 20 30 40 50 60 70 80 90 100 0 0 -50 -500 -100 -1000 T -150 -1500 -200 -2000 -250 -2500 -300 -350 0 -3000 N = 100 10 20 30 40 50 60 70 80 90 100 3e+299 1e+29 0.12 2.5e+299 0.04 0.14 n(r)P (x) = !N " !N " r n(r) = N −rr r p1 (1 − p1 ) N = 1000 0 100 200 300 400 500 600 700 800 9001000 0.045 0.035 0.03 8e+28 0.1 2e+299 6e+28 0.08 1.5e+299 0.025 0.06 4e+28 0.02 1e+299 0.04 2e+28 5e+298 0.02 0 0 10 20 30 40 50 60 70 80 90 100 0 0 10 20 30 40 50 60 70 80 90 100 P (x) = pr1 (1 − p1 )N −r -3500 1.2e+29 T r 2e-05 2e-05 0.015 0.01 0 0.005 0 100 200 300 400 500 600 700 800 9001000 0 0 100 200 300 400 500 600 700 800 9001000 r 1e-05 Figure 4.11. Anatomy of the typical set of T .length For p1 N, = 0.1 and N= = 1e-05 • Consider iid bit-strings strings prob(1) p1100 =0.1and N = 1000, these graphs show n(r), the number of strings containing r 1s; the probability P (x) of a single string that contains r 1s; the same0probability on a log scale; and the total probability n(r)P (x) of 0 10 20 30 40 50 60 70 80 90 100 all strings that contain r 1s.0 The number r is on the horizontal axis. The plot of log 2 P (x) 0 -50 the mean value of log P (x) -500= −N H (p ) which equals −46.9 also shows by a dotted line 2 1 2 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye -1000 T T includes only the strings that when N = 100log and −469-100 when N = 1000. The typical set 2 P (x) -150 -1500 have log2 P (x) close to this T shows the set TN β (as defined in -200 value. The range marked-2000 -250 β = 0.29 (left) and N = -2500 section 4.4) for N = 100 and 1000, β = 0.09 (right). 0 0 1 2 3 4 5 Typical sets intuition • What’s the point? Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. • You Consider iid bit-strings strings of length N=100, prob(1) = p1=0.1 78 4 — The Source Coding Theorem log2 (P (x)) x ...1...................1.....1....1.1.......1........1...........1.....................1.......11... ......................1.....1.....1.......1....1.........1.....................................1.... ........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. 1.1...1................1.......................11.1..1............................1.....1..1.11..... ...11...........1...1.....1.1......1..........1....1...1.....1............1......................... ..............1......1.........1.1.......1..........1............1...1......................1....... .....1........1.......1...1............1............1...........1......1..11........................ .....1..1..1...............111...................1...............1.........1.1...1...1.............1 .........1..........1.....1......1..........1....1..............................................1... ......1........................1..............1.....1..1.1.1..1...................................1. 1.......................1..........1...1...................1....1....1........1..11..1.1...1........ ...........11.1.........1................1......1.....................1............................. .1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... ......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. ............11.1......1....1..1............................1.......1..............1.......1......... .................................................................................................... 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 −50.1 −37.3 −65.9 −56.4 −53.2 −43.7 −46.8 −56.4 −37.3 −43.7 −56.4 −37.3 −56.4 −59.5 −46.8 −15.2 −332.1 except for tails close to δ = 0 and 1. As long as we are allowed a tiny University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye probability of error δ, compression down to N H bits is possible. Even if we are allowed a large probability of error, we still can compress only down to N H bits. This is the source coding theorem. Theorem 4.1 Shannon’s source coding theorem. Let X be an ensemble with entropy H(X) = H bits. Given " > 0 and 0 < δ < 1, there exists a positive integer N0 such that for N > N0 , ! ! ! !1 ! Hδ (X N ) − H ! < ". (4.21) ! !N Definition: weak typicality 4.4 Typicality Why does increasing N help? Let’s examine long strings from X N . Table 4.10 shows fifteen samples from X N for N = 100 and p1 = 0.1. The probability of a string x that contains r 1s and N −r 0s is P (x) = pr1 (1 − p1 )N −r . The number of strings that contain r 1s is " # N n(r) = . r So the number of 1s, r, has a binomial distribution: " # N r P (r) = p (1 − p1 )N −r . r 1 (4.22) (4.23) (4.24) These functions are shown in figure 4.11. The mean of r is N p 1 , and its $ standard deviation is N p1 (1 − p1 ) (p.1). If N is 100 then r ∼ N p1 ± $ N p1 (1 − p1 ) # 10 ± 3. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Figure 4.10. The top 15 strings are samples from X 100 , where p1 = 0.1 and p0 = 0.9. The bottom two are the most and least probable strings in this ensemble. The final column shows the log-probabilities of the random strings, which may be compared with the entropy H(X 100 ) = 46.9 bits. (4.25) [Mackay textbook, pg. 78] n(H (X)+!) |A(n) . ! |≤2 Finally, for sufficiently large n, Pr{A(n) ! } (3.12) > 1 − !, so that 1 − ! < Pr{A(n) (3.13) ! } ! −n(H (X)−!) ≤ 2 (3.14) permitted. Printing not permitted. http://www.cambridge.org/0521642981 Copyright Cambridge University Press 2003. On-screen viewing (n) this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. You can x∈Abuy ! The typical set visually |A(n) = 2−n(H (X)−!) 4.5: Proofs ! |, (3.15) 81 where the second inequality follows from (3.6). Hence, log2 P (x) n(H (X)−!) of length 100, prob(1) = 0.1 Bit sequences |A(n) , (3.16) ! | ≥ (1 − !)2 ! which completes the proof of the properties of A(n) ! . 3.2 −N H(X) ! TN β CONSEQUENCES OF THE AEP: DATA COMPRESSION Let X1 , X2 , . . . , Xn be independent, identically distributed random variables drawn from the probability mass"function p(x). We wish to find short descriptions for such sequences of1111111111110. random variables. We divide all . . 11111110111 sequences in X n into two sets: the typical set A(n) ! and its complement, as shown in Figure 3.1. " " " " 0000100000010. . . 00001000010 0100000001000. . . 00010000000 n:| 0001000000000. . . 00000000000 |n elements 0000000000000. . . 00000000000 The ‘asymptotic equipartition’ principle is equivalent to: Shannon’s source Non-typical coding theorem (verbal statement). N i.i.d. ranset dom variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞; conversely if they are compressed into fewer than N H(X) bits it is virset tually certain that Typical information will be lost. ∋ A(n) : 2n(H + ) ∋ Most + least likely sequences NOT in typical set!! Proofsthe second inequality follows from (3.6). Hence where 60 4.5 [Mackay pg. 81] elements These two theorems are equivalent because we can define a compression algorithm that gives a distinct name of length N H(X) bits to each x in the typical FIGURE 3.1. Typical sets and source coding. set. [Cover+Thomas pg. 60] Figure 4.12. Schematic diagram showing all strings in the ensemble X N ranked by their probability, and the typical set TN β . ASYMPTOTIC EQUIPARTITION PROPERTY University of Illinois at Chicago ECE 534, FallThis 2009, Natasha Devroye section may be skipped if found tough going. |A(n) | ≤ 2n(H (X)+!) . (3.12) ! The law of large numbers Finally, for sufficiently large n, Pr{A(n) ! } > 1 − !, so that Our proof of the source coding theorem uses the law of large numbers. (n) ! (3.13) 1− Pr{A ! } Mean and variance of!a<real random = ū = u P (u)u !variable are E[u] !2 2. and var(u) = σu2 = E[(u − ū) ] = P (u)(u − ū) −n(H (X)−!) u ≤ 2 (3.14) (n) Technical note: strictly x∈A! I am assuming here that u is a function u(x) of finite discrete ensemble X. Then the summations ! a sample x from a −n(H (X)−!)! (n) |Ax! P |, =2 (x)f (u(x)). This means that(3.15) P (u) u P (u)f (u) should be written is a finite sum of delta functions. This restriction guarantees that the and variance of u follows do exist,from which(3.6). is notHence, necessarily the case for wheremean the second inequality general P (u). Properties of the typical set (n) n(H (X)−!) |ALet (1 − (3.16) ! |≥ Chebyshev’s inequality 1. t be a !)2 non-negative, real random variable, and let α be a positive real number. Then which completes the proof of the properties of A(n) ! ! . t̄ P (t ≥ α) ≤ . (4.30) α 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION ! Proof: P (t ≥ α) = ! t≥α P (t). We multiply each term by t/α ≥ 1 and , .≥ . . ,α) Xn≤be independent, identically random variLet X1 , X add thedistributed (non-negative) missing obtain: P2(t t≥α P (t)t/α. ! We function ables drawn from the terms and obtain: P (t probability ≥ α) ≤ t mass P (t)t/α = t̄/α.p(x). We wish to find ! short descriptions for such sequences of random variables. We divide all sequences in X n into two sets: the typical set A(n) ! and its complement, as shown in Figure 3.1. n:| |n elements Non-typical set Typical set ∋ FIGURE 3.1. Typical sets and source coding. [Cover+Thomas pg. 60] University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye ∋ A(n) : 2n(H + ) elements n(H (X)−!) |A(n) , ! | ≥ (1 − !)2 (3.16) ! which completes the proof of the properties of A(n) ! . 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION Let X1 , X2 , . . . , Xn be independent, identically distributed random variables drawn from the probability mass function p(x). We wish to find short descriptions for such sequences of random variables. We divide all sequences in X n into two sets: the typical set A(n) ! and its complement, as shown in Figure 3.1. Consequences of the AEP n:| |n elements Typical set contains almost all the probability! Non-typical set Typical set ∋ A(n) : 2n(H + ) elements 3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION 61 ∋ FIGURE 3.1. Typical sets and source coding. Non-typical set Description: n log | | + 2 bits How many are in this set useful for source coding (compression)! Typical set Description: n(H + ) + 2 bits ∋ FIGURE 3.2. Source code using the typical set. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye We order all elements in each set according to some order (e.g., lexicographic order). Then we can represent each sequence of A(n) ! by giving the index of the sequence in the set. Since there are ≤ 2n(H +!) sequences (n) in A! , the indexing requires no more than n(H + !) + 1 bits. [The extra bit may be necessary because n(H + !) may not be an integer.] We prefix all these sequences by a 0, giving a total length of ≤ n(H + !) + 2 bits to represent each sequence in A(n) ! (see Figure 3.2). Similarly, we can index each sequence not in A(n) ! by using not more than n log |X| + 1 bits. Prefixing these indices by 1, we have a code for all the sequences in X n . Note the following features of the above coding scheme: Consequences of the AEP • • • The code is one-to-one and easily decodable. The initial bit acts as a flag bit to indicate the length of the codeword that follows. c We have used a brute-force enumeration of the atypical set A(n) ! without taking into account the fact that the number of elements in c is less than the number of elements in X n . Surprisingly, this is A(n) ! good enough to yield an efficient description. The typical sequences have short descriptions of length ≈ nH . We use the notation x to denote a sequence x , x , . . . , x . Let l(x ) By enumeration! be the length of the codeword corresponding to x . If n is sufficiently n 1 n 2 n n large so that Pr{A(n) ! } ≥ 1 − !, the expected length of the codeword is ! E(l(X n )) = p(x n )l(x n ) (3.17) xn University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye AEP and data compression University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye ``High-probability set’’ vs. ``typical set’’ • Typical set: small number of outcomes that contain most of the probability • Is it the smallest such set? University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Some notation Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 81 4.5: Proofs log2 P (x) Bit sequences of length 100, prob(1) = 0.1 −N H(X) ! TN β " " " " " 1111111111110. . . 11111110111 0000100000010. . . 00001000010 0100000001000. . . 00010000000 0001000000000. . . 00000000000 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye 0000000000000. . . 00000000000 The ‘asymptotic equipartition’ principle is equivalent to: Shannon’s source coding theorem (verbal statement). N i.i.d. random variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞; conversely if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost. Figure 4.12. Schematic diagram showing all strings in the ensemble X N ranked by their probability, and the typical set TN β . [Mackay pg. 81] What’s the difference? These two theorems are equivalent because we can define a compression algorithm that gives a distinct name of length N H(X) bits to each x in the typical set. 4.5 Proofs This section may be skipped if found tough going. The law of large numbers Our proof of the source coding theorem uses the law of large numbers. ! Mean and variance of a real random = ū = u P (u)u !variable are E[u] 2 2 2 and var(u) = σu = E[(u − ū) ] = u P (u)(u − ū) . Technical note: strictly I am assuming here that u is a function u(x) of a sample x from a finite discrete ! ! ensemble X. Then the summations u P (u)f (u) should be written x P (x)f (u(x)). This means that P (u) is a finite sum of delta functions. This restriction guarantees that the mean and variance of u do exist, which is not necessarily the case for general P (u). Chebyshev’s inequality 1. Let t be a non-negative real random variable, and let α be a positive real number. Then t̄ P (t ≥ α) ≤ . (4.30) • Why use the ``typical set’’ rather than αthe ``high-probability set’’? ! Proof: P (t ≥ α) = ! t≥α P (t). We multiply each term by t/α ≥ 1 and obtain: P (t ≥ α) ≤ t≥α P (t)t/α. ! We add the (non-negative) missing terms and obtain: P (t ≥ α) ≤ t P (t)t/α = t̄/α. ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye