Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
One checks that (5.16) P ( x, y) XY I ( X; Y ) = ∑ PXY ( x, y) log , P ( x )PY (y) X x ∈X ,y∈Y and so in particular I ( X; Y ) = I (Y; X ) (which can also be seen by noting that I ( X; Y ) = H ( X ) + H (Y ) − H ( XY ). 6 Shannon’s Model of Communication Shannon studied the case where a discrete memoryless channel C from X to Y is given. Such a channel can be specified by providing, for every x ∈ X a probability distribution P( x) (y) over the values of y ∈ Y which are received when x is input into the channel. We will often write PY | X = x (y) or PY | X (y, x ) for this probability distribution.9 Furthermore, if x ∈ X is given, we write C( x ) to denote a random variable distributed as given by PY | X = x (y). If values x n = ( x1 , . . . , xn ) ∈ X n , we analogously write C( n ) ( x n ). Definition 6.1. The capacity Cap(C) of a channel C is defined as (6.1) Cap(C) := sup I ( X; Y ) , PX Y =C( X ) where the supremum is over all probability distributions PX over X , then X is picked according to PX , and Y is obtained by sending X through the channel C. Shannon proved that in n uses of channel C, roughly n · Cap(C) bits can be transmitted. We will provide more exact statements and some proofs in the next two sections. 6.1 The Channel Coding Theorem We first give the theorem which states that we can transmit information over a channel at any rate below the channel capacity. 9 This is slight abuse of notation: when a probability distribution over PXY ( x, y) is given, we also write PY |X = x (y) to denote the conditional probability distribution. Note that when only a channel is provided, this is not such a conditional distribution (as there is no distribution over X ). However, if PXY is a probability distribution over X × Y which is achieved by picking x according to some distribution and then sending it over a channel, the two objects almost coincide. 21 Theorem 6.2 (Shannon’s Channel Coding Theorem). Let C be a discrete memoryless channel from X to Y , e > 0, and let R = Cap(C) − e. Then, for every n > n0 (C, e), m ≤ nR, there exist functions Enc : {0, 1}m → X n and Dec : Y n → {0, 1}m such that for every w ∈ {0, 1}m we have (6.2) 2 Prn [Dec(C(n) (Enc(w))) 6= w] ≤ exp(−Ω( eL2n )) C where L := log(|X | · |Y |). We omit the proof of Theorem 6.2, but prove the following related result. It shows that a random linear code over {0, 1} work well, if the channel achieves capacity on the uniform distribution. Exercise 6.3. Let C be the binary symmetric channel with error probability p, i.e., channel from {0, 1} to {0, 1} which, on input x, outputs x with probability p and 1 − x with probabilty 1 − p. Prove that the capacity of the channel is 1 − h( p). Theorem 6.4. Let C be a discrete memoryless channel from X = {0, 1} to Y which achieves the capacity on the uniform input. Let e > 0, and R = Cap(C) − e. Then, for every n > n0 (C, e), m ≤ nR we have that there exists a family of functions DecG : Y n → m such that for all w ∈ {0, 1}m \ {0m } we have (6.3) 2 Pr [DecG (C(n) ( G · w)) 6= w] ≤ exp(−Ω( eL2n )) G,Cn where L := log(|Y |), and G ∈ {0, 1}n×m is a uniform random matrix. Note that the probability in (6.3) is also over the choice of G, which makes it a weaker statement: we only show that for every w a uniform random G will suffice, but one would like to show that some G works for all w. Proof. We stay a bit more general than the theorem for a moment and consider an arbitrary channel. Let PX be a distribution which achieves capacity, and consider the distribution PXY ( x, y) = PX ( x )PY | X = x (y) which one obtains by first picking X according to PX and then setting Y = C( X ). Furthermore, let PX n Y n , the distribution over n independent such copies. For a parameter δ > 0, we then define the “typical set” n (6.4) Anδ := ( x n , yn ) h X n Y n ( x n , yn ) ∈ n( H ( XY ) ± δ) ∧ We have the following claims: h X n ( xn ) ∈ n( H ( X ) ± δ) ∧ o h Y n ( y n ) ∈ n ( H (Y ) ± δ ) 22 (1) Pick ( x n , yn ) according to PX n Y n . Then, with probability at least 1 − 2 we have ( x n , yn ) ∈ Anδ . xn 2 −Ω( nδ2 ) L yn (2) Pick according to PX n and indepdentently according to PY n . Then, the n n n probability that ( x , y ) ∈ Aδ is at most 2−nCap(C)+3nδ . Claim (1) follows immediately from Theorem 5.8. To see claim (2), we note that A has at most 2n( H (XY )+δ) elements, because each element has pointwise entropy at most n( H ( XY ) + δ), and so probability at least 2−n( H (XY )+δ) . Because of the two remaining conditions on elements in Anδ , the probability that in the process in claim (2) a fixed pair ( x n , yn ) is picked is at most 2−n( H (X )−δ)−n( H (Y )−δ) , which gives the result. Thus, overall we have a probability of at most 2n( H (XY )+δ)−n( H (X )−δ)−n( H (Y )−δ) = 2−nI (X;Y )+3nδ . We now come to the proof of the theorem. The decoder DecG works as follows: it enumerates all messages w∗ 6= 0m , and ∗ checks whether ( Gw∗ , yn ) is in Am δ . If there is a single w for which this holds, he outputs w∗ , otherwise he outputs a special symbol ‘fail’. By Claim (1) above, the probability (over the choice of G and and the randomness of the channel) that for the correct message w the pair ( Gw, yn ) is typical is 1 − exp(−Ω(mδ2 /L2 )). Furthermore, by Claim (2) and the union bound, the probability that for any other non-zero message the pair ( Gw∗ , yn ) is typical is 2m 2−nI (X;Y )+3nδ ≤ 2−ne+3nδ . Thus, if we choose δ = e/4 we get the result. 6.2 The Converse to the Channel Coding Theorem For completeness we add a proof that rates above capacity can not be achieved. Theorem 6.9 (Shannon’s Converse Channel Coding Theorem). Let Z be a discrete memoryless channel from X to Y , e > 0, and let R = Cap( Z ) + e. Then, for every m ≥ nR and every pair of functions Enc : {0, 1}m → X n and Dec : Y n → {0, 1}m we have en − 1 [Dec( Z (n) (Enc(w))) 6= w] ≥ Pr (6.5) ( n ) m m w←{0,1} ,Z Proof sketch. We assume that the encoding function is injective – intuitively this is no problem, but formally it loses generality, and one should do more work which we omit. We pick the message w uniformly in {0, 1}m . We then let x (n) := Enc(w), and y(n) the result of sending x (n) through the channel. We see that (6.6) = m − I ( X (n) ; Y (n) ) = m − H (Y ( n ) ) − H (Y ( n ) | X ( n ) ) n = m − H (Y (n) ) − ∑ H (Yi |Y1 , . . . , Yi−1 X (n) ) i =1 n = m − H (Y (n) ) − ∑ H (Yi | Xi ) Exercise 6.5. Assume that C is the binary symmetric channel, and strengthen Theorem 6.4 for this case, showing that some fixed G works for all w including 0m . Exercise 6.6. Prove Theorem 6.2. What is the smallest possible minimum such a code can have? {0, 1}n C (r ) {0, 1}nr be some code. For r ∈ N, let ⊆ be the code Exercise 6.7. Let C ⊆ which is obtained by concatenating r codewords of C . Formally, C (r) := {(c1 kc2 k . . . kcr ) : c1 ∈ C ∧ · · · ∧ cr ∈ C}. Let C be a channel from {0, 1} to Y which achieves capacity on the uniform distribution, and fix e > 0. Use Theorem 6.4 and the above construction for an appropriate r to show that there exists a family of polynomial time encodable and decodable codes C ⊆ {0, 1}n with rate Cap(C) − e and error probability at most n1 (or any other polynomial in n). 2 Exercise 6.8. Let C ⊆ {0, 1}n be a code with ∆(C) ≥ n4 and |C| ≤ 2−ne /1000 . Prove that there is a decoder which finds the sent codeword with overwhelming probability, when a codeword from C is sent over a binary symmetric channels which flips a bit with probability 1− e 2 . 23 H (W |Y (n) ) = H (W ) − I (W; Y (n) ) i =1 n ≥ m − ∑ I ( Xi ; Yi ) ≥ nR − nCap(C) ≥ en. i =1 Now, define (6.7) E := ( 1 0 if Dec(Y (n) ) 6= W otherwise We see that H ( EW |Y (n) ) = H ( E|Y (n) ) + H (W | E, Y (n) ) and H (W | E, Y n ) = Pr[ E = 1] H (W |Y (n) , E = 1) ≤ Pr[ E = 1]m we see that Pr[ E = 1] ≥ this with (6.6) gives the result. H (W |Y (n) )−1 . m Combining 7 Concatenated Codes In order to explain concatenated codes, we suppose for a moment that a code is given by an encoding function C : Σk → Σn . 24 kin kin kin kin }| ··· kin kin kin { } nout symbols from Σ Inner Code kin ··· Outer code Cout } kin encoding, and there exists a polynomial time decoding procedure which decodes errors up to relative distance at least δ/2. { {z kin }| Outer Code z kout symbols from Σ | z nin nin ··· {z nin nout blocks nin nin | nin | {z Inner code Cin } Figure 5: Concatenation of codes. We then assume we have two codes. First, an outer code Cout : Σ(kout ) → Σ(nout ) . Furthermore, we have given an inner code Cin : {0, 1}kin → {0, 1}nin , where the assumption that the inner code is binary is just for simplicity. We further have the assumption that Σ = {0, 1}kin . We can then naturally get a “concatenated code” Cout ◦ Cin ⊆ {0, 1}nout nin (writing Cout ◦ Cin and not Cin ◦ Cout seems to be standard) which is given by a function D : Σ(kout ) → {0, 1}nout nin as follows: first, map the message (σ1 , . . . , σkout ) to (τ1 , . . . , τnout ) = Cout (σ1 , . . . , σkout ). Then, map each τi to a bitstring Cin (τi ) of length nout , which gives the final codeword (Cin (τ1 )k . . . kCin (τnout )). An graphical explanation is given in Figure 5. Concatenated codes have first been studied by Forney in his 1965 PhD thesis. Lemma 7.1. If the outer code Cout has minimum distance dout and the inner code Cin has minimum distance din then Cout ◦ Cin has minimum distance at least dout · din . Exercise 7.2. Argue that this bound is not always tight. 7.1 Asymptotically Good Codes We say that a family of codes {Cn } with Cn ⊆ F2n is an “efficient code of relative distance δ” if it has relative minimum distance at least δ, there exists a polynomial time 25 Theorem 7.3 (Zyablov Codes). For every rate r < 1 and n > n0 (r ) there exists an efficient linear code C ⊆ F2n with dimension k > r · n and relative distance at least δ(r ) > 0. For every δ < 12 and every n > n1 (δ) there exists an efficient linear code C ⊆ F2n with dimension k > r (δ) · n, where r (δ) > 0, and relative distance at least δ. Proof. We use a concatenated code. As outer code, we use a Reed-Solomon code over F2m for large enough m with relative distance δout < 1. As an inner code, we use a linear code as given by Theorem 2.2 with dimension m, with relative distance δin < 12 . The length of the outer code is maximal, i.e., nout = 2m . If the dimension kout , the 1 relative distance is δout = 1 − nkout + nout ≈ 1 − nkout . out out By Theorem 2.2, the inner code encodes messages of length m into codewords of length roughly 1−hm(δ ) . in Thus, the total length of the code is nout · 1−hm(δ ) , and it can encode binary mesin sages of length kout · m. This means that the rate of the code is roughly (1 − δout )(1 − h(δin )). Furthermore, the relative distance is least δin δout . For a given relative overall relative distance δ ∈ [0, 12 ), one now thus looks for δout ∈ [0, 1) and δin ∈ [0, 12 ) which maximize (1 − δout )(1 − h(δin )) subject to the constraint that δin δout = δ. The result is plotted in Figure 4 and called “Zyablov-Radius”. Note that both results are implied by the curve (and can be easily seen from the optimization). Finally, we note that encoding and decoding up relative distance δ2in δout 2 can be done in polynomial time (using Theorem 2.4, exhaustive search for decoding, and Theorem 4.8). 7.2 Constructions with Varying Inner Codes In general, there is no need to use the same inner code Cin for each symbol of the outer code. There are two constructions we want to quickly mention which use this idea. The Wozencraft ensemble is a family of codes which have the property that most of the codes in the family achieve the Gilbert-Varshamov bound. In his 1972 paper “A class of constructive asymptotically good algebraic codes”, Justesen proposed to use this family for the inner codes. Then, each position of the outer codeword is encoded using a different member of the family. For some rates, this construction 26 then achieves the same parameters as the construction from the previous section, but without the need for exhaustive search. The second construction is due to Thommesen. He shows one can use the ReedSolomon code together with a new random linear code in each position and with high probability achieve the Gilbert-Varshamov bound (the Reed-Solomon code is used with different parameters than in the following construction). 7.3 Capacity Achieving Concatenated Codes We discuss this as an exercise. Exercise 7.4. Let C be a binary symmetric channel (as in Exercise 6.3). Fix 0 < e < 1. Let Cout be the Reed-Solomon code over an alphabet of size 2m with rate 1 − e, length 2m . Furthermore, we pick for each coordinate of the Reed-Solomon code a random linear code over {0, 1}m with rate Cap(C) · (1 − e). Consider the decoder which first treats each instance of the inner code individually, decoding each received word to the closest codeword. After this, the decoder does the BerklekampWalch decoding on the outer code. • Prove that for any codeword, the probability (over the choice of the linear codes and the randomness of the channel) that there are more than e/2 fraction of wrong symbols recieved in the Reed-Solomon code is exponentially small (in the overall message length). • Argue that the decoder is polynomial time. 27