Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ECE 771 Lecture 10 – The Gaussian channel Objective: In this lecture we will learn about communication over a channel of practical interest, in which the transmitted signal is subjected to additive white Gaussian noise. We will derive the famous capacity formula. 1 The Gaussian channel Suppose we send information over a channel that is subjected to additive white Gaussian noise. Then the output is Yi = Xi + Zi where Yi is the channel output, Xi is the channel input, and Zi is zero-mean Gaussian with variance N : Zi ∼ N (0, N ). This is different from channel models we saw before, in that the output can take on a continuum of values. This is also a good model for a variety of practical communication channels. We will assume that there is a constraint on the input power. If we have an input codeword (x1 , x2 , . . . , xn ), we will assume that the average power is constrained so that n 1X 2 x ≤P n i=1 i Let is consider√the probability of error for binary transmission. Suppose that we √ can send either + P or − P over the channel. The receiver looks at the received signal amplitude and determines the signal transmitted using a threshold test. Then √ √ 1 1 P (Y < 0|X = + P ) + P (Y > 0|X = − P ) 2 2 √ √ √ √ 1 1 = P (Z < − P |X = + P ) + P (Z > P |X = − P ) 2 2 √ = P (Z > P ) Z ∞ 2 1 = √ √ e−x /2N dx 2πN P p p = Q( P/N ) = 1 − Φ( P/N ) Pe = where 1 Q(x) = √ 2π Z 1 Φ(x) = √ 2π Z ∞ 2 e−x /2 dx x or x 2 e−x /t dx −∞ Definition 1 The information capacity of the Gaussian channel with power constraint is C= max p(x):EX 2 ≤P I(X; Y ). 2 ECE 771: Lecture 10 – The Gaussian channel 2 We can compute this as follows: I(X; Y ) = h(Y ) − h(Y |X) = h(Y ) − h(X + Z|X) = h(Y ) − h(Z|X) = h(Y ) − h(Z) 1 1 ≤ log 2πe(P + N ) − log 2πeN 2 2 1 = log(1 + P/N ) 2 since EY 2 = P + N and the Gaussian is the maximum-entropy distribution for a given variance. So C= 1 log(1 + P/N ), 2 bits per channel use. The maximum is obtained when X is Gaussian distributed. (How do we make the input distribution look Gaussian?) Definition 2 An (M, n) code for the Gaussian channel with power constraint P consists of the following: 1. An index set {1, 2, . . . , M } 2. An encoding function x : {1, . . . , M } → X n , which maps an input index into a sequence that is n elements long, xn (1), xn (2), . . . , xn (M ), such that the average power constraints is satisfied: n X (xni (w))2 ≤ nP i=1 for w = 1, 2, . . . , M . 3. A decoding function g : Y n → {1, 2, . . . , M }. 2 Definition 3 A rate R is said to be achievable for a a Gaussian channel with a power constraint P if there exists a sequence of (2nR , n) codes with codewords satisfying the power constraint such that the maximal probability of error λ(n) → 0. The capacity of the channel is the supremum of the achievable rates. 2 Theorem 1 The capacity of a Gaussian channel with power constraint P and noise variance N is 1 P bits per transmission. C = log 1 + 2 N Geometric plausibility For a codeword of length n, the received vector (in n space) is normally distributed with mean equal to the true codeword. With high probability, the received vector is contained in sphere about the mean of radius p n(N + ). Why? Because with high probability, the vector falls within one standard deviation away from the mean in each direction, and the total distance away is the Euclidean sum: E[z12 + z22 + · · · zn2 ] = nN. ECE 771: Lecture 10 – The Gaussian channel 3 This is the square of the expected distance within which we expect to fall. If we assign everything within this sphere to the given codeword, we misdetect only if we fall outside this codeword. p Other codewords will have other spheres, each with radius approximately n(N + ). The received vectors a limited in energy by P , so they all must lie in a sphere of rap dius n(P + N ). The number of (approximately) nonintersecting decoding spheres is therefore p volume of sphere in n-space with radius r = n(P + N ) p number of spheres ≈ volume of sphere in n-space with radius r = n(N + ) The volume of a sphere of radius r in n space is proportional to rn . Substituting in this fact we get number of spheres ≈ n P (n(P + N ))n/2 ≈ 2 2 (1+ N ) (n(N + ))n/2 Proof We will follow essentially the same steps as before. 1. First we generate a codebook at random. This time we generate the codebook according to the Gaussian distribution: let Xi (w), i = 1, 2, . . . , n be the code sequence corresponding to input index w, where each Xi (w) is selected at random i.i.d. according to N (0, P − ). (With high probability, this has average power P .) The codebook is known by both transmitter and receiver. 2. Encode as described above. 3. The receiver gets a Y n , and looks at the list of codewords {X n (w)} and searches for one which is jointly typical with the received vector. If there is only one such vector, it is declared as the transmitted vector. If there is more than one such vector, an error is declared. An error is also declared if the chosen codeword does not satisfy the power constraint. For the probability of error, assume w.o.l.o.g. that codeword 1 is sent: Y n = X n (1) + Z n Define the following events: n E0 = { 1X 2 X (1) > P } n i=1 i (the event that the codeword exceeds the power constraint) and Ei = {(X n (i), Y n ) is in A(n) } The probability of error is then P (E) = P (E0 ∪ E1c ∪ E2 ∪ E3 ∪ · · · ∪ E2nR ) nR ≤ P (E0 ) + P (E1c ) + 2 X P (Ei ) union bound i=2 By LLN, P (E0 ) → 0. By joint AEP, P (E1c ) → 0, so P (E1c ) ≤ for n sufficiently large. By the code generation process, X n (1) and X n (i) are independent, so are ECE 771: Lecture 10 – The Gaussian channel 4 Y n and X n (i), i 6= 1. So the probability that X n (1) and Y n are jointly typical is ≤ 2−n(I(X;Y )−3) by joint AEP. So nR Pe(n) ≤ + + 2 X 2−n(I(X;Y )−3) i=2 ≤ 2 + (2nR − 1)2−n(I(X;Y )−3) =≤ 2 + 2nR 2−n(I(X;Y )−3) ≤ 3 for n sufficiently large, if R < I(X; Y ) − 3. This gives the average probability of error: we then go through the same kinds of arguments as before to conclude that the maximum probability of error also must go to zero. 2 The converse is that rate R > C are not achievable, or, equivalently, that if (n) Pe → 0 then it must be that R ≤ C. Proof The proof starts with Fano’s inequality: H(W |Y n ) ≤ 1 + nRPe(n) = nn where n = 1 + RPe(n) n and n → 0 as n → ∞. The proof is a string of inequalities: nR = H(W ) = I(W ; Y n ) + H(W |Y n ) uniform W ; definition of I ≤ I(W ; Y n ) + nn Fano’s inequality = h(Y n ) − h(Y n |X n ) + nn = h(Y n ) − h(Z n ) + nn n X ≤ h(Yi ) − h(Z n ) + nn i=1 = ≤ n X i=1 n X i=1 n X h(Yi ) − n X h(Zi ) + nn i=1 1 1 log 2πe(Pi + N ) − log 2πeN + nn 2 2 entropies of Y and Z; power constraint 1 log(1 + Pi /N ) + nn 2 i=1 ! n 1X =n log(1 + Pi /N ) + nn n i=1 n 1X ≤ n log(1 + Pi /N ) + nn Jensen’s n i=1 1 ≤ n log(1 + P/N ) + nn . 2 = Dividing through by n, R≤ 1 log(1 + P/N ) + n . 2 2 ECE 771: Lecture 10 – The Gaussian channel 2 5 Band-limited channels We now come to the first time in the book where the information is actually carried by a time-waveform, instead of a random variable. We will consider transmission over a band-limited channel (such as a phone channel). A key result is the sampling theorem: Theorem 2 If f (t) is bandlimited to W Hz, then the function is completely deter1 mined by samples of the function taken every 2W seconds apart. This is the classical Nyquist sampling theorem. However, Shannon’s name is also attached to it, since he provided a proof and used it. A representation of the function f (t) is f (t) = X f( n n n ) sinc(t − ) 2W 2W where sinc(t) = sin(2πW t) 2πW t From this theorem, we conclude (the dimensionality theorem) that a bandlimited function has only 2W degrees of freedom per second. For a signal which has “most” of the energy in bandwidth W and “most” of the energy in a time T , then there are about 2W T degrees of freedom, and the time- and band-limited function can be represented using 2W T orthogonal basis functions, known as the prolate spheroidal functions. We can view band- and timelimited functions as vectors in a 2T W dimensional vector space. Assume that the noise power-spectral density of the channel is N0 /2. Then the noise power is (N0 /2)(2W ) = N0 W . Over the time interval of T seconds, the energy per sample (per channel use) is PT P = . 2W T 2W Use this information in the capacity: 1 P log(1 + ) bits per channel use 2 N 1 P = log(1 + ) bits per channel use. 2 N0 W C= There are 2W samples each second (channel uses), so the capacity is 1 P C = (2W ) log(1 + ) bits/second 2 N0 W or C = W log(1 + P ) N0 W This is the famous and key result of information theory. As W → ∞, we have to do a little calculus to find that C= P log2 e bits per second. N0 ECE 771: Lecture 10 – The Gaussian channel 6 This is interesting: even with infinite bandwidth, the capacity is not infinite, but grows linearly with the power. Example 1 For a phone channel, take W = 3300 Hz. If the SNR is P/N0 W = 40dB = 10000, we get C = 43850 bits per second. If P/W N0 = 20dB = 100 we get C = 21972 bits/second. (The book is dated.) We cannot do better than capacity! 3 2 Kuhn-Tucker Conditions Before proceeding with the next section, we need a result from constrained optimization theory known as the Kuhn-Tucker condition. Suppose we are minimizing some convex objective function L(x), min L(x) subject to a constraint f (x) ≤ 0. Let the optimal value of x be x0 . Then either the constraint is inactive, in which case we get ∂L =0 ∂x x0 or, if the constraint is active, it must be the case that the objective function increases for all admissible values of x: ∂L ≥0 ∂x x∈A where A is the set of admissible values, for which ∂f ≤ 0. ∂y (Think about what happens if this is not the case.) Thus, sgn ∂f ∂L = − sgn ∂x ∂x or ∂L ∂f +λ =0 ∂x ∂x λ ≥ 0. We can create a new objective function J(x, λ) = L(x) + λf (x), so the necessary conditions become ∂J =0 ∂x (1) ECE 771: Lecture 10 – The Gaussian channel 7 and f (x) ≤ 0 where ( ≥ 0 f (y) = 0 λ = 0 f (y) < 0 constraint is active constraint is inactive. For a vector variable x, then the condition (1) means: ∂f ∂L is parallel to and pointing in opposite directions, ∂x ∂x where ∂L ∂x is interpreted as the gradient. In words, what condition (1) says is: the gradient of L with respect to x at a minimum must be pointed in such a way that decrease of L can only come by violating the constraints. Otherwise, we could decrease L further. This is the essence of the Kuhn-Tucker condition. 4 Parallel Gaussian channels Parallel Gaussian channels are used to model bandlimited channels with a non-flat frequency response. We assume we have k Gaussian channels, Yj = Xj + Zj , j = 1, 2, . . . , k. where Zj ∼ N (0, Nj ) and the channels are independent. The total power used is constrained: E k X Xj2 ≤ P. j=1 One question we might ask is: how do we distribute the power across the k channels to get maximum throughput. We can find the maximum mutual information (the information channel capacity) as I(X1 , . . . , Xk ; Y1 , . . . , Yk ) = h(Y1 , . . . , Yk ) − h(Y1 , . . . , Yk |X1 , . . . , Xk ) = h(Y1 , . . . , Yk ) − h(Z1 , . . . , Zk ) k X = h(Y1 , . . . , Yk ) − h(Zi ) i=1 ≤ k X h(Yi ) − h(Zi ) i=1 ≤ X1 i 2 log(1 + Pi /Ni ) Equality is obtained when the Xs are independent normally distributed. We want to distribute the power available among the various channels, subject to not exceeding the power constraint: J(P1 , . . . , Pk ) = X1 i 2 log(1 + k X Pi )+λ Pi Ni i=1 ECE 771: Lecture 10 – The Gaussian channel 8 with a side constraint (not shown) that Pi ≥ 0. Differential w.r.t. Pj to obtain 1 + λ ≥ 0. Pj + Nj with equality only if all the constraints are inactive. After some fiddling, we obtain Pj = ν − Nj (since λ is a constant). However, we must also have Pj ≥ 0, so we must ensure that we don’t violate that if Nj > ν. Thus, we let Pj = (ν − Nj )+ where ( x x≥0 (x) = 0 x<0 + and ν is chosen so that n X (ν − Ni )+ = P i=1 Draw picture; explain “water filling.” 5 Channels with colored Gaussian noise We will extend the results of the previous section now to channels with non-white Gaussian noise. Let Kz be the covariance of the noise Kx the covariance of the input, with the input constrained by 1X EXi2 ≤ P n i which is the same as 1 tr(KX ) ≤ P. n We can write I(X1 , . . . , Xn ; Y1 , . . . , Yn ) = h(Y1 , . . . , Yn ) − h(Z1 , . . . , Zn ) where h(Y1 , . . . , Yn ) ≤ 1 log((2πe)n |Kx + Kz |) 2 Now how do we choose Kx to maximize Kx + Kz , subject to the power constraint? Let Kz = QΛQT then |Kx + Kz | = |Kx + QΛQT | = |Q||QT Kx Q + Λ||QT | = |QT Kx Q + Λ| = |A + λ| ECE 771: Lecture 10 – The Gaussian channel 9 where A = QT Kx Q. Observe that tr(A) = tr(QT Kx Q) = tr(QT QKx ) = tr(Kx ) So we want to maximize |A + Λ| subject to tr(A) ≤ nP . The key is to use an inequality, in this case Hadamard’s inequality. Hadamard’s inequality follows directly from the “conditioning reduces entropy” theorem: X h(X1 , . . . , Xn ) ≤ h(Xi ). Let X ∼ N (0, K). Then h(X) = 1 log(2πe)n |K| 2 and 1 log(2πe)Kii 2 h(Xi ) = Substituting in and simplifying gives |K| ≤ Y Kii i with equality iff K is diagonal. Getting back to our problem, |A + Λ| ≤ Y (Aii + Λii ) i with equality iff A is diagonal. We have 1X n Aii ≤ P i (the power constraint), and Aii ≥ 0. As before, we take Aii = (ν − λi )+ where ν is chosen so that X Aii = nP. Now we want to generalize to a continuous time system. For a channel with (n) AWGN and covariance matrix KZ , the covariance is Toeplitz. If the channel noise process is stationary, then the covariance matrix is Toeplitz, and the eigenvalues of the covariance matrix tend to a limit as n → ∞. The density of the eigenvalues on the real line tends to the power spectrum of the stochastic process. That is, if Kij = Ki−j are the autocorrelation values and the power spectrum is S(ω) = F[rk ] then 1 λ1 + λ2 + · · · + λM = lim M →∞ M 2π Z π S(ω)dω. −π In this case, the water filling translates to water filling in the spectral domain. The capacity of the channel with noise spectrum N (f ) can be shown to be Z 1 (ν − N (f ))+ C= log(1 + )df 2 N (f ) where ν is chosen so that Z (ν − N (f ))+ df = P