Download 1 The Gaussian channel

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ECE 771
Lecture 10 – The Gaussian channel
Objective: In this lecture we will learn about communication over a channel of
practical interest, in which the transmitted signal is subjected to additive white
Gaussian noise. We will derive the famous capacity formula.
1
The Gaussian channel
Suppose we send information over a channel that is subjected to additive white
Gaussian noise. Then the output is
Yi = Xi + Zi
where Yi is the channel output, Xi is the channel input, and Zi is zero-mean Gaussian with variance N : Zi ∼ N (0, N ). This is different from channel models we saw
before, in that the output can take on a continuum of values. This is also a good
model for a variety of practical communication channels.
We will assume that there is a constraint on the input power. If we have an input
codeword (x1 , x2 , . . . , xn ), we will assume that the average power is constrained
so that
n
1X 2
x ≤P
n i=1 i
Let is consider√the probability
of error for binary transmission. Suppose that we
√
can send either + P or − P over the channel. The receiver looks at the received
signal amplitude and determines the signal transmitted using a threshold test. Then
√
√
1
1
P (Y < 0|X = + P ) + P (Y > 0|X = − P )
2
2
√
√
√
√
1
1
= P (Z < − P |X = + P ) + P (Z > P |X = − P )
2
2
√
= P (Z > P )
Z ∞
2
1
= √ √
e−x /2N dx
2πN
P
p
p
= Q( P/N ) = 1 − Φ( P/N )
Pe =
where
1
Q(x) = √
2π
Z
1
Φ(x) = √
2π
Z
∞
2
e−x
/2
dx
x
or
x
2
e−x
/t
dx
−∞
Definition 1 The information capacity of the Gaussian channel with power
constraint is
C=
max
p(x):EX 2 ≤P
I(X; Y ).
2
ECE 771: Lecture 10 – The Gaussian channel
2
We can compute this as follows:
I(X; Y ) = h(Y ) − h(Y |X)
= h(Y ) − h(X + Z|X)
= h(Y ) − h(Z|X)
= h(Y ) − h(Z)
1
1
≤ log 2πe(P + N ) − log 2πeN
2
2
1
= log(1 + P/N )
2
since EY 2 = P + N and the Gaussian is the maximum-entropy distribution for a
given variance. So
C=
1
log(1 + P/N ),
2
bits per channel use. The maximum is obtained when X is Gaussian distributed.
(How do we make the input distribution look Gaussian?)
Definition 2 An (M, n) code for the Gaussian channel with power constraint P
consists of the following:
1. An index set {1, 2, . . . , M }
2. An encoding function x : {1, . . . , M } → X n , which maps an input index into
a sequence that is n elements long, xn (1), xn (2), . . . , xn (M ), such that the
average power constraints is satisfied:
n
X
(xni (w))2 ≤ nP
i=1
for w = 1, 2, . . . , M .
3. A decoding function g : Y n → {1, 2, . . . , M }.
2
Definition 3 A rate R is said to be achievable for a a Gaussian channel with
a power constraint P if there exists a sequence of (2nR , n) codes with codewords
satisfying the power constraint such that the maximal probability of error λ(n) → 0.
The capacity of the channel is the supremum of the achievable rates.
2
Theorem 1 The capacity of a Gaussian channel with power constraint P and noise
variance N is
1
P
bits per transmission.
C = log 1 +
2
N
Geometric plausibility For a codeword of length n, the received vector (in n
space) is normally distributed with mean equal to the true codeword. With high
probability,
the received vector is contained in sphere about the mean of radius
p
n(N + ). Why? Because with high probability, the vector falls within one standard deviation away from the mean in each direction, and the total distance away
is the Euclidean sum:
E[z12 + z22 + · · · zn2 ] = nN.
ECE 771: Lecture 10 – The Gaussian channel
3
This is the square of the expected distance within which we expect to fall. If we
assign everything within this sphere to the given codeword, we misdetect only if we
fall outside this codeword.
p
Other codewords will have other spheres, each with radius approximately n(N + ).
The received
vectors a limited in energy by P , so they all must lie in a sphere of rap
dius n(P + N ). The number of (approximately) nonintersecting decoding spheres
is therefore
p
volume of sphere in n-space with radius r = n(P + N )
p
number of spheres ≈
volume of sphere in n-space with radius r = n(N + )
The volume of a sphere of radius r in n space is proportional to rn . Substituting
in this fact we get
number of spheres ≈
n
P
(n(P + N ))n/2
≈ 2 2 (1+ N )
(n(N + ))n/2
Proof We will follow essentially the same steps as before.
1. First we generate a codebook at random. This time we generate the codebook
according to the Gaussian distribution: let Xi (w), i = 1, 2, . . . , n be the code
sequence corresponding to input index w, where each Xi (w) is selected at
random i.i.d. according to N (0, P − ). (With high probability, this has
average power P .) The codebook is known by both transmitter and receiver.
2. Encode as described above.
3. The receiver gets a Y n , and looks at the list of codewords {X n (w)} and
searches for one which is jointly typical with the received vector. If there is
only one such vector, it is declared as the transmitted vector. If there is more
than one such vector, an error is declared. An error is also declared if the
chosen codeword does not satisfy the power constraint.
For the probability of error, assume w.o.l.o.g. that codeword 1 is sent:
Y n = X n (1) + Z n
Define the following events:
n
E0 = {
1X 2
X (1) > P }
n i=1 i
(the event that the codeword exceeds the power constraint) and
Ei = {(X n (i), Y n ) is in A(n)
}
The probability of error is then
P (E) = P (E0 ∪ E1c ∪ E2 ∪ E3 ∪ · · · ∪ E2nR )
nR
≤ P (E0 ) +
P (E1c )
+
2
X
P (Ei )
union bound
i=2
By LLN, P (E0 ) → 0. By joint AEP, P (E1c ) → 0, so P (E1c ) ≤ for n sufficiently
large. By the code generation process, X n (1) and X n (i) are independent, so are
ECE 771: Lecture 10 – The Gaussian channel
4
Y n and X n (i), i 6= 1. So the probability that X n (1) and Y n are jointly typical is
≤ 2−n(I(X;Y )−3) by joint AEP. So
nR
Pe(n) ≤ + +
2
X
2−n(I(X;Y )−3)
i=2
≤ 2 + (2nR − 1)2−n(I(X;Y )−3)
=≤ 2 + 2nR 2−n(I(X;Y )−3) ≤ 3
for n sufficiently large, if R < I(X; Y ) − 3.
This gives the average probability of error: we then go through the same kinds
of arguments as before to conclude that the maximum probability of error also must
go to zero.
2
The converse is that rate R > C are not achievable, or, equivalently, that if
(n)
Pe → 0 then it must be that R ≤ C.
Proof The proof starts with Fano’s inequality:
H(W |Y n ) ≤ 1 + nRPe(n) = nn
where
n =
1
+ RPe(n)
n
and n → 0 as n → ∞.
The proof is a string of inequalities:
nR = H(W ) = I(W ; Y n ) + H(W |Y n )
uniform W ; definition of I
≤ I(W ; Y n ) + nn
Fano’s inequality
= h(Y n ) − h(Y n |X n ) + nn
= h(Y n ) − h(Z n ) + nn
n
X
≤
h(Yi ) − h(Z n ) + nn
i=1
=
≤
n
X
i=1
n
X
i=1
n
X
h(Yi ) −
n
X
h(Zi ) + nn
i=1
1
1
log 2πe(Pi + N ) − log 2πeN + nn
2
2
entropies of Y and Z; power constraint
1
log(1 + Pi /N ) + nn
2
i=1
!
n
1X
=n
log(1 + Pi /N ) + nn
n i=1
n
1X
≤ n log(1 +
Pi /N ) + nn
Jensen’s
n i=1
1
≤ n log(1 + P/N ) + nn .
2
=
Dividing through by n,
R≤
1
log(1 + P/N ) + n .
2
2
ECE 771: Lecture 10 – The Gaussian channel
2
5
Band-limited channels
We now come to the first time in the book where the information is actually carried
by a time-waveform, instead of a random variable. We will consider transmission
over a band-limited channel (such as a phone channel). A key result is the sampling
theorem:
Theorem 2 If f (t) is bandlimited to W Hz, then the function is completely deter1
mined by samples of the function taken every 2W
seconds apart.
This is the classical Nyquist sampling theorem. However, Shannon’s name is also
attached to it, since he provided a proof and used it. A representation of the
function f (t) is
f (t) =
X
f(
n
n
n
) sinc(t −
)
2W
2W
where
sinc(t) =
sin(2πW t)
2πW t
From this theorem, we conclude (the dimensionality theorem) that a bandlimited
function has only 2W degrees of freedom per second.
For a signal which has “most” of the energy in bandwidth W and “most” of
the energy in a time T , then there are about 2W T degrees of freedom, and the
time- and band-limited function can be represented using 2W T orthogonal basis
functions, known as the prolate spheroidal functions. We can view band- and timelimited functions as vectors in a 2T W dimensional vector space.
Assume that the noise power-spectral density of the channel is N0 /2. Then the
noise power is (N0 /2)(2W ) = N0 W . Over the time interval of T seconds, the energy
per sample (per channel use) is
PT
P
=
.
2W T
2W
Use this information in the capacity:
1
P
log(1 + ) bits per channel use
2
N
1
P
= log(1 +
) bits per channel use.
2
N0 W
C=
There are 2W samples each second (channel uses), so the capacity is
1
P
C = (2W ) log(1 +
) bits/second
2
N0 W
or
C = W log(1 +
P
)
N0 W
This is the famous and key result of information theory.
As W → ∞, we have to do a little calculus to find that
C=
P
log2 e bits per second.
N0
ECE 771: Lecture 10 – The Gaussian channel
6
This is interesting: even with infinite bandwidth, the capacity is not infinite, but
grows linearly with the power.
Example 1 For a phone channel, take W = 3300 Hz. If the SNR is P/N0 W =
40dB = 10000, we get
C = 43850 bits per second.
If P/W N0 = 20dB = 100 we get
C = 21972 bits/second.
(The book is dated.)
We cannot do better than capacity!
3
2
Kuhn-Tucker Conditions
Before proceeding with the next section, we need a result from constrained optimization theory known as the Kuhn-Tucker condition.
Suppose we are minimizing some convex objective function L(x),
min L(x)
subject to a constraint
f (x) ≤ 0.
Let the optimal value of x be x0 . Then either the constraint is inactive, in which
case we get
∂L =0
∂x x0
or, if the constraint is active, it must be the case that the objective function increases
for all admissible values of x:
∂L
≥0
∂x x∈A
where A is the set of admissible values, for which
∂f
≤ 0.
∂y
(Think about what happens if this is not the case.) Thus,
sgn
∂f
∂L
= − sgn
∂x
∂x
or
∂L
∂f
+λ
=0
∂x
∂x
λ ≥ 0.
We can create a new objective function
J(x, λ) = L(x) + λf (x),
so the necessary conditions become
∂J
=0
∂x
(1)
ECE 771: Lecture 10 – The Gaussian channel
7
and
f (x) ≤ 0
where
(
≥ 0 f (y) = 0
λ
= 0 f (y) < 0
constraint is active
constraint is inactive.
For a vector variable x, then the condition (1) means:
∂f
∂L
is parallel to
and pointing in opposite directions,
∂x
∂x
where ∂L
∂x is interpreted as the gradient.
In words, what condition (1) says is: the gradient of L with respect to x at
a minimum must be pointed in such a way that decrease of L can only come by
violating the constraints. Otherwise, we could decrease L further. This is the
essence of the Kuhn-Tucker condition.
4
Parallel Gaussian channels
Parallel Gaussian channels are used to model bandlimited channels with a non-flat
frequency response. We assume we have k Gaussian channels,
Yj = Xj + Zj ,
j = 1, 2, . . . , k.
where
Zj ∼ N (0, Nj )
and the channels are independent. The total power used is constrained:
E
k
X
Xj2 ≤ P.
j=1
One question we might ask is: how do we distribute the power across the k channels
to get maximum throughput.
We can find the maximum mutual information (the information channel capacity) as
I(X1 , . . . , Xk ; Y1 , . . . , Yk ) = h(Y1 , . . . , Yk ) − h(Y1 , . . . , Yk |X1 , . . . , Xk ) = h(Y1 , . . . , Yk ) − h(Z1 , . . . , Zk )
k
X
= h(Y1 , . . . , Yk ) −
h(Zi )
i=1
≤
k
X
h(Yi ) − h(Zi )
i=1
≤
X1
i
2
log(1 + Pi /Ni )
Equality is obtained when the Xs are independent normally distributed. We want to
distribute the power available among the various channels, subject to not exceeding
the power constraint:
J(P1 , . . . , Pk ) =
X1
i
2
log(1 +
k
X
Pi
)+λ
Pi
Ni
i=1
ECE 771: Lecture 10 – The Gaussian channel
8
with a side constraint (not shown) that Pi ≥ 0. Differential w.r.t. Pj to obtain
1
+ λ ≥ 0.
Pj + Nj
with equality only if all the constraints are inactive. After some fiddling, we obtain
Pj = ν − Nj
(since λ is a constant). However, we must also have Pj ≥ 0, so we must ensure that
we don’t violate that if Nj > ν. Thus, we let
Pj = (ν − Nj )+
where
(
x x≥0
(x) =
0 x<0
+
and ν is chosen so that
n
X
(ν − Ni )+ = P
i=1
Draw picture; explain “water filling.”
5
Channels with colored Gaussian noise
We will extend the results of the previous section now to channels with non-white
Gaussian noise. Let Kz be the covariance of the noise Kx the covariance of the
input, with the input constrained by
1X
EXi2 ≤ P
n i
which is the same as
1
tr(KX ) ≤ P.
n
We can write
I(X1 , . . . , Xn ; Y1 , . . . , Yn ) = h(Y1 , . . . , Yn ) − h(Z1 , . . . , Zn )
where
h(Y1 , . . . , Yn ) ≤
1
log((2πe)n |Kx + Kz |)
2
Now how do we choose Kx to maximize Kx + Kz , subject to the power constraint?
Let
Kz = QΛQT
then
|Kx + Kz | = |Kx + QΛQT | = |Q||QT Kx Q + Λ||QT |
= |QT Kx Q + Λ| = |A + λ|
ECE 771: Lecture 10 – The Gaussian channel
9
where A = QT Kx Q. Observe that
tr(A) = tr(QT Kx Q) = tr(QT QKx ) = tr(Kx )
So we want to maximize |A + Λ| subject to tr(A) ≤ nP . The key is to use an inequality, in this case Hadamard’s inequality. Hadamard’s inequality follows directly
from the “conditioning reduces entropy” theorem:
X
h(X1 , . . . , Xn ) ≤
h(Xi ).
Let X ∼ N (0, K). Then
h(X) =
1
log(2πe)n |K|
2
and
1
log(2πe)Kii
2
h(Xi ) =
Substituting in and simplifying gives
|K| ≤
Y
Kii
i
with equality iff K is diagonal.
Getting back to our problem,
|A + Λ| ≤
Y
(Aii + Λii )
i
with equality iff A is diagonal. We have
1X
n
Aii ≤ P
i
(the power constraint), and Aii ≥ 0. As before, we take
Aii = (ν − λi )+
where ν is chosen so that
X
Aii = nP.
Now we want to generalize to a continuous time system. For a channel with
(n)
AWGN and covariance matrix KZ , the covariance is Toeplitz. If the channel noise
process is stationary, then the covariance matrix is Toeplitz, and the eigenvalues of
the covariance matrix tend to a limit as n → ∞. The density of the eigenvalues
on the real line tends to the power spectrum of the stochastic process. That is, if
Kij = Ki−j are the autocorrelation values and the power spectrum is
S(ω) = F[rk ]
then
1
λ1 + λ2 + · · · + λM
=
lim
M →∞
M
2π
Z
π
S(ω)dω.
−π
In this case, the water filling translates to water filling in the spectral domain. The
capacity of the channel with noise spectrum N (f ) can be shown to be
Z
1
(ν − N (f ))+
C=
log(1 +
)df
2
N (f )
where ν is chosen so that
Z
(ν − N (f ))+ df = P