Download "Approximation Theory of Output Statistics,"

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Randomness wikipedia , lookup

Probability box wikipedia , lookup

Law of large numbers wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Central limit theorem wikipedia , lookup

Transcript
152
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993
Approximation Theory of Output Statistics
Te Sun Han, Fellow, IEEE, and Sergio VerdG, Fellow, IEEE
Abstract-Given
a channel and an input process, the minimum
randomness of those input processes whose output statistics
approximate the original output statistics with arbitrary accuracy
is studied. The notion of resolvability of a channel, defined
as the number of random bits required per channel use in
order to generate an input that achieves arbitrarily
accurate
approximation of the output statistics for any given input process,
is introduced. A general formula for resolvability that holds
regardless of the channel memory structure, is obtained. It is
shown that, for most channels, resolvability is equal to Shannon
capacity. By-products of the analysis are a general formula for
the minimum achievable (fixed-length) source coding rate of any
finite-alphabet source, and a strong converse of the identification
coding theorem, which holds for any channel that satisfies the
strong converse of the channel coding theorem.
Index Terms- Shannon theory, channel output statistics, resolvability, random number generation complexity; channel capacity, noiseless source coding theorem, identification via channels.
I.
INTRODUCTION
T
0 MOTIVATE
the problem studied in this paper let us
consider the computer simulation of stochastic systems.
Usually, the objective of the simulation is to compute a set
of statistics of the response of the system to a given “realworld” input random process. To accomplish this, a sample
path of the input random process is generated and empirical
estimates of the desired output statistics are computed from
the output sample path. A random number generator is used
to generate the input sample path and an important question
is how many random bits are required per input sample. The
answer would depend only on the given “real-world” input
statistics if the objective were to reproduce those statistics
exactly (in which case an infinite number of bits per sample
would be required if the input distribution is continuous, for
example). However, the real objective is to approximate the
output statistics. Therefore, the required number of random
bits depends not only on the input statistics but on the degree
of approximation required for the output statistics, and on
the system itself. In this paper, we are interested in the
approximation of output statistics (via an alternative input
process) with arbitrary accuracy, in the sense that the distance
between the finite-dimensional statistics of the true output
process and the approximated output process is required to
vanish asymptotically. This leads to the introduction of a new
Manuscript received February 7, 1992; revised September 18, 1992. This
work was supported in part by the U.S. Office of Naval Research under Grant
N00014-90-J-1734 and in part by the NEC Corp. under its grant program.
T. S. Han is with the Graduate School of Information Systems, University
of Electra-Communications, Tokyo 182, Japan.
S. Verdti is with the Department of Electrical Engineering, Princeton
University, Princeton, NJ 08544.
IEEE Log Number 9206960.
0018-9448/93$03.00
concept in the Shannon theory; the resolvability of a system
(channel) defined as the number of random bits per input
sample required to achieve arbitrarily accurate approximation
of the output statistics regardless of the actual input process.
Intuitively, we can anticipate that the resolvability of a system
will depend on how “noisy” it is. A coarse approximation of
the input statistics whose generation requires comparatively
few bits will be good enough when the system is very
noisy, because, then, the output cannot reflect any fine detail
contained in the input distribution.
Although the problem of approximation of output statistics
involves no codes of any sort or the transmission/reproduction
of information, its analysis and results turn out to be Shannon
theoretic in nature. In fact, our main conclusion is that (for
most systems) resolvability is equal to Shannon capacity.
In order to make the notion of resolvability precise, we
need to specify the “distance” measure between true and
approximated output statistics and the “complexity” measure
of random number generation. Our main, but not exclusive,
focus is on the II-distance (or variational distance) and on
the worst-case measure of randomness, respectively. This
complexity measure of a random variable is equal to the
number of random bits required to generate every possible
realization of the random variable; we refer to it as the
resolution of the random variable and we show how to obtain
it from the probability distribution. The alternative, average
randomness measure is known to equal the entropy plus at
most two bits [ll], and it leads to the associated notion of
mean-resolvability.
Section II introduces the main definitions. The class of
channels we consider is very general. To keep the development
as simple as possible we restrict attention to channels with
finite input/output alphabets. However, most of the proofs do
not rely on that assumption, and it is specifically pointed out
when this is not the case. In addition to allowing channels
with arbitrary memory structure, we deal with completely
general input processes, in particular, neither ergodicity nor
stationarity assumptions are imposed.
Further motivation for the notions of resolvability and
mean-resolvability is given in Section III. Section IV gives
a general formula for the resolvability of a channel. The
achievability part of the resolvability thereom (which gives an
upper bound to resolvability) holds for any channel, regardless
of its memory structure or the finiteness of the input/output
alphabets. The finiteness of the input set is the only substantive
restriction under which the converse part (which lower bounds
resolvability) is shown in Section IV via Lemma 6.
The approximation of output statistics has intrinisic connections with the following three major problems in the Shannon
theory: (noiseless) source coding, channel coding and identi0 1993 IEEE
753
HAN AND VERD& APPROXIMATION THEORY OF OUTPUT STATISTICS
fication via channels [l]. As a by-product of our resolvability
results, we find in Section III a very general formula for the
minimum achievable fixed-length source coding rate that holds
for any finite-alphabet source thereby dispensing with the
classical assumptions of ergodicity and stationarity. In Section
V, we show that as long as the channel satisfies the strong converse to the channel coding theorem, the resolvability formula
found in Section IV is equal to the Shannon capacity. As a simple consequence of the achievability part of the resolvability
theorem, we show in Section VI a general strong converse
to the identification coding theorem, which was known to
hold only for discrete memoryless channels [7]. This result
implies that the identification capacity is guaranteed to equal
the Shannon capacity for any finite-input channel that satisfies
the strong converse to the Shannon channel coding theorem.
The more appropriate kind (average or worst-case) of complexity measure will depend on the specific application. For
example, in single sample-path simulations, the worst-case
measure may be preferable. At any rate, the limited study
in Section VII indicates that in every case we consider, the
mean-resolvability is also equal to the Shannon capacity of
the system.
Similarly, the results presented in Section VIII evidence that
the main conclusions on resolvability (established in previous
sections) also hold when the variational-distance approximation criterion is replaced by the normalized divergence.
Section VIII concludes with the proof of a folk theorem
which fits naturally within the approximation theory of output
statistics: the output distribution due to any good channel code
must approximate the output distribution due to the capacityachieving input.
Although the problem treated in this paper is new, it is
interesting to note two previous information-theoretic contributions related to the notion of quantifying the minimum
complexity of a randomness source required to approximate
some given distribution. In one of the approaches to measure
the common randomness between two dependent random
variables proposed in [21], the randomness source is the input
to two independent memoryless random transformations, the
outputs of which are required to have a joint distribution which
approximates (in normalized divergence) the nth product of the
given joint distribution, The class of channels whose transition
probabilities can be approximated (in &distance) by slidingblock transformations of the input and an independent noise
source are studied in [13], and the minimum entropy rate of the
independent noise source required for accurate approximation
is shown to be the maximum conditional output entropy over
all stationary inputs.
In order to describe the statistics of input/output processes,
we will use the sequence of finite-dimensional distributions’
{Xn = (X,‘“‘, . .. ,X?))}r=,,
which is abbreviated as X.
The following notation will be used for the output distribution
when the input is distributed according to Q”:
Q”W”(y”)
=
c
W”(y”
1 ?)&“(a?)<
X”EA”
Definition 2 (141: Given a joint distribution Pxnyfi (a~“,y”)
= Px-(z”)W”(y”
] a?), the information density is the
function defined on A” x B”:
iXnW”(un,
b”) = log Wn(bn ’ a”).
PY” @“I
The distribution of the random variable (l/n)ix-w(Xn,Y”)
where X” and Y” have joint distribution Px-Y- will be
referred to as the information spectrum. The expected value of
the information spectrum is the normalized mutual information
(l/n)I(X”;
Y”).
Definition 3: The limsup in probability of a sequence of
random variables {A,} is defined as the smallest extended
real number @ such that for all E > 0
ilm
P[A,
> /3 + E] = 0.
Analogously, the liminf in probability is the largest extended
real number Q! such that for all e > 0, limn+cr, P[A, 5
Cl-E]
= 0. Note that a sequence of random variables
converges in probability to a constant, if and only if its
limsup in probability is equal to its liminf in probability.
The limsup in probability [resp. liminf in probability] of the
sequence of random variables {(l/n)ix%w(Xn, Yn)}rZ1
will be referred to as the sup-information rate [resp. infinformation rate] of the pair (X, Y) and will be denoted
as 7(X; Y) [resp. 1(X; Y)]. The mutual information rate of
(X, Y), if it exists, is the limit
1(X; Y) = lim I1(X”;
71’00 n
Y”).
Although convergence in probability does not necessarily
imply convergence of the means, (e.g., [15, p. 135]), in most
cases of information-theoretic interest that implication does
indeed hold in the context of information rates.
Lemma 1: For any channel with finite input alphabet, if
I(X; Y) = L(X; Y), (i.e., the information spectrum converges in probability to a constant), then
7(X; Y) = 1(X; Y) = l(X;
Y),
II. PRELIMINARIES
This section introduces the basic notation and fundamental
concepts as well as several properties to be used in the sequel.
Definition 1: A channel W with input and output alphabets, A and B, respectively, is a sequence of conditional
distributions
w = {w”(yn
1 2”) = Pyn,X”(yn
12”);
(cz?, y”) E A” x B”},“=,.
and the input-output pair (X, Y) is called information stable.
Proof: See the Appendix for a proof that hinges on the
0
finiteness of the input alphabet.
1 No consistency restrictions between the channel conditional probabilities,
or between the finite-dimensional distributions of the input/output processes
are imposed. Thus, “processes” refer to sequences of finite-dimensional
distributions, rather than distributions on spaces of infinite dimensional
sequences.
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993
754
Definition 4 [7]: For any positive integer M,2 a probability
distribution P is said to be M - type if
and
d(Y”,
P(u) E
0, $-, ;,
. . . ,l
C
I
for all w E R.
)
Py- = Px-kV,
P+
sups,
E>O
with equality, if and only if P is equiprobable.
The information spectrum is upper bounded almost surely
by the input (or output) resolution:
Lemma 3:
Proof: For every (a?, y”)
Px-(x”)
> 0,3 we have
iX”W”(xn,
= 1.
E A” x B”
y”) 5 log
1
px- (x”)
exp (-R(Xn)),
~I%4
WE0
= 2~5
Q(w)1
P’(E) - Q(WI.
(2.2)
-
(2.3)
Definition 7: Let c 2 0. R is an c-achievable resolution
rate for channel W if for every input process X and for all
y > 0, there.exists h whose resolution satisfies
;R(Xn)
< R + y
s = sup S(X).
X
(2.1)
Definition 6 (e.g., [3]): The variational distance or Iidistance between two distributions P and Q defined on the
same measurable space (R, 9) is
Q)=
(2.6)
The main focus of this paper is on the resolvability of
systems as defined in Definition 7. In addition, we shall
investigate another kind of resolvability results by considering
a different randomness measure. Specifically, if in Definition
7, (2.4) is replaced by
where m(z”) is an integer greater than or equal to 1. Thus,
0
the result follows uniting (2.1) and (2.2).
d(P>
= s.
The definitions of achievable resolution rates can be modified so that the defining property applies to a particular input
X instead of every input process. In such case, we refer to the
corresponding quantities as (c-) achievable resolution rate for
X and (E-) resolvability for X, for which we use the notation
S,(X) and S(X). It follows from Definition 7 that
such that
and
Px-(CC”) = m(C)
= P.&%W”.
If R is an c-achievable resolution rate, for every E > 0, then,
we say that R is an achievable resolution rate. By definition,
the set of (e-) achievable resolution rates is either empty or
a closed interval. The minimum e-achievable resolution rate
(resp., achievable resolution rate) is called the c-resolvability
(resp., resolvability) of the channel, and it is denoted by S,
(resp., S). Note that S, is monotonically nonincreasing in E and
H(P)< z(P)5 R(P)
Y”) < R(X”)]
(2.5)
for all sufficiently large 12, where Y and Y are the output
statistics due to input processes X and X, respectively, i.e.,
The number of different M-type distributions on R is upper
bounded by IRIM.
Definition 5: The resolution R(P) of a probability distribution P is the minimum log M such that P is M-type. (If
P is not M-type for any integer M, then R(P) = +oo.)
Resolution is a new measure of randomness which is
related to conventional measures via the following immediate
inequality.
Lemma 2: Let H(P) denote the entropy of P and let Z(P)
denote the RCnyi entropy of order 0, i.e., the logarithm of the
number of points with positive P-mass.
Then.
p[iXnWn(Xn,
9’“) < E,
$2”)
(2.7)
then achievable resolution rates become achievable entropy
rates and resolvability becomes mean-resolvability. It follows
from Lemma 2 that for all E > 0 and X
ax) 5 Se(X)
(2.8)
where 3, and ?? denote (E-) mean-resolvability in parallel
with the above definitions of S, and S. It is obvious that
3 = supx 3(X).
The motivation for the definitions of resolvability and meanresolvability is further developed in the following section.
III. RESOLUTION,
(2.4)
‘The alternative terminology type with denominator M can be found in [2,
ch. 121.
3Following common usage in information theory, when the distributions in
Definitions 4-6 denote those of random variables, they will b_ereplaced by
the random variables themselves, e.g., R(X), H(X), d(Yn, Y”).
< R + y
RANDOM
NUMBER
GENERATION
AND SOURCE CODING
The purpose of this section is to motivate the definitions
of resolvability and mean-resolvability introduced in Section
II through their relationship with random number generation
and noiseless source coding. Along the way, we will show
that our resolvability theorems lead to new general results in
source coding.
HAN AND VERDti: APPROXIMATION THEORY OF OUTPUT STATISTICS
A. Resolution and Random Number Generation
A prime way to quantify the “randomness” of a random
variable is via the complexity of its generation with a computer
that has access to a basic random experiment which generates
equally likely random values, such as fair coin flips, dice,
etc. By complexity, we mean the number of random bits that
the most efficient algorithm requires in order to generate the
random variable. Depending on the algorithm, the required
number of random bits may be random itself. For example,
consider the generation of the random variable with probability
masses P[X = -11 = l/4, P[X = 0] = l/2, P[X = 11 =
l/4, with an algorithm such that if the outcome of a fair coin
flip is Heads, then the output is 0, and if the outcome is Tails,
another fair coin flip is requested in order to decide $1 or
-1. On the average this algorithm requires 1.5 coin flips, and
in the worst-case 2 coin flips are necessary. Therefore, the
complexity measure can take two fundamental forms: worstcase or average (over the range of outcomes of the random
variable).
First, let us consider the worst-case complexity. A conceptual model for the generation of arbitrary random variables is
a deterministic transformation of a random variable uniformly
distributed on [0, 11. Although such a random variable cannot
be generated by a discrete machine, this model suggests an algorithm for the generation of finitely-valued random variables
in a finite-precision computer: a deterministic transformation
of the outcome of a random number generator which outputs
M equally likely values, in lieu of the uniformly distributed
random variable. The lowest value of log M required to
generate the random variable (among all possible deterministic
transformations) is its worst-case complexity. Other algorithms
may require fewer random bits on the average, but not for
every possible outcome. It is now easy to recognize that
the worst-case complexity of a random variable is equal
to its resolution. This is because processing the output of
the M-valued random number generator with a deterministic
transformation (which is conceptually nothing more than a
table lookup) results in a discrete random variable whose
probability masses are multiples of l/M, i.e., an M-type.
At first sight, it may seem that the use of resolution (as
opposed to entropy) in the definition of resolvability is overly
stringent. However, this is not the case because that definition
is concerned with asymptotic approximation. Analogously, in
practice, M may be constrained to be a power of 2; however,
this possible modification has no effect on the definition
of achievable resolution rates (Definition 7) because it is
only concerned with the asymptotic behavior of the ratio
of resolution to number of dimensions of the approximating
distribution.
The average complexity of random variable generation has
been studied in the work of Knuth and Yao [ll], which shows
that the minimum expected number of fair bits required to
generate a random variable lies between its entropy and its
entropy plus two bits, (cf. [2, theorem 5.12.31). That lower
bound holds even if the basic equally likely random number
generator is allowed to be nonbinary. This result is the reason
for the choice of entropy as the average complexity measure
755
in the definition of mean-resolvability. Note that the two-bit
uncertainty of the Knuth-Yao theorem is inconsequential for
the purposes of our (asymptotic) definition.
B. Resolution and Source Coding
Having justified the new concepts of resolvability and meanresolvability on the basis of their significance in the complexity
of random variable generation, let us now explore their relationship with well-established concepts in the Shannon theory.
To this end, in the remainder of this section we will focus on
the special case of an identity channel (A = B; W” (y” (
xn) = 1 if xn = y/“), in which our approximation theory
becomes one of approximation of source statistics.
Suppose we would ike to generate random sequences according to the finite dimensional distributions of some given
process X. As we have argued, the worst-case and average
number of bits per dimension required are (l/n)R(Xn)
and
(l/n)H(Xn),
respectively. If, however, we are content with
reproducing the source statistics within an arbitrarily small
tolerance, fewer bits may be needed, asymptotically in the
worst case. For example, consider the case of independent flips
of a biased coin with tails probability equal to l/~. It is evident
that R(Xn) = 00 for every n. However, the asymptotic
equipartition property (AEP) states that for any E > 0 and
large n, the exp (&(1/x)
+ nc) typical sequences exhaust
most of the probability. If we let M = exp (nh(l/rr) + 2nc)
then we can quantize the probability of each of those sequences
to a multiple of l/M, thereby achieving a quantization error
in each mass of at most l/M. Consequently, the sum of the
absolute errors on the typical sequences is exponentially small,
and the masses of the atypical sequences can be approximated
by zero because of the AEP, thereby yielding an arbitrarily
small variational distance between the true and approximating
statistics. The resolution rate of the approximating statistics is
h(l/7r)+2~. Indeed, in this case S(X) = s(X) = h(l/r), and
this reasoning can be applied to any stationary ergodic source
to show that S(X) is equal to the entropy rate of X (always
in the context of an identity channel). The key to the above
procedure to approximate the statistics of the source with finite
resolution is the use of repetition. Had we insisted on a uniform
approximation to the original statistics we would not have
succeeded in bringing the variational distance to negligible
levels, because of the small but exponentially significant
variation in the probability masses of the typical sequences.
By allowing an approximation with a uniform distribution on
a collection of M elements with repetition, i.e., an M-type,
with large enough M, it is possible to closely track those
variations in the probability masses. A nice bonus is that for
this approximation procedure to work it is not necessary that
the masses of the typical sequences be similar, as dictated by
the AEP. This is why the connection between resolvability
and source coding is deeper than that provided by the AEP,
and transcends stationary ergodic sources. To show this, let us
first record the standard definitions of the fundamental limits
in fixed-length and variable-length source coding.
Definition 8: R is an &-achievable source coding rate for
X if for all y > 0 and all sufficiently large n, there exists a
IEEE TRANSACTIONSON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993
756
P[Xn $z {XT,. . . ,x&)1 L E.
collection of M n-tuples {x’;“, . . . , x%} such that
Choose M’ such that
~10gM~Rfy
exp (nR + 2ny) I M’ I exp (nR + 3ny)
and
R is an achievable (fixed-length) source coding rate for X
if it is e-achievable for all 0 < E < 1. T(X) denotes the
minimum achievable source coding rate for X.
Definition 9: Fix an integer r > 2. R is an achievable
variable-length source coding rate for X if for all y > 0 and
all sufficiently large n, there exists an r-ary prefix code for
X” such that the average codeword length L, satisfies
;Ln
log r < Rfy.
The minimum achievable variable-length source coding rate
for X is denoted by T(X).
As shown below, in the special case of the identity channel, resolvability and mean-resolvability reduce to the minimum achievable fixed-length and variable-length source coding rates, respectively, for any source. Although quite different
from the familiar setting of combined source and channel
coding (e.g., no decoding is present at the channel output), the
approximation theory of output statistics could be subtitled
“source coding via channels” because of the following two
results.
Theorem 1: For any X and the identity channel,
,
and an arbitrary element xg @ D.
’ We are going to construct
an approximation Xn to X” which satisfies the following
conditions:
a)
b)
c)
d)
Xn is an M/-type,
P&6$)
5 t,
]P*n(xr) - Px-(xa)] 5 l/M’,
P+(D)
+ P+(x$)
= 1.
It will then follow immediately that R is a 3e-achievable
resolution rate, as
$2”)
5 R + 37
and
M
4xn>
0
5 ClP+(xy)
- Px-(x;)l+
P+(x;;)
i=l
+
c
px-(xn)
X%EDC
<26+$<2t+exp(-ny)<3r
for
is
all
sufficiently
P+(xn)
S(X) = T(X).
Proof:
1) T(X) < S(X). We show that if R is an e-achievable
resolution rate for X then it is an e/2-achievable source
coding rate for X. According to Definition 7, for every
y > 0 and all sufficiently large n, there exists Xn with
i = l,...,M,
large n.
= 0,
The construction
of
Xn
if xn $ {a$, . . . ,xG},
i = O,...,M,
where the integers k; are selected as follows. If
5
,M’PXn (x1), 5 M’,
i=l
;R(Xn)
< R+y
then
k = [M’Pxn (~a)],
d(X”,
i=
l,...,M
M
Xn) < E.
ko = M’ - Cki
We can view Xn as putting mass l/M on each member
of a collection of M = exp (R(Xn)) elements of
A” denoted by D 7 {XT, . . . , XL} (Note that the M
elements of this collection need not all be different.)
The collection D is a source code with probability of
error smaller than e/2 because
i=l
and properties a)-d) are readily seen to be satisfied. On the
other hand, consider the case where
M
CrM’Px-(x,n)]
= M’ + L
i=l
E > d(X”, P) 2 2Pp(D”)
= 2Pxn (D”).
- 2P,,(D”)
2) S(X) 5 T(X). We show that if R is an e-achievable
source coding rate for X, then it is a Se-achievable
resolution rate for X. For arbitrary y > 0 and all
sufficiently large n, select D = {x;, +. +, XL} such that
1 log M < R + y,
,n
with 1 5 L 5 M. Since it may be assumed, without loss of
generality, that Pp(xT)
> 0 for all i = 1,. .. , M we may
set
IcrJ= 0
ki = [M’Px~~(~r)j - 1 > 0,
ki = [M’Pxn (x%)1,
i=
l,*..L,
i = L+
which again guarantees that a)-d) are satisfied.
l,...M,
0
151
HAN AND VERDtl: APPROXIMATION THEORY OF OUTPUT STATISTICS
Theorem 2: For any X and the identity channel,
S(X)
= T(X)
= lim SUP~+~ $X”).
Proof:
1) s(X)
5 T(X). Suppose that R is an achievable
variable-length source coding rate. Then, Definition 9
states that there exists for all y > 0 and all sufficiently
large n a prefix code whose average length L, satisfies
Theorems 1 and 2 show a pleasing parallelism between the
resolvability of a process (with the identity channel) and its
minimum achievable source coding rate. Theorem 1 and the
Shannon-MacMillan theorem lead to the solution of S(X) as
the entropy rate of X in the special case of stationary ergodic
X. Interestingly, the results of this paper allow us to find
the resolvability of any process with the identity channel, and
thus a completely general formula for the minimum achievable
source-coding rate for any source.
Theorem 3: For any X and the identity channel,
;Ln
log T < R + y.
(3.1)
Moreover, the fundamental source coding lower bound
for an r-ary code (e.g., [2, theorem 5.3.11) is
II
5 L, log T.
< R + y,
(3.3)
concluding that R is an achievable mean-resolution rate
for X.
2) T(X) < s(X). Let R be an achievable mean-resolution
rate for X. For arbitrary y > 0, 0 < E < l/2 and all
sufficiently large n, choose Xn such that (3.3) is satisfied
and d(Xn, 2”) < E. On the other hand, there exists an
r-ary prefix code for X” with average length bounded
by (e.g., [2, theorem 5.4.1)
L, log T < H(Xn)
+ log T.
1
Proof:
1) H(X) 5 S(X). w e will argue by contradiction: choose
an achievable resolution rate R for X such that for some
s>o
R+S<H(X).
By the definition of a(X),
H(Q)1F
0 log (I fl I /@I.
$X”)
+ 1 log T
n
for sufficiently large n if Elog (1 A 1 /c) < y.
This follows im3) T(X) = lim s~p~+~ (l/n)H(Xn).
mediately from the bounds in (3.2) and (3.4). See also
[lOI.
and
LR(x”)
n
< Rf
$.
Define
D1 = {xc” E An: Pxn(xn)
> 0
and
1 _ P+W)
Pxn(2”)
Pxn(Dl
log IAl”
E
< E
< $/2
I-
1
and consider
+ ; log T 5 $?(a”)
+‘t
n
z R+S
Select 0 < E < a2, and Xn for all sufficiently large n
to satisfy
(3.5)
Using Lemma 4 and (3.3), we obtain
(3.6)
1
1 log
PX”(““)
n
d(Xn, P)
for all sufficiently large n, thereby proving that R is an
achievable variable length source coding rate for X. To
that end, all that is required is the continuity of entropy
in variational distance:
Lemma 4 (3, p. 331: If P and Q are distributions
defined on R such that d(P, Q) 5 6’< l/2, then
-
2 Q
infinitely often with DO defined as the set of least likely
source words:
AL% log r 2 R + 27
n
V(P)
there exists a! > 0 such that
px-(Do)
(3.4)
We want to show that if the above E is chosen sufficiently
small then the code satisfies
= H(X),
where H(X) is the sup-entropy rate defined as 7(X; Y) for
the identity channel (cf. Definition 3), i.e., the smallest real
number ,0 such that for all E > 0
(3.2)
Now, let X = X. Then, d(Xn, Xn) = 0, and (3.1)-(3.2)
imply that
h(%“)
n
S(X) = T(X)
0
n Do) 2 Px-(Do) - Px-(DE)
> a - Eli2 > 0,
which holds infinitely often because of (3.6) and
~~‘~px-(D:)
I
c
X-CD;
2 E.
Px=(x:“)
P+ (2”)
1 - pxn(xn)
(3.7)
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993
758
For those n such that (3.7) holds, we can find x2; E
D1 n Do whose P*,-mass satisfies the following lower
and upper bounds:
P+ (x;) 2 (1 - 2’2)Pxn
This result will be shown by means of an achievability (or
direct) theorem which provides an upper bound to resolvability
along with a converse theorem which gives a lower bound to
resolvability.
(xgn) > 0
A. Direct Resolvability Theorem
and
> n1 log Pp(xg)
1
; log P.+(x$)
1
1
+ n1 log 1 + El/2
Theorem 4: Every channel W and input process X satisfy
S,(X)
L qx;
y>
for any E > 0 where Y is the output of W due to X.
if n is sufficiently large. Therefore we have found (an
infinite number of) n such that
Pj+
1
1 log
n
P+ p)
L P*n
+ log
> ;R(2”)
1
P+&(X”)
>R+;
-
1
1
contradicting Lemma 3.
is a special case (identity channel) of the
2) S(X) I mq
general direct resolvability result (Theorem 4 in Section
q
IV).
Remark 1: We may consider a modified version of Definition 9 as follows. Let us say that an r-ary variable-length code
for X” is an c-prefix code for X” (0 < E < 1)
{@(xn))rn~An
if there exists a subset D of A” such that Pp (D) 2 1 - 6
is a prefix code. It is easy to check that
and {$(x~)),~cD
Theorem 2 continues to hold if “all y > 0” and “r-ary prefix
code” are replaced by “all y > 0 and 0 < E < 1” and “r-ary
c-prefix code” respectively in Definition 9.
A general formula for the minimum achievable rate for
noiseless source coding without stationarity and ergodicity assumptions has been a longstanding goal. It had been achieved
[lo] in the setting of variable-length coding (see Theorem
2). In fixed-length coding, progress towards that goal had
been achieved mainly in the context of stationary sources
(via the ergodic decomposition theorem, e.g., [6]). A general
result that holds for nonstationary/nonergodic sources is stated
in4 [9] without introducing the notions of T(X) and H(X).
The results established in this section from the standpoint of
distribution approximation attain general formulas for both
fixed-length and variable-length source coding without recourse to stationarity or ergodicity assumptions. It should
be noted that an independent proof of T(X) = g(X) can
be obtained by generalizing the proof of the source coding
theorem in [3, Theorem 1.11.
IV.
Proof: Fix an arbitrary y > 0. According to Definition
7, we have to show the existence of a process x such that
lim d(Yn, 9’“) = 0
n-ice
and gn is an M-type distribution with
M = exp (nT(X; Y) + ny),
and Y”, Y” are the output distribution due to X” and zn,
respectively.
We will construct the approximating input statistics by the
Shannon random selection approach. For any collection of
(not necessarily distinct) M elements of A”, the distribution
constructed by placing l/M mass on each of the members of
the collection is an M-type distribution. If each member of
the collection is generated randomly and independently with
distribution Xn, we will show that the variational distance
between Y” and the approximated output averaged over the
selection of the M-collection vanishes, and hence there must
exist a sequence of realizations for which the variational
distance also vanishes.
For any { cj E A”, j = 1, +. . , M} denote the output
distribution
](Yn)
pPn[C1,..‘
,CM=
~~~n(Yn
I Cj).
(4.1)
3=1
The objective is to show that
lim Ed(Y”,
n--too
?“[X;,.+.,X$])
= 0,
where the expectation is with respect to i.i.d. (X;, . e. , X&)
with common distribution Xn. Instead of using the standard
Csiszar-Kullback-Pinsker bound in terms of divergence [3],
in order to upper bound d(Y”, Y”[X;, . e. , X$]) we will
use the following new bound in terms of the distribution of
the log likelihood ratio.
Lemma 5: For every p > 0,
RESOLVAL~ILITY THEOREMS
A general formula for the resolvability of any channel in
terms of its statistical description is obtained in this section.
4[9] refers to a Nankai University thesis by T. S. Yang for a proof.
d(P, Q) L -?- cL+2p
.log e
P(X)
1% Q(x)
where X is distributed according to P.
>P
1
>
HAN AND VERDfJ: APPROXIMATION THEORY OF OUTPUT STATISTICS
759
Proof of Lemma 5: We can write the variational distance as
g}
d(P,Q) = cl
0 5 log
XEcl 1
P=(x)- Q(x)1
log $j
+ Cl
XEcl {
= 2dl + 2d2,
5 O}
[Q(x)-
where r = (exp p - 1)/2 > 0, X” and Y” are connected
through W” and {Y”, X;, . . . , XE} are independent.
The first probability in the right-hand side of (4.2) is
P ;ixnrun(x”,
[
P(x)l,
rate. The
M
P $c
dr=xl
log 8
> IL} W
-
{
Q(x)1
0
e.
Proof of Theorem 4 (cont.): According to Lemma 5, it suffices to show that the following expression goes to 0 as
n + co, for every p > 0,
*..
c
Px-(Cl).*.px-(CM)
c
CMEA”
ClEA*
* 1 log
= $5
‘*.
log
c
c
PEEP
i
C2EAn
W”(Y”
> ~
(Y”)
c
c
. 1 $- exp (ix-w-(cl,
{
G,j(yn)
=
v,,j(~“)W’in,j(~“)
i~n~n(Xjn,
(4.4)
I
Ml,
(4.5)
Note that for every y” E Bn both {Vn,j(yn)}jNl
and
{Z,, j (y”)}y=r are independent collections of random variables because {XT}jM_r are independent. According to (4.4)
and (4.5), the probability in (4.3) is equal to the expected
value with respect to Pyn of
1
Y”)
PX”Y”(Cl>
P[UM(f)
#
UM(yn)]
> 1 + r].
The first term in the right-hand side of (4.6) is equal to
1
Y”))
M
exp (ixnwn(XT,
3=2
p[TM(!/“)
yy”)) > 1+ 27
3=2
$c
1 + T] 5
y/“))
exp (ixnwn(cj,
exp(ix-wn(Xn,
>
+P[TM(f)
M
+ $c
[
exp
I %I
cl~Any”~Bn
+ P
=
j=l
c
Px-(4
CMEAn
. . . Px-(cnn)
< P &
[
Y”),
K,j(yn)
UdYn)
= j$vn,j(yn),
Px-(Cl)
%[C,-CM]w)
PY”
...
(4.3)
I
CM 64”
. . . Px-(cM)
c
> pi
PY”(Yn)
j=lcl~A”
.l
+-[~,...&f?
PEB”
c
.
> 1+ r
it is not possible to apply the weak law of large numbers to
(4.3) directly because the distribution of each of the random
variables in the sum depends on the number of terms in the
sum (through n). In order to show that the probability in
(4.3) vanishes it will be convenient to condition on Y”, and,
accordingly to define the following random variables for every
yn E B” and j = l,...,M,
p~-[c,...chl](Yn)
i
Y”))
Had we a maximum over j = 1, . . * , M instead of the sum
in (4.3), showing that the corresponding probability vanishes
would be a standard step in the random-coding proof of the
direct channel coding theorem. In the present case, we need
to work harder. Despite the fact that for every j = 1, e. . , M
and
I ~/log
exp (ixnwn(Xy,
j=1
[
-< P log
=
)
I
which goes to 0 by definition of sup-information
second probability in (4.2) is upper bounded by
where
c
Y”)>I(X;Y)+y+$ogr
> 7
1
Y”))
> 1+ r
,
I
(4.2)
5 ~w-dY”)
# K,j(Yn)l
j=l
= MP[Vn,
l(y”“)
> M],
(4.6)
IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993
760
O*
whose expectation with respect to Py- yields
&if c
c
px-(zn)pYn(Yn)
ee
x”EA”~“EB~
. l{exp ixnwn(zn,
= M
-a 0
l/2
C
C
l/2
y”) > M}
Pxnyn(xn,
1.
y”) exp(--iXnWn(P,
y”))
Fig. 1.
<
.I
Discrete memoryless channel in Example 1.
z”EA”~“EB”
. l{exp x=wn(xn,
I
C
C
y”) > iVf>
which goes to 0 as n + 00 by definition.
Px~y~(~n,yn)l{expix~W~(~n,yln)
> M)
z”EA*~“EB”
= P ;ixnwn(Xn,
[
1
+ y )
Y”) > f(X;Y)
which, again, goes to 0 by definition of 7(X; Y). Regarding
the second term in (4.6), notice first that
JwM(Y”)l = wn,
1 c
l(Y”)l
I YV{K%,
PX-yY”W
l(Y”)
5 w
XnEAn
5 1.
(4.7)
Therefore, using (4.7) and the Chebychev inequality, we get
p[TM(Yn)
> 1 + Tl I p[TM(Yn)
- E[TM(Y”)]
> T]
5 -$ var (G&P))
where we have used the fact that {Zn, j(y”)}gr
are i.i.d.
Finally, unconditioning the expectation on the right side of
(4.8), we get
$pz,
l(Y”)l
xnEAnyn~Bn
pX”Y”(Zn,
.(
2
y”)
px-(~“)PY”(LP)
. l{exp ixnwm(z?,
= E +
[
y”) 5 M}
exp ix-wn(Xn,
+1 $
Y”)
exp ixnwn(Xn,
Y”) 5 1
II
where the expectation in the right-hand side is with respect to
Pxnyn and can be decomposed as
E
+E
$
exp ixnwn(Xn,
Y”)
.1 A
{
exp ixnwn(Xn,
Y”) < exp 7
$
exp ixnw-(Xn,
Y”)
[
. 1 exp 7
{
I exp
< &
exp ixnwn(Xn,
;ix-w7L(x”,
0
W e remark that in most cases of interest in applications,
X and W will be such that (X; Y) is information stable in
which case, the upper bound in Theorem 4 is equal to the
input-output mutual information rate (cf. Lemma 1).
B. Converse Resolvability Theorem
Theorem 4 together with the converse resolvability theorem
proved in this subsection will enable us to find a general
formula for the resolvability of a channel. However, let us
start by giving a negative answer to the immediate question
as to whether the upper bound in Theorem 4 is always tight.
Example 1: Consider the 3-input-2-output memoryless
channel of Fig. 1, and the i.i.d. input process X that uses 0 and
1 with probability l/2, respectively. It is clear that 7(X; Y) =
1(X; Y) = 1 bit/symbol. However, the deterministic input
process that concentrates all the probability in (e,. .. , e)
achieves exactly the same output statistics. Thus S(X) =
s(X) = 0. On the other hand, it turns out that we can find
a capacity-achieving input process for which the bound in
Theorem 4 is tight. (We will see in the sequel that this is always
true.) Let X’ be the uniform distribution on all sequencesthat
contain no symbol e and the same number of O’s and l’s (i.e.,
their type is (l/2, 0, l/2)). The entropy rate, the resolution
rate and the mutual information rate of this process are all
equal to 1 bit/symbol. Moreover, any input process which
approximates X’ arbitrarily accurately cannot have a lower
entropy rate (nor lower resolution rate, a fortiori). To see this,
first consider the case when the input is restricted not to use
e. Then the input is equal to the output and close variational
distance implies that the entropies are also close (cf. Lemma
4). If e is allowed in the input sequences,then the capabilities
for approximating X’ do not improve because for any input
sequence containing at least one e, the probability that the
output sequencehas type (l/2, l/2) is less than l/2. Therefore,
the distance between the output distributions is lower bounded
by one half the probability of the input sequences containing
at least one e. Thus S(X’) = 3(X’) = 1 bit/symbol.
The degeneracy illustrated by Example 1 is avoided in
important classes of channels such as discrete memoryless
channels with full rank (cf. Remark 4). In those settings,
sharp results including the tightness of Theorem 4 can be
proved using the method of types [8]. In general, however, the
converse resolvability theorem does not apply to individual
inputs.
Theorem 5: For any channel with finite input alphabet and
all sufficiently small E > 0,
Y”) 5 1
Y”) > 7(X; Y) + ;
)
I
s, > supT(X; Y),
X
161.
HAN AND VERDtJ: APPROXIMATION THEORY OF OUTPUT STATISTICS
Proofi The following simple result turns the proof of the
converse resolvability theorem into a constructive proof.
Lemma 6: Given a channel W with finite input alphabet,
fix R > 0, E > 0. If for every y > 0 there exists a collection
{Qi}y=r such that
contradicting (4.10). The number of different M/-type distributions on A” is upper bounded by 1 A” j”’ . Therefore,
the number of different distributions satisfying (4.13) is upper
bounded by
N 5 exp (&’
;loglogN>R-y
+ nr)
( A” j=P (nR’+n7)
5 1 A” l=P (nR’+2n7),
(4.9)
for sufficiently large n E J, which results in
and
rnnyd(Q;W”,
(4.10)
QjnWn) > 2~
; loglogN<
R/+37,
infinitely often in n, then
(4.11)
S, > R.
Proof of Lemma 6: We first make the point that the achievability in Definition 7 is equivalent to its uniform version
where the “sufficiently large n” for which the statement holds
is independent of X. To see this, suppose that R is e-achievable
in the sense of Definition 7 and denote by n,(e, R, y, W, Xj
the minimum n,, such that for all n 2 n, there exists X
satisfying (2.4) and (2.5). We claim
swno(c,
X
R, Y, W, X) < 00.
(4.12)
Assume otherwise. Then there exists a sequence of input
processes {Xk}r!r
such that
KG = n,(E, R, Y, W, Xk)
for sufficiently large n E J, contradicting (4.9) because y > 0
0
is arbitrarily small.
Proof of Theorem 5 {cont.): The construction of the N distributions required by Lemma 6 boils down to the construction
of a channel derived from the original one and a code for
that channel. This is because of Lemma 7, which is akin to
the direct identification coding theorem [l] and whose proof
readily follows from the argument used in the proof of [7,
Theorem 31.
Lemma 7: Fix 0 < X < l/2, and a conditional distribution
W;: T; -+ T; with arbitrary alphabets T; C A” and
T$ c B”. If there exists an (n, M, X) code in the maximal
error probability sense for W$, then there exists p > 1 and a
where Q: is a distribution on Tg
collection { (Qr , Di)}zr
and Di c TF, such that for all i = 1,. . . , N,
is an increasing divergent sequence. Construct a new input
process X by letting
if ?ik <n<&+l.
X” = x,“,,,
Note that for all k, n = %k+r - 1 is such that there is no Xn
with the desired properties. On the other hand, since Ek+t -1 is
a divergent sequence, R cannot be an e-achievable resolution
rate for the constructed input process X and we arrive at a
contradiction, thereby establishing (4.12).
Let us now contradict’ the claim of Lemma 6 and suppose
that for some R’ < R,
S, 5 R’.
Then, for each Q;, we can find ai such that
LR(cjl”)
n
(4.13)
< R’ + y
5 d(Q;W”,
I x:“) < W”(y”
ifj
# i,
(4.16)
(4.17)
I x6”) I W,“(y” 1 x”)
(4.18)
and
(4.19)
infinitely often, where
QyWn)
@W”)
i 2X,
(4.15)
< E,
if n 2 sup, nO(t, R’, y, W, X). Along those n, there is an
infinite number of integers denoted by J for which (4.10)
holds. Let us focus attention on those blocklengths only. We
must have Ql # Qy if i # j, for otherwise the triangle
inequality implies
d(Q;W”,
Q~WW)
(4.14)
We will show that the collection {Qr}Er
satisfies the
conditions of Lemma 6, with an appropriate choice of M. For
now, let us construct an appropriate conditional distribution
WG which will be suitable for finding the channel code
required by Lemma 7.
Lemma 8: For every X and RI < 7(X; Y) there exists
0 < a < 1, T; c A”, T; c B”, T;l-y c T; x T;
and a conditional distribution WG: T$ -+ TF such that if
(z”, y”) E T&,, then,
aW,“(yn
@W”)
5 log M,
2 1 -X,
N 2 j$pM - 11.
and
cl(Q;W”,
R(Q:)
Q;W;(Di)
+ d(QyW”,
@Wn)
Pq(z”)
=
Px- (X”)/Px0,
< 2E
5The finiteness of the input alphabet is crucial in this argument.
and Py-T = Px.;.W,“.
(CL
xn E T;,
xn $ T;,
(4.20)
[EEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993
162
Proof: Choose RI < Ra < 7(X; Y) and define
D&
(C, y”) E A” x Bn: $xnw/-(zn,
=
y”) > Ra .
C
By definition of 7(X; Y), there exists a > 0 such that
[D;yl
Px-Y-
(4.21)
> 2a,
if n E I, where I is an infinite set. Define
T;(C)
X 5 exp (-nB)+P
= {yn E Bn: (~9, y”) E D$y}
T; =
u
Now, let us choose M so that
R1 - 20 5 1 log M < R1 - x9,
(4.24)
n
for arbitrary 19 > 0. Then, owing to (4.19) and the second
inequality in (4.24), the second term on the right-hand side of
(4.23) vanishes. Lemmas 8 and 9 along with the left inequality
in (4.24) provide the (n, M, exp (-n6’)) code required by
Lemma 7 for an infinite number of blocklengths n. Then,
(4.17) and (4.24) imply that for sufficiently large n
X"ET2
T;
n (T; x T;)
+ TF by
W ,“(Y”
I xn)= C0Wn(yln
>
if y” E T;(F)
if y” @ T?(9).
) xn)/g(xn),
Now, (4.18) follows immediately because if zn E Tg, then
a! < g(P) 5 1. In general, if zn E A”, then 0 5 a(~?) 5 1
and
Px”y”[D;y]
=
c
YT) 5 k log M + 8 .
(4.23)
T;(C)
TnXY -- D&
1
kix;w;(XG,
[
0(x?) = w”(T;(z”)
1 P)
T; = (9 E A”: r~(x~) > cx}
and W;:
Proof of Theorem 5 (cont.): Now it remains to show existence of the channel code required by Lemma 7. This is
accomplished via a fundamental result in the Shannon theory
applied to X$ and WG. I
Lemma 9 [4]: For every n, M and d > 0, there exists an
(n, M, X) code for W$ whose maximal error probability X
satisfies
; log log N 2 ; log M - 8,
2 R1 - 38,
which satisfies (4.9) for arbitrary y > 0 because R1 <
supx f(X; Y), and 0 > 0 can be chosen arbitrarily close
to those boundaries. Finally, to show (4.10) we apply Lemma
7 to get
Px-(C)a(9)
d(Q;W”,
: QjnWn) > 2Q;W”(Di)
- 2QjnWn(Di)
> 2aQ;W,“(Di)
- 2QyW,“(Di)
> 2a(l - A) - 4x,
where the second inequality follows from (4.18) and the third
inequality results from (4.15) and (4.16). With an adequate
choice of X (guaranteed by Lemma 7) the right side of (4.25)
is strictly positive and (4.10) holds, concluding the proof of
0
Theorem 5.
which together with (4.21) implies that
Px-El
> Q,
if n E I. In turn, this implies
pY$(Yn) =
c
wyyn
PET;
I X”)PXP (z”)
o(x”>Px-
;ix;&n,
Theorems 4 and 5 and S = supx S(X) readily result in
the general formula for channel resolvability.
CT;)
I sI12.n(Y”).
Now to prove (4.19), note that if (?,
(4.25)
(4.22)
yy”) E Tgy, and n E I
Theorem 6: The resolvability of a channel W with finite
input alphabet is given by
s = sup7(X;
Y).
(4.26)
X
y”) = ; log wn(Yn I x”>
c+“>Py,(Y”>
Remark 2: Theorems 4 and 5 actually show the stronger
result:
> 1 log WYYln I x:“)
n
PY$ (Y”)
s, = supF(X;
Y)
(4.27)
X
> ~ix7zu”‘(x”, y”) + ; log Q
n
2
> Rz + ; log o
for all sufficiently small e > 0.
> RI
Having derived a general expression for resolvability in
Theorem 6, this section shows that for the great majority
of channels of interest, resolvability is equal to the Shannon
capacity.6 Let us first record the following fact.
V. RESOLVABILITY AND CAPACITY
infinitely often; where we have used (4.22) to derive the second
inequality. Finally, by definition of W$
Px;Y,[T;~-YI
= 1.
0
6 We adhere to the conventional definition of channel caDacitv 13. o. 1011.
163
HAN AND VERDti: APPROXIMATION THEORY OF OUTPUT STATISTICS
Theorem 7: For any channel with finite input alphabet
c I s.
(5.1)
Proof: The result follows from Theorem 6 and the following chain of inequalities
C < liminf,,,
sup il(xn;
in n
Y”)
sup A (X”;
X” n
Y”)
5 limsup,,,
< sup7(X;
Y).
The first inequality is the general (weak) converse to the
channel coding theorem, which can be proved using the Fano
inequality (cf. [16, Theorem 11). So only the third inequality
needs to be proved, i.e., for all y > 0 and all sufficiently large
n,
Y”) 5 supqx;
1.
P ,2x-w-(X”,
[
1
Y”)>c+s
>a,
for all n in an infinite set J of integers. Under such an
assumption, we will construct an (n, M, 1 - 01/3) code with
(5.2)
X
sup 5(X”;
xn n
Proof of Lemma 10: Arguing by contradiction, assume that
there exists S > 0, cy > 0 and X” such that
for every sufficiently large n E J. The construction will follow
from the standard Shannon random selection with a suitably
chosen input distribution.
Codebook: Each codeword is selected independently and
randomly with distribution
Pp(z?)
Y) + 7,
In order to investigate under what conditions equality holds
in (5.1) we will need the following definition.
Definition 10: A channel W with capacity C is said7 to
satisfy the strong converse if for every y > 0 and every
sequence of (n, M, X,) codes that satisfy
if a? E G:
otherwise,
where
G=
X
but this follows from Lemma Al (whose proof hinges on the
finiteness of the input alphabet) as developed in the proof of
0
Lemma 1 (cf. Appendix).
~~(a”)/Pp(G),
,
=
-t
xn~An:
W”(D(z%)
1 x”) 2 ;}
Lxqr(x”,
n
Decoder: The following
codeword c; :
y/“) > c + 6
decoding set Di corresponds to
Di = D(q)
- 6 D(q).
3=1
3fZ
Error Probability:
1 log M > C + y,
n
W”(D;
1 ci) 5 W”(D”(c;)
it holds that
Al + 1,
1 ci) + FW-(D(c)
3=1
1#1
< 1 - ; + FW”(D(c)
1=1
j#,
as n~oo.
Theorem 8: For any channel with finite input alphabet
which satisfies the strong converse:
1 ci)
) ci).
(5.3)
Let us now estimate the average of the last term in (5.3)
with respect to the choice of ci:
c
WV(q)
I c;)px+i)
C,EG
Proof: In view of Theorem 7 and its proof, it is enough to
show S < C. The following lemma implies the desired result
c > supI(X;
= c
Y) = s.
I j+q
Lemma 10: A channel W that satisfies the strong converse
has the property that for all S > 0 and X
1
Y”) > C + 6 = 0.
7This is the strong converse in the sense of Wolfowitz [20].
1 ci)m,
c
X”
X
lim P kix%pv-(X”,
n-00
[
W”(D(q)
C,EG
WV%)
I x”)Px-(x”)
XnEA”
= P~@(cj))/Px4G),
where
py"(D(cj))
=
c
PY-(Y~)
ynEBn
. l{Py-(yy”)
< exp (-nC
< exp (-nC
- nS).
- nS)W”(y”
1 cj)}
IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993
lb4
Fig. 2.
Channel whose capacity is less than its resolvability.
3.’ 65
Thus, the average of the right side of (5.3) is upper bounded by
1 - G + (A4 - 1) exp(-nC
Fig. 3. Information spectrum (probability mass function of the normalized
information density) of channel in Example 2 with 61 = 0.1, 62 = 0.15,
a = 0.5, n = 1000.
- &)/E+(G)
5 1 - % + exp
where
using (5.2). Finally, all that remains is to lower bound
Px-(G). We notice that Pxn(G) = P[Z 2 a/2], with
2 = Wn(D(Xn)
( Xn). The random variable 2 lies in the
interval [0, l] and its expectation is bounded by
[
P 2 2 ;
1
+ ; 2 E[Z]
1
1.
= P ,zxnwn(X”,
[
Y”) > c + a!
> 01.
Therefore, the right side of (5.4) is upper bounded by
1-t+exp
-?
nS
2
;<l--5,
(
>
for all sufficiently large n E J, thereby completing the
construction and analysis of the sought-after code.
0
The condition of Theorem 8 is not only sz@cient but also
necessary for 5’ = C. This fact along with a general formula
for the Shannon capacity of a channel (obtained without any
assumptions such as information stability):
c = supI(X;
Y)
I Zlr...rG)
= q-JwYi
i=l
0 < & < l/2,
I Xi) + (1 - N,l-Iwz(Yi
i=l
I Xi)
i = 1, 2.
i log min {a, 1 - a} + max {u, V}
5 L log Q exp nu + (l-
o) exp nv)
v}.
Therefore, the information spectrum evaluated with the optimal i.i.d. input distribution X (which assigns equal probability
to all the input sequences) converges asymptotically to the
distribution of
where we have identified u and v
the maximum in (5.6). If (Xj, Yj)
(which occurs with probability a),
the first term in (5.6) exceeds that
wx,w,
n
# x:),
A typical information spectrum for this channel and large n
is depicted in Fig. 3. The e-capacity of this channel depends
on E [19] and its capacity is equal to min {Cl, Cz} where
Ci = log 2 - h(Si) is the capacity of BSC;. In order to
compute the resolvability of this channel, note first that if
the distribution of the random variable A, is a mixture of
two distributions, then the lim sup in probability of {A,}
is equal to the maximum of the corresponding lim sup in
probability that would result if A, were distributed under
each of the individual distributions comprising the mixture.
Now, to investigate the asymptotic*behavior of the information
spectrum, consider the bounds
5 Lx{u:
(cf. the dual expression (4.26) for the channel resolvability) is
proved in [17], by way of a new approach to the converse of
the channel coding theorem.
Important classes of channels (such as those with finite
memory) are known to satisfy the strong converse [5], [18],
[20]. The archetypical example of a channel that does not
satisfy the strong converse is:
Example 2: Consider the channel in Fig. 2 where the
switch selects one of the binary symmetric channels (BSC)
with probability (a, 1 - a), 0 < Q < 1 and remains fixed for
the whole duration of the transmission. Thus, its conditional
probabilities are
72
= x} f&lb
and
(5.5)
X
W ”(!/l,...,Yn
Wi(Y I x> = (1 - &)l{Y
(Xj, Yj)l = qxj;
with the quantities within
are connected through IVr
then the expected value of
of the second one by
Yj)
= -@X,W(Xj,
ydl +
m47lll~2
I Xj),
(5.7)
lb5
HAN AND VERDir: APPROXIMATION THEORY OF OUTPUT STATISTICS
where the expectations in (5.7) are with respect to the joint
distribution of (Xj, Yj) connected through Wi. Reversing
the roles of channels 1 and 2, we obtain an analogous
expression to (5.7). Therefore, the weak law of large numbers
results in
Proof: If R is a (Xi, &)-achievable ID rate, then for
every y > 0 and for all sufficiently large n, there exist
(n, N, X1, &)-ID codes {(Qr, Di), i = 1,. . . , N} whose
rate satisfies
; log log N > R - y.
7(X; Y) = max{Ci,
Ca}.
0
We have seen that for the majority of channels, resolvability
is equal to capacity, and therefore the body of results in
information theory devoted to the maximization of mutual
information is directly applicable to the calculation of resolvability for these channels. Example 2 has illustrated the
computation of resolvability using the formula in Theorem 6 in
a case where the capacity is strictly smaller than resolvability.
For channels that do not satisfy the strong converse it is of
interest to develop tools for the maximization of the supinformation rate (resolvability) and of the inf-information rate
(capacity). It turns out [17] that the basic results on mutual
information which are the key to its maximization, such as
the data-processing lemma and the optimality of independent
inputs for memoryless systems are inherited by 1(X; Y) and
ax; 0
From such a sequence of codebooks { Qy , i = 1, . . . , N}
where N grows monotonically (doubly exponentially) with
n, we can construct the sequence {Qi = (Qi, Qz, .. .)}E”=,
required in Lemma 6, with an arbitrary choice of Qy if i > N.
Then {Qi}~!i
satisfies (4.9). Furthermore, for i # j and
i < N, j 5 N, then for all sufficiently large n,
d(QaW”,
QyWn)
> 2Q~W’“(Di)
- aQ,“w”(Di)
> 2(1 - Xl) - 2x2 > 26,
satisfying (4.10). Thus, the conclusion of Lemma 6 is that
0
and Theorem 9 is proved.
Theorem 9 and Theorem 4 imply that the (Xi, &)-ID
capacity is upper bounded by
VI. RESOLVABILITYAND IDENTIFICATIONVIA CHANNELS
A major recent achievement in the Shannon Theory was the
identification (ID) coding theorem of Ahlswede and Dueck [l].
The ID capacity of a channel is the maximal iterated logarithm
of the number of messages per channel use which can be
reliably transmitted when the receiver is only interested in
deciding whether a specific message is equal to the transmitted
message. The direct part of the ID coding theorem states
that the ID-capacity of any channel is lower bounded by its
capacity [l]. A version of the converse theorem (soft converse)
which requires the error probabilities to vanish exponentially
fast and applies to discrete memoryless channels was proved in
[l]. The strong converse to the ID coding theorem for discrete
memoryless channels was proved in [7]. Both proofs (of the
soft-converse and the strong converse) are nonelementary and
crucially rely on the assumption that the channel is discrete and
memoryless. The purpose of this section is to provide a version
of the strong converse to the ID coding theorem which not only
holds in wide generality, but follows immediately from the
direct part of the resolvability theorem. The link between the
theories of approximation of output statistics and identification
via channels is not accidental. We have already seen that the
proof of the converse resolvability theorem (Theorem 5) uses
Lemma 7, which is, in essence, the central tool in the proof
of the direct ID coding theorem.
The root of the interplay between both bodies of results is
the following simple theorem.*
Theorem 9: Let the channel have finite input alphabet. Its
(Xi, &)-ID capacity is upper bounded by its e-resolvability
S,, with 0 < E < 1 - Xr - X2.
s We refer the reader to [l], [7] for the pertinent definitions in identification
via channels.
supT(X;
X
Y),
which is equal to the Shannon capacity under the mild sufficient condition (strong converse) found in Section V. This
gives a very general version of the strong converse to the
identification coding theorem, which applies to any finite-input
channel-well beyond the discrete memoryless channels for
which it was already known to hold [7, Theorem 21.
It should be noted that Theorem 9 and [7, Theorem l] imply
that if 0 < X < Xi, X < X2, E > 0, and E + Xi + X2 5 1, then
where CA is the X-capacity of the channel in the maximal error
probability sense and Dxl, xZ is the (X1, &)-ID capacity. Note
that unlike the bound on e-resolvability in Theorem 5, (6.1)
can be used with arbitrary 0 < 6 < 1, but may not be tight if
the channel does not satisfy the strong converse. If the strong
converse is satisfied, however, (6.1) holds with equality for
all sufficiently small E > 0, because of (4.27) and Theorem
8 as well as the fact that C = CA for all ‘0 < X < 1 due
to the assumed strong converse. Consequently, we have the
following corollary.
Corollary: For any finite-input channel satisfying the strong
converse:
C=%,xz
=s,
(f-54
if Xi + X2 < 1.
The first equality in (6.2) had been proved in [7, Theorem
21 for the specific case of discrete memoryless channels using
a different approach.
IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993
766
we obtain
VII. MEAN-RESOLVABILITYTHEOREMS
This section briefly investigates the effect of replacing the
worst-case randomness measure (resolution) by the average
randomness measure (entropy) on the results obtained in
Section IV. The treatment here will not be as thorough as
that in Section IV, and in particular, we will leave the proof
of a general formula for mean-resolvability for future work.
Instead, we will present some evidence that, in channels of
interest, mean-resolvability is also equal to capacity.
An immediate consequence of (2.8) is that the direct resolvability theorem (Theorem 4) holds verbatim for meanresolvability, i.e., for all E > 0 and X
S’,(X) <3(X)
< 7(X; Y),
$(Y”)
- $(P”)
= $X”)
- $qX”
Theorem 10: The mean-resolvability of a BSC is equal to
its capacity.
Proo$ Since BSC’s satisfy the strong converse, (7.2)
holds and we need to show
(7.3)
Suppose otherwise, i.e., for some p > 0,
0 <S<C-,LL.
(7.4)
Let A = ~1 log 4, and y = ,LL/~. For all sufficiently large
n, there exists an (n, M, E) code (in the maximal error
probability sense) such that all its codewords are distinct
(X < l/2, because p < log 2) and
log 2 > ; log A4 > c - y.
(7.5)
Let X” be uniformly distributed on the codewords of the
(n, M, X) code. Thus,
H(Xn)
= log 111.
2 c - y - [C - p +-y] - x log 2 - i/Z(X)
= p/4 - $A)
(7.10)
where the first inequality is a result of the Fano inequality, the
second inequality follows from (7.5), (7.6), and (7.8) and the
last inequality holds for all sufficiently large n. Now applying
Lemma 4 to the present case (I R I= an), (7.7) results in
IH(Y”)
- H(P)1
Remark 3: The only feature of the BSC used in theproof of
Theorem 10 is that H(Y 1 X = Z) is independent of 5, which
holds for any “additive-noise” discrete-memoryless channel.
A converse mean-resolvability theorem which, unlike Theorem 5, does not hinge on the assumption of finite input
alphabet can be proved by curtailing the freedom of choice
of the approximating input distribution. In Example 1, we
illustrated the pathological behavior that may arise when the
approximating input distribution puts mass in sequenceswhich
have zero probability under the original distribution. One way
to avoid this behavior is to restrict the approximating input
distribution to be absolutely continuous with respect to the
original input distribution.
Theorem 11 (Mean-Resolvability Semi-Converse): For any
channel W with capacity C there exists an input process X
such that if 2 satisfies
lim d(Y”, P)
n-03
and P*-
= 0
(7.12)
<< Px-, then, for every ~1 > 0,
5qe)
n
< 0
5 n0 log 2 + 0 log l/19,
which contradicts (7.10) because 0 can be chosen arbitrarily
0
small.
(7.6)
According to (7.4), for any 0 < 0 < l/2, there exists Xn such
that the outputs to Xn and Xn satisfy
d(Yn, P)
- +(Xn)
2 7,
(7.2)
c.
1 Y”)
- ; log M - ;h(X)
(7.1)
Therefore, in this section our attention can be focused on
converse mean-resolvability theorems.
First, we illustrate a proof technique which is completely
different from that of the converse resolvability theorem
(Theorem 5) in order to find the mean-resolvability of binary
symmetric channels (BSC).
s>
1 Y”) + &P
> ;H(Xn)
and if the channel satisfies the strong converse, then
s 5 c.
- +(X”)
2 c - p
(7.13)
(7.7)
infinitely often.
and
$2’“)
< c-p+$.
(7.8)
Proof: Let us suppose that the result is not true and
therefore there exists ~0 > 0 such that for every input process
X we can find 2 such that P+, < Pxn, (7.12) and
Since by definition of BSC
H(Y”
1 P)
= H(Y”
1 X”),
$(X”)
(7.9)
5 c - ,urJ
(7.14)
HAN AND VERDtl: APPROXIMATION
767
THEORY OF OUTPUT STATISTICS
for some S > 0 because of (7.15) and (7.18). Combining (7.21)
and (7.22), we get
are satisfied. Fix 0 < y < ~0 and choose
T<
PO-Y
c - PO’
xc--L
27 +
d(Y”, Y”) > 2P+%(B) - 2Pp(B)
(7.16)
1
For all sufficiently large n, select an (n, M, A) code
{(G, W>g”=, (in th e maximal error probability sense) with
rate
; log M 2 c - y.
Let X” be equal to ci with probability l/M, i = 1, +. . , M.
The restriction P+ < Pp means that the approximating
distribution can only put mass on the codewords { ci , . +. , CM}.
However, the mass on those points is not restricted in any way
(e.g., P+ need not have finite resolution).
Define the set of the most likely codewords under P+
> &(l
- A) - 2X - 2 exp (-&A)
which is bounded away from zero because of (7.16).
In connection with Theorem 11, it should be pointed out
that the conclusion of the achievability result in Theorem 4
holds even if Xn is restricted to be absolutely continuous’
with respect to X” as long as X” is a discrete distribution.’
(Recall that in the proof of Theorem 4, Xn is generated by
random selection from Xn.)
Remark 4: Recall that a general formula for the individual
resolvability of input processes is attainable only for channels
that avoid the pathological behavior of Example 1. In [8], it
is shown that
S(X) = 1(X; Y)
T, = {z~ E A”:
P+(zn)
2
exp
(-n(c
-
Po)(~
f 7”1
(7.17)
’ //J
1
PO)(~
(7.18)
+ ~1).
(Xa>]
2 n(C - Po)(l + 4pj+
VIII.
(7.19)
7).
We will lower bound the variational distance between Y” and
Y-n by estimating the respective probabilities of the set
iEI
where 1 = {i E {l,...,M}:
(7.20) are disjoint, we have
ci E T7}. Since the sets in
2 CW”(B
iEI
DIVERGENCE-GAUGED
APPROXIMATION
So far, we have studied the approximation of output statistics under the criterion of vanishing variational distance. Here,
we will consider the effect of replacing the variational distance
d(Y”, Y”) by the normalized divergence:
(7.20)
B = UDi,
P+(B)
if the channel W is discrete memoryless with full rank, i.e.,
the transition vectors {W(. ] u)}~~A are linearly independent.
This class of channels includes as a special case the BSC
(Theorem 10). Even in this special case, however, the complete
characterization of s(X) remains unsolved.
(T”)
or the lower bound
P*,z (TT) L 7/(1+
(7.24)
s=z=c,
From (7.14), we have
n(C - PO) 2 E[log l/P*,
(7.23)
along with
whose cardinality is obviously bounded by
I Z I< exp (n(C -
0
I Ci)P&i)
We point out that this criterion is neither weaker nor stronger
than the previous one. Although we do not attempt to give
a comprehensive body of results based on this criterion, we
will show several achievability and converse results by proof
techniques that differ substantially from the ones that have
appeared in the previous sections.
We give first an achievability result within the context of
information stable input/outputs.
Theorem 12: Suppose that (X, Y) is information stable
and I(X; Y) < 00, where Y is the output of W due to X.
Then,
;I(xi”,...;x$; F-n)
On the other hand,
Pp(B)
=
= &JW”(”
iEI
I Ci) + $W”(O
) c;)
@I
<-+p-p
ICI
5 exp
(An)
+
A,
(7.22)
+
iI(X”:
Y”) - ; log A4
I
+ o(l)
(8.1)
where {XF, . . . , XL} is i.i.d. with common distribution X”,
Yn is connected with Xp, . +. , X& via the channel in Fig. 4,
and the term o(1) is nonnegative and vanishes as n -+ co.
9Theorem 1 holds for infinite alphabet channels, as well.
IEEE TRANSACI-IONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993
768
obtain
$t”,
x;, . . *,X&,;
+)
- ;I(Xn;
5 kE[log (M exp (-1(X”;
5 iE[log
Y”) + $ log M
Y”)) + exp A,)]
(exp (-nS) + exp A,)]
5 i log 2 + E[max { -6, A,/n}]
Fig. 4.
Input-output transformation in random coding. Switch is equally
likely to be in each of its A4 positions.
Proof: Note that the joint distribution of (X’“, Yn) is
that of (Xn, Y”). By Kolmogorov’s identity and the conditional independence of Yn and X;, * . . , XL given Xn
I(X”;
Y”) = I@;
(8.4)
where the second inequality is a result of M
I
exp(I(Xn;
Y”) - nfi). Th e expectation on the right side of
(8.4) is upper bounded by
.[;A,,{
;An
= .[-;,,I{
> -6}]
;A,
5 -“)1
Py
= I(R”,
x;,
5 $(X”;
‘. . ) xgf,; P)
=I(X;,-,Xgf;P’n)
+ I(P;
ly
Y”)p[;A,,
- ;E[ix-&Xn,
1 x;,
5 -63
Y”)l{ixn~“(X~,
y”)
’* * ,x2,>
(8.5)
(8.2)
where the second term on the right side is less than or equal to
log M. This shows that the term o(1) in (8.1) is nonnegative.
It remains to upper bound the left side of (8.1):
I(?,
x;,
. ‘. )X$,; 5-y
< o}],
where we have used E[A,] = 0 and the first term vanishes
asymptotically because (X, Y) is information stable with
finite mutual information rate, whereas the expectation of the
negative part of an information density cannot be less than
-e-i log e (e.g., [14, (2.3.2)]). Thus, the second term in (8.5)
also vanishes asymptotically. In the remaining case,
+
~“EB”cIEA”
CM EA”
. cW-(f
i=l
=
c
Y”) - ; log M
1 S) log
**.
c
ClEA”
I
= 0
and we can choose any arbitrarily small 6 > 0 while satisfying
M 2 exp (1(X”; Y”) - n6). Now, normalizing by n we can
further upper bound (8.3) with
PX-(Cl)**-PX-(CM)
CMEA”
iE[log
.log
;
(
exp ix-w-
< E log 1 + $
[
(
where the inequalities follow from log (1 + exp t) 5 log 2 +
tl{t > 0) and (8.5), respectively. Now, the theorem follows
0
since 6 can be chosen arbitrarily small.
(Cl, Y/“>
exp ixnwn(Xn,
Y”)
I
,
(8.3)
where the first inequality follows from the concavity of the
logarithm and the second is a result of
E[exp ixn~n(Xy,
(1 + exp (A, + nS))]
y/“)] = 1,
for all y” E En and j = 1,. +. , M. Consider first the case
where M < exp (l(Xn; Yn) - ns) for some S > 0. Using
(8.3) and denoting A, = ix-w-(Xn,
Yn) - l(Xn; Y”) we
Theorem 12 evaluates the mutual information between the
channel output and a collection of random codewords such
as those used in the random coding proofs of the direct part
of the channel coding theorem. However, the rationale for its
inclusion here is the following corollary. The special case of
this corollary for i.i.d. inputs and discrete memoryless channels
is [21, Theorem 6.31, proved using a different approach.
Corollary: For every X such that (X, Y) is information
stable, and for all y > 0, E > 0, there exists x whose
resolution satisfies
$2’“)
< I(X;
Y) + y
HAN AND VERDlj: APPROXIMATION THEORY OF OUTPUT STATISTICS
and
769
for all xn E A”. Thus, for every distribution Xn,
0 5 I(T;
for all sufficiently large n.
Proof: The link between Theorem 12 and this section is
the following identity
T)
- 1(X”;
P’“)
= D(W”llTn
1X”)
- ll(wyF’”
1 P)
= D(WllY”
1 P)
- D(W”llP”
) P)
= D(Pypq
where the second equation follows from (8.9). Thus,
qPyx;,
* ’* , ~&lllY”
I x,“, . . . >xzf,)
= 1(X?,.
. . ) XL,; P),
where Y” [Xl”, . . . ,X&l is defined in (4.1). As in the proof
of Theorem 4, (8.1) implies that there exists (CT, +. . , cb)
such that the output distribution due to a uniform distribution
on (CT,+.. , c%) approximates the true output distribution in
the sense that their unconditional divergence per symbol can
be made arbitrarily small by choosing (l/n) log A4 to be
appropriately close to (l/n)l(X”;
Y”).
0
A sharper achievability result (parallel to Theorem 4)
whereby the assumption of (X, Y) being information stable
is dropped and 1(X; Y) is replaced by 7(X; Y) can be
shown by (1) letting M = exp (nI(X; Y) + TM?),(2) using
log (1 + exp t) 5 log 2 + tl{t > 0} to bound the right side
of (8.3), and (3) invoking Lemma Al (under the assumption
that the input alphabet is finite).
The extension of the general converse resolvability results
(Theorems 5 and 9) to the divergence-gauged approximation
criterion is an open problem. On the other hand, the analogous exercise with the converse mean-resolvability results of
Section VII is comparatively easy. A much more general class
of channels than the BSCs of Theorem 10 is the scope of the
next result.
Theorem 13: Let a finite-input channel W with capacity C
be such that for each sufficiently large n, there exists r
for
which
I(T;
7”)
= m&x1(X”;
Y”)
(8.6)
from where the result follows because of (8.8) and the channel
capacity converse.
C < liminf,,,
’ -n. > --n
;w
y 1.
Remark 5: It is obvious that the channel in Example 1 does
not satisfy the sufficient condition in Theorem 13, whereas the
full-rank discrete memoryless channels (cf. Remark 4) always
satisfy that condition.
A counterpart of Theorem 11 (mean-resolvability semiconverse) with divergence in lieu of variational distance is
easy to prove using the same idea as in the proof of Theorem
13.
Theorem 14: For any finite-input channel with capacity C,
there exists an input process x such that if X satisfies
and Pkn << Pxn then,
liminf,,,
$2”)
D(W”IIY”
1 P)
= D(W”(lF
= I(F;
= x”] > 0,
‘for all xn E A”.
63.7)
If X is such that
= 0
(8.8)
= H(P
T).
1 P’“) - D(~‘nIIT)
+ D(W”llT
2 1(X”;
then
) rr”)
Thus,
H(X”)
lim -5ZJ(Y”]]yn)
n-+00 n
> c.
Proof: Let ri” maximize l(Xn; Y”). It follows from
Lemma 11 and the assumed condition P+, < Pj~n that
and
P[T
0
1X”)
Y”) - D(iqFn)
from where the result follows immediately.
liminf,,,
Proof: The following result will be used.
Lemma ll(3, p. 1471: If 1(7?;m
= maxxn I(Xn;
then,
qr;
T)
2 D(WQ(.
Y”),
) x”)IIY”)
with equality for all Z? E A” such that Pyn(z”) > 0.
According to Lemma 11 and the assumption in Theorem 13,
I@“;7”) = D(W(.1xyq,
(8.9)
Let us now state our concluding result which falls outside
the theory of resolvability but is still within the boundaries
of the approximation theory of output statistics. It is a folktheorem in information theory whose proof is intimately
related to the arguments used in this section.
Fix a codebook {CT, . . . , c&}. If all the codewords are
equally likely, the distributions of the input and output of the
channel are
1
if xn = c”
P&?)
= ;,
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993
770
and
P&P)
= ~gvyYn
1 CT),
(8.10)
j=1
respectively. The issue we want to address, in the spirit of
this paper, is the relationship between Yn and the output
distribution corresponding to the input that maximizes the
mutual information. It is widely believed that if the code is
good [with rate close to capacity and low error probability),
then Y” must approximate the output distribution due to the
input maximizing the mutual information. (See, e.g., [2, section
8.101 for a discussion on the plausibility of this statement.)
To focus ideas take a DMC with capacity
-C = 1(X; Y) = m;xI(X;
Y).
Is it true that the output due to a good code looks i.i.d.
with distribution Py? Sometimes this is erroneously taken for
granted based on some sort of “random coding” reasoning.
However, recall that the objective is to analyze the behavior
of the output due to any individual good code, rather than
to average the output statistics over a hypothetical random
choice of codebooks.
Our formalization of the folk-theorem is very general, and
only rules out channels whose capacity is not obtained through
the maximization of mutual information. Naturally, the result
ceases to be meaningful for those channels.
Theorem 15: For any channel W with finite input alphabet
and capacity C that satisfies the strong converse, the following
holds. Fix any y > 0 and any sequence of (n, M, X,) codes
such that
and
ilogM>C-;
n
Then,”
(8.11)
for all sufficiently large n, where Y-n is the output due to the
(n, M, X,) code (cf. (8.10)) and 7 is the output due to x=”
that satisfies
I(T;
Proofi
Fn) = mxyl(Xn;
Yn
For every Xn, we write
“It can be shown that the output distribution due to a maximal mutual
information input is unique.
where the inequality follows from Lemma 11. If we now
particularize Xn to the uniform distribution on the (nL M, X,)
codebook of the above statement, then (l/n)l(X”;
Y”) must
approach capacity because of the Fano inequality:
$2”;
P)
> (1 - An);
log $l-
2 (1 - X,)(C
- ;)
; log 2
- ; log 2. (8.13)
~~~~ (8.12) and (8.13) result in
+yP)
5 $xn,
B”) - (1 - X,)
.(c-$)+;lo,,
5 Y,
for sufficiently large n because X, --) 0 and the strongconverse assumption guarantees that the inequalities in (5.1)
actually reduce to identities owing to Theorem 8.
0
As a simple exercise, we may particularize Theorem 15 to
a BSC, in which case, the output y due to x’” achieving
capacity is given by
P&y”)
for all yn E (0, l}“.
= 2F,
Then, (8.11) is equivalent to
log 2 - y 5 ;H(En),
for an arbitrarily small y > 0 and all sufficiently large n. This
implies that the output Yn due to the input distribution Xn
of a good codebook must be almost uniformly distributed on
(0, l}” (cf. [2, example 2, section 8.101).
Can a result in the spirit of Theorem 15 be proved for the
input statistics rather than the output statistics? The answer
is negative, despite the widespread belief that the statistics of
any good code must approximate those that maximize mutual
information. To see this, simply consider the normalized
entropy of Xn versus that of x”:
;H(XII)
- ;H(Xn)
=
where the last two terms in the right-hand side are each asymptotically close to capacity. However, the term (l/n)H(xn
I
7”) does not vanish in general. For example, in the case of
a BSC with crossover probability p, (l/n)H(r
I y”) =
h(P).
Despite this negative result concerning the approximation
of the input statistics, it is possible in many cases to bootstrap
some conclusions on the behavior of input distributions with
fixed dimension from Theorem 15. For example, in the case
of the BSC (p # l/2), the approximation of the first order
input statistics follows from that of the output because of the
invertibility of the transition probability matrix. Thus, in a
good code for the BSC, every input symbol must be equal
HAN AND VERDtJ: APPROXIMATION
171
THEORY OF OUTPUT STATISTICS
to 0 for roughly half of the codewords. As another example,
consider the Gaussian noise channel with constrained input
power. The output spectral density is the sum of the input
spectrum and the noise spectrum. Thus, a good code must
have an input spectrum that approximates asymptotically the
water-filling solution.
The conventional intuition that regards the statistics of good
codes as those that maximize mutual information, constitutes
the basis for an important component of the practical value
of the Shannon theory. The foregoing discussion points out
that that intuition can often be put on a sound footing via the
approximation of output statistics, despite the danger inherent
in far-reaching statements on the statistics of good codes.
IX. CONCLUSION
Aside from the setting of system simulation alluded to
in the introduction, we have not dwelled on other plausible
applications of the approximation theory of output statistics.
Rather, our focus has been on highlighting the new information
theoretic concepts and their strong relationships with source
coding, channel coding and identification via channels. Other
applications could be found in topics such as transmission
without decoding and remote artificial synthesis of images,
speech and other signals (e.g., [12]).
A novel aspect of our development has been the unveiling of sup/inf-information rate and sup-entropy rate as the
right way to generalize the conventional average quantities
(mutual information rate and entropy rate) when dealing
with nonergodic channels and sources. We have seen that
those concepts actually open the way towards new types of
general formulas in source coding (Section III), channel coding
[17] and approximation of output statistics (Section IV). In
particular, the formula (5.5) for channel capacity [17] exhibits
a nice duality with the formula for resolvability (4.26).
In parallel with well-established results on channel capacity,
it is relatively straightforward to generalize the results in this
paper so as to incorporate input constraints, i.e., cases where
the input distributions can be chosen only within a class that
satisfies a specified constraint on the expectation of a certain
cost functional.
Presently, exact results on the resolvability of individual
input processes can be attained only within restricted contexts,
such as that of full-rank discrete memoryless channels [8].
In those cases, the resolvability of individual inputs is given
by the sup-information rate; this provides one of those rare
instances where an operational characterization of the mutual
information rate (for information stable input/output pairs) is
known. Whereas our proof of the achievability part of the
resolvability theorem holds in complete generality, the main
weakness of our present proof of the converse part is its strong
reliance on the finiteness of the input alphabet. So far, we
have not mentioned how to relax such a sufficient condition.
However, it is indeed possible to remove such a restriction for
a class of channels. In a forthcoming paper, the counterpart of
Theorem 5 will be shown for infinite-input channels under a
mild smoothness condition, which is satisfied, for example, by
additive Gaussian noise channels with power constraints.
APPENDIX
In this appendix, we address a technical issue dealing with
the information stability of input-output pairs.
Lemma Al: Let G > log I A 1, then,
1
,ixnwn(X”,
Y”)l
Y”) > G
;i&X”,
+ 0.
Proof: The main idea is that an input-output pair can
attain a large information density only if the input has low
probability. Since for all (9, y”) E A” x B”
ixnWn(xn,
y”) I log
1
P&I?)
(A.11
we can upper bound
where
DG = (9
E A”: Pp(z”)
5 exp(-nG)}
(A.31
The right side of (A.2) can be decomposed as
E ;l{Xn
-
E DG} log
APx~(DC)
log
px- (DG)
px- (xn)
1
Px-(DG)
5 Px- (DG) i log ( DG I -i
[
log Px~
(DG)
1
(A.4)
because entropy is maximized by the uniform distribution.
Now, bounding I DG 11 I A In and
Px-(DG)
<I
DG
( em-@
5 ev(-nS),
where S = G - log I A I > 0, the result follows. One
consequence of Lemma A.1 is Lemma 1.
Proof of Lemma 1: First, we lower bound mutual information for any y > 0 as
il(X”;
n
Y”) 2 E ;ixnw-(X”,
Y”)
1
Y”) > 1(X; Y) - y .
(A.5)
It is well known [ 141 that the first term in the right side of (A.5)
vanishes and the probability in (A.5) goes to 1 by definition
of L(X; Y). Thus, 1(X”; Y”)/n
> 1(X; Y) - 2y for all
sufficiently large n.
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993
712
Conversely, we can upper bound mutual information as
$X”;
161R.M. Gray and L.D. Davisson, “The ergodic decomposition of sta-
Y”)
5 E ;ixnwn(Xn,
[
Yn)l
;ix&X”,
Y”) 2 G
{
+ GP ;ixnwn(X’“,
[
+PCX;
y>+Yl
1.
i
’P nZX”pv”(Xn,
1
1
11
Y”) > 7(X; Y) + y
Y”) gx;Y)+y
1
G4.6)
If G is chosen to satisfy the condition in Lemma Al then
(A.6) results in
ll(X’“,
n
Y”) 5 7(X; Y) + 27,
for all sufficiently large 12.
q
ACKNOWLEDGMENT
Discussions with V. Anantharam, A. Barron, M. Burnashev, A. Orlitsky, S. Shamai, A. Wyner, and J. Ziv are
acknowledged. References [9], [13], [21] were brought to the
authors’ attention by I. Csiszar, D. Neuhoff, and A. Wyner,
respectively. Fig. 3 was generated by R. Cheng.
REFERENCES
PI R. Ahlswede and G. Dueck, “Identification via channels,” IEEE Trans.
Inform. Theory, vol. 35, pp. 15-29, Jan. 1989.
PI T. M. Cover and J. A. Thomas, Elements of Information Theory. New
York: Wiley, 1991.
[31 I. Csiszar and J. Komer, Information Theory: Coding Theorems for
Discrete Memoryless Systems. New York: Academic, 1981.
141 A. Feinstein, “A new basic theorem of information theory,” IRE Trans.
PGIT, vol. 4, pp. 2-22, 1954.
“On the coding theorem and its converse for finite-memory
[51 -,
channels,” Inform. Contr., vol. 2, pp. 25-44, 1959.
tionary discrete random processes,” IEEE Trans. Inform. Theory, vol.
IT-20, pp. 625-636, Sept. 1974.
[71 T. S. Han and S. Verdti, “New. results in the theory and applications of
identification via channels,” IEEE Trans. Inform. Theory, vol. 38, pp.
14-25, Jan. 1992.
“Spectrum invariancy under output approximation for full-rank
PI -,
discrete memoryless channels,” Probl. Peredach. Inform. (in Russian),
no. 2, 1993.
PI G.D. Hu, “On Shannon theorem and its converse for sequences of
communication schemes in the case of abstract random variables,”
Trans. Third Prague Conf Inform. Theory, Statistical Decision Functions, Random Processes, Czechoslovak Academy of Sciences, Prague,
1964, pp. 285-333.
11013. C. Kieffer, “Finite-state adaptive block-to-variable length noiseless
coding of a nonstationary information source,” IEEE Trans. Information
Theory, vol. 35, pp. 1259-1263, 1989.
1111D. E. Knuth and A. C. Yao, “The complexity of random number generation,” in Proceedings of Symposium on New Directions and Recent
Results in Algorithms and Complexity. New York: Academic Press,
1976.
I121 R. W. Lucky, Silicon Dreams: Information, Man and Machine. New
York: St. Martin’s Press, 1989.
P31 D.L. Neuhoff and P.C. Shields, “Channel entropy and primitive approximation,” Ann. Probab., vol. 10, pp. 188-198, Feb. 1982.
1141 M. S. Pinsker, Information and Information Stability of Random Sariables and Processes. San Francisco: Holden-Day, 1964.
1151 J.M. Stoyanov, Counterexamples in Probability. New York: Wiley,
1987.
P61 S. Verdti, “Multiple-access channels with memory with and without
frame-synchronism,” IEEE Trans. Inform. Theory, vol. 35, pp. 605-619,
May 1989.
[I71 S. Verde and T. S. Han, “A new converse leading to a general channel
capacity formula,” to be presented at the IEEE Inform. Theory Workshop,
Susono-shi, Japan, June 1993.
Psi J. Wolfowitz, “A note on the strong converse of the coding theorem for
the general discrete finite-memory channel,” Inform. Contr., vol. 3, pp.
89-93, 1960.
“On channels without capacity,” Inform. Contr., vol. 6, pp.
1191-,
49-54, 1963.
“Notes on a general strong converse,” Inform. Contr., vol. 12,
PO1 -,
pp. 14, 1968.
1211A.D. Wyner, “The common information of two dependent random
variables,” IEEE Trans. Inform. Theory, vol. IT-21, pp. 163-179, Mar.
1975.