Download "Approximation Theory of Output Statistics,"

152 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993 Approximation Theory of Output Statistics Te Sun Han, Fellow, IEEE, and Sergio VerdG, Fellow, IEEE Abstract-Given a channel and an input process, the minimum randomness of those input processes whose output statistics approximate the original output statistics with arbitrary accuracy is studied. The notion of resolvability of a channel, defined as the number of random bits required per channel use in order to generate an input that achieves arbitrarily accurate approximation of the output statistics for any given input process, is introduced. A general formula for resolvability that holds regardless of the channel memory structure, is obtained. It is shown that, for most channels, resolvability is equal to Shannon capacity. By-products of the analysis are a general formula for the minimum achievable (fixed-length) source coding rate of any finite-alphabet source, and a strong converse of the identification coding theorem, which holds for any channel that satisfies the strong converse of the channel coding theorem. Index Terms- Shannon theory, channel output statistics, resolvability, random number generation complexity; channel capacity, noiseless source coding theorem, identification via channels. I. INTRODUCTION T 0 MOTIVATE the problem studied in this paper let us consider the computer simulation of stochastic systems. Usually, the objective of the simulation is to compute a set of statistics of the response of the system to a given “realworld” input random process. To accomplish this, a sample path of the input random process is generated and empirical estimates of the desired output statistics are computed from the output sample path. A random number generator is used to generate the input sample path and an important question is how many random bits are required per input sample. The answer would depend only on the given “real-world” input statistics if the objective were to reproduce those statistics exactly (in which case an infinite number of bits per sample would be required if the input distribution is continuous, for example). However, the real objective is to approximate the output statistics. Therefore, the required number of random bits depends not only on the input statistics but on the degree of approximation required for the output statistics, and on the system itself. In this paper, we are interested in the approximation of output statistics (via an alternative input process) with arbitrary accuracy, in the sense that the distance between the finite-dimensional statistics of the true output process and the approximated output process is required to vanish asymptotically. This leads to the introduction of a new Manuscript received February 7, 1992; revised September 18, 1992. This work was supported in part by the U.S. Office of Naval Research under Grant N00014-90-J-1734 and in part by the NEC Corp. under its grant program. T. S. Han is with the Graduate School of Information Systems, University of Electra-Communications, Tokyo 182, Japan. S. Verdti is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544. IEEE Log Number 9206960. 0018-9448/93$03.00 concept in the Shannon theory; the resolvability of a system (channel) defined as the number of random bits per input sample required to achieve arbitrarily accurate approximation of the output statistics regardless of the actual input process. Intuitively, we can anticipate that the resolvability of a system will depend on how “noisy” it is. A coarse approximation of the input statistics whose generation requires comparatively few bits will be good enough when the system is very noisy, because, then, the output cannot reflect any fine detail contained in the input distribution. Although the problem of approximation of output statistics involves no codes of any sort or the transmission/reproduction of information, its analysis and results turn out to be Shannon theoretic in nature. In fact, our main conclusion is that (for most systems) resolvability is equal to Shannon capacity. In order to make the notion of resolvability precise, we need to specify the “distance” measure between true and approximated output statistics and the “complexity” measure of random number generation. Our main, but not exclusive, focus is on the II-distance (or variational distance) and on the worst-case measure of randomness, respectively. This complexity measure of a random variable is equal to the number of random bits required to generate every possible realization of the random variable; we refer to it as the resolution of the random variable and we show how to obtain it from the probability distribution. The alternative, average randomness measure is known to equal the entropy plus at most two bits [ll], and it leads to the associated notion of mean-resolvability. Section II introduces the main definitions. The class of channels we consider is very general. To keep the development as simple as possible we restrict attention to channels with finite input/output alphabets. However, most of the proofs do not rely on that assumption, and it is specifically pointed out when this is not the case. In addition to allowing channels with arbitrary memory structure, we deal with completely general input processes, in particular, neither ergodicity nor stationarity assumptions are imposed. Further motivation for the notions of resolvability and mean-resolvability is given in Section III. Section IV gives a general formula for the resolvability of a channel. The achievability part of the resolvability thereom (which gives an upper bound to resolvability) holds for any channel, regardless of its memory structure or the finiteness of the input/output alphabets. The finiteness of the input set is the only substantive restriction under which the converse part (which lower bounds resolvability) is shown in Section IV via Lemma 6. The approximation of output statistics has intrinisic connections with the following three major problems in the Shannon theory: (noiseless) source coding, channel coding and identi0 1993 IEEE 753 HAN AND VERD& APPROXIMATION THEORY OF OUTPUT STATISTICS fication via channels [l]. As a by-product of our resolvability results, we find in Section III a very general formula for the minimum achievable fixed-length source coding rate that holds for any finite-alphabet source thereby dispensing with the classical assumptions of ergodicity and stationarity. In Section V, we show that as long as the channel satisfies the strong converse to the channel coding theorem, the resolvability formula found in Section IV is equal to the Shannon capacity. As a simple consequence of the achievability part of the resolvability theorem, we show in Section VI a general strong converse to the identification coding theorem, which was known to hold only for discrete memoryless channels [7]. This result implies that the identification capacity is guaranteed to equal the Shannon capacity for any finite-input channel that satisfies the strong converse to the Shannon channel coding theorem. The more appropriate kind (average or worst-case) of complexity measure will depend on the specific application. For example, in single sample-path simulations, the worst-case measure may be preferable. At any rate, the limited study in Section VII indicates that in every case we consider, the mean-resolvability is also equal to the Shannon capacity of the system. Similarly, the results presented in Section VIII evidence that the main conclusions on resolvability (established in previous sections) also hold when the variational-distance approximation criterion is replaced by the normalized divergence. Section VIII concludes with the proof of a folk theorem which fits naturally within the approximation theory of output statistics: the output distribution due to any good channel code must approximate the output distribution due to the capacityachieving input. Although the problem treated in this paper is new, it is interesting to note two previous information-theoretic contributions related to the notion of quantifying the minimum complexity of a randomness source required to approximate some given distribution. In one of the approaches to measure the common randomness between two dependent random variables proposed in [21], the randomness source is the input to two independent memoryless random transformations, the outputs of which are required to have a joint distribution which approximates (in normalized divergence) the nth product of the given joint distribution, The class of channels whose transition probabilities can be approximated (in &distance) by slidingblock transformations of the input and an independent noise source are studied in [13], and the minimum entropy rate of the independent noise source required for accurate approximation is shown to be the maximum conditional output entropy over all stationary inputs. In order to describe the statistics of input/output processes, we will use the sequence of finite-dimensional distributions’ {Xn = (X,‘“‘, . .. ,X?))}r=,, which is abbreviated as X. The following notation will be used for the output distribution when the input is distributed according to Q”: Q”W”(y”) = c W”(y” 1 ?)&“(a?)< X”EA” Definition 2 (141: Given a joint distribution Pxnyfi (a~“,y”) = Px-(z”)W”(y” ] a?), the information density is the function defined on A” x B”: iXnW”(un, b”) = log Wn(bn ’ a”). PY” @“I The distribution of the random variable (l/n)ix-w(Xn,Y”) where X” and Y” have joint distribution Px-Y- will be referred to as the information spectrum. The expected value of the information spectrum is the normalized mutual information (l/n)I(X”; Y”). Definition 3: The limsup in probability of a sequence of random variables {A,} is defined as the smallest extended real number @ such that for all E > 0 ilm P[A, > /3 + E] = 0. Analogously, the liminf in probability is the largest extended real number Q! such that for all e > 0, limn+cr, P[A, 5 Cl-E] = 0. Note that a sequence of random variables converges in probability to a constant, if and only if its limsup in probability is equal to its liminf in probability. The limsup in probability [resp. liminf in probability] of the sequence of random variables {(l/n)ix%w(Xn, Yn)}rZ1 will be referred to as the sup-information rate [resp. infinformation rate] of the pair (X, Y) and will be denoted as 7(X; Y) [resp. 1(X; Y)]. The mutual information rate of (X, Y), if it exists, is the limit 1(X; Y) = lim I1(X”; 71’00 n Y”). Although convergence in probability does not necessarily imply convergence of the means, (e.g., [15, p. 135]), in most cases of information-theoretic interest that implication does indeed hold in the context of information rates. Lemma 1: For any channel with finite input alphabet, if I(X; Y) = L(X; Y), (i.e., the information spectrum converges in probability to a constant), then 7(X; Y) = 1(X; Y) = l(X; Y), II. PRELIMINARIES This section introduces the basic notation and fundamental concepts as well as several properties to be used in the sequel. Definition 1: A channel W with input and output alphabets, A and B, respectively, is a sequence of conditional distributions w = {w”(yn 1 2”) = Pyn,X”(yn 12”); (cz?, y”) E A” x B”},“=,. and the input-output pair (X, Y) is called information stable. Proof: See the Appendix for a proof that hinges on the 0 finiteness of the input alphabet. 1 No consistency restrictions between the channel conditional probabilities, or between the finite-dimensional distributions of the input/output processes are imposed. Thus, “processes” refer to sequences of finite-dimensional distributions, rather than distributions on spaces of infinite dimensional sequences. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993 754 Definition 4 [7]: For any positive integer M,2 a probability distribution P is said to be M - type if and d(Y”, P(u) E 0, $-, ;, . . . ,l C I for all w E R. ) Py- = Px-kV, P+ sups, E>O with equality, if and only if P is equiprobable. The information spectrum is upper bounded almost surely by the input (or output) resolution: Lemma 3: Proof: For every (a?, y”) Px-(x”) > 0,3 we have iX”W”(xn, = 1. E A” x B” y”) 5 log 1 px- (x”) exp (-R(Xn)), ~I%4 WE0 = 2~5 Q(w)1 P’(E) - Q(WI. (2.2) - (2.3) Definition 7: Let c 2 0. R is an c-achievable resolution rate for channel W if for every input process X and for all y > 0, there.exists h whose resolution satisfies ;R(Xn) < R + y s = sup S(X). X (2.1) Definition 6 (e.g., [3]): The variational distance or Iidistance between two distributions P and Q defined on the same measurable space (R, 9) is Q)= (2.6) The main focus of this paper is on the resolvability of systems as defined in Definition 7. In addition, we shall investigate another kind of resolvability results by considering a different randomness measure. Specifically, if in Definition 7, (2.4) is replaced by where m(z”) is an integer greater than or equal to 1. Thus, 0 the result follows uniting (2.1) and (2.2). d(P> = s. The definitions of achievable resolution rates can be modified so that the defining property applies to a particular input X instead of every input process. In such case, we refer to the corresponding quantities as (c-) achievable resolution rate for X and (E-) resolvability for X, for which we use the notation S,(X) and S(X). It follows from Definition 7 that such that and Px-(CC”) = m(C) = P.&%W”. If R is an c-achievable resolution rate, for every E > 0, then, we say that R is an achievable resolution rate. By definition, the set of (e-) achievable resolution rates is either empty or a closed interval. The minimum e-achievable resolution rate (resp., achievable resolution rate) is called the c-resolvability (resp., resolvability) of the channel, and it is denoted by S, (resp., S). Note that S, is monotonically nonincreasing in E and H(P)< z(P)5 R(P) Y”) < R(X”)] (2.5) for all sufficiently large 12, where Y and Y are the output statistics due to input processes X and X, respectively, i.e., The number of different M-type distributions on R is upper bounded by IRIM. Definition 5: The resolution R(P) of a probability distribution P is the minimum log M such that P is M-type. (If P is not M-type for any integer M, then R(P) = +oo.) Resolution is a new measure of randomness which is related to conventional measures via the following immediate inequality. Lemma 2: Let H(P) denote the entropy of P and let Z(P) denote the RCnyi entropy of order 0, i.e., the logarithm of the number of points with positive P-mass. Then. p[iXnWn(Xn, 9’“) < E, $2”) (2.7) then achievable resolution rates become achievable entropy rates and resolvability becomes mean-resolvability. It follows from Lemma 2 that for all E > 0 and X ax) 5 Se(X) (2.8) where 3, and ?? denote (E-) mean-resolvability in parallel with the above definitions of S, and S. It is obvious that 3 = supx 3(X). The motivation for the definitions of resolvability and meanresolvability is further developed in the following section. III. RESOLUTION, (2.4) ‘The alternative terminology type with denominator M can be found in [2, ch. 121. 3Following common usage in information theory, when the distributions in Definitions 4-6 denote those of random variables, they will b_ereplaced by the random variables themselves, e.g., R(X), H(X), d(Yn, Y”). < R + y RANDOM NUMBER GENERATION AND SOURCE CODING The purpose of this section is to motivate the definitions of resolvability and mean-resolvability introduced in Section II through their relationship with random number generation and noiseless source coding. Along the way, we will show that our resolvability theorems lead to new general results in source coding. HAN AND VERDti: APPROXIMATION THEORY OF OUTPUT STATISTICS A. Resolution and Random Number Generation A prime way to quantify the “randomness” of a random variable is via the complexity of its generation with a computer that has access to a basic random experiment which generates equally likely random values, such as fair coin flips, dice, etc. By complexity, we mean the number of random bits that the most efficient algorithm requires in order to generate the random variable. Depending on the algorithm, the required number of random bits may be random itself. For example, consider the generation of the random variable with probability masses P[X = -11 = l/4, P[X = 0] = l/2, P[X = 11 = l/4, with an algorithm such that if the outcome of a fair coin flip is Heads, then the output is 0, and if the outcome is Tails, another fair coin flip is requested in order to decide $1 or -1. On the average this algorithm requires 1.5 coin flips, and in the worst-case 2 coin flips are necessary. Therefore, the complexity measure can take two fundamental forms: worstcase or average (over the range of outcomes of the random variable). First, let us consider the worst-case complexity. A conceptual model for the generation of arbitrary random variables is a deterministic transformation of a random variable uniformly distributed on [0, 11. Although such a random variable cannot be generated by a discrete machine, this model suggests an algorithm for the generation of finitely-valued random variables in a finite-precision computer: a deterministic transformation of the outcome of a random number generator which outputs M equally likely values, in lieu of the uniformly distributed random variable. The lowest value of log M required to generate the random variable (among all possible deterministic transformations) is its worst-case complexity. Other algorithms may require fewer random bits on the average, but not for every possible outcome. It is now easy to recognize that the worst-case complexity of a random variable is equal to its resolution. This is because processing the output of the M-valued random number generator with a deterministic transformation (which is conceptually nothing more than a table lookup) results in a discrete random variable whose probability masses are multiples of l/M, i.e., an M-type. At first sight, it may seem that the use of resolution (as opposed to entropy) in the definition of resolvability is overly stringent. However, this is not the case because that definition is concerned with asymptotic approximation. Analogously, in practice, M may be constrained to be a power of 2; however, this possible modification has no effect on the definition of achievable resolution rates (Definition 7) because it is only concerned with the asymptotic behavior of the ratio of resolution to number of dimensions of the approximating distribution. The average complexity of random variable generation has been studied in the work of Knuth and Yao [ll], which shows that the minimum expected number of fair bits required to generate a random variable lies between its entropy and its entropy plus two bits, (cf. [2, theorem 5.12.31). That lower bound holds even if the basic equally likely random number generator is allowed to be nonbinary. This result is the reason for the choice of entropy as the average complexity measure 755 in the definition of mean-resolvability. Note that the two-bit uncertainty of the Knuth-Yao theorem is inconsequential for the purposes of our (asymptotic) definition. B. Resolution and Source Coding Having justified the new concepts of resolvability and meanresolvability on the basis of their significance in the complexity of random variable generation, let us now explore their relationship with well-established concepts in the Shannon theory. To this end, in the remainder of this section we will focus on the special case of an identity channel (A = B; W” (y” ( xn) = 1 if xn = y/“), in which our approximation theory becomes one of approximation of source statistics. Suppose we would ike to generate random sequences according to the finite dimensional distributions of some given process X. As we have argued, the worst-case and average number of bits per dimension required are (l/n)R(Xn) and (l/n)H(Xn), respectively. If, however, we are content with reproducing the source statistics within an arbitrarily small tolerance, fewer bits may be needed, asymptotically in the worst case. For example, consider the case of independent flips of a biased coin with tails probability equal to l/~. It is evident that R(Xn) = 00 for every n. However, the asymptotic equipartition property (AEP) states that for any E > 0 and large n, the exp (&(1/x) + nc) typical sequences exhaust most of the probability. If we let M = exp (nh(l/rr) + 2nc) then we can quantize the probability of each of those sequences to a multiple of l/M, thereby achieving a quantization error in each mass of at most l/M. Consequently, the sum of the absolute errors on the typical sequences is exponentially small, and the masses of the atypical sequences can be approximated by zero because of the AEP, thereby yielding an arbitrarily small variational distance between the true and approximating statistics. The resolution rate of the approximating statistics is h(l/7r)+2~. Indeed, in this case S(X) = s(X) = h(l/r), and this reasoning can be applied to any stationary ergodic source to show that S(X) is equal to the entropy rate of X (always in the context of an identity channel). The key to the above procedure to approximate the statistics of the source with finite resolution is the use of repetition. Had we insisted on a uniform approximation to the original statistics we would not have succeeded in bringing the variational distance to negligible levels, because of the small but exponentially significant variation in the probability masses of the typical sequences. By allowing an approximation with a uniform distribution on a collection of M elements with repetition, i.e., an M-type, with large enough M, it is possible to closely track those variations in the probability masses. A nice bonus is that for this approximation procedure to work it is not necessary that the masses of the typical sequences be similar, as dictated by the AEP. This is why the connection between resolvability and source coding is deeper than that provided by the AEP, and transcends stationary ergodic sources. To show this, let us first record the standard definitions of the fundamental limits in fixed-length and variable-length source coding. Definition 8: R is an &-achievable source coding rate for X if for all y > 0 and all sufficiently large n, there exists a IEEE TRANSACTIONSON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993 756 P[Xn $z {XT,. . . ,x&)1 L E. collection of M n-tuples {x’;“, . . . , x%} such that Choose M’ such that ~10gM~Rfy exp (nR + 2ny) I M’ I exp (nR + 3ny) and R is an achievable (fixed-length) source coding rate for X if it is e-achievable for all 0 < E < 1. T(X) denotes the minimum achievable source coding rate for X. Definition 9: Fix an integer r > 2. R is an achievable variable-length source coding rate for X if for all y > 0 and all sufficiently large n, there exists an r-ary prefix code for X” such that the average codeword length L, satisfies ;Ln log r < Rfy. The minimum achievable variable-length source coding rate for X is denoted by T(X). As shown below, in the special case of the identity channel, resolvability and mean-resolvability reduce to the minimum achievable fixed-length and variable-length source coding rates, respectively, for any source. Although quite different from the familiar setting of combined source and channel coding (e.g., no decoding is present at the channel output), the approximation theory of output statistics could be subtitled “source coding via channels” because of the following two results. Theorem 1: For any X and the identity channel, , and an arbitrary element xg @ D. ’ We are going to construct an approximation Xn to X” which satisfies the following conditions: a) b) c) d) Xn is an M/-type, P&6$) 5 t, ]P*n(xr) - Px-(xa)] 5 l/M’, P+(D) + P+(x$) = 1. It will then follow immediately that R is a 3e-achievable resolution rate, as $2”) 5 R + 37 and M 4xn> 0 5 ClP+(xy) - Px-(x;)l+ P+(x;;) i=l + c px-(xn) X%EDC <26+$<2t+exp(-ny)<3r for is all sufficiently P+(xn) S(X) = T(X). Proof: 1) T(X) < S(X). We show that if R is an e-achievable resolution rate for X then it is an e/2-achievable source coding rate for X. According to Definition 7, for every y > 0 and all sufficiently large n, there exists Xn with i = l,...,M, large n. = 0, The construction of Xn if xn $ {a$, . . . ,xG}, i = O,...,M, where the integers k; are selected as follows. If 5 ,M’PXn (x1), 5 M’, i=l ;R(Xn) < R+y then k = [M’Pxn (~a)], d(X”, i= l,...,M M Xn) < E. ko = M’ - Cki We can view Xn as putting mass l/M on each member of a collection of M = exp (R(Xn)) elements of A” denoted by D 7 {XT, . . . , XL} (Note that the M elements of this collection need not all be different.) The collection D is a source code with probability of error smaller than e/2 because i=l and properties a)-d) are readily seen to be satisfied. On the other hand, consider the case where M CrM’Px-(x,n)] = M’ + L i=l E > d(X”, P) 2 2Pp(D”) = 2Pxn (D”). - 2P,,(D”) 2) S(X) 5 T(X). We show that if R is an e-achievable source coding rate for X, then it is a Se-achievable resolution rate for X. For arbitrary y > 0 and all sufficiently large n, select D = {x;, +. +, XL} such that 1 log M < R + y, ,n with 1 5 L 5 M. Since it may be assumed, without loss of generality, that Pp(xT) > 0 for all i = 1,. .. , M we may set IcrJ= 0 ki = [M’Px~~(~r)j - 1 > 0, ki = [M’Pxn (x%)1, i= l,*..L, i = L+ which again guarantees that a)-d) are satisfied. l,...M, 0 151 HAN AND VERDtl: APPROXIMATION THEORY OF OUTPUT STATISTICS Theorem 2: For any X and the identity channel, S(X) = T(X) = lim SUP~+~ $X”). Proof: 1) s(X) 5 T(X). Suppose that R is an achievable variable-length source coding rate. Then, Definition 9 states that there exists for all y > 0 and all sufficiently large n a prefix code whose average length L, satisfies Theorems 1 and 2 show a pleasing parallelism between the resolvability of a process (with the identity channel) and its minimum achievable source coding rate. Theorem 1 and the Shannon-MacMillan theorem lead to the solution of S(X) as the entropy rate of X in the special case of stationary ergodic X. Interestingly, the results of this paper allow us to find the resolvability of any process with the identity channel, and thus a completely general formula for the minimum achievable source-coding rate for any source. Theorem 3: For any X and the identity channel, ;Ln log T < R + y. (3.1) Moreover, the fundamental source coding lower bound for an r-ary code (e.g., [2, theorem 5.3.11) is II 5 L, log T. < R + y, (3.3) concluding that R is an achievable mean-resolution rate for X. 2) T(X) < s(X). Let R be an achievable mean-resolution rate for X. For arbitrary y > 0, 0 < E < l/2 and all sufficiently large n, choose Xn such that (3.3) is satisfied and d(Xn, 2”) < E. On the other hand, there exists an r-ary prefix code for X” with average length bounded by (e.g., [2, theorem 5.4.1) L, log T < H(Xn) + log T. 1 Proof: 1) H(X) 5 S(X). w e will argue by contradiction: choose an achievable resolution rate R for X such that for some s>o R+S<H(X). By the definition of a(X), H(Q)1F 0 log (I fl I /@I. $X”) + 1 log T n for sufficiently large n if Elog (1 A 1 /c) < y. This follows im3) T(X) = lim s~p~+~ (l/n)H(Xn). mediately from the bounds in (3.2) and (3.4). See also [lOI. and LR(x”) n < Rf $. Define D1 = {xc” E An: Pxn(xn) > 0 and 1 _ P+W) Pxn(2”) Pxn(Dl log IAl” E < E < $/2 I- 1 and consider + ; log T 5 $?(a”) +‘t n z R+S Select 0 < E < a2, and Xn for all sufficiently large n to satisfy (3.5) Using Lemma 4 and (3.3), we obtain (3.6) 1 1 log PX”(““) n d(Xn, P) for all sufficiently large n, thereby proving that R is an achievable variable length source coding rate for X. To that end, all that is required is the continuity of entropy in variational distance: Lemma 4 (3, p. 331: If P and Q are distributions defined on R such that d(P, Q) 5 6’< l/2, then - 2 Q infinitely often with DO defined as the set of least likely source words: AL% log r 2 R + 27 n V(P) there exists a! > 0 such that px-(Do) (3.4) We want to show that if the above E is chosen sufficiently small then the code satisfies = H(X), where H(X) is the sup-entropy rate defined as 7(X; Y) for the identity channel (cf. Definition 3), i.e., the smallest real number ,0 such that for all E > 0 (3.2) Now, let X = X. Then, d(Xn, Xn) = 0, and (3.1)-(3.2) imply that h(%“) n S(X) = T(X) 0 n Do) 2 Px-(Do) - Px-(DE) > a - Eli2 > 0, which holds infinitely often because of (3.6) and ~~‘~px-(D:) I c X-CD; 2 E. Px=(x:“) P+ (2”) 1 - pxn(xn) (3.7) IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993 758 For those n such that (3.7) holds, we can find x2; E D1 n Do whose P*,-mass satisfies the following lower and upper bounds: P+ (x;) 2 (1 - 2’2)Pxn This result will be shown by means of an achievability (or direct) theorem which provides an upper bound to resolvability along with a converse theorem which gives a lower bound to resolvability. (xgn) > 0 A. Direct Resolvability Theorem and > n1 log Pp(xg) 1 ; log P.+(x$) 1 1 + n1 log 1 + El/2 Theorem 4: Every channel W and input process X satisfy S,(X) L qx; y> for any E > 0 where Y is the output of W due to X. if n is sufficiently large. Therefore we have found (an infinite number of) n such that Pj+ 1 1 log n P+ p) L P*n + log > ;R(2”) 1 P+&(X”) >R+; - 1 1 contradicting Lemma 3. is a special case (identity channel) of the 2) S(X) I mq general direct resolvability result (Theorem 4 in Section q IV). Remark 1: We may consider a modified version of Definition 9 as follows. Let us say that an r-ary variable-length code for X” is an c-prefix code for X” (0 < E < 1) {@(xn))rn~An if there exists a subset D of A” such that Pp (D) 2 1 - 6 is a prefix code. It is easy to check that and {$(x~)),~cD Theorem 2 continues to hold if “all y > 0” and “r-ary prefix code” are replaced by “all y > 0 and 0 < E < 1” and “r-ary c-prefix code” respectively in Definition 9. A general formula for the minimum achievable rate for noiseless source coding without stationarity and ergodicity assumptions has been a longstanding goal. It had been achieved [lo] in the setting of variable-length coding (see Theorem 2). In fixed-length coding, progress towards that goal had been achieved mainly in the context of stationary sources (via the ergodic decomposition theorem, e.g., [6]). A general result that holds for nonstationary/nonergodic sources is stated in4 [9] without introducing the notions of T(X) and H(X). The results established in this section from the standpoint of distribution approximation attain general formulas for both fixed-length and variable-length source coding without recourse to stationarity or ergodicity assumptions. It should be noted that an independent proof of T(X) = g(X) can be obtained by generalizing the proof of the source coding theorem in [3, Theorem 1.11. IV. Proof: Fix an arbitrary y > 0. According to Definition 7, we have to show the existence of a process x such that lim d(Yn, 9’“) = 0 n-ice and gn is an M-type distribution with M = exp (nT(X; Y) + ny), and Y”, Y” are the output distribution due to X” and zn, respectively. We will construct the approximating input statistics by the Shannon random selection approach. For any collection of (not necessarily distinct) M elements of A”, the distribution constructed by placing l/M mass on each of the members of the collection is an M-type distribution. If each member of the collection is generated randomly and independently with distribution Xn, we will show that the variational distance between Y” and the approximated output averaged over the selection of the M-collection vanishes, and hence there must exist a sequence of realizations for which the variational distance also vanishes. For any { cj E A”, j = 1, +. . , M} denote the output distribution ](Yn) pPn[C1,..‘ ,CM= ~~~n(Yn I Cj). (4.1) 3=1 The objective is to show that lim Ed(Y”, n--too ?“[X;,.+.,X$]) = 0, where the expectation is with respect to i.i.d. (X;, . e. , X&) with common distribution Xn. Instead of using the standard Csiszar-Kullback-Pinsker bound in terms of divergence [3], in order to upper bound d(Y”, Y”[X;, . e. , X$]) we will use the following new bound in terms of the distribution of the log likelihood ratio. Lemma 5: For every p > 0, RESOLVAL~ILITY THEOREMS A general formula for the resolvability of any channel in terms of its statistical description is obtained in this section. 4[9] refers to a Nankai University thesis by T. S. Yang for a proof. d(P, Q) L -?- cL+2p .log e P(X) 1% Q(x) where X is distributed according to P. >P 1 > HAN AND VERDfJ: APPROXIMATION THEORY OF OUTPUT STATISTICS 759 Proof of Lemma 5: We can write the variational distance as g} d(P,Q) = cl 0 5 log XEcl 1 P=(x)- Q(x)1 log $j + Cl XEcl { = 2dl + 2d2, 5 O} [Q(x)- where r = (exp p - 1)/2 > 0, X” and Y” are connected through W” and {Y”, X;, . . . , XE} are independent. The first probability in the right-hand side of (4.2) is P ;ixnrun(x”, [ P(x)l, rate. The M P $c dr=xl log 8 > IL} W - { Q(x)1 0 e. Proof of Theorem 4 (cont.): According to Lemma 5, it suffices to show that the following expression goes to 0 as n + co, for every p > 0, *.. c Px-(Cl).*.px-(CM) c CMEA” ClEA* * 1 log = $5 ‘*. log c c PEEP i C2EAn W”(Y” > ~ (Y”) c c . 1 $- exp (ix-w-(cl, { G,j(yn) = v,,j(~“)W’in,j(~“) i~n~n(Xjn, (4.4) I Ml, (4.5) Note that for every y” E Bn both {Vn,j(yn)}jNl and {Z,, j (y”)}y=r are independent collections of random variables because {XT}jM_r are independent. According to (4.4) and (4.5), the probability in (4.3) is equal to the expected value with respect to Pyn of 1 Y”) PX”Y”(Cl> P[UM(f) # UM(yn)] > 1 + r]. The first term in the right-hand side of (4.6) is equal to 1 Y”)) M exp (ixnwn(XT, 3=2 p[TM(!/“) yy”)) > 1+ 27 3=2 $c 1 + T] 5 y/“)) exp (ixnwn(cj, exp(ix-wn(Xn, > +P[TM(f) M + $c [ exp I %I cl~Any”~Bn + P = j=l c Px-(4 CMEAn . . . Px-(cnn) < P & [ Y”), K,j(yn) UdYn) = j$vn,j(yn), Px-(Cl) %[C,-CM]w) PY” ... (4.3) I CM 64” . . . Px-(cM) c > pi PY”(Yn) j=lcl~A” .l +-[~,...&f? PEB” c . > 1+ r it is not possible to apply the weak law of large numbers to (4.3) directly because the distribution of each of the random variables in the sum depends on the number of terms in the sum (through n). In order to show that the probability in (4.3) vanishes it will be convenient to condition on Y”, and, accordingly to define the following random variables for every yn E B” and j = l,...,M, p~-[c,...chl](Yn) i Y”)) Had we a maximum over j = 1, . . * , M instead of the sum in (4.3), showing that the corresponding probability vanishes would be a standard step in the random-coding proof of the direct channel coding theorem. In the present case, we need to work harder. Despite the fact that for every j = 1, e. . , M and I ~/log exp (ixnwn(Xy, j=1 [ -< P log = ) I which goes to 0 by definition of sup-information second probability in (4.2) is upper bounded by where c Y”)>I(X;Y)+y+$ogr > 7 1 Y”)) > 1+ r , I (4.2) 5 ~w-dY”) # K,j(Yn)l j=l = MP[Vn, l(y”“) > M], (4.6) IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993 760 O* whose expectation with respect to Py- yields &if c c px-(zn)pYn(Yn) ee x”EA”~“EB~ . l{exp ixnwn(zn, = M -a 0 l/2 C C l/2 y”) > M} Pxnyn(xn, 1. y”) exp(--iXnWn(P, y”)) Fig. 1. < .I Discrete memoryless channel in Example 1. z”EA”~“EB” . l{exp x=wn(xn, I C C y”) > iVf> which goes to 0 as n + 00 by definition. Px~y~(~n,yn)l{expix~W~(~n,yln) > M) z”EA*~“EB” = P ;ixnwn(Xn, [ 1 + y ) Y”) > f(X;Y) which, again, goes to 0 by definition of 7(X; Y). Regarding the second term in (4.6), notice first that JwM(Y”)l = wn, 1 c l(Y”)l I YV{K%, PX-yY”W l(Y”) 5 w XnEAn 5 1. (4.7) Therefore, using (4.7) and the Chebychev inequality, we get p[TM(Yn) > 1 + Tl I p[TM(Yn) - E[TM(Y”)] > T] 5 -$ var (G&P)) where we have used the fact that {Zn, j(y”)}gr are i.i.d. Finally, unconditioning the expectation on the right side of (4.8), we get $pz, l(Y”)l xnEAnyn~Bn pX”Y”(Zn, .( 2 y”) px-(~“)PY”(LP) . l{exp ixnwm(z?, = E + [ y”) 5 M} exp ix-wn(Xn, +1 $ Y”) exp ixnwn(Xn, Y”) 5 1 II where the expectation in the right-hand side is with respect to Pxnyn and can be decomposed as E +E $ exp ixnwn(Xn, Y”) .1 A { exp ixnwn(Xn, Y”) < exp 7 $ exp ixnw-(Xn, Y”) [ . 1 exp 7 { I exp < & exp ixnwn(Xn, ;ix-w7L(x”, 0 W e remark that in most cases of interest in applications, X and W will be such that (X; Y) is information stable in which case, the upper bound in Theorem 4 is equal to the input-output mutual information rate (cf. Lemma 1). B. Converse Resolvability Theorem Theorem 4 together with the converse resolvability theorem proved in this subsection will enable us to find a general formula for the resolvability of a channel. However, let us start by giving a negative answer to the immediate question as to whether the upper bound in Theorem 4 is always tight. Example 1: Consider the 3-input-2-output memoryless channel of Fig. 1, and the i.i.d. input process X that uses 0 and 1 with probability l/2, respectively. It is clear that 7(X; Y) = 1(X; Y) = 1 bit/symbol. However, the deterministic input process that concentrates all the probability in (e,. .. , e) achieves exactly the same output statistics. Thus S(X) = s(X) = 0. On the other hand, it turns out that we can find a capacity-achieving input process for which the bound in Theorem 4 is tight. (We will see in the sequel that this is always true.) Let X’ be the uniform distribution on all sequencesthat contain no symbol e and the same number of O’s and l’s (i.e., their type is (l/2, 0, l/2)). The entropy rate, the resolution rate and the mutual information rate of this process are all equal to 1 bit/symbol. Moreover, any input process which approximates X’ arbitrarily accurately cannot have a lower entropy rate (nor lower resolution rate, a fortiori). To see this, first consider the case when the input is restricted not to use e. Then the input is equal to the output and close variational distance implies that the entropies are also close (cf. Lemma 4). If e is allowed in the input sequences,then the capabilities for approximating X’ do not improve because for any input sequence containing at least one e, the probability that the output sequencehas type (l/2, l/2) is less than l/2. Therefore, the distance between the output distributions is lower bounded by one half the probability of the input sequences containing at least one e. Thus S(X’) = 3(X’) = 1 bit/symbol. The degeneracy illustrated by Example 1 is avoided in important classes of channels such as discrete memoryless channels with full rank (cf. Remark 4). In those settings, sharp results including the tightness of Theorem 4 can be proved using the method of types [8]. In general, however, the converse resolvability theorem does not apply to individual inputs. Theorem 5: For any channel with finite input alphabet and all sufficiently small E > 0, Y”) 5 1 Y”) > 7(X; Y) + ; ) I s, > supT(X; Y), X 161. HAN AND VERDtJ: APPROXIMATION THEORY OF OUTPUT STATISTICS Proofi The following simple result turns the proof of the converse resolvability theorem into a constructive proof. Lemma 6: Given a channel W with finite input alphabet, fix R > 0, E > 0. If for every y > 0 there exists a collection {Qi}y=r such that contradicting (4.10). The number of different M/-type distributions on A” is upper bounded by 1 A” j”’ . Therefore, the number of different distributions satisfying (4.13) is upper bounded by N 5 exp (&’ ;loglogN>R-y + nr) ( A” j=P (nR’+n7) 5 1 A” l=P (nR’+2n7), (4.9) for sufficiently large n E J, which results in and rnnyd(Q;W”, (4.10) QjnWn) > 2~ ; loglogN< R/+37, infinitely often in n, then (4.11) S, > R. Proof of Lemma 6: We first make the point that the achievability in Definition 7 is equivalent to its uniform version where the “sufficiently large n” for which the statement holds is independent of X. To see this, suppose that R is e-achievable in the sense of Definition 7 and denote by n,(e, R, y, W, Xj the minimum n,, such that for all n 2 n, there exists X satisfying (2.4) and (2.5). We claim swno(c, X R, Y, W, X) < 00. (4.12) Assume otherwise. Then there exists a sequence of input processes {Xk}r!r such that KG = n,(E, R, Y, W, Xk) for sufficiently large n E J, contradicting (4.9) because y > 0 0 is arbitrarily small. Proof of Theorem 5 {cont.): The construction of the N distributions required by Lemma 6 boils down to the construction of a channel derived from the original one and a code for that channel. This is because of Lemma 7, which is akin to the direct identification coding theorem [l] and whose proof readily follows from the argument used in the proof of [7, Theorem 31. Lemma 7: Fix 0 < X < l/2, and a conditional distribution W;: T; -+ T; with arbitrary alphabets T; C A” and T$ c B”. If there exists an (n, M, X) code in the maximal error probability sense for W$, then there exists p > 1 and a where Q: is a distribution on Tg collection { (Qr , Di)}zr and Di c TF, such that for all i = 1,. . . , N, is an increasing divergent sequence. Construct a new input process X by letting if ?ik <n<&+l. X” = x,“,,, Note that for all k, n = %k+r - 1 is such that there is no Xn with the desired properties. On the other hand, since Ek+t -1 is a divergent sequence, R cannot be an e-achievable resolution rate for the constructed input process X and we arrive at a contradiction, thereby establishing (4.12). Let us now contradict’ the claim of Lemma 6 and suppose that for some R’ < R, S, 5 R’. Then, for each Q;, we can find ai such that LR(cjl”) n (4.13) < R’ + y 5 d(Q;W”, I x:“) < W”(y” ifj # i, (4.16) (4.17) I x6”) I W,“(y” 1 x”) (4.18) and (4.19) infinitely often, where QyWn) @W”) i 2X, (4.15) < E, if n 2 sup, nO(t, R’, y, W, X). Along those n, there is an infinite number of integers denoted by J for which (4.10) holds. Let us focus attention on those blocklengths only. We must have Ql # Qy if i # j, for otherwise the triangle inequality implies d(Q;W”, Q~WW) (4.14) We will show that the collection {Qr}Er satisfies the conditions of Lemma 6, with an appropriate choice of M. For now, let us construct an appropriate conditional distribution WG which will be suitable for finding the channel code required by Lemma 7. Lemma 8: For every X and RI < 7(X; Y) there exists 0 < a < 1, T; c A”, T; c B”, T;l-y c T; x T; and a conditional distribution WG: T$ -+ TF such that if (z”, y”) E T&,, then, aW,“(yn @W”) 5 log M, 2 1 -X, N 2 j$pM - 11. and cl(Q;W”, R(Q:) Q;W;(Di) + d(QyW”, @Wn) Pq(z”) = Px- (X”)/Px0, < 2E 5The finiteness of the input alphabet is crucial in this argument. and Py-T = Px.;.W,“. (CL xn E T;, xn $ T;, (4.20) [EEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993 162 Proof: Choose RI < Ra < 7(X; Y) and define D& (C, y”) E A” x Bn: $xnw/-(zn, = y”) > Ra . C By definition of 7(X; Y), there exists a > 0 such that [D;yl Px-Y- (4.21) > 2a, if n E I, where I is an infinite set. Define T;(C) X 5 exp (-nB)+P = {yn E Bn: (~9, y”) E D$y} T; = u Now, let us choose M so that R1 - 20 5 1 log M < R1 - x9, (4.24) n for arbitrary 19 > 0. Then, owing to (4.19) and the second inequality in (4.24), the second term on the right-hand side of (4.23) vanishes. Lemmas 8 and 9 along with the left inequality in (4.24) provide the (n, M, exp (-n6’)) code required by Lemma 7 for an infinite number of blocklengths n. Then, (4.17) and (4.24) imply that for sufficiently large n X"ET2 T; n (T; x T;) + TF by W ,“(Y” I xn)= C0Wn(yln > if y” E T;(F) if y” @ T?(9). ) xn)/g(xn), Now, (4.18) follows immediately because if zn E Tg, then a! < g(P) 5 1. In general, if zn E A”, then 0 5 a(~?) 5 1 and Px”y”[D;y] = c YT) 5 k log M + 8 . (4.23) T;(C) TnXY -- D& 1 kix;w;(XG, [ 0(x?) = w”(T;(z”) 1 P) T; = (9 E A”: r~(x~) > cx} and W;: Proof of Theorem 5 (cont.): Now it remains to show existence of the channel code required by Lemma 7. This is accomplished via a fundamental result in the Shannon theory applied to X$ and WG. I Lemma 9 [4]: For every n, M and d > 0, there exists an (n, M, X) code for W$ whose maximal error probability X satisfies ; log log N 2 ; log M - 8, 2 R1 - 38, which satisfies (4.9) for arbitrary y > 0 because R1 < supx f(X; Y), and 0 > 0 can be chosen arbitrarily close to those boundaries. Finally, to show (4.10) we apply Lemma 7 to get Px-(C)a(9) d(Q;W”, : QjnWn) > 2Q;W”(Di) - 2QjnWn(Di) > 2aQ;W,“(Di) - 2QyW,“(Di) > 2a(l - A) - 4x, where the second inequality follows from (4.18) and the third inequality results from (4.15) and (4.16). With an adequate choice of X (guaranteed by Lemma 7) the right side of (4.25) is strictly positive and (4.10) holds, concluding the proof of 0 Theorem 5. which together with (4.21) implies that Px-El > Q, if n E I. In turn, this implies pY$(Yn) = c wyyn PET; I X”)PXP (z”) o(x”>Px- ;ix;&n, Theorems 4 and 5 and S = supx S(X) readily result in the general formula for channel resolvability. CT;) I sI12.n(Y”). Now to prove (4.19), note that if (?, (4.25) (4.22) yy”) E Tgy, and n E I Theorem 6: The resolvability of a channel W with finite input alphabet is given by s = sup7(X; Y). (4.26) X y”) = ; log wn(Yn I x”> c+“>Py,(Y”> Remark 2: Theorems 4 and 5 actually show the stronger result: > 1 log WYYln I x:“) n PY$ (Y”) s, = supF(X; Y) (4.27) X > ~ix7zu”‘(x”, y”) + ; log Q n 2 > Rz + ; log o for all sufficiently small e > 0. > RI Having derived a general expression for resolvability in Theorem 6, this section shows that for the great majority of channels of interest, resolvability is equal to the Shannon capacity.6 Let us first record the following fact. V. RESOLVABILITY AND CAPACITY infinitely often; where we have used (4.22) to derive the second inequality. Finally, by definition of W$ Px;Y,[T;~-YI = 1. 0 6 We adhere to the conventional definition of channel caDacitv 13. o. 1011. 163 HAN AND VERDti: APPROXIMATION THEORY OF OUTPUT STATISTICS Theorem 7: For any channel with finite input alphabet c I s. (5.1) Proof: The result follows from Theorem 6 and the following chain of inequalities C < liminf,,, sup il(xn; in n Y”) sup A (X”; X” n Y”) 5 limsup,,, < sup7(X; Y). The first inequality is the general (weak) converse to the channel coding theorem, which can be proved using the Fano inequality (cf. [16, Theorem 11). So only the third inequality needs to be proved, i.e., for all y > 0 and all sufficiently large n, Y”) 5 supqx; 1. P ,2x-w-(X”, [ 1 Y”)>c+s >a, for all n in an infinite set J of integers. Under such an assumption, we will construct an (n, M, 1 - 01/3) code with (5.2) X sup 5(X”; xn n Proof of Lemma 10: Arguing by contradiction, assume that there exists S > 0, cy > 0 and X” such that for every sufficiently large n E J. The construction will follow from the standard Shannon random selection with a suitably chosen input distribution. Codebook: Each codeword is selected independently and randomly with distribution Pp(z?) Y) + 7, In order to investigate under what conditions equality holds in (5.1) we will need the following definition. Definition 10: A channel W with capacity C is said7 to satisfy the strong converse if for every y > 0 and every sequence of (n, M, X,) codes that satisfy if a? E G: otherwise, where G= X but this follows from Lemma Al (whose proof hinges on the finiteness of the input alphabet) as developed in the proof of 0 Lemma 1 (cf. Appendix). ~~(a”)/Pp(G), , = -t xn~An: W”(D(z%) 1 x”) 2 ;} Lxqr(x”, n Decoder: The following codeword c; : y/“) > c + 6 decoding set Di corresponds to Di = D(q) - 6 D(q). 3=1 3fZ Error Probability: 1 log M > C + y, n W”(D; 1 ci) 5 W”(D”(c;) it holds that Al + 1, 1 ci) + FW-(D(c) 3=1 1#1 < 1 - ; + FW”(D(c) 1=1 j#, as n~oo. Theorem 8: For any channel with finite input alphabet which satisfies the strong converse: 1 ci) ) ci). (5.3) Let us now estimate the average of the last term in (5.3) with respect to the choice of ci: c WV(q) I c;)px+i) C,EG Proof: In view of Theorem 7 and its proof, it is enough to show S < C. The following lemma implies the desired result c > supI(X; = c Y) = s. I j+q Lemma 10: A channel W that satisfies the strong converse has the property that for all S > 0 and X 1 Y”) > C + 6 = 0. 7This is the strong converse in the sense of Wolfowitz [20]. 1 ci)m, c X” X lim P kix%pv-(X”, n-00 [ W”(D(q) C,EG WV%) I x”)Px-(x”) XnEA” = P~@(cj))/Px4G), where py"(D(cj)) = c PY-(Y~) ynEBn . l{Py-(yy”) < exp (-nC < exp (-nC - nS). - nS)W”(y” 1 cj)} IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993 lb4 Fig. 2. Channel whose capacity is less than its resolvability. 3.’ 65 Thus, the average of the right side of (5.3) is upper bounded by 1 - G + (A4 - 1) exp(-nC Fig. 3. Information spectrum (probability mass function of the normalized information density) of channel in Example 2 with 61 = 0.1, 62 = 0.15, a = 0.5, n = 1000. - &)/E+(G) 5 1 - % + exp where using (5.2). Finally, all that remains is to lower bound Px-(G). We notice that Pxn(G) = P[Z 2 a/2], with 2 = Wn(D(Xn) ( Xn). The random variable 2 lies in the interval [0, l] and its expectation is bounded by [ P 2 2 ; 1 + ; 2 E[Z] 1 1. = P ,zxnwn(X”, [ Y”) > c + a! > 01. Therefore, the right side of (5.4) is upper bounded by 1-t+exp -? nS 2 ;<l--5, ( > for all sufficiently large n E J, thereby completing the construction and analysis of the sought-after code. 0 The condition of Theorem 8 is not only sz@cient but also necessary for 5’ = C. This fact along with a general formula for the Shannon capacity of a channel (obtained without any assumptions such as information stability): c = supI(X; Y) I Zlr...rG) = q-JwYi i=l 0 < & < l/2, I Xi) + (1 - N,l-Iwz(Yi i=l I Xi) i = 1, 2. i log min {a, 1 - a} + max {u, V} 5 L log Q exp nu + (l- o) exp nv) v}. Therefore, the information spectrum evaluated with the optimal i.i.d. input distribution X (which assigns equal probability to all the input sequences) converges asymptotically to the distribution of where we have identified u and v the maximum in (5.6). If (Xj, Yj) (which occurs with probability a), the first term in (5.6) exceeds that wx,w, n # x:), A typical information spectrum for this channel and large n is depicted in Fig. 3. The e-capacity of this channel depends on E [19] and its capacity is equal to min {Cl, Cz} where Ci = log 2 - h(Si) is the capacity of BSC;. In order to compute the resolvability of this channel, note first that if the distribution of the random variable A, is a mixture of two distributions, then the lim sup in probability of {A,} is equal to the maximum of the corresponding lim sup in probability that would result if A, were distributed under each of the individual distributions comprising the mixture. Now, to investigate the asymptotic*behavior of the information spectrum, consider the bounds 5 Lx{u: (cf. the dual expression (4.26) for the channel resolvability) is proved in [17], by way of a new approach to the converse of the channel coding theorem. Important classes of channels (such as those with finite memory) are known to satisfy the strong converse [5], [18], [20]. The archetypical example of a channel that does not satisfy the strong converse is: Example 2: Consider the channel in Fig. 2 where the switch selects one of the binary symmetric channels (BSC) with probability (a, 1 - a), 0 < Q < 1 and remains fixed for the whole duration of the transmission. Thus, its conditional probabilities are 72 = x} f&lb and (5.5) X W ”(!/l,...,Yn Wi(Y I x> = (1 - &)l{Y (Xj, Yj)l = qxj; with the quantities within are connected through IVr then the expected value of of the second one by Yj) = -@X,W(Xj, ydl + m47lll~2 I Xj), (5.7) lb5 HAN AND VERDir: APPROXIMATION THEORY OF OUTPUT STATISTICS where the expectations in (5.7) are with respect to the joint distribution of (Xj, Yj) connected through Wi. Reversing the roles of channels 1 and 2, we obtain an analogous expression to (5.7). Therefore, the weak law of large numbers results in Proof: If R is a (Xi, &)-achievable ID rate, then for every y > 0 and for all sufficiently large n, there exist (n, N, X1, &)-ID codes {(Qr, Di), i = 1,. . . , N} whose rate satisfies ; log log N > R - y. 7(X; Y) = max{Ci, Ca}. 0 We have seen that for the majority of channels, resolvability is equal to capacity, and therefore the body of results in information theory devoted to the maximization of mutual information is directly applicable to the calculation of resolvability for these channels. Example 2 has illustrated the computation of resolvability using the formula in Theorem 6 in a case where the capacity is strictly smaller than resolvability. For channels that do not satisfy the strong converse it is of interest to develop tools for the maximization of the supinformation rate (resolvability) and of the inf-information rate (capacity). It turns out [17] that the basic results on mutual information which are the key to its maximization, such as the data-processing lemma and the optimality of independent inputs for memoryless systems are inherited by 1(X; Y) and ax; 0 From such a sequence of codebooks { Qy , i = 1, . . . , N} where N grows monotonically (doubly exponentially) with n, we can construct the sequence {Qi = (Qi, Qz, .. .)}E”=, required in Lemma 6, with an arbitrary choice of Qy if i > N. Then {Qi}~!i satisfies (4.9). Furthermore, for i # j and i < N, j 5 N, then for all sufficiently large n, d(QaW”, QyWn) > 2Q~W’“(Di) - aQ,“w”(Di) > 2(1 - Xl) - 2x2 > 26, satisfying (4.10). Thus, the conclusion of Lemma 6 is that 0 and Theorem 9 is proved. Theorem 9 and Theorem 4 imply that the (Xi, &)-ID capacity is upper bounded by VI. RESOLVABILITYAND IDENTIFICATIONVIA CHANNELS A major recent achievement in the Shannon Theory was the identification (ID) coding theorem of Ahlswede and Dueck [l]. The ID capacity of a channel is the maximal iterated logarithm of the number of messages per channel use which can be reliably transmitted when the receiver is only interested in deciding whether a specific message is equal to the transmitted message. The direct part of the ID coding theorem states that the ID-capacity of any channel is lower bounded by its capacity [l]. A version of the converse theorem (soft converse) which requires the error probabilities to vanish exponentially fast and applies to discrete memoryless channels was proved in [l]. The strong converse to the ID coding theorem for discrete memoryless channels was proved in [7]. Both proofs (of the soft-converse and the strong converse) are nonelementary and crucially rely on the assumption that the channel is discrete and memoryless. The purpose of this section is to provide a version of the strong converse to the ID coding theorem which not only holds in wide generality, but follows immediately from the direct part of the resolvability theorem. The link between the theories of approximation of output statistics and identification via channels is not accidental. We have already seen that the proof of the converse resolvability theorem (Theorem 5) uses Lemma 7, which is, in essence, the central tool in the proof of the direct ID coding theorem. The root of the interplay between both bodies of results is the following simple theorem.* Theorem 9: Let the channel have finite input alphabet. Its (Xi, &)-ID capacity is upper bounded by its e-resolvability S,, with 0 < E < 1 - Xr - X2. s We refer the reader to [l], [7] for the pertinent definitions in identification via channels. supT(X; X Y), which is equal to the Shannon capacity under the mild sufficient condition (strong converse) found in Section V. This gives a very general version of the strong converse to the identification coding theorem, which applies to any finite-input channel-well beyond the discrete memoryless channels for which it was already known to hold [7, Theorem 21. It should be noted that Theorem 9 and [7, Theorem l] imply that if 0 < X < Xi, X < X2, E > 0, and E + Xi + X2 5 1, then where CA is the X-capacity of the channel in the maximal error probability sense and Dxl, xZ is the (X1, &)-ID capacity. Note that unlike the bound on e-resolvability in Theorem 5, (6.1) can be used with arbitrary 0 < 6 < 1, but may not be tight if the channel does not satisfy the strong converse. If the strong converse is satisfied, however, (6.1) holds with equality for all sufficiently small E > 0, because of (4.27) and Theorem 8 as well as the fact that C = CA for all ‘0 < X < 1 due to the assumed strong converse. Consequently, we have the following corollary. Corollary: For any finite-input channel satisfying the strong converse: C=%,xz =s, (f-54 if Xi + X2 < 1. The first equality in (6.2) had been proved in [7, Theorem 21 for the specific case of discrete memoryless channels using a different approach. IEEE TRANSACTIONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993 766 we obtain VII. MEAN-RESOLVABILITYTHEOREMS This section briefly investigates the effect of replacing the worst-case randomness measure (resolution) by the average randomness measure (entropy) on the results obtained in Section IV. The treatment here will not be as thorough as that in Section IV, and in particular, we will leave the proof of a general formula for mean-resolvability for future work. Instead, we will present some evidence that, in channels of interest, mean-resolvability is also equal to capacity. An immediate consequence of (2.8) is that the direct resolvability theorem (Theorem 4) holds verbatim for meanresolvability, i.e., for all E > 0 and X S’,(X) <3(X) < 7(X; Y), $(Y”) - $(P”) = $X”) - $qX” Theorem 10: The mean-resolvability of a BSC is equal to its capacity. Proo$ Since BSC’s satisfy the strong converse, (7.2) holds and we need to show (7.3) Suppose otherwise, i.e., for some p > 0, 0 <S<C-,LL. (7.4) Let A = ~1 log 4, and y = ,LL/~. For all sufficiently large n, there exists an (n, M, E) code (in the maximal error probability sense) such that all its codewords are distinct (X < l/2, because p < log 2) and log 2 > ; log A4 > c - y. (7.5) Let X” be uniformly distributed on the codewords of the (n, M, X) code. Thus, H(Xn) = log 111. 2 c - y - [C - p +-y] - x log 2 - i/Z(X) = p/4 - $A) (7.10) where the first inequality is a result of the Fano inequality, the second inequality follows from (7.5), (7.6), and (7.8) and the last inequality holds for all sufficiently large n. Now applying Lemma 4 to the present case (I R I= an), (7.7) results in IH(Y”) - H(P)1 Remark 3: The only feature of the BSC used in theproof of Theorem 10 is that H(Y 1 X = Z) is independent of 5, which holds for any “additive-noise” discrete-memoryless channel. A converse mean-resolvability theorem which, unlike Theorem 5, does not hinge on the assumption of finite input alphabet can be proved by curtailing the freedom of choice of the approximating input distribution. In Example 1, we illustrated the pathological behavior that may arise when the approximating input distribution puts mass in sequenceswhich have zero probability under the original distribution. One way to avoid this behavior is to restrict the approximating input distribution to be absolutely continuous with respect to the original input distribution. Theorem 11 (Mean-Resolvability Semi-Converse): For any channel W with capacity C there exists an input process X such that if 2 satisfies lim d(Y”, P) n-03 and P*- = 0 (7.12) << Px-, then, for every ~1 > 0, 5qe) n < 0 5 n0 log 2 + 0 log l/19, which contradicts (7.10) because 0 can be chosen arbitrarily 0 small. (7.6) According to (7.4), for any 0 < 0 < l/2, there exists Xn such that the outputs to Xn and Xn satisfy d(Yn, P) - +(Xn) 2 7, (7.2) c. 1 Y”) - ; log M - ;h(X) (7.1) Therefore, in this section our attention can be focused on converse mean-resolvability theorems. First, we illustrate a proof technique which is completely different from that of the converse resolvability theorem (Theorem 5) in order to find the mean-resolvability of binary symmetric channels (BSC). s> 1 Y”) + &P > ;H(Xn) and if the channel satisfies the strong converse, then s 5 c. - +(X”) 2 c - p (7.13) (7.7) infinitely often. and $2’“) < c-p+$. (7.8) Proof: Let us suppose that the result is not true and therefore there exists ~0 > 0 such that for every input process X we can find 2 such that P+, < Pxn, (7.12) and Since by definition of BSC H(Y” 1 P) = H(Y” 1 X”), $(X”) (7.9) 5 c - ,urJ (7.14) HAN AND VERDtl: APPROXIMATION 767 THEORY OF OUTPUT STATISTICS for some S > 0 because of (7.15) and (7.18). Combining (7.21) and (7.22), we get are satisfied. Fix 0 < y < ~0 and choose T< PO-Y c - PO’ xc--L 27 + d(Y”, Y”) > 2P+%(B) - 2Pp(B) (7.16) 1 For all sufficiently large n, select an (n, M, A) code {(G, W>g”=, (in th e maximal error probability sense) with rate ; log M 2 c - y. Let X” be equal to ci with probability l/M, i = 1, +. . , M. The restriction P+ < Pp means that the approximating distribution can only put mass on the codewords { ci , . +. , CM}. However, the mass on those points is not restricted in any way (e.g., P+ need not have finite resolution). Define the set of the most likely codewords under P+ > &(l - A) - 2X - 2 exp (-&A) which is bounded away from zero because of (7.16). In connection with Theorem 11, it should be pointed out that the conclusion of the achievability result in Theorem 4 holds even if Xn is restricted to be absolutely continuous’ with respect to X” as long as X” is a discrete distribution.’ (Recall that in the proof of Theorem 4, Xn is generated by random selection from Xn.) Remark 4: Recall that a general formula for the individual resolvability of input processes is attainable only for channels that avoid the pathological behavior of Example 1. In [8], it is shown that S(X) = 1(X; Y) T, = {z~ E A”: P+(zn) 2 exp (-n(c - Po)(~ f 7”1 (7.17) ’ //J 1 PO)(~ (7.18) + ~1). (Xa>] 2 n(C - Po)(l + 4pj+ VIII. (7.19) 7). We will lower bound the variational distance between Y” and Y-n by estimating the respective probabilities of the set iEI where 1 = {i E {l,...,M}: (7.20) are disjoint, we have ci E T7}. Since the sets in 2 CW”(B iEI DIVERGENCE-GAUGED APPROXIMATION So far, we have studied the approximation of output statistics under the criterion of vanishing variational distance. Here, we will consider the effect of replacing the variational distance d(Y”, Y”) by the normalized divergence: (7.20) B = UDi, P+(B) if the channel W is discrete memoryless with full rank, i.e., the transition vectors {W(. ] u)}~~A are linearly independent. This class of channels includes as a special case the BSC (Theorem 10). Even in this special case, however, the complete characterization of s(X) remains unsolved. (T”) or the lower bound P*,z (TT) L 7/(1+ (7.24) s=z=c, From (7.14), we have n(C - PO) 2 E[log l/P*, (7.23) along with whose cardinality is obviously bounded by I Z I< exp (n(C - 0 I Ci)P&i) We point out that this criterion is neither weaker nor stronger than the previous one. Although we do not attempt to give a comprehensive body of results based on this criterion, we will show several achievability and converse results by proof techniques that differ substantially from the ones that have appeared in the previous sections. We give first an achievability result within the context of information stable input/outputs. Theorem 12: Suppose that (X, Y) is information stable and I(X; Y) < 00, where Y is the output of W due to X. Then, ;I(xi”,...;x$; F-n) On the other hand, Pp(B) = = &JW”(” iEI I Ci) + $W”(O ) c;) @I <-+p-p ICI 5 exp (An) + A, (7.22) + iI(X”: Y”) - ; log A4 I + o(l) (8.1) where {XF, . . . , XL} is i.i.d. with common distribution X”, Yn is connected with Xp, . +. , X& via the channel in Fig. 4, and the term o(1) is nonnegative and vanishes as n -+ co. 9Theorem 1 holds for infinite alphabet channels, as well. IEEE TRANSACI-IONSON INFORMATIONTHEORY, VOL. 39, NO. 3, MAY 1993 768 obtain $t”, x;, . . *,X&,; +) - ;I(Xn; 5 kE[log (M exp (-1(X”; 5 iE[log Y”) + $ log M Y”)) + exp A,)] (exp (-nS) + exp A,)] 5 i log 2 + E[max { -6, A,/n}] Fig. 4. Input-output transformation in random coding. Switch is equally likely to be in each of its A4 positions. Proof: Note that the joint distribution of (X’“, Yn) is that of (Xn, Y”). By Kolmogorov’s identity and the conditional independence of Yn and X;, * . . , XL given Xn I(X”; Y”) = I@; (8.4) where the second inequality is a result of M I exp(I(Xn; Y”) - nfi). Th e expectation on the right side of (8.4) is upper bounded by .[;A,,{ ;An = .[-;,,I{ > -6}] ;A, 5 -“)1 Py = I(R”, x;, 5 $(X”; ‘. . ) xgf,; P) =I(X;,-,Xgf;P’n) + I(P; ly Y”)p[;A,, - ;E[ix-&Xn, 1 x;, 5 -63 Y”)l{ixn~“(X~, y”) ’* * ,x2,> (8.5) (8.2) where the second term on the right side is less than or equal to log M. This shows that the term o(1) in (8.1) is nonnegative. It remains to upper bound the left side of (8.1): I(?, x;, . ‘. )X$,; 5-y < o}], where we have used E[A,] = 0 and the first term vanishes asymptotically because (X, Y) is information stable with finite mutual information rate, whereas the expectation of the negative part of an information density cannot be less than -e-i log e (e.g., [14, (2.3.2)]). Thus, the second term in (8.5) also vanishes asymptotically. In the remaining case, + ~“EB”cIEA” CM EA” . cW-(f i=l = c Y”) - ; log M 1 S) log **. c ClEA” I = 0 and we can choose any arbitrarily small 6 > 0 while satisfying M 2 exp (1(X”; Y”) - n6). Now, normalizing by n we can further upper bound (8.3) with PX-(Cl)**-PX-(CM) CMEA” iE[log .log ; ( exp ix-w- < E log 1 + $ [ ( where the inequalities follow from log (1 + exp t) 5 log 2 + tl{t > 0) and (8.5), respectively. Now, the theorem follows 0 since 6 can be chosen arbitrarily small. (Cl, Y/“> exp ixnwn(Xn, Y”) I , (8.3) where the first inequality follows from the concavity of the logarithm and the second is a result of E[exp ixn~n(Xy, (1 + exp (A, + nS))] y/“)] = 1, for all y” E En and j = 1,. +. , M. Consider first the case where M < exp (l(Xn; Yn) - ns) for some S > 0. Using (8.3) and denoting A, = ix-w-(Xn, Yn) - l(Xn; Y”) we Theorem 12 evaluates the mutual information between the channel output and a collection of random codewords such as those used in the random coding proofs of the direct part of the channel coding theorem. However, the rationale for its inclusion here is the following corollary. The special case of this corollary for i.i.d. inputs and discrete memoryless channels is [21, Theorem 6.31, proved using a different approach. Corollary: For every X such that (X, Y) is information stable, and for all y > 0, E > 0, there exists x whose resolution satisfies $2’“) < I(X; Y) + y HAN AND VERDlj: APPROXIMATION THEORY OF OUTPUT STATISTICS and 769 for all xn E A”. Thus, for every distribution Xn, 0 5 I(T; for all sufficiently large n. Proof: The link between Theorem 12 and this section is the following identity T) - 1(X”; P’“) = D(W”llTn 1X”) - ll(wyF’” 1 P) = D(WllY” 1 P) - D(W”llP” ) P) = D(Pypq where the second equation follows from (8.9). Thus, qPyx;, * ’* , ~&lllY” I x,“, . . . >xzf,) = 1(X?,. . . ) XL,; P), where Y” [Xl”, . . . ,X&l is defined in (4.1). As in the proof of Theorem 4, (8.1) implies that there exists (CT, +. . , cb) such that the output distribution due to a uniform distribution on (CT,+.. , c%) approximates the true output distribution in the sense that their unconditional divergence per symbol can be made arbitrarily small by choosing (l/n) log A4 to be appropriately close to (l/n)l(X”; Y”). 0 A sharper achievability result (parallel to Theorem 4) whereby the assumption of (X, Y) being information stable is dropped and 1(X; Y) is replaced by 7(X; Y) can be shown by (1) letting M = exp (nI(X; Y) + TM?),(2) using log (1 + exp t) 5 log 2 + tl{t > 0} to bound the right side of (8.3), and (3) invoking Lemma Al (under the assumption that the input alphabet is finite). The extension of the general converse resolvability results (Theorems 5 and 9) to the divergence-gauged approximation criterion is an open problem. On the other hand, the analogous exercise with the converse mean-resolvability results of Section VII is comparatively easy. A much more general class of channels than the BSCs of Theorem 10 is the scope of the next result. Theorem 13: Let a finite-input channel W with capacity C be such that for each sufficiently large n, there exists r for which I(T; 7”) = m&x1(X”; Y”) (8.6) from where the result follows because of (8.8) and the channel capacity converse. C < liminf,,, ’ -n. > --n ;w y 1. Remark 5: It is obvious that the channel in Example 1 does not satisfy the sufficient condition in Theorem 13, whereas the full-rank discrete memoryless channels (cf. Remark 4) always satisfy that condition. A counterpart of Theorem 11 (mean-resolvability semiconverse) with divergence in lieu of variational distance is easy to prove using the same idea as in the proof of Theorem 13. Theorem 14: For any finite-input channel with capacity C, there exists an input process x such that if X satisfies and Pkn << Pxn then, liminf,,, $2”) D(W”IIY” 1 P) = D(W”(lF = I(F; = x”] > 0, ‘for all xn E A”. 63.7) If X is such that = 0 (8.8) = H(P T). 1 P’“) - D(~‘nIIT) + D(W”llT 2 1(X”; then ) rr”) Thus, H(X”) lim -5ZJ(Y”]]yn) n-+00 n > c. Proof: Let ri” maximize l(Xn; Y”). It follows from Lemma 11 and the assumed condition P+, < Pj~n that and P[T 0 1X”) Y”) - D(iqFn) from where the result follows immediately. liminf,,, Proof: The following result will be used. Lemma ll(3, p. 1471: If 1(7?;m = maxxn I(Xn; then, qr; T) 2 D(WQ(. Y”), ) x”)IIY”) with equality for all Z? E A” such that Pyn(z”) > 0. According to Lemma 11 and the assumption in Theorem 13, I@“;7”) = D(W(.1xyq, (8.9) Let us now state our concluding result which falls outside the theory of resolvability but is still within the boundaries of the approximation theory of output statistics. It is a folktheorem in information theory whose proof is intimately related to the arguments used in this section. Fix a codebook {CT, . . . , c&}. If all the codewords are equally likely, the distributions of the input and output of the channel are 1 if xn = c” P&?) = ;, IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993 770 and P&P) = ~gvyYn 1 CT), (8.10) j=1 respectively. The issue we want to address, in the spirit of this paper, is the relationship between Yn and the output distribution corresponding to the input that maximizes the mutual information. It is widely believed that if the code is good [with rate close to capacity and low error probability), then Y” must approximate the output distribution due to the input maximizing the mutual information. (See, e.g., [2, section 8.101 for a discussion on the plausibility of this statement.) To focus ideas take a DMC with capacity -C = 1(X; Y) = m;xI(X; Y). Is it true that the output due to a good code looks i.i.d. with distribution Py? Sometimes this is erroneously taken for granted based on some sort of “random coding” reasoning. However, recall that the objective is to analyze the behavior of the output due to any individual good code, rather than to average the output statistics over a hypothetical random choice of codebooks. Our formalization of the folk-theorem is very general, and only rules out channels whose capacity is not obtained through the maximization of mutual information. Naturally, the result ceases to be meaningful for those channels. Theorem 15: For any channel W with finite input alphabet and capacity C that satisfies the strong converse, the following holds. Fix any y > 0 and any sequence of (n, M, X,) codes such that and ilogM>C-; n Then,” (8.11) for all sufficiently large n, where Y-n is the output due to the (n, M, X,) code (cf. (8.10)) and 7 is the output due to x=” that satisfies I(T; Proofi Fn) = mxyl(Xn; Yn For every Xn, we write “It can be shown that the output distribution due to a maximal mutual information input is unique. where the inequality follows from Lemma 11. If we now particularize Xn to the uniform distribution on the (nL M, X,) codebook of the above statement, then (l/n)l(X”; Y”) must approach capacity because of the Fano inequality: $2”; P) > (1 - An); log $l- 2 (1 - X,)(C - ;) ; log 2 - ; log 2. (8.13) ~~~~ (8.12) and (8.13) result in +yP) 5 $xn, B”) - (1 - X,) .(c-$)+;lo,, 5 Y, for sufficiently large n because X, --) 0 and the strongconverse assumption guarantees that the inequalities in (5.1) actually reduce to identities owing to Theorem 8. 0 As a simple exercise, we may particularize Theorem 15 to a BSC, in which case, the output y due to x’” achieving capacity is given by P&y”) for all yn E (0, l}“. = 2F, Then, (8.11) is equivalent to log 2 - y 5 ;H(En), for an arbitrarily small y > 0 and all sufficiently large n. This implies that the output Yn due to the input distribution Xn of a good codebook must be almost uniformly distributed on (0, l}” (cf. [2, example 2, section 8.101). Can a result in the spirit of Theorem 15 be proved for the input statistics rather than the output statistics? The answer is negative, despite the widespread belief that the statistics of any good code must approximate those that maximize mutual information. To see this, simply consider the normalized entropy of Xn versus that of x”: ;H(XII) - ;H(Xn) = where the last two terms in the right-hand side are each asymptotically close to capacity. However, the term (l/n)H(xn I 7”) does not vanish in general. For example, in the case of a BSC with crossover probability p, (l/n)H(r I y”) = h(P). Despite this negative result concerning the approximation of the input statistics, it is possible in many cases to bootstrap some conclusions on the behavior of input distributions with fixed dimension from Theorem 15. For example, in the case of the BSC (p # l/2), the approximation of the first order input statistics follows from that of the output because of the invertibility of the transition probability matrix. Thus, in a good code for the BSC, every input symbol must be equal HAN AND VERDtJ: APPROXIMATION 171 THEORY OF OUTPUT STATISTICS to 0 for roughly half of the codewords. As another example, consider the Gaussian noise channel with constrained input power. The output spectral density is the sum of the input spectrum and the noise spectrum. Thus, a good code must have an input spectrum that approximates asymptotically the water-filling solution. The conventional intuition that regards the statistics of good codes as those that maximize mutual information, constitutes the basis for an important component of the practical value of the Shannon theory. The foregoing discussion points out that that intuition can often be put on a sound footing via the approximation of output statistics, despite the danger inherent in far-reaching statements on the statistics of good codes. IX. CONCLUSION Aside from the setting of system simulation alluded to in the introduction, we have not dwelled on other plausible applications of the approximation theory of output statistics. Rather, our focus has been on highlighting the new information theoretic concepts and their strong relationships with source coding, channel coding and identification via channels. Other applications could be found in topics such as transmission without decoding and remote artificial synthesis of images, speech and other signals (e.g., [12]). A novel aspect of our development has been the unveiling of sup/inf-information rate and sup-entropy rate as the right way to generalize the conventional average quantities (mutual information rate and entropy rate) when dealing with nonergodic channels and sources. We have seen that those concepts actually open the way towards new types of general formulas in source coding (Section III), channel coding [17] and approximation of output statistics (Section IV). In particular, the formula (5.5) for channel capacity [17] exhibits a nice duality with the formula for resolvability (4.26). In parallel with well-established results on channel capacity, it is relatively straightforward to generalize the results in this paper so as to incorporate input constraints, i.e., cases where the input distributions can be chosen only within a class that satisfies a specified constraint on the expectation of a certain cost functional. Presently, exact results on the resolvability of individual input processes can be attained only within restricted contexts, such as that of full-rank discrete memoryless channels [8]. In those cases, the resolvability of individual inputs is given by the sup-information rate; this provides one of those rare instances where an operational characterization of the mutual information rate (for information stable input/output pairs) is known. Whereas our proof of the achievability part of the resolvability theorem holds in complete generality, the main weakness of our present proof of the converse part is its strong reliance on the finiteness of the input alphabet. So far, we have not mentioned how to relax such a sufficient condition. However, it is indeed possible to remove such a restriction for a class of channels. In a forthcoming paper, the counterpart of Theorem 5 will be shown for infinite-input channels under a mild smoothness condition, which is satisfied, for example, by additive Gaussian noise channels with power constraints. APPENDIX In this appendix, we address a technical issue dealing with the information stability of input-output pairs. Lemma Al: Let G > log I A 1, then, 1 ,ixnwn(X”, Y”)l Y”) > G ;i&X”, + 0. Proof: The main idea is that an input-output pair can attain a large information density only if the input has low probability. Since for all (9, y”) E A” x B” ixnWn(xn, y”) I log 1 P&I?) (A.11 we can upper bound where DG = (9 E A”: Pp(z”) 5 exp(-nG)} (A.31 The right side of (A.2) can be decomposed as E ;l{Xn - E DG} log APx~(DC) log px- (DG) px- (xn) 1 Px-(DG) 5 Px- (DG) i log ( DG I -i [ log Px~ (DG) 1 (A.4) because entropy is maximized by the uniform distribution. Now, bounding I DG 11 I A In and Px-(DG) <I DG ( em-@ 5 ev(-nS), where S = G - log I A I > 0, the result follows. One consequence of Lemma A.1 is Lemma 1. Proof of Lemma 1: First, we lower bound mutual information for any y > 0 as il(X”; n Y”) 2 E ;ixnw-(X”, Y”) 1 Y”) > 1(X; Y) - y . (A.5) It is well known [ 141 that the first term in the right side of (A.5) vanishes and the probability in (A.5) goes to 1 by definition of L(X; Y). Thus, 1(X”; Y”)/n > 1(X; Y) - 2y for all sufficiently large n. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 3, MAY 1993 712 Conversely, we can upper bound mutual information as $X”; 161R.M. Gray and L.D. Davisson, “The ergodic decomposition of sta- Y”) 5 E ;ixnwn(Xn, [ Yn)l ;ix&X”, Y”) 2 G { + GP ;ixnwn(X’“, [ +PCX; y>+Yl 1. i ’P nZX”pv”(Xn, 1 1 11 Y”) > 7(X; Y) + y Y”) gx;Y)+y 1 G4.6) If G is chosen to satisfy the condition in Lemma Al then (A.6) results in ll(X’“, n Y”) 5 7(X; Y) + 27, for all sufficiently large 12. q ACKNOWLEDGMENT Discussions with V. Anantharam, A. Barron, M. Burnashev, A. Orlitsky, S. Shamai, A. Wyner, and J. Ziv are acknowledged. References [9], [13], [21] were brought to the authors’ attention by I. Csiszar, D. Neuhoff, and A. Wyner, respectively. Fig. 3 was generated by R. Cheng. REFERENCES PI R. Ahlswede and G. Dueck, “Identification via channels,” IEEE Trans. Inform. Theory, vol. 35, pp. 15-29, Jan. 1989. PI T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [31 I. Csiszar and J. Komer, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1981. 141 A. Feinstein, “A new basic theorem of information theory,” IRE Trans. PGIT, vol. 4, pp. 2-22, 1954. “On the coding theorem and its converse for finite-memory [51 -, channels,” Inform. Contr., vol. 2, pp. 25-44, 1959. tionary discrete random processes,” IEEE Trans. Inform. Theory, vol. IT-20, pp. 625-636, Sept. 1974. [71 T. S. Han and S. Verdti, “New. results in the theory and applications of identification via channels,” IEEE Trans. Inform. Theory, vol. 38, pp. 14-25, Jan. 1992. “Spectrum invariancy under output approximation for full-rank PI -, discrete memoryless channels,” Probl. Peredach. Inform. (in Russian), no. 2, 1993. PI G.D. Hu, “On Shannon theorem and its converse for sequences of communication schemes in the case of abstract random variables,” Trans. Third Prague Conf Inform. Theory, Statistical Decision Functions, Random Processes, Czechoslovak Academy of Sciences, Prague, 1964, pp. 285-333. 11013. C. Kieffer, “Finite-state adaptive block-to-variable length noiseless coding of a nonstationary information source,” IEEE Trans. Information Theory, vol. 35, pp. 1259-1263, 1989. 1111D. E. Knuth and A. C. Yao, “The complexity of random number generation,” in Proceedings of Symposium on New Directions and Recent Results in Algorithms and Complexity. New York: Academic Press, 1976. I121 R. W. Lucky, Silicon Dreams: Information, Man and Machine. New York: St. Martin’s Press, 1989. P31 D.L. Neuhoff and P.C. Shields, “Channel entropy and primitive approximation,” Ann. Probab., vol. 10, pp. 188-198, Feb. 1982. 1141 M. S. Pinsker, Information and Information Stability of Random Sariables and Processes. San Francisco: Holden-Day, 1964. 1151 J.M. Stoyanov, Counterexamples in Probability. New York: Wiley, 1987. P61 S. Verdti, “Multiple-access channels with memory with and without frame-synchronism,” IEEE Trans. Inform. Theory, vol. 35, pp. 605-619, May 1989. [I71 S. Verde and T. S. Han, “A new converse leading to a general channel capacity formula,” to be presented at the IEEE Inform. Theory Workshop, Susono-shi, Japan, June 1993. Psi J. Wolfowitz, “A note on the strong converse of the coding theorem for the general discrete finite-memory channel,” Inform. Contr., vol. 3, pp. 89-93, 1960. “On channels without capacity,” Inform. Contr., vol. 6, pp. 1191-, 49-54, 1963. “Notes on a general strong converse,” Inform. Contr., vol. 12, PO1 -, pp. 14, 1968. 1211A.D. Wyner, “The common information of two dependent random variables,” IEEE Trans. Inform. Theory, vol. IT-21, pp. 163-179, Mar. 1975.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download "Approximation Theory of Output Statistics,"