Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
the best mismatched detector channel coding and hypothesis testing Finding for Emmanuel Abbe EECS and LIDS, MIT [email protected] Muriel Medard EECS and LIDS, MIT Sean Meyn ECE and CSL, UIUC [email protected] [email protected] Abstract- The mismatched-channel formulation is generalized to obtain simplified algorithms for computation of capacity bounds and improved signal constellation designs. The following issues are addressed: (i) For a given finite dimensional family of linear detectors, how can we compute the best in this class to maximize the reliably received rate? That is, what is the best mismatched detector in a given class? (ii) For computation of the best detector, a new algorithm is proposed based on a stochastic approximation implementation of the Newton-Raphson method. (iii) The geometric setting provides a unified treatment of channel coding and robust/adaptive hypothesis testing. I. INTRODUCTION Consider a discrete memoryless channel (DMC) with input sequence X on X, output sequence Y on Y, with Y = C, and X a subset of R or C. It is assumed that a channel density exists, P{Y(t) C dy X(t) C x} = Pylx(y x)dy, X, y C Y. We consider in this paper the reliably received transmission rate over a non-coherent DMC using a mismatched detector, following the work of Csiszar, Kaplan, Lapidoth, Merhav, Narayan and Shamai [1], [2], [3]. The basic problem statement can be described as follows: We are given a codebook consisting of enR codewords of length n, where R is the input rate in nats. The ith codeword is denoted {Xt 1 < t < n}. We let F7 denote the associated empirical distribution for the joint input-output process, x C F' n = n-1a t l t=l Suppose that a function of two variables, F: X x Y --> R is given. A linear detector based on this function is described as follows: The codeword i* is chosen as the maximizer, t = argmax( , F). (2) For a random codebook this is the Maximum Likelihood decoder when F is the log likelihood ratio (LLR). In the mismatched detector the function F is arbitrary. There is ample motivation for considering general F: First, this can be interpreted as the typical situation in which side information at the receiver is imperfect. Also, in some cases the LLR is difficult to evaluate so that simpler functions Lizhong Zheng EECS and LIDS, MIT lizhonggmit.edu can be used to balance the tradeoff between complexity and performance. In Merhav et. al. [1] and Csiszar [2] the following bound on the reliably received rate is obtained, known as the generalized mutual information: IGMI(PX; F):= -/, F2 = PY}, min{ D( Px 8 Py): (F, FF (3) where Fi denotes the ith marginal of the distribution F on = (Pxy, F). Under certain conditions the X x Y, and generalized mutual information coincides with the reliably received rate using (2). In this paper we provide a geometric derivation of this result based on Sanov's theorem. This approach is related to Csiszar's Information Geometry [3], [4], [5], which is itself a close cousin of the geometry associated with Large Deviations Theory [6], [7], [8], [9], [10], [11]. Based on this geometry we answer the following question: j < d} of functions on the Given a basis { j : 1 product space X x Y, and the resulting family of detectors based on the collection of functions {F aj i}, how can we obtain the best reliably received rate over this class? To compute a* we construct a new recursive algorithm that is a stochastic approximation implementation of the Newton Raphson method. The algorithm can be run at the receiver blindly - without knowledge of X or the channel statistics. The geometry surveyed here is specialized to the low SNR setting in the companion paper [12]. II. GEOMETRY IN INFORMATION THEORY We begin with the following general setting: Z denotes an i.i.d. sequence taking values in a compact set Z, and the n > 1} are defined as the empirical distributions {F sequence of discrete probability measure on B(Z), = Ed n !LZ{ZkeA}. Fn(A)= k=l A C B(Z). (4) For a given probability measure 7 and constant Q > 0 we denote the divergence sets QC(7) = r: D(F w) < Q3} and QIjw) = F: D(F w) < 3}. The logarithmic moment generating function (log-MGF) for a given probability measure 7 is defined for any bounded measurable function G via 284 A(G) = log( ecG(z) 7(dz)). (5) We write A, when we wish to emphasize the particular probability used in this definition. The geometry emphasized in this paper is based on Proposition 2.1 that expresses divergence as the convex dual of the log-MGF. For a proof see [6]. Proposition 2.1: The relative entropy D(w1 w70) is the solution to the convex program (6), where the supremum is over all bounded continuous functions on Z: sup -A, (F) subject to 71 (F) (6) > 0. hypothesis Hj for j 0,1. The goal is to classify a given set of observations into one of the two hypotheses. For a given n > 1, suppose that a decision test b, is constructed based on the finite set of measurements { Zl..... , Zn }This may be expressed as the characteristic function of a subset An c Xn. The test declares that hypothesis Hi is true if On = 1, or equivalently, (Zl, Z2, ... Zn) C Aln. The performance of a sequence of tests b: {f On : n > 1} is reflected in the error exponents for the type-I error probability and type-II error probability, defined respectively by, It is also the solution to the following unconstrained convex program, D(w1 70) = sup{w71 (F)- A,(F)} inf- log (P ,O{¢)n(Zl, . . Zn,) = I1}), Jo, := -limn -~oo n . (7) I4 ,(Zi,...,Z,) =0}), :-liminf-log(P,l{n n -~oo n -.- That is, the relative entropy is the convex dual of A,O. We conclude this section with a brief sketch of the proof to set the stage for the characterizations that follow. 3( The asymptotic N-P criterion of Hoeffding [13] is described as follows: For a given constant bound R > 0 on the false-alarm exponent, an optimal test is the solution to, Q ( H ...................... Q,13, (7 1) Fig. 1. The entropy neighborhoods sides of the hyperplane 'H. Qa1 (w1) and Q,0 (w0O) lie on = sup{4¢,: subject to Jo > RI (9) where the supremum is over all test sequences ¢b (see also Csiszar et. al. [14], [15].) It is known that the optimal value of the exponent 40 is described as the solution to a convex program involving relative entropy [7], and that the solution leads to optimal tests defined in terms of the empirical distributions. The form of an optimal test is as follows: =fr (r7F)=O0 .................. (8) opposite F (F, F) = 0}. For a given function F we denote H {: We interpret 'H as a hyperplane, and consider the two 'half spaces', X_ := {F C M: (17F) < O} and 'H+ := {F e M: F,(F) > 0}. Suppose that 'H separates 70 and 71 as shown in Figure 1. Assuming also that F is feasible for (6), we must have 7w1(F) > 0 and w0(F) < 0. That is, 71 C X+ . Evidently D(71 070) > and 7 0 C where QC0 (70) is the largest divergence set contained in 'H. If r°* lies on the intersection of the hyperplane and the divergence set shown, F0* C H n QC0 (w0), then by definition D(10* w70) = o. Moreover, Sanov's Theorem (in the form of Chemoff's bound) gives D(10* wl0) > -A(OoF) for any 00 > 0. Taking 00 = we obtain D(10* 70) >- A(F). This shows that the value of (6) is a lower bound on the divergence. To see that this lower bound can be approximately arbitrarily closely, consider a continuous approximation to the log likelihood ratio L = log (d1j/d)-) D(w1 w0). Qo, A. Hypothesis testing In a similar fashion we can obtain a solution to the NeymanPearson hypothesis testing problem in terms of a supremum over linear tests. Consider the binary hypothesis testing problem based on a finite number of observations from a sequence Z = {Zt: t = ,.. .}, taking values in the set X. It is assumed that, conditioned on either of the hypotheses Ho or H,, these observations are independent and identically distributed (i.i.d.). The marginal probability distribution on X is denoted 7i under ON = {rFN C}. (10) Under general conditions on the set A, we can conclude from Sanov's Theorem that the error exponents are given by, 10 inf{D(F 71) F C AC} and Jo = inf{D(F 70) : F C A}. One choice of A is the half-space A = {F: F(F) > 0} for a given function F. On optimizing over all F that give a feasible test we obtain a new characterization of the solution to the Neyman-Pearson problem. The proof of Proposition 2.2 is similar to the proof of Proposition 2.1, based on Figure 1. Proposition 2.2: For a given R > 0 there exists g > 0 such that the solution * to the Neyman-Pearson hypothesis testing problem is the solution to the convex program, sup -A71 (- 9F)- (A70 (F) + R) 9, (1 1) where the supremum is over continuous and bounded functions. The value of (11) is achieved with the function F* -i + t log(d7w/d7w) with t and constant, (t (1 + Q) -). n B. Capacity Using a random code book with distribution Px combined with maximum likelihood decoding, the reliably received rate is the mutual information I(X;Y) = D(Pxy^Px Py). Hence, the mutual information is the solution to the convex program (6) with 7l = Pxy and 70 = Px OPy. The function F appearing in (6) is a function of two variables since 70 and 7r are joint distributions on X x Y. 285 III. A GEOMETRIC VIEW OF THE MISMATCHED CHANNEL We thereby obtain a formula for the optimizer, In this section we consider the reliably received rate based F* (dx, dy) = eF(xy) PX(F)PX (dx)Py (dy), on a given linear test. (F* OF Suppose that F is given with := (Pxy,F) > 0 and as well as the value, D(F* Px ( Py) the original (Px X Py, F) < -. Sanov's Theorem implies a lower bound Apx(OFy)). If F* is feasible so Fit optimizes F* convex then the mean of under is program, equal to -, on the pair-wise probability of error using the detector (2), and we obtain the desired for the expression (c) generalized and applying a union bound we obtain the following bound mutual information IGMI(Px; F). on the reliably received rate: The identity (b) then follows from the definition of the F . (12) relaxation which gives D (F Px X Py) ILDP(PX; F) = min{D(FrPx 0 Py) : F1,F min{D(F Px Py) -0* (F, F° We have ILDP (PX; F) < IGMI(Px; F) since the typicality ) constraint F2 = PY is absent in (12). 1 1)} O*((F For the given function F and y C Y we define Fy: X -> R by Fy(x) = F(x,y), and for any 0 C R the log MGF is where F° is given in (14) and °y0 is the mean of F° under Pxy. By restricting to only those F that are probability denoted, measures with the common mean F(F0) = ° we obtain the (13) convex program, Ap, (0Fy): log JeF(xY) PX(dx) . min D (F Px (9 Py) Proposition 3.1: For a given function F: X x Y -> , subject to (17 F°F) = ° define the new function, This is the rate function ILDP(PX; F°) defined in (12). n (14) F°(x, y) = 0*F(x, y)- Apx (0* Fy) IV. OPTIMIZATION OF A LINEAR FAMILY OF DETECTORS where 0* is chosen so that the following identity holds, We now consider methods for computing an effective test. J (O*FY)F(x (dy) = . We restrict to functions in a linear class of the form {F, = >3aji a C Rd} where {4bi : 1 < i < d} are given. Then, the generalized mutual information IGMI (Px; F) can be We obtain recursive algorithms based on the Newton-Raphson expressed in the following three equivalent forms: method. (a) sup ILDP(PX;F + G), Our main goal is to set the stage for the construction of G(y) sample-path algorithms (popularly called 'machine learning'). (b) ILDP(PX; F°), The resulting algorithms are similar to a technique introduced (c) 0* - fApx(0* Fy)Py(dy). in to solve a robust Neyman-Pearson hypothesis testing [8] Proof: The proof of (a) will be contained in the full paper. and recent problem, independent work of Basu et. al. [17] To prove (b) and (c) we construct a Lagrangian relaxation risk-sensitive concerning optimal control. of the convex program (3) that defines IGMI(PX; F). The Lagrangian is denoted L(F) = A. Hypothesis testing We first consider approximation techniques for the NeymanD(FPx 0 Py) - O(F.F Pearson hypothesis testing problem. - 00((F, 1) 1) - (F2 - Py, G) For a given false-alarm exponent R > 0 we seek the best where 0 is the Lagrange multiplier corresponding to the mean test in this class: It must be feasible, so that Q+(7w0) c X_, constraint, 00 corresponds to to the constraint that F has and subject to this constraint we seek the largest exponent mass one, and the function G corresponds to the typicality /3(F1) = min{D(F w1) F 'xH}. Define 'H(a) := {F C constraint, 12 = Py. The dual functional is defined as the M : (1Tb) < O} for a C Rd, and let Q- c M the inf L:(F). A bit of calculus gives intersection of all possible acceptance regions for Ho, infimum, TI(Ooo, 0, G) = F>0 the optimizer F* (dx, dy) XeF(x,Y)-fl±(Y)p (dx)Py (dy), 9 a c (a): where II C R is a constant. The next task is to understand the Then, Q* min{D(Fw1l) F e 9-}. Alternatively, as in Lagrange multipliers. 2.2 we can show that * is the solution to the Proposition To find G we simply integrate over x to obtain, d-dimensional convex program, Py(dy) J F*(dx,dy) min Al.i (- 9Fo) +9AO -(Fo ) GxX (15) subject to 70(Fo ) < 1 (Fo ) =(J e( Y)rx (dx)) e +±(Y)PY(dy) The first order condition for optimality of a, ignoring the inequality constraint 7w0(Fc) <w1(Fc), is V,{tAIF(- 9aTO) + Hence we have for a.e. y [Py], gA (aTO)I)} = 0. This can be expressed e-H+G(y) (JXeoF(xy)Px d)1 e-A px (OFy) - Q*1(4) + 970(4b) = 0 C Rd (16) c*F(x,y)-Ap y)Px (dx)Py (n{'n 286 QR(w0) (a), Rd} where for any function G, r(G) QF>G) 1(e w0(G) 7F(e (17) G) On multiplying through by a*T and dividing by g, with a* a solution to (16), we then have w( T)+TO(a' 0 T) = 0. Recall that if F° optimizes (11) then F° - has the same value for any aC R. We define F* as the normalized function, F* To _0(a*T0) So that _1l(F*) = (F ) = 0. We then necessarily have, by appealing to convexity of the log moment generating functions, T (F ) = X1 (F*) = dA, (OF*) (OF*) ~0=0 <-O0=1 dAdO AF1 (OF 0) d > d0A1 (OF*) 0, -Q Consequently we have feasibility, 7(F*) < 0 < 7r (F*). For fixed g we obtain corresponding exponents, Se* -A V'{AA( _aTO) + gA1V0 (CaT)}I 2= Ql + QZ0 (18) where the matrices {Et} are the covariance of b under the respective distributions { ik }: i( i f fT) i ())i * (OT) i=0: 1 . Hence, the Newton-Raphson method to estimate a* is given by the recursion, = Fo(tM)] + Za?J 1[[-7Fot (f) + a[t t > 0, with 1 to runlength For t Draw independent samples from {w'} and evaluate O(X'): X 70 Xtl w71, i The Hessian of the objective function in (15) can be expressed, aVt+1 A. Hypothesis testing We introduce here a stochastic approximation algorithm intended to approximate the Newton-Raphson recursion (19). Let {f-t, Et} denote a pair of positive scalar sequences. In the following algorithm V2 > 0 is an estimate of the Hessian (18) and ct is chosen so that for some constant c > 0, t > 0. t < c, and [EtI + V2] > jI, The sequence -yt} is the gain for the stochastic approximation, such as a= a-o/(t + 1), t > 0, with -Yo > 0 given. AWO(F ) = -A,0 (R*T/) +,(a *Ti) (-gF*) = -A 1i (_Qa*To) Q_kO(a*To) RQ V. SIMULATION-BASED COMPUTATION Optimization via simulation is compelling: (i) It is convenient in complex models since we avoid numerical integration. (ii) It offers the opportunity to optimize based on real data. (iii) One obstacle in solving (15) is that this nonlinear program must be solved for several values of g > 0 until an appropriate value of R is discovered. This is largely resolved using simulation techniques since parallelization is straightforward. (19) > -0 0 llt+ t ft+l qft+l > = qt + at (e Ao (O* F ) > > (20) = A ) max{ayf (aTA a where b (PxyYb Although it is in principle possible to find a test maximizing the generalized mutual information -GMI, the resulting optimization problem is complex and requires significant channel information. A simpler algorithm is obtained if we minimize the bound ILDP; there is no loss of optimality if the function the function class {F,o = E abi : a C Rd} is normalized so that Fo defined in (14) is zero for each a. and wl: ±t(e + Kt(Vt)T t St) + yt (e0tfttl ( Vt] 1t (23) 't '+t &t Estimate the Hessian,Vt = Newton Raphson -7t EtI+ recursion, t+l- at= [- l 't t + ea't )t We illustrate this technique with an example: 7r1 is uniform [0,1], and 70 is the distribution of the mth root of a realization of 71 for a fixed constant m > 0. Shown in Figure 2 are plots of Er vs. R for m = 5, with various choices of b: Rd -> Rd. These results were obtained using the stochastic Newton-Raphson method. Figure 3 is intended to illustrate the transient behavior of this stochastic algorithm. The trajectories of the estimates of Er and R appear highly deterministic. This is the typical behavior seen in all of the experiments we have run for this model. on C Rd} (22) qj t Estimate the un-normalized twisted Et+1 = 0*aTb we arrive at the convex program, ILDP (PX; ) (21) -t) t tl + yt (e-t etTft = x Z°H+l = aT /t+ l = t + yt (e - 0tTf _1 > Estimate the un-normalized twisted mean of y under o and w1: a0o C Rd given as initial condition. ILDP (PX; FY)= 0* ) O under K and the ) + yt (eQtct _ °) covariance of + under w The rate function (12) can be expressed in terms of the log-MGF via, = > Estimate the mean of mean of e-0a f under 71: B. Capacity Since 0*Fei - It = O(Xt ). 287 Er (R) t~~~~~~~~~~~~~~k(.x) =[x,log(x)1l0]' * \ , f X - (X) =tS(~~ ~ ~ ~ ~ ~ ~. . . . . . . . . (I+ ]03.)4]1' 0.30X\ = Opt (i) Signal constellation and code design based on a finite dimensional detector class {Fc, }. (ii) Maximization of the error exponent for a given transmission rate, and a given detector. (iii) Applications to MIMO channels and networks to simultaneously resolve coding and resource allocation in complex models. imadetedtor , (X) = [X71/(I + JOX)] 0.20 D(,r111,°) =2.39 0.2 0.4 0.6 0.8 1.0 1'.4 1.2 1'.6 1.8 2.0 2.2 2.4 R Fig. 2. Solutions to the Neyman-Pearson hypothesis testing problem for various linear tests. The marginal distribution of X1 was taken uniform on [0, 1], and X° = X . R6006 E,(R)= = ACKNOWLEDGMENT Research supported in part by Motorola RPS #32 and ITMANET DARPA RK 2006-07284. REFERENCES Fig. 3. Evolution of the transient phase of the stochastic Newton-Raphson method. These plots show estimates of R. and E(RQ) as a function of time using b = [(1 + l0X)-1, (1 + lOX)-2,. . , (1 + lOX)-6], with g equally spaced among ten points from 0.1 to 1. B. Capacity We close with some recent results on computation of the best mismatched detector. Our goal is to maximize the objective function J(a) := ILDP(PX; FO) defined in (20), VJ V2J = = (Pxy Y) - (Fa ) (F a fT where for any set A C B(X x F(A) (F a,, Fc(, t)7 Y), CFA) :=(Px(Px8 PY, e Py,eF'x Hence a stochastic Newton-Raphson algorithm can be constructed in analogy with the Neyman-Pearson problem. Results for the AWGN channel are shown in Figure 4 for a special case in which the LLR lies in the quadratic function class obtained with 0 = (x 2 y2 txyV). Fig. 4. Capacity estimates for the AWGN channel with SNR=2. VI. CONCLUSIONS In this paper we have relied on the geometry surrounding Sanov's Theorem to obtain a simple derivation of the mismatched coding bound. The insight obtained combined with stochastic approximation provides a powerful new computational tool. Based on these results we are currently considering the following: 288 [1] G. Kaplan, A. Lapidoth, S. Shamai, and N. Merhav, "On information rates for mismatched decoders." IEEE Trans. Inform. Theory, vol. 40, no. 6, pp. 1953-1967, 1994. [2] I. Csiszar and P. Narayan, "Channel capacity for a given decoding metric," IEEE Trans. Inform. Theory, vol. 41, no. 1, pp. 35-43, 1995. [3] I. Csiszar and J. Korner, "Graph decomposition: a new key to coding theorems," IEEE Trans. Inform. Theory, vol. 27, no. 1, pp. 5-12, 1981. [4] I. Csiszar, "The method of types," IEEE Trans. Inform. Theory, vol. 44, no. 6, pp. 2505-2523, 1998, information theory: 1948-1998. [5] S. Borade and L. Zheng, "I-projection and the geometry of error exponents," in Proceedings of the Forty-Fourth Annual Allerton Conference on Communication, Control, and Computing, Sept 27-29, 2006, UIUC, Illinois, USA, 2006. [6] A. Dembo and 0. Zeitouni, Large Deviations Techniques And Applications, 2nd ed. New York: Springer-Verlag, 1998. [7] 0. Zeitouni and M. Gutman, "On universal hypotheses testing via large deviations," IEEE Trans. Inform. Theory, vol. 37, no. 2, pp. 285-290, 1991. [8] C. Pandit and S. P. Meyn, "Worst-case large-deviations with application to queueing and information theory," Stoch. Proc. Applns., vol. 116, no. 5, pp. 724-756, May 2006. [9] J. Huang, C. Pandit, S. Meyn, and V. V. Veeravalli, "Extremal distributions in information theory and hypothesis testing (invited.)," in In Proc. IEEE Information Theory Workshop, San Antonio, TX, October 24-29 2004, pp. 76-81. [10] J. Huang, S. Meyn, and M. Medard, "Error exponents for channel coding and signal constellation design," IEEE J Selected Areas in Comm., vol. 24, no. 8, pp. 1647-, 2006, special issue on NONLINEAR OPTIMIZATION OF COMMUNICATION SYSTEMS. Guest Editors M. Chiang, S. H. Low, Z.-Q. Luo, N. B. Shroff, and W. Yu. [11] I. Kontoyiannis, L. Lastras-Mofitano, and S. Meyn, "Relative entropy and exponential deviation bounds for general Markov chains," in Proceedings of the 2005 IEEE International Symposium on Information Theory, Sept. 2005, pp. 1563-1567. [12] E. Abbe, M. Medard, S. Meyn, and L. Zheng, "Geometry of mismatched decoders," 2007, submitted to IEEE International Symposium on Information Theory. [13] W. Hoeffding, "Asymptotically optimal tests for multinomial distributions," Ann. Math. Statist., vol. 36, pp. 369-408, 1965. [14] I. Csiszar, G. Katona, and G. Tusnady, "Information sources with different cost scales and the principle of conservation of entropy," in Proc. Colloquium on Information Theory (Debrecen, 1967), Vol. I. Budapest: Janos Bolyai Math. Soc., 1968, pp. 101-128. [15] I. Csiszar and G. Tusnady, "Information geometry and alternating minimization procedures," Statistics & Decisions. International Journal for Statistical. Supplemental Issue #1., pp. 205-237, 1984. [16] R. G. Gallager, Information Theory and Reliable Communication. New York, NY, USA: John Wiley & Sons, Inc., 1968. [17] A. Basu, T. Bhattacharyya, and V. S. Borkar, "A learning algorithm for risk-sensitive cost," Tata Institute for Fundamental Research," Unpublished report available at http: //math. iisc. ernet. in/eprints/, 2006.