Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012 1289 Functional Properties of Minimum Mean-Square Error and Mutual Information Yihong Wu, Student Member, IEEE, and Sergio Verdú, Fellow, IEEE Abstract—In addition to exploring its various regularity properties, we show that the minimum mean-square error (MMSE) is a concave functional of the input–output joint distribution. In the case of additive Gaussian noise, the MMSE is shown to be weakly continuous in the input distribution and Lipschitz continuous with respect to the quadratic Wasserstein distance for peak-limited inputs. Regularity properties of mutual information are also obtained. Several applications to information theory and the central limit theorem are discussed. Index Terms—Bayesian statistics, central limit theorem, Gaussian noise, minimum mean-square error (MMSE), mutual information, non-Gaussian noise. I. INTRODUCTION ONOTONICITY, convexity, and infinite differentiability of the minimum mean-square error (MMSE) in Gaussian noise as a function of the signal-to-noise ratio (SNR) have been shown in [2]. In contrast, this paper deals with the functional aspects of MMSE, i.e., as a function of the input–output joint distribution , and in particular, as a function of the input distribution when is fixed. We devote special attention to additive Gaussian noise. The MMSE is a functional of the input–output joint distribution defined on where denotes the Borel -algebra, or equivalently of the pair . Define M (1) (2) (3) These notations will be used interchangeably. When is related to through an additive-noise channel with gain , i.e., where is independent of , we denote is standard Gaussian distributed. Similarly, we denote where the mutual information by (6) (7) In Section II, we study various concavity properties of the MMSE functional defined in (3)–(5). Unlike the mutual , which is concave in , convex information in but neither convex nor concave in , we show that the MMSE functional is concave in the joint distribution , hence concave individually in when is fixed, and in when is fixed. is neither concave nor convex in the pair . Various regularity properties of MMSE are explored in Section III. In particular, we show that 1) is weakly lower semicontinuous (l.s.c.) in but not continuous in general; 2) when has a continuous and bounded density, is weakly continuous; 3) When is Gaussian and has a finite moment of a certain order, is Lipschitz continuous with respect to the Wasserstein distance [3]. For more general distortion functions (not necessarily mean square), concavity and lower semicontinuity properties of the optimal distortion as a function of the input distribution or channel statistics have recently been studied in [4] in the context of stochastic control. Via the I-MMSE relationship1[5] (8) regularities of MMSE are inherited by the mutual information when the input power is bounded. Several applications of these continuity properties are given in Section IV. 1) Using the weak continuity of MMSE, we prove that the differential version of the I-MMSE relationship (4) (5) Manuscript received December 29, 2010; accepted September 02, 2011. Date of current version February 29, 2012. The material in this paper was presented in part at the 2010 IEEE International Symposium on Information Theory. Y. Wu was with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA. He is now with the Department of Statistics, Wharton School of the University of Pennsylvania, Philadelphia, PA 19104 USA (e-mail: [email protected]). S. Verdú is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: [email protected]). Communicated by D. Guo, Associate Editor for Shannon Theory. Digital Object Identifier 10.1109/TIT.2011.2174959 (9) as long as the mutual information is holds for all finite, thus dropping the finite-variance condition imposed on in [5, Th. 1]. 2) The Lipschitz continuity of MMSE enables us to gauge the gap between the Gaussian channel capacity and the mutual information achieved by a given input by computing its Wasserstein distance to the Gaussian distribution. 1Throughout this paper, natural logarithms are adopted and information units are nats. 0018-9448/$26.00 © 2011 IEEE 1290 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012 3) We give upper bounds on the convergence rate (in terms of relative entropy) in the central limit theorem for densities obtained as the convolution of Gaussian with another distribution under various moment assumptions. 4) We give an example of the central limit theorem in the sense of weak convergence where the non-Gaussianness is finite but does not vanish. We also give a new bound on the decrease of MMSE due to additional observations in terms of the mutual information. In Section V, we discuss the data processing inequality asis desociated with MMSE, which implies creasing in for those with a stable distribution,2 e.g., Gaussian. In Section VI, we present relevant results on the extremization of the MMSE functional with Gaussian inputs and/or Gaussian noise. While the least favorable input with additive Gaussian noise is Gaussian, the worst random transformation faced by Gaussian inputs is an attenuation followed by additive Gaussian noise, which coincides with the optimal forward random transformation achieving the Gaussian rate-distortion function, i.e., the backward random transformation is an additive Gaussian noise. Nonetheless, the worst additive-noise channel is still Gaussian. We also discuss the MMSE-maximizing input distribution under an amplitude constraint and its discrete nature. II. CONCAVITY Theorem 1: For any , and Corollary 2 (Nonstrict Concavity): In general, MMSE is not strictly concave. Proof: According to (10), it can be shown that (13) holds for all if and only if (14) -a.e. and -a.e. . Therefore, instances of nonholds for strict concavity can be established by constructing pairs of distributions which give rise to the same optimal estimator. Consider the following examples. where and are i.i.d. By symmetry, 1) Let . Then, since , the optimal estimator is given by , regardless of the distribution of . Therefore, in this case, the mapping is affine between those joint distributions. and be independent and standard Gaussian. 2) Let and Denote the joint distribution of by and , respectively. Then, the optimal estimators of under and are both . Therefore, is not strictly concave. 3) Let , where is independent of and equiprobable Bernoulli. Consider two densities of (15) (16) (10) , and and denote the where under and , respectively. Consemarginals of is a concave functional in . quently, Proof: See Appendix A. Corollary 1: is individually concave in each of its arguments when the other one is fixed. Remark 1: MMSE is not concave in the pair . We , 2, let illustrate this point by the following example: for , where and are independent and equiprobable on , and are independent and equiprobable on and , respectively. Then, and . Let , where the distribution of (resp. ) is the equal mixture of those of and (resp. and ). Then (11) (12) distribution P is called stable if for X , X independent identically disbX has tributed according to P , for any a, b 2 , the random variable aX the same distribution as cX d for some c, d 2 [6, p. 6]. 2A + + denotes the standard normal denwhere sity. It can be shown that the optimal estimators for (15) and (16) are the same (17) Hence, the MMSE functional for this channel is the same for any mixture of (15) and (16). Despite the nonstrict concavity in for general , in the special case of additive Gaussian noise, MMSE is indeed , as shown next. The proof a strictly concave functional of exploits the relationship between the optimal estimator in Gaussian channels and the Weierstrass transform [7] of the input distribution. Theorem 2: For fixed , are both strictly concave. Proof: See Appendix B. and III. REGULARITY OF MMSE A. Continuity and Semicontinuity In general, the functional tinuous. To see this, consider converges in distribution to is not weakly semicon, which . Therefore, WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION . However, for each . Thus (18) and therefore is not lower semicontinuous (l.s.c.) in . On the other hand, consider and (19) Then, and not u.s.c.: , in distribution. Since , it holds that is (20) Theorem 4: Let . Then, for any is weakly u.s.c. In addition, if tinuous and bounded density, then weakly continuous. Proof: See Appendix C. 1291 , has a conis Remark 3: Theorem 4 cannot be extended to , , which is weakly l.s.c. in because but not continuous, as the example in (19) illustrates. For , need not be weakly continuous if the sufficient conditions in Theorem 4 are not satisfied. and are both equiprobable For example, suppose that , where is a sequence of irrational Bernoulli. Let in distribution, and numbers converging to 1. Then, for all , but . This also show that under the condition of Theorem 3, need not be weakly continuous in . (21) Corollary 3: For fixed , is weakly continuous. Corollary 3 guarantees that the MMSE of a random variable can be calculated using the MMSE of its successively finer discretizations, which paves the way for numerically calculating MMSE for singular inputs (e.g., Cantor distribution) in [11]. However, one caveat is that to calculate the value of MMSE within a given accuracy, the quantization level needs to grow such that the quantization error is much smaller than with the noise. In view of the representation of the MMSE by the Fisher information of the channel output with additive Gaussian noise [5, eq. (58)], (known as Brown’s identity in the statistics literature [12, eq. (1.3.4)]): (22) (24) where and denote the collection of all real-valued Borel and continuous bounded functions on , respectively, and in . (22) is due to the denseness of For a fixed estimator in Corollary 3 implies the weak continuity of . While Fisher information is only l.s.c. [9, p. 79], here the continuity is due to convolution with the Gaussian density. Nevertheless, assuming either bounded input or additive noise, MMSE is indeed a weakly u.s.c. functional. Theorem 3: Let be such that is bounded. Denote the collection of all Borel probability mea. Then, is weakly u.s.c. sures on by . on Proof: Variational representations prove to be effective tools in proving semicontinuity and convexity of information measures (for example, relative entropy [8], Fisher information [9], etc). Here, we follow the same approach by using the following variational characterization of MMSE3: B. Lipschitz Continuity (23) . This is because is weakly continuous in since is bounded in . Therefore, by (22), is weakly u.s.c. because it is the pointwise infimum of weakly continuous functions. In view of the counterexample is not in (20), we see that the boundedness assumption on superfluous. Remark 2: The variational representation of MMSE in (22) provides an alternative proof of its concavity as follows: since , is affine in , is for any concave because it is the pointwise infimum of affine functions. This proof does not rely on the boundedness of . Hence, we could set . 3The Borel measurability of estimators in (22) is not superfluous. For example [10], it is possible to construct a random variable Y and a nonmeasurable function f^ such that X = f^(Y ) is a random variable independent of Y . Then, (X jY ) = X while [(X 0 f^(Y )) ] = 0. Seeking a finer characterization of the modulus of continuity of , we introduce the Wasserstein distance [3]. Definition 1: For , the Wasserstein space of order on is defined as the collection of all Borel probability . The measures with finite th-order moments, denoted by Wasserstein distance of order ( distance) is a metric on , defined for , as (25) where the infimum is over all joint distributions of . distance coincides with the disOn the real line, the tance between the quantile functions of two distributions [13], [14] (26) 1292 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012 where denotes the cumulative distribution function of By Hölder’s inequality, . is equal to the distribution of the joint distribution of conditioned on the event , i.e. (27) distance metrizes the topology of weak convergence The plus convergence of th-order moments. Because in general, convergence in distribution does not yield convergence of moments, this topology is strictly finer than the weak topology. Since convergence implies convergence of variance, in view , is conof Corollary 3, for all . Capitalizing on the tinuous on the metric space smoothness of the optimal estimator with Gaussian noise, the can be established Lipschitz continuity of as follows. Theorem 5: For any and any , (35) Since (36) is a suboptimal estimator of , we have with (37) (38) (28) (39) Consequently, uous on any compact set in Proof: See Appendix D. is -Lipschitz continfor any . Finally, we present two results which frequently enable us to approximate MMSE of a given input using its truncated version, where the approximation error is uniform in the random transformation. (40) which implies (33). To show (34), note that (41) (42) (43) Lemma 1: For any (29) Proof: IV. APPLICATIONS TO MUTUAL INFORMATION (30) (31) (32) Interchanging the roles of and , (29) follows. Capitalizing on the weak continuity of MMSE proved in Theorem 4, we prove that the I-MMSE relationship holds as long as the mutual information is finite, thus removing the finite-variance condition imposed on the input in [5, Th. 1]. (34) Theorem 6: For any random variable , the following are equivalent: for any ; a) for all ; b) c) . , then (9) holds for all Furthermore, if . is much weaker It should be remarked that than , as the following sufficient condition shows, which also applies to non-Gaussian noise. be distributed according to when the input is . Then, Lemma 3: Let have a density with . Let be an increasing continuous function that satisfies the following conditions: Lemma 2: Let be distributed according to . For , the distribution of conditioned on the event denote by , i.e., for any measurable set . Then, for any (33) Proof of Lemma 2: Let and be the output of A. Finite Mutual Information and the I-MMSE Relationship WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION 1) 1293 for any Borel subset . i.e., This is well defined because the denominator is positive for sufis bounded, we have ficiently large . Since (44) (54) 2) For any , there exists , such that (45) If in distribution, by the weak continuity of Since MMSE proved in Corollary 3, . Since , which in integrable on the interval , applying dominated convergence theorem to (54) yields (46) (47) then 1) 2) (55) To establish (53), it remains to show for all ; is continuous on Proof: See Appendix E. (56) . Examples of functions that satisfy (45) include 1) is convex; is increasing and subadditive (e.g., concave with 2) ). Consequently, any convex, increasing, and nonconstant satisfies both (44) and (45). Particularizing to Gaussian noise, is sufficient for choosing to quadratic implies that . In fact, choosing to be nested logarithmic, we obtain a family of sufficient conditions much weaker than square integrability: define (48) th tetration (iterated exponential) of where denotes the . Then, for some implies for all . Proof of Theorem 6: The equivalence between a) and c) b), assume has been shown in [15, Th. 1]. To prove that a) for some . By monotonicity, it is that . Since sufficient to consider forms a Markov chain,4 we have By the lower semicontinuity of relative entropy, it is sufficient to show (57) Note that (58) (59) (60) (61) which implies the desired (57). on established In view of the continuity of in [2, Proposition 7], differentiating (53) yields (9) for every . Remark 4: It can be shown that the integral I-MMSE rela. First, consider the case of tionship (8) holds for all . Sending in (53) and applying the monotone convergence theorem, we have (62) (52) amounts to proving the Therefore, establishing (8) for any at , which has been established continuity of in [16] for arbitrary with finite mutual information. In case of , the integral I-MMSE relationship (8) still holds in the sense that both sides are infinity. To see this, be distributed according to the conditional distribution let defined in Lemma 2. Then where (50) follows from the concavity of the logarithm. Therefor all . fore, Next, we show that for any (63) (53) (64) For , let be a random variable distributed according conditioned on the event , to the distribution of (65) (49) (50) (51) X 0 Y 0 Z means that X and Z are conditionally independent given Y , i.e., (X; Y;Z ) is a Markov chain. (66) 4 where 1294 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012 1) Equation (64): by (34); is bounded; 2) Equation (65): by (8), since 3) Equation (66): by the lower semicontinuity of . the lower semicontinuity of that of relative entropy. is inherited by As a consequence of Theorem 7, we can restrict inputs to a weakly dense subset (e.g., discrete distributions) in the maximization B. Regularity of Mutual Information Note that unlike the finite-dimensional setting (e.g., [17]), the concavity of mutual information does not imply continuity. In view of the weak lower semicontinuity of relative entropy [8], is weakly l.s.c. in but not continuous in general, as the following example illustrates: let (67) and , which converges weakly to regardless of the choice of . Using the concavity and the dominated convergence theorem, it can be of shown that for any with (68) but . Nevertheless, mutual information is indeed weakly continuous in the input distribution if the input variance is bounded and the additive noise is Gaussian. Applying Corollary 3 and the dominated convergence theorem to (8), we obtain the following. in distribution and , Theorem 7: If for any . then -continuity of MMSE (in Theorem 5), By the is also -continuous. Moreover, Lipschitz continuity holds when the input is peak-limited, as the following result, which is obtained by integrating both sides of (28) and (8), shows. Corollary 4: Under the conditions in Theorem 5, for any , , with (72) with replaced by . It is interesting to analyze how and the maximal mutual information the gap between values, denoted by achieved by unit-variance inputs taking , closes as grows. The -Lipschitz continuity of mutual information allows us to obtain an upper bound on the distance convergence rate. It is known that the optimal between a given distribution and discrete distributions taking values coincides with the square root of the mean-square quantization error of the optimal -point quantizer [18], which [19]. Choosing to be the output of scales according to the optimal quantizer and applying a truncation argument, we . In fact, the gap conclude that vanishes exponentially fast [20]. C. Applications to the Central Limit Theorem We proceed to prove upper bounds for the convergence rate of non-Gaussianness in central limit theorem with i.i.d. random variables whose distribution is the convolution between a Gaussian distribution and an arbitrary distribution satisfying certain moment assumptions. and be independent i.i.d. seTheorem 9: Let . Let quences of random variables where and (73) (69) Consequently, is -Lipschitz continuous on for any . any compact set in -continuity also carries over to non-Gaussian In fact, noise. Theorem 8: Let , i.e., in distribution and second-order moments also converge. Then, whenever has finite nonGaussianness (70) Then 1) If , then (74) 2) If the moment generating function of is finite, then (75) . Then, Proof: Let with independent . Let denote a normal random variable with the same mean and variance as . Let , and . According to (71), we have (76) Proof: Note that (77) (78) (71) with convergence, the upper Since variance converges under follows from the lower semicontinuity semicontinuity of relative entropy. As we mentioned before, (79) WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION where (78) follows from Corollary 4. The Wasserstein distance between and satisfies the following. , if , then [21], [22, eq. (1.5)] 1) For 1295 in estimating never exceeds half of the mutual information and . between Theorem 10: Let take values in the unit ball of an Euclidean space. For any (80) 2) If the moment generating function of eq. (1.6)] for any (85) is finite, then [22, In particular (86) (81) Note that if . Applying and (81) to (78), respectively, we obtain the (80) with desired results in (74) and (75). . Then Proof: Let (87) Remark 5: Nonasymptotic upper bounds on can be obtained by combining (78) and the nonasymptotic results on the Wasserstein distance in [22, Th. 2.1 and eq. (1.6)]. Theorem 9 can also be generalized to non-i.i.d. sequences by applying [22, Th. 4.1]. was studied in [23], Remark 6: The asymptotic of where it was shown that if and only if for some . Finer analysis in [24] and [25] showed that if has finite Poincaré constant [25, Definition 1.2], the vanishes as the optimal rate . non-Gaussianness However, finiteness of the Poincaré constant implies has moments of all orders [26], while (74) only assumes the existence of the sixth moment. In the special case of compactly supported , [27, Th. 1.8] implies that has a finite Poincaré constant. , which is stronger Therefore, in this case, than (75). After the submission of this paper, we became aware of the results in [28, Th, 1.1] which established the exact asymptotics using Edgeworth-type expansions: let be defined of in (73) with . Then (82) (88) (89) (90) (91) (92) where (90) makes uses of the fact that belongs to the unit ball, (91) follows from the Kullback–Csiszár–Kemperman–Pinsker inequality (e.g., [32]) and (89) follows from: (93) . Let denote the normalized To verify (93), assume , i.e., . Denote by version of the absolutely value of taken componentwise. Then which improves Theorem 9 significantly. To conclude this section, we give an example where the nonGaussianness of a sequence of absolutely continuous distributions does not vanish in the central limit theorem. Consider the be a sequence of following example [29, eq. (17.4)]: let independent random variables, with . While Define tation of characteristic functions reveals that distribution. Now, let Since is bounded, by Theorem 7, . In view of (71), we have (94) (95) (96) (83) (97) (84) where (96) follows from Jensen’s inequality and the convexity of norms. , direct compu, in , in distribution. . D. Bounds on MMSE via Mutual Information To conclude this section, we present a result related to Tao’s inequality [30], [31], which shows that the contribution of V. DATA PROCESSING INEQUALITY In [33], it is pointed out that, like mutual information, MMSE satisfies a data processing inequality. We prove this property along with a necessary and sufficient condition for equality to hold. Theorem 11: If , then (98) 1296 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012 with equality if and only if a.e. Proof: By the orthogonality principle and the Markov property (99) Corollary 5: For any , is decreasing in . Proof: The monotonicity is a consequence of Theorem 11: and , for any forms a Markov chain. Using the same reasoning, we can conclude that, for any with a stable law, is monotonically decreasing in for all . It should be noted that unlike the data processing inequality for mutual information, equality in (98) does not imply that is a sufficient statistic of for . Consider the following example of a multiplicative channel. Let , , and be independent square integrable random variables with zero mean. Let and . Then and (100) Similarly, reminiscent of the fact that the capacity-achieving input distribution for the Gaussian channel under amplitude constraints is also finitely supported [35]. In general, there are no closed(see [36] for numerical recipes). Howform solutions for , it has been shown that the worst input ever, when distribution is binary [37] (103) Capitalizing on the I-MMSE relationship (8), this result has been used to establish that (103) also achieves (104) [38]. when when is large has been investigated in The behavior of [39], where it is shown that if is distributed according to , in distribution as , where the limiting then has the following density: distribution (105) Moreover . By Theorem 11, . However, does not hold. VI. MAXIMIZATION OF MMSE In this section, we consider the maximization of MMSE over for fixed and vice versa. a convex set of (106) An intuitive explanation for (105) and (106) is the following: let . By (24) (107) (108) (109) A. Worst Input Distribution Theorem 12 ([2, Proposition 12]): Let pendent of inde- in distribution and that . Then must minimize the Fisher information among all distributions supported on . This unique minimizer is given by (105) [39], [40]. Similarly, if a variance constraint ( ) is added to (102) in addition to the amplitude constraint, it can converges in distribution to the density be shown that that minimizes the Fisher information supported on with variance not exceeding , which has been obtained in [41]. Compared to (105), it is interesting to observe that the capacity-achieving input distribution in (104) has a different limachieves (104). Then, converges to iting behavior: let the uniform distribution on , which maximizes the differential entropy instead of minimizing the Fisher information. Suppose that (101) where the maximum is achieved if and only if for some . Through the I-MMSE relationship (8), Theorem 12 provides an alternative explanation for the optimality of Gaussian inputs in Gaussian channels, because the integrand in (8) is maximized pointwise. Consequently, to approximate the capacityachieving Gaussian distribution under some given constraints, whose MMSE profile approximates it is equivalent to find norm. This observation has been the Gaussian MMSE in the exploited in Section IV to estimate how discrete inputs approach the Gaussian channel capacity. Theorem 12 states that the least favorable input distribution under the variance constraint is Gaussian. However, under the that achieves amplitude constraint, the input distribution (102) is finitely supported (see, for example, [34]), a consequence of the analyticity of the Gaussian density. This phenomenon is converges to B. Worst Random Transformation In this section, we investigate the variance-constrained additive noise that maximizes the MMSE of the input given its noisy version. It is important to note that we do not constrain the noise to be independent of the input. If the criterion is the minimization of mutual information and the additive noise is not allowed to depend on the input, the worst case noise is known for specific input distributions such as Gaussian and binary [42]. WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION Theorem 13: (110) 1297 and with respect to Denote the densities of by and , respectively. based on is given by The optimal estimator for holds for any with finite mean. Proof of Theorem 13: (Converse) and (118) where (111) (112) (113) (Achievability) If , is achieved by . If , let with independent of and chosen such that . Such an always exists because is a decreasing continuous function [2, Propowhich vanishes as . Moreover, it sition 7] in as , even if can be shown that . Then, achieves the upper bound since . (119) Then, the optimal estimator for based on is given by (120) since . By the orthogonality principle and (121) (122) It is interesting to analyze the worst channel in (110) for Gaussian input . When , the maximal MMSE is achieved by an attenuator followed by contamination by additive independent Gaussian noise (123) (124) (114) (125) Interestingly, (114) is the minimizer of This proves the desired equality (10). (115) (116) . Hence, the backward random where consists of additive Gaussian noise with transformation variance . Nonetheless, the worst independent additive-noise channel is still Gaussian, i.e. (117) is independent of , by (111), the problem because when reduces to the situation in Theorem 12 and the same expression applies. APPENDIX B PROOF OF THEOREM 2 Proof: The concavity follows from Corollary 1. To prove . Suppose that strict concavity, it is sufficient to consider for with distribution and , we have (126) Denote 2. Then, by (14), and , (127) holds for Lebesgue-a.e. . The density of is given by APPENDIX A PROOF OF THEOREM 1 and be two joint distributions. Proof: Let be a random variable taking values on with Let . Let be distributed according to conditioned on for or 2. Then, has joint distribution . for (128) which is the Weierstrass transform [7] of hood ratio as . Define the likeli- (129) 1298 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012 The following properties of the log-likelihood ratio are proved in [43] and [44, Property 3], respectively Denote denseness of that in . Let , for all , there exists . Then . By the such (130) (142) (131) The solution of (130) is (143) (144) (132) Hence, in view of (127), we have for all (133) Next, we show that measure according to . For where (142) is due to the suboptimality of and (144) follows from the dominated convergence theorem. By the arbitrariness of , the proof for upper semicontinuity is complete, and it remains to show (145) , 2, define a probability (134) under the assumption that has a continuous and bounded density . For every positive integer , define the following continuous and compactly supported function for each positive Borel function . From (129), we observe that is the Laplace transform of . By (133), we conclude that . Then (135) (136) (137) where (136) follows from (146) Denote the density of , define by . For fixed (147) (148) where (138) . By (137) and the arbitrariness of , we conclude that In view of (8), the strict concavity of follows. (149) (150) is continuous and bounded, and are both continuous and bounded functions. is a sequence of nonnegative measurable Therefore, . By Fatou’s lemma functions converging pointwise to Since APPENDIX C PROOF OF THEOREM 4 Proof of Theorem 4: Note that (139) (140) (141) Therefore, it is equivalent to prove the upper semicontinuity of . Without loss of generality, we . shall assume that Let be a sequence of distributions converging to weakly. By the Skorohod’s representation [45, Th. 25.6], there with distribuexist a sequence of random variables , respectively, such that a.s. Let be a tions random variable defined on the same probability space and in. dependent of (151) (152) (153) Note that any for any . By Lemma 1, for (154) (155) WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION Plugging (155) into (153) yields 1299 The proof of (157) and (158) is completed by observing that (166) (156) which, upon sending 1) Proof of Theorem 5: Let and be jointly distributed according to their optimal coupling so that , gives the desired (145). (167) Let (131), APPENDIX D PROOF OF THEOREM 5 . In view of (130) and is an increasing differentiable function with derivative First, we prove an upper bound on the conditional variance, which improves the estimate in [46, Proposition 1.2.]. Lemma 4: Let is independent of . If (168) (169) , where , then for any where the upper bound follows from further weakening (158) for . Then by applying the inequality (157) (170) (158) is the probability density where function of . Proof: Without loss of generality, we assume that . Fix . Then (159) (160) (171) (172) where (170) follows from the suboptimality of for estimating . Next, we upper bound the third term in (172): fix and . Then, for any , let (173) (161) (174) where (161) follows from that . Minimizing the upper bound in (161) by choosing , we have (175) where (174) follows from (169). Therefore (162) (163) (176) Since (177) where (163) follows from Jensen’s inequality: (178) (164) (165) (179) (180) 1300 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012 where 1) Equation (178): by . 2) Equation (179): by Hölder’s inequality with , and . Taking square root on both sides of (180) before applying it to (176), we have where 1) Equation (187): by the monotonicity of and the nonnegativity of relative entropy; 2) Equation (188): by (45). 3) Equation (189): by (44), (46), and (47). on Next, we establish the continuity of . In view of the weak lower semicontinuity of relative entropy [8], we have (190) (181) Hence, it remains to show (191) Substituting (167) and (181) into (172) and using the fact that , we obtain By (186), we have (192) (182) For any5 which implies (28) since . -Lipschitz continuity of The . Since -distance is finite on setting require , i.e., . , follows from for , we (193) APPENDIX E PROOF OF LEMMA 3 Proof: Let As reasoned to obtain (188), the right-hand side of (193) is integrable. By the continuity of and the reverse Fatou’s lemma and (194) (183) By the lower semicontinuity of relative entropy, we have In view of (44), we define a reference probability density by on (184) (195) Plugging (194) and (195) into (192) yields the desired (191). ACKNOWLEDGMENT Then (185) (186) (187) (188) (189) The authors are grateful to Associate Editor Dongning Guo for a careful reading of the draft and many fruitful discussions. Especially, the proof of (8) in the case of infinite mutual information in Remark 4 is due to him. The author also thank Oliver Johnson for stimulating discussions and Ramon van Handel for pointing us to the recent results in [28]. REFERENCES [1] Y. Wu and S. Verdú, “Functional properties of MMSE,” in Proc. IEEE Int. Symp. Inf. Theory, Austin, TX, Jun. 2010, pp. 1453–1457. [2] D. Guo, Y. Wu, S. Shamai, and S. Verdú, “Estimation in Gaussian noise: Properties of the minimum mean-square error,” IEEE Trans. Inf. Theory, vol. 57, no. 4, pp. 2371–2385, Apr. 2011. 5If snr = 0, replace snr 0 by 0. WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION [3] C. Villani, Optimal Transport: Old and New. Berlin, Germany: Springer-Verlag, 2008. [4] S. Yüksel and T. Linder, “Optimization and convergence of observation channels in stochastic control,” 2010 [Online]. Available: arxiv. org/abs/1009.3824 [5] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error in Gaussian channels,” IEEE Trans. Inf. Theory, vol. 51, no. 4, pp. 1261–1283, Apr. 2005. [6] V. M. Zolotarev, One-Dimensional Stable Distributions. Providence, RI: American Mathematical Society, 1986. [7] Y. A. Brychkov and A. P. Prudnikov, Integral Transforms of Generalized Functions. New York: Gordon and Breach, 1989. [8] P. Dupuis and R. S. Ellis, A Weak Convergence Approach to the Theory of Large Deviations. New York: Wiley-Interscience, 1997. [9] P. J. Huber, Robust Statistics. New York, NY: Wiley-Interscience, 1981. [10] G. L. Wise, “A note on a common misconception in estimation,” Syst. Control Lett., vol. 5, pp. 355–0356, Apr. 1985. [11] Y. Wu and S. Verdú, “MMSE dimension,” IEEE Trans. Inf. Theory, vol. 57, no. 8, pp. 4857–4879, Aug. 2011. [12] L. D. Brown, “Admissible estimators, recurrent diffusions, and insoluble boundary value problems,” Ann. Math. Statist., vol. 42, no. 3, pp. 855–903, 1971. [13] S. Vallender, “Calculation of the Wasserstein distance between probability distributions on the line,” Theory Probabil. Appl., vol. 18, pp. 784–786, 1974. [14] S. T. Rachev and L. Rüschendorf, Mass Transportation Problems: Vol. I: Theory. Berlin, Germany: Springer-Verlag, 1998. [15] Y. Wu, “Shannon theory for compressed sensing,” Ph.D. dissertation, Dept. Elect. Eng., Princeton Univ., Princeton, NJ, 2011. [16] D. Guo, S. Shamai, and S. Verdú, The Interplay Between Information and Estimation Measures 2011, Preprint. [17] R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton Univ. Press, 1970. [18] J. A. Cuesta-Albertos, “Shape of a distribution through the L -Wasserstein distance,” in Proceedings of Distributions with Given Marginals and Statistical Modelling. Norwell, MA: Kluwer, 2002, pp. 51–62. [19] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2325–2383, Oct. 1998. [20] Y. Wu and S. Verdú, “The impact of constellation cardinality on Gaussian channel capacity,” in Proc. 48th Annu. Allerton Conf. Commun., Control, Comput., Monticello, IL, Oct. 1, 2010, pp. 620–628. [21] A. I. Sakhanenko, “Estimates in an invariance principle,” Proc. Inst. Math. Novosibirsk, vol. 45, no. 5, pp. 27–44, 1985. [22] E. Rio, “Upper bounds for minimal distances in the central limit theorem,” Annales de l’Institut Henri Poincaré-Probabilités et Statistiques, vol. 45, no. 3, pp. 802–817, 2009. [23] A. R. Barron, “Entropy and the central limit theorem,” Ann. Probabil., vol. 14, no. 1, pp. 336–342, 1986. [24] S. Artstein, K. Ball, F. Barthe, and A. Naor, “On the rate of convergence in the entropic central limit theorem,” Probabil. Theory Related Fields, vol. 129, no. 3, pp. 381–390, 2004. [25] O. Johnson and A. Barron, “Fisher information inequalities and the central limit theorem,” Probabil. Theory Related Fields, vol. 129, no. 3, pp. 391–409, 2004. [26] A. A. Borovkov and S. A. Utev, “On an inequality and a related characterization of the normal distribution,” Theory Probabil. Appl., vol. 28, pp. 219–228, 1984. [27] S. A. Utev, “An application of integro-differential inequalities in probability theory,” Siberian Adv. Math., vol. 2, pp. 164–199, 1992. [28] S. G. Bobkov, G. P. Chistyakov, and F. Götze, Rate of Convergence and Edgeworth-Type Expansion in the Entropic Central Limit Theorem 2011 [Online]. Available: arxiv.org/abs/1104.3994 [29] J. M. Stoyanov, Counterexamples in Probability, 2nd ed. Chichester, U.K.: Wiley, 1997. [30] T. Tao, “Szemerédi’s regularity lemma revisited,” Contributions Discrete Math., vol. 1, no. 1, pp. 8–28, 2006. [31] R. Ahlswede, “The final form of Tao’s inequality relating conditional expectation and conditional mutual information,” Adv. Math. Commun., vol. 1, no. 2, pp. 239–242, 2007. [32] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1982. [33] R. Zamir, “A proof of the Fisher information inequality via a data processing argument,” IEEE Trans. Inf. Theory, vol. 44, no. 3, pp. 1246–1250, May 1998. 1301 [34] M. N. Ghosh, “Uniform approximation of minimax point estimates,” Ann. Math. Statist., vol. 35, no. 3, pp. 1031–1047, 1964. [35] J. G. Smith, “The information capacity of amplitude and variance-constrained scalar Gaussian channels,” Inf. Control, vol. 18, pp. 203–219, 1971. [36] E. Gourdin, B. Jaumard, and B. MacGibbon, “Global optimization decomposition methods for bounded parameter minimax risk evaluation,” SIAM J. Sci. Comput., vol. 15, pp. 16–35, 1994. [37] G. Casella and W. E. Strawderman, “Estimating a bounded normal mean,” Ann. Statist., vol. 9, pp. 870–878, 1981. [38] M. Raginsky, “On the information capacity of Gaussian channels under small peak power constraints,” in Proc. 46th Annu. Allerton Conf. Commun., Control, Comput., 2008, pp. 286–293. [39] P. J. Bickel, “Minimax estimation of the mean of a normal distribution when the parameter space is restricted,” Ann. Statist., vol. 9, pp. 1301–1309, 1981. [40] P. J. Huber, “Fisher information and spline interpolation,” Ann. Statist., vol. 2, pp. 1029–1033, 1974. [41] E. Uhrmann-Klingen, “Minimal Fisher information distributions with compact-supports,” Sankhyā: Indian J. Statist., Series A, vol. 57, pp. 360–374, 1995. [42] S. S. Shitz and S. Verdú, “Worst-case power-constrained noise for binary-input channels,” IEEE Trans. Inf. Theory, vol. 38, no. 5, pp. 1494–1511, Sep. 1992. [43] R. Esposito, “On a relation between detection and estimation in decision theory,” Inf. Control, vol. 12, pp. 116–120, 1968. [44] C. Hatsell and L. Nolte, “Some geometric properties of the likelihood ratio,” IEEE Trans. Inf. Theory, vol. 17, no. 5, pp. 616–618, Sep. 1971. [45] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995. [46] M. Fozunbal, “On regret of parametric mismatch in minimum mean square error estimation,” in Proc. 2010 IEEE Int. Symp. Inf. Theory, Austin, TX, Jun. 2010, pp. 1408–1412. Yihong Wu (S’10) received the B.E. degree from Tsinghua University, Beijing, China, in 2006 and the M.A. and Ph.D. degrees from Princeton University, Princeton, NJ, in 2008 and 2011, all in electrical engineering. He is currently a postdoctoral fellow with the Statistics Department, The Wharton School, University of Pennsylvania, Philadelphia, PA. His research interests are in information theory, statistical decision theory, high-dimensional inference, approximation theory and fractal geometry. Dr. Wu is a recipient of the Princeton University Wallace Memorial Honorific Fellowship in 2010. He was a recipient of 2011 Marconi Society Paul Baran Young Scholar Award and the Best Student Paper Award at the 2011 IEEE International Symposiums on Information Theory (ISIT). Sergio Verdú (S’80–M’84–SM’88–F’93) received the telecommunications engineering degree from the Universitat Politènica de Barcelona, Barcelona, Spain, in 1980, and the Ph.D. degree in electrical engineering from the University of Illinois at Urbana-Champaign, Urbana, in 1984. Since 1984, he has been a member of the faculty of Princeton University, Princeton, NJ, where he is the Eugene Higgins Professor of Electrical Engineering. Dr. Verdú is the recipient of the 2007 Claude E. Shannon Award and the 2008 IEEE Richard W. Hamming Medal. He is a member of the National Academy of Engineering and was awarded a Doctorate Honoris Causa from the Universitat Politènica de Catalunya in 2005. He is a recipient of several paper awards from the IEEE: the 1992 Donald Fink Paper Award, the 1998 Information Theory Outstanding Paper Award, an Information Theory Golden Jubilee Paper Award, the 2002 Leonard Abraham Prize Award, the 2006 Joint Communications/Information Theory Paper Award, and the 2009 Stephen O. Rice Prize from the IEEE Communications Society. He has also received paper awards from the Japanese Telecommunications Advancement Foundation and from Eurasip. He received the 2000 Frederick E. Terman Award from the American Society for Engineering Education for his book Multiuser Detection. He served as President of the IEEE Information Theory Society in 1997 and is currently Editor-in-Chief of Foundations and Trends in Communications and Information Theory.