Download Functional Properties of Minimum Mean-Square Error and Mutual Information,

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Generalized linear model wikipedia , lookup

Nyquist–Shannon sampling theorem wikipedia , lookup

Information theory wikipedia , lookup

Transcript
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012
1289
Functional Properties of Minimum Mean-Square
Error and Mutual Information
Yihong Wu, Student Member, IEEE, and Sergio Verdú, Fellow, IEEE
Abstract—In addition to exploring its various regularity properties, we show that the minimum mean-square error (MMSE)
is a concave functional of the input–output joint distribution.
In the case of additive Gaussian noise, the MMSE is shown to
be weakly continuous in the input distribution and Lipschitz
continuous with respect to the quadratic Wasserstein distance for
peak-limited inputs. Regularity properties of mutual information
are also obtained. Several applications to information theory and
the central limit theorem are discussed.
Index Terms—Bayesian statistics, central limit theorem,
Gaussian noise, minimum mean-square error (MMSE), mutual
information, non-Gaussian noise.
I. INTRODUCTION
ONOTONICITY, convexity, and infinite differentiability of the minimum mean-square error (MMSE)
in Gaussian noise as a function of the signal-to-noise ratio
(SNR) have been shown in [2]. In contrast, this paper deals
with the functional aspects of MMSE, i.e., as a function of
the input–output joint distribution
, and in particular, as a
function of the input distribution
when
is fixed. We
devote special attention to additive Gaussian noise.
The MMSE is a functional of the input–output joint distribution
defined on
where denotes the Borel -algebra, or equivalently of the pair
. Define
M
(1)
(2)
(3)
These notations will be used interchangeably. When is related
to
through an additive-noise channel with gain
, i.e.,
where is independent of , we denote
is standard Gaussian distributed. Similarly, we denote
where
the mutual information by
(6)
(7)
In Section II, we study various concavity properties of
the MMSE functional defined in (3)–(5). Unlike the mutual
, which is concave in
, convex
information
in
but neither convex nor concave in
, we show
that the MMSE functional
is concave in the joint
distribution
, hence concave individually in
when
is fixed, and in
when
is fixed.
is
neither concave nor convex in the pair
.
Various regularity properties of MMSE are explored in Section III. In particular, we show that
1)
is weakly lower semicontinuous (l.s.c.)
in
but not continuous in general;
2) when
has a continuous and bounded density,
is weakly continuous;
3) When is Gaussian and has a finite moment of a certain
order,
is Lipschitz continuous with
respect to the Wasserstein distance [3].
For more general distortion functions (not necessarily mean
square), concavity and lower semicontinuity properties of
the optimal distortion as a function of the input distribution
or channel statistics have recently been studied in [4] in the
context of stochastic control.
Via the I-MMSE relationship1[5]
(8)
regularities of MMSE are inherited by the mutual information
when the input power is bounded. Several applications of these
continuity properties are given in Section IV.
1) Using the weak continuity of MMSE, we prove that the
differential version of the I-MMSE relationship
(4)
(5)
Manuscript received December 29, 2010; accepted September 02, 2011. Date
of current version February 29, 2012. The material in this paper was presented
in part at the 2010 IEEE International Symposium on Information Theory.
Y. Wu was with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA. He is now with the Department of Statistics, Wharton School of the University of Pennsylvania, Philadelphia, PA 19104
USA (e-mail: [email protected]).
S. Verdú is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: [email protected]).
Communicated by D. Guo, Associate Editor for Shannon Theory.
Digital Object Identifier 10.1109/TIT.2011.2174959
(9)
as long as the mutual information is
holds for all
finite, thus dropping the finite-variance condition imposed
on in [5, Th. 1].
2) The Lipschitz continuity of MMSE enables us to gauge the
gap between the Gaussian channel capacity and the mutual
information achieved by a given input by computing its
Wasserstein distance to the Gaussian distribution.
1Throughout this paper, natural logarithms are adopted and information units
are nats.
0018-9448/$26.00 © 2011 IEEE
1290
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012
3) We give upper bounds on the convergence rate (in terms of
relative entropy) in the central limit theorem for densities
obtained as the convolution of Gaussian with another distribution under various moment assumptions.
4) We give an example of the central limit theorem in the
sense of weak convergence where the non-Gaussianness is
finite but does not vanish.
We also give a new bound on the decrease of MMSE due to
additional observations in terms of the mutual information.
In Section V, we discuss the data processing inequality asis desociated with MMSE, which implies
creasing in
for those
with a stable distribution,2 e.g.,
Gaussian.
In Section VI, we present relevant results on the extremization of the MMSE functional with Gaussian inputs and/or
Gaussian noise. While the least favorable input with additive
Gaussian noise is Gaussian, the worst random transformation
faced by Gaussian inputs is an attenuation followed by additive
Gaussian noise, which coincides with the optimal forward
random transformation achieving the Gaussian rate-distortion
function, i.e., the backward random transformation is an additive Gaussian noise. Nonetheless, the worst additive-noise
channel is still Gaussian. We also discuss the MMSE-maximizing input distribution under an amplitude constraint and its
discrete nature.
II. CONCAVITY
Theorem 1: For any
,
and
Corollary 2 (Nonstrict Concavity): In general, MMSE is not
strictly concave.
Proof: According to (10), it can be shown that
(13)
holds for all
if and only if
(14)
-a.e. and
-a.e. . Therefore, instances of nonholds for
strict concavity can be established by constructing pairs of distributions which give rise to the same optimal estimator. Consider the following examples.
where and are i.i.d. By symmetry,
1) Let
. Then, since
, the optimal estimator is given by
, regardless of the distribution of .
Therefore, in this case, the mapping
is
affine between those joint distributions.
and
be independent and standard Gaussian.
2) Let
and
Denote the joint distribution of
by
and
, respectively. Then, the
optimal estimators of
under
and
are both
. Therefore,
is
not strictly concave.
3) Let
, where
is independent of
and
equiprobable Bernoulli. Consider two densities of
(15)
(16)
(10)
, and
and
denote the
where
under
and
, respectively. Consemarginals of
is a concave functional in
.
quently,
Proof: See Appendix A.
Corollary 1:
is individually concave in each
of its arguments when the other one is fixed.
Remark 1: MMSE is not concave in the pair
. We
, 2, let
illustrate this point by the following example: for
, where
and
are independent and equiprobable on
,
and
are independent and equiprobable on
and
, respectively. Then,
and
. Let
, where the distribution of
(resp. ) is the equal mixture of those of
and
(resp.
and
). Then
(11)
(12)
distribution P is called stable if for X , X independent identically disbX has
tributed according to P , for any a, b 2 , the random variable aX
the same distribution as cX d for some c, d 2 [6, p. 6].
2A
+
+
denotes the standard normal denwhere
sity. It can be shown that the optimal estimators for (15)
and (16) are the same
(17)
Hence, the MMSE functional for this channel is the same
for any mixture of (15) and (16).
Despite the nonstrict concavity in
for general
, in
the special case of additive Gaussian noise, MMSE is indeed
, as shown next. The proof
a strictly concave functional of
exploits the relationship between the optimal estimator in
Gaussian channels and the Weierstrass transform [7] of the
input distribution.
Theorem 2: For fixed
,
are both strictly concave.
Proof: See Appendix B.
and
III. REGULARITY OF MMSE
A. Continuity and Semicontinuity
In general, the functional
tinuous. To see this, consider
converges in distribution to
is not weakly semicon, which
. Therefore,
WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION
. However,
for each
. Thus
(18)
and therefore
is not lower semicontinuous (l.s.c.) in
. On the other hand, consider
and
(19)
Then,
and
not u.s.c.:
, in distribution. Since
, it holds that
is
(20)
Theorem 4: Let
. Then, for any
is weakly u.s.c. In addition, if
tinuous and bounded density, then
weakly continuous.
Proof: See Appendix C.
1291
,
has a conis
Remark 3: Theorem 4 cannot be extended to
,
, which is weakly l.s.c. in
because
but not continuous, as the example in (19) illustrates. For
,
need not be weakly continuous if the sufficient conditions in Theorem 4 are not satisfied.
and
are both equiprobable
For example, suppose that
, where is a sequence of irrational
Bernoulli. Let
in distribution, and
numbers converging to 1. Then,
for all , but
. This
also show that under the condition of Theorem 3,
need not be weakly continuous in
.
(21)
Corollary 3: For fixed
,
is
weakly continuous.
Corollary 3 guarantees that the MMSE of a random variable
can be calculated using the MMSE of its successively finer discretizations, which paves the way for numerically calculating
MMSE for singular inputs (e.g., Cantor distribution) in [11].
However, one caveat is that to calculate the value of MMSE
within a given accuracy, the quantization level needs to grow
such that the quantization error is much smaller than
with
the noise.
In view of the representation of the MMSE by the Fisher information of the channel output with additive Gaussian noise [5,
eq. (58)], (known as Brown’s identity in the statistics literature
[12, eq. (1.3.4)]):
(22)
(24)
where
and
denote the collection of all real-valued
Borel and continuous bounded functions on , respectively, and
in .
(22) is due to the denseness of
For a fixed estimator
in
Corollary 3 implies the weak continuity of
. While Fisher information is only l.s.c. [9, p. 79], here the
continuity is due to convolution with the Gaussian density.
Nevertheless, assuming either bounded input or additive
noise, MMSE is indeed a weakly u.s.c. functional.
Theorem 3: Let
be such that
is bounded. Denote the collection of all Borel probability mea. Then,
is weakly u.s.c.
sures on by
.
on
Proof: Variational representations prove to be effective
tools in proving semicontinuity and convexity of information
measures (for example, relative entropy [8], Fisher information
[9], etc). Here, we follow the same approach by using the
following variational characterization of MMSE3:
B. Lipschitz Continuity
(23)
. This is because
is weakly continuous in
since is bounded in . Therefore, by (22),
is weakly u.s.c. because it is the pointwise infimum
of weakly continuous functions. In view of the counterexample
is not
in (20), we see that the boundedness assumption on
superfluous.
Remark 2: The variational representation of MMSE in (22)
provides an alternative proof of its concavity as follows: since
,
is affine in
,
is
for any
concave because it is the pointwise infimum of affine functions.
This proof does not rely on the boundedness of . Hence, we
could set
.
3The Borel measurability of estimators in (22) is not superfluous. For example
[10], it is possible to construct a random variable Y and a nonmeasurable function f^ such that X = f^(Y ) is a random variable independent of Y . Then,
(X jY ) =
X while [(X 0 f^(Y )) ] = 0.
Seeking a finer characterization of the modulus of continuity
of
, we introduce the Wasserstein distance
[3].
Definition 1: For
, the Wasserstein space of
order on is defined as the collection of all Borel probability
. The
measures with finite th-order moments, denoted by
Wasserstein distance of order (
distance) is a metric on
, defined for ,
as
(25)
where the infimum is over all joint distributions of
.
distance coincides with the
disOn the real line, the
tance between the quantile functions of two distributions [13],
[14]
(26)
1292
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012
where
denotes the cumulative distribution function of
By Hölder’s inequality,
.
is equal to the distribution of
the joint distribution of
conditioned on the event
, i.e.
(27)
distance metrizes the topology of weak convergence
The
plus convergence of th-order moments. Because in general,
convergence in distribution does not yield convergence of moments, this topology is strictly finer than the weak topology.
Since
convergence implies convergence of variance, in view
,
is conof Corollary 3, for all
. Capitalizing on the
tinuous on the metric space
smoothness of the optimal estimator with Gaussian noise, the
can be established
Lipschitz continuity of
as follows.
Theorem 5: For any
and any
,
(35)
Since
(36)
is a suboptimal estimator of
, we have
with
(37)
(38)
(28)
(39)
Consequently,
uous on any compact set in
Proof: See Appendix D.
is
-Lipschitz continfor any
.
Finally, we present two results which frequently enable us to
approximate MMSE of a given input using its truncated version,
where the approximation error is uniform in the random transformation.
(40)
which implies (33). To show (34), note that
(41)
(42)
(43)
Lemma 1: For any
(29)
Proof:
IV. APPLICATIONS TO MUTUAL INFORMATION
(30)
(31)
(32)
Interchanging the roles of
and
, (29) follows.
Capitalizing on the weak continuity of MMSE proved in Theorem 4, we prove that the I-MMSE relationship holds as long as
the mutual information is finite, thus removing the finite-variance condition imposed on the input in [5, Th. 1].
(34)
Theorem 6: For any random variable , the following are
equivalent:
for any
;
a)
for all
;
b)
c)
.
, then (9) holds for all
Furthermore, if
.
is much weaker
It should be remarked that
than
, as the following sufficient condition shows,
which also applies to non-Gaussian noise.
be distributed according to
when the input is
. Then,
Lemma 3: Let
have a density with
. Let
be an increasing continuous function that satisfies
the following conditions:
Lemma 2: Let be distributed according to . For
,
the distribution of
conditioned on the event
denote by
, i.e.,
for any measurable set
. Then, for any
(33)
Proof of Lemma 2: Let
and
be the output of
A. Finite Mutual Information and the I-MMSE Relationship
WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION
1)
1293
for any Borel subset .
i.e.,
This is well defined because the denominator is positive for sufis bounded, we have
ficiently large . Since
(44)
(54)
2) For any
, there exists
, such that
(45)
If
in distribution, by the weak continuity of
Since
MMSE proved in Corollary 3,
. Since
, which in integrable
on the interval
, applying dominated convergence
theorem to (54) yields
(46)
(47)
then
1)
2)
(55)
To establish (53), it remains to show
for all
;
is continuous on
Proof: See Appendix E.
(56)
.
Examples of functions that satisfy (45) include
1) is convex;
is increasing and subadditive (e.g.,
concave with
2)
).
Consequently, any convex, increasing, and nonconstant
satisfies both (44) and (45). Particularizing to Gaussian noise,
is sufficient for
choosing to quadratic implies that
. In fact, choosing to be nested logarithmic,
we obtain a family of sufficient conditions much weaker than
square integrability: define
(48)
th tetration (iterated exponential) of
where denotes the
. Then,
for some
implies
for all
.
Proof of Theorem 6: The equivalence between a) and c)
b), assume
has been shown in [15, Th. 1]. To prove that a)
for some
. By monotonicity, it is
that
. Since
sufficient to consider
forms a Markov chain,4 we have
By the lower semicontinuity of relative entropy, it is sufficient
to show
(57)
Note that
(58)
(59)
(60)
(61)
which implies the desired (57).
on
established
In view of the continuity of
in [2, Proposition 7], differentiating (53) yields (9) for every
.
Remark 4: It can be shown that the integral I-MMSE rela. First, consider the case of
tionship (8) holds for all
. Sending
in (53) and applying the
monotone convergence theorem, we have
(62)
(52)
amounts to proving the
Therefore, establishing (8) for any
at
, which has been established
continuity of
in [16] for arbitrary with finite mutual information.
In case of
, the integral I-MMSE relationship
(8) still holds in the sense that both sides are infinity. To see this,
be distributed according to the conditional distribution
let
defined in Lemma 2. Then
where (50) follows from the concavity of the logarithm. Therefor all
.
fore,
Next, we show that for any
(63)
(53)
(64)
For
, let
be a random variable distributed according
conditioned on the event
,
to the distribution of
(65)
(49)
(50)
(51)
X 0 Y 0 Z means that X and Z are conditionally independent given Y ,
i.e., (X; Y;Z ) is a Markov chain.
(66)
4
where
1294
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012
1) Equation (64): by (34);
is bounded;
2) Equation (65): by (8), since
3) Equation (66): by the lower semicontinuity of
.
the lower semicontinuity of
that of relative entropy.
is inherited by
As a consequence of Theorem 7, we can restrict inputs
to a weakly dense subset (e.g., discrete distributions) in the
maximization
B. Regularity of Mutual Information
Note that unlike the finite-dimensional setting (e.g., [17]), the
concavity of mutual information does not imply continuity. In
view of the weak lower semicontinuity of relative entropy [8],
is weakly l.s.c. in
but not continuous in general,
as the following example illustrates: let
(67)
and
, which converges weakly to
regardless of the choice of
. Using the concavity
and the dominated convergence theorem, it can be
of
shown that for any
with
(68)
but
.
Nevertheless, mutual information is indeed weakly continuous in the input distribution if the input variance is bounded
and the additive noise is Gaussian. Applying Corollary 3 and the
dominated convergence theorem to (8), we obtain the following.
in distribution and
,
Theorem 7: If
for any
.
then
-continuity of MMSE (in Theorem 5),
By the
is also
-continuous. Moreover, Lipschitz continuity holds
when the input is peak-limited, as the following result, which
is obtained by integrating both sides of (28) and (8), shows.
Corollary 4: Under the conditions in Theorem 5, for any
,
,
with
(72)
with
replaced by
. It is interesting to analyze how
and the maximal mutual information
the gap between
values, denoted by
achieved by unit-variance inputs taking
, closes as
grows. The
-Lipschitz continuity of
mutual information allows us to obtain an upper bound on the
distance
convergence rate. It is known that the optimal
between a given distribution and discrete distributions taking
values coincides with the square root of the mean-square
quantization error of the optimal -point quantizer [18], which
[19]. Choosing
to be the output of
scales according to
the optimal quantizer and applying a truncation argument, we
. In fact, the gap
conclude that
vanishes exponentially fast [20].
C. Applications to the Central Limit Theorem
We proceed to prove upper bounds for the convergence
rate of non-Gaussianness in central limit theorem with i.i.d.
random variables whose distribution is the convolution between
a Gaussian distribution and an arbitrary distribution satisfying
certain moment assumptions.
and
be independent i.i.d. seTheorem 9: Let
. Let
quences of random variables where
and
(73)
(69)
Consequently,
is
-Lipschitz continuous on
for any
.
any compact set in
-continuity also carries over to non-Gaussian
In fact,
noise.
Theorem 8: Let
, i.e.,
in
distribution and second-order moments also converge. Then,
whenever
has finite nonGaussianness
(70)
Then
1) If
, then
(74)
2) If the moment generating function of
is finite, then
(75)
. Then,
Proof: Let
with independent
. Let
denote a normal
random variable with the same mean and variance as
. Let
,
and
. According to (71), we have
(76)
Proof: Note that
(77)
(78)
(71)
with
convergence, the upper
Since variance converges under
follows from the lower
semicontinuity
semicontinuity of relative entropy. As we mentioned before,
(79)
WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION
where (78) follows from Corollary 4.
The Wasserstein distance between
and
satisfies the
following.
, if
, then [21], [22, eq. (1.5)]
1) For
1295
in estimating
never exceeds half of the mutual information
and .
between
Theorem 10: Let
take values in the unit ball of an Euclidean space. For any
(80)
2) If the moment generating function of
eq. (1.6)] for any
(85)
is finite, then [22,
In particular
(86)
(81)
Note that
if
. Applying
and (81) to (78), respectively, we obtain the
(80) with
desired results in (74) and (75).
. Then
Proof: Let
(87)
Remark 5: Nonasymptotic upper bounds on
can be
obtained by combining (78) and the nonasymptotic results on
the Wasserstein distance in [22, Th. 2.1 and eq. (1.6)]. Theorem
9 can also be generalized to non-i.i.d. sequences by applying
[22, Th. 4.1].
was studied in [23],
Remark 6: The asymptotic of
where it was shown that
if and only if
for some
. Finer analysis in [24] and [25] showed
that if
has finite Poincaré constant [25, Definition 1.2], the
vanishes as the optimal rate
.
non-Gaussianness
However, finiteness of the Poincaré constant implies
has moments of all orders [26], while (74) only assumes the existence
of the sixth moment. In the special case of compactly supported
, [27, Th. 1.8] implies that
has a finite Poincaré constant.
, which is stronger
Therefore, in this case,
than (75).
After the submission of this paper, we became aware of the
results in [28, Th, 1.1] which established the exact asymptotics
using Edgeworth-type expansions: let
be defined
of
in (73) with
. Then
(82)
(88)
(89)
(90)
(91)
(92)
where (90) makes uses of the fact that belongs to the unit ball,
(91) follows from the Kullback–Csiszár–Kemperman–Pinsker
inequality (e.g., [32]) and (89) follows from:
(93)
. Let denote the normalized
To verify (93), assume
, i.e.,
. Denote by
version of
the absolutely value of taken componentwise. Then
which improves Theorem 9 significantly.
To conclude this section, we give an example where the nonGaussianness of a sequence of absolutely continuous distributions does not vanish in the central limit theorem. Consider the
be a sequence of
following example [29, eq. (17.4)]: let
independent random variables, with
. While
Define
tation of characteristic functions reveals that
distribution. Now, let
Since
is bounded, by Theorem 7,
. In view of (71), we have
(94)
(95)
(96)
(83)
(97)
(84)
where (96) follows from Jensen’s inequality and the convexity
of norms.
, direct compu, in
, in distribution.
.
D. Bounds on MMSE via Mutual Information
To conclude this section, we present a result related to Tao’s
inequality [30], [31], which shows that the contribution of
V. DATA PROCESSING INEQUALITY
In [33], it is pointed out that, like mutual information, MMSE
satisfies a data processing inequality. We prove this property
along with a necessary and sufficient condition for equality to
hold.
Theorem 11: If
, then
(98)
1296
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012
with equality if and only if
a.e.
Proof: By the orthogonality principle and the Markov
property
(99)
Corollary 5: For any ,
is decreasing in
.
Proof: The monotonicity is a consequence of Theorem 11:
and
,
for any
forms a Markov chain.
Using the same reasoning, we can conclude that, for any
with a stable law,
is monotonically decreasing
in
for all .
It should be noted that unlike the data processing inequality
for mutual information, equality in (98) does not imply that is
a sufficient statistic of for . Consider the following example
of a multiplicative channel. Let , , and be independent
square integrable random variables with zero mean. Let
and
. Then
and
(100)
Similarly,
reminiscent of the fact that the capacity-achieving input distribution for the Gaussian channel under amplitude constraints
is also finitely supported [35]. In general, there are no closed(see [36] for numerical recipes). Howform solutions for
, it has been shown that the worst input
ever, when
distribution is binary [37]
(103)
Capitalizing on the I-MMSE relationship (8), this result has
been used to establish that (103) also achieves
(104)
[38].
when
when is large has been investigated in
The behavior of
[39], where it is shown that if
is distributed according to ,
in distribution as
, where the limiting
then
has the following density:
distribution
(105)
Moreover
. By Theorem 11,
. However,
does not hold.
VI. MAXIMIZATION OF MMSE
In this section, we consider the maximization of MMSE over
for fixed
and vice versa.
a convex set of
(106)
An intuitive explanation for (105) and (106) is the following: let
. By (24)
(107)
(108)
(109)
A. Worst Input Distribution
Theorem 12 ([2, Proposition 12]): Let
pendent of
inde-
in distribution and that
. Then
must minimize
the Fisher information among all distributions supported on
. This unique minimizer is given by (105) [39], [40].
Similarly, if a variance constraint
(
)
is added to (102) in addition to the amplitude constraint, it can
converges in distribution to the density
be shown that
that minimizes the Fisher information
supported on
with variance not exceeding , which has been obtained in [41].
Compared to (105), it is interesting to observe that the capacity-achieving input distribution in (104) has a different limachieves (104). Then,
converges to
iting behavior: let
the uniform distribution on
, which maximizes the differential entropy instead of minimizing the Fisher information.
Suppose that
(101)
where the maximum is achieved if and only if
for some
.
Through the I-MMSE relationship (8), Theorem 12 provides
an alternative explanation for the optimality of Gaussian inputs in Gaussian channels, because the integrand in (8) is maximized pointwise. Consequently, to approximate the capacityachieving Gaussian distribution under some given constraints,
whose MMSE profile approximates
it is equivalent to find
norm. This observation has been
the Gaussian MMSE in the
exploited in Section IV to estimate how discrete inputs approach
the Gaussian channel capacity.
Theorem 12 states that the least favorable input distribution
under the variance constraint is Gaussian. However, under the
that achieves
amplitude constraint, the input distribution
(102)
is finitely supported (see, for example, [34]), a consequence of
the analyticity of the Gaussian density. This phenomenon is
converges to
B. Worst Random Transformation
In this section, we investigate the variance-constrained additive noise that maximizes the MMSE of the input given its noisy
version. It is important to note that we do not constrain the noise
to be independent of the input. If the criterion is the minimization of mutual information and the additive noise is not allowed
to depend on the input, the worst case noise is known for specific input distributions such as Gaussian and binary [42].
WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION
Theorem 13:
(110)
1297
and
with respect to
Denote the densities of
by and , respectively.
based on is given by
The optimal estimator for
holds for any with finite mean.
Proof of Theorem 13: (Converse)
and
(118)
where
(111)
(112)
(113)
(Achievability) If
,
is achieved
by
. If
, let
with
independent of
and
chosen such that
. Such an
always exists because
is a decreasing continuous function [2, Propowhich vanishes as
. Moreover, it
sition 7] in
as
, even if
can be shown that
. Then,
achieves the upper bound since
.
(119)
Then, the optimal estimator for
based on
is given by
(120)
since
.
By the orthogonality principle and
(121)
(122)
It is interesting to analyze the worst channel in (110) for
Gaussian input
. When
, the maximal
MMSE is achieved by an attenuator followed by contamination
by additive independent Gaussian noise
(123)
(124)
(114)
(125)
Interestingly, (114) is the minimizer of
This proves the desired equality (10).
(115)
(116)
. Hence, the backward random
where
consists of additive Gaussian noise with
transformation
variance . Nonetheless, the worst independent additive-noise
channel is still Gaussian, i.e.
(117)
is independent of , by (111), the problem
because when
reduces to the situation in Theorem 12 and the same expression
applies.
APPENDIX B
PROOF OF THEOREM 2
Proof: The concavity follows from Corollary 1. To prove
. Suppose that
strict concavity, it is sufficient to consider
for
with distribution
and
, we have
(126)
Denote
2. Then, by (14),
and
,
(127)
holds for Lebesgue-a.e.
.
The density of is given by
APPENDIX A
PROOF OF THEOREM 1
and
be two joint distributions.
Proof: Let
be a random variable taking values on
with
Let
. Let
be distributed according to
conditioned on
for
or 2. Then,
has joint distribution
.
for
(128)
which is the Weierstrass transform [7] of
hood ratio as
. Define the likeli-
(129)
1298
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012
The following properties of the log-likelihood ratio are proved
in [43] and [44, Property 3], respectively
Denote
denseness of
that
in
. Let
, for all
, there exists
. Then
. By the
such
(130)
(142)
(131)
The solution of (130) is
(143)
(144)
(132)
Hence, in view of (127), we have for all
(133)
Next, we show that
measure
according to
. For
where (142) is due to the suboptimality of and (144) follows
from the dominated convergence theorem. By the arbitrariness
of , the proof for upper semicontinuity is complete, and it remains to show
(145)
, 2, define a probability
(134)
under the assumption that has a continuous and bounded density . For every positive integer , define the following continuous and compactly supported function
for each positive Borel function . From (129), we observe that
is the Laplace transform of . By (133), we conclude that
. Then
(135)
(136)
(137)
where (136) follows from
(146)
Denote the density of
, define
by
. For fixed
(147)
(148)
where
(138)
.
By (137) and the arbitrariness of , we conclude that
In view of (8), the strict concavity of
follows.
(149)
(150)
is continuous and bounded,
and
are both continuous and bounded functions.
is a sequence of nonnegative measurable
Therefore,
. By Fatou’s lemma
functions converging pointwise to
Since
APPENDIX C
PROOF OF THEOREM 4
Proof of Theorem 4: Note that
(139)
(140)
(141)
Therefore, it is equivalent to prove the upper semicontinuity
of
. Without loss of generality, we
.
shall assume that
Let
be a sequence of distributions converging to
weakly. By the Skorohod’s representation [45, Th. 25.6], there
with distribuexist a sequence of random variables
, respectively, such that
a.s. Let be a
tions
random variable defined on the same probability space and in.
dependent of
(151)
(152)
(153)
Note that
any
for any
. By Lemma 1, for
(154)
(155)
WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION
Plugging (155) into (153) yields
1299
The proof of (157) and (158) is completed by observing that
(166)
(156)
which, upon sending
1) Proof of Theorem 5: Let and be jointly distributed
according to their optimal coupling so that
, gives the desired (145).
(167)
Let
(131),
APPENDIX D
PROOF OF THEOREM 5
. In view of (130) and
is an increasing differentiable function with derivative
First, we prove an upper bound on the conditional variance,
which improves the estimate in [46, Proposition 1.2.].
Lemma 4: Let
is independent of . If
(168)
(169)
, where
, then for any
where the upper bound follows from further weakening (158)
for
. Then
by applying the inequality
(157)
(170)
(158)
is the probability density
where
function of .
Proof: Without loss of generality, we assume that
. Fix
. Then
(159)
(160)
(171)
(172)
where (170) follows from the suboptimality of for estimating
.
Next, we upper bound the third term in (172): fix
and
. Then, for any ,
let
(173)
(161)
(174)
where (161) follows from that
. Minimizing the upper bound in (161)
by choosing
, we have
(175)
where (174) follows from (169). Therefore
(162)
(163)
(176)
Since
(177)
where (163) follows from Jensen’s inequality:
(178)
(164)
(165)
(179)
(180)
1300
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 58, NO. 3, MARCH 2012
where
1) Equation (178): by
.
2) Equation (179): by Hölder’s inequality with
,
and
.
Taking square root on both sides of (180) before applying it to
(176), we have
where
1) Equation (187): by the monotonicity of and the nonnegativity of relative entropy;
2) Equation (188): by (45).
3) Equation (189): by (44), (46), and (47).
on
Next, we establish the continuity of
. In view of the weak lower semicontinuity of relative entropy [8], we have
(190)
(181)
Hence, it remains to show
(191)
Substituting (167) and (181) into (172) and using the fact that
, we obtain
By (186), we have
(192)
(182)
For any5
which implies (28) since
.
-Lipschitz continuity of
The
. Since
-distance is finite on
setting
require
, i.e.,
.
,
follows from
for
, we
(193)
APPENDIX E
PROOF OF LEMMA 3
Proof: Let
As reasoned to obtain (188), the right-hand side of (193) is integrable. By the continuity of and the reverse Fatou’s lemma
and
(194)
(183)
By the lower semicontinuity of relative entropy, we have
In view of (44), we define a reference probability density
by
on
(184)
(195)
Plugging (194) and (195) into (192) yields the desired (191).
ACKNOWLEDGMENT
Then
(185)
(186)
(187)
(188)
(189)
The authors are grateful to Associate Editor Dongning Guo
for a careful reading of the draft and many fruitful discussions.
Especially, the proof of (8) in the case of infinite mutual information in Remark 4 is due to him. The author also thank Oliver
Johnson for stimulating discussions and Ramon van Handel for
pointing us to the recent results in [28].
REFERENCES
[1] Y. Wu and S. Verdú, “Functional properties of MMSE,” in Proc. IEEE
Int. Symp. Inf. Theory, Austin, TX, Jun. 2010, pp. 1453–1457.
[2] D. Guo, Y. Wu, S. Shamai, and S. Verdú, “Estimation in Gaussian
noise: Properties of the minimum mean-square error,” IEEE Trans. Inf.
Theory, vol. 57, no. 4, pp. 2371–2385, Apr. 2011.
5If snr = 0,
replace snr
0 by 0.
WU AND VERDÚ: FUNCTIONAL PROPERTIES OF MINIMUM MEAN-SQUARE ERROR AND MUTUAL INFORMATION
[3] C. Villani, Optimal Transport: Old and New. Berlin, Germany:
Springer-Verlag, 2008.
[4] S. Yüksel and T. Linder, “Optimization and convergence of observation channels in stochastic control,” 2010 [Online]. Available: arxiv.
org/abs/1009.3824
[5] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum
mean-square error in Gaussian channels,” IEEE Trans. Inf. Theory, vol.
51, no. 4, pp. 1261–1283, Apr. 2005.
[6] V. M. Zolotarev, One-Dimensional Stable Distributions. Providence,
RI: American Mathematical Society, 1986.
[7] Y. A. Brychkov and A. P. Prudnikov, Integral Transforms of Generalized Functions. New York: Gordon and Breach, 1989.
[8] P. Dupuis and R. S. Ellis, A Weak Convergence Approach to the Theory
of Large Deviations. New York: Wiley-Interscience, 1997.
[9] P. J. Huber, Robust Statistics. New York, NY: Wiley-Interscience,
1981.
[10] G. L. Wise, “A note on a common misconception in estimation,” Syst.
Control Lett., vol. 5, pp. 355–0356, Apr. 1985.
[11] Y. Wu and S. Verdú, “MMSE dimension,” IEEE Trans. Inf. Theory,
vol. 57, no. 8, pp. 4857–4879, Aug. 2011.
[12] L. D. Brown, “Admissible estimators, recurrent diffusions, and insoluble boundary value problems,” Ann. Math. Statist., vol. 42, no. 3, pp.
855–903, 1971.
[13] S. Vallender, “Calculation of the Wasserstein distance between probability distributions on the line,” Theory Probabil. Appl., vol. 18, pp.
784–786, 1974.
[14] S. T. Rachev and L. Rüschendorf, Mass Transportation Problems: Vol.
I: Theory. Berlin, Germany: Springer-Verlag, 1998.
[15] Y. Wu, “Shannon theory for compressed sensing,” Ph.D. dissertation,
Dept. Elect. Eng., Princeton Univ., Princeton, NJ, 2011.
[16] D. Guo, S. Shamai, and S. Verdú, The Interplay Between Information
and Estimation Measures 2011, Preprint.
[17] R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton Univ.
Press, 1970.
[18] J. A. Cuesta-Albertos, “Shape of a distribution through the L -Wasserstein distance,” in Proceedings of Distributions with Given Marginals
and Statistical Modelling. Norwell, MA: Kluwer, 2002, pp. 51–62.
[19] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. Inf.
Theory, vol. 44, no. 6, pp. 2325–2383, Oct. 1998.
[20] Y. Wu and S. Verdú, “The impact of constellation cardinality on
Gaussian channel capacity,” in Proc. 48th Annu. Allerton Conf.
Commun., Control, Comput., Monticello, IL, Oct. 1, 2010, pp.
620–628.
[21] A. I. Sakhanenko, “Estimates in an invariance principle,” Proc. Inst.
Math. Novosibirsk, vol. 45, no. 5, pp. 27–44, 1985.
[22] E. Rio, “Upper bounds for minimal distances in the central limit
theorem,” Annales de l’Institut Henri Poincaré-Probabilités et Statistiques, vol. 45, no. 3, pp. 802–817, 2009.
[23] A. R. Barron, “Entropy and the central limit theorem,” Ann. Probabil.,
vol. 14, no. 1, pp. 336–342, 1986.
[24] S. Artstein, K. Ball, F. Barthe, and A. Naor, “On the rate of convergence
in the entropic central limit theorem,” Probabil. Theory Related Fields,
vol. 129, no. 3, pp. 381–390, 2004.
[25] O. Johnson and A. Barron, “Fisher information inequalities and the
central limit theorem,” Probabil. Theory Related Fields, vol. 129, no.
3, pp. 391–409, 2004.
[26] A. A. Borovkov and S. A. Utev, “On an inequality and a related characterization of the normal distribution,” Theory Probabil. Appl., vol. 28,
pp. 219–228, 1984.
[27] S. A. Utev, “An application of integro-differential inequalities in probability theory,” Siberian Adv. Math., vol. 2, pp. 164–199, 1992.
[28] S. G. Bobkov, G. P. Chistyakov, and F. Götze, Rate of Convergence
and Edgeworth-Type Expansion in the Entropic Central Limit Theorem
2011 [Online]. Available: arxiv.org/abs/1104.3994
[29] J. M. Stoyanov, Counterexamples in Probability, 2nd ed. Chichester,
U.K.: Wiley, 1997.
[30] T. Tao, “Szemerédi’s regularity lemma revisited,” Contributions Discrete Math., vol. 1, no. 1, pp. 8–28, 2006.
[31] R. Ahlswede, “The final form of Tao’s inequality relating conditional expectation and conditional mutual information,” Adv. Math.
Commun., vol. 1, no. 2, pp. 239–242, 2007.
[32] I. Csiszár and J. Körner, Information Theory: Coding Theorems for
Discrete Memoryless Systems. New York: Academic, 1982.
[33] R. Zamir, “A proof of the Fisher information inequality via a data
processing argument,” IEEE Trans. Inf. Theory, vol. 44, no. 3, pp.
1246–1250, May 1998.
1301
[34] M. N. Ghosh, “Uniform approximation of minimax point estimates,”
Ann. Math. Statist., vol. 35, no. 3, pp. 1031–1047, 1964.
[35] J. G. Smith, “The information capacity of amplitude and variance-constrained scalar Gaussian channels,” Inf. Control, vol. 18, pp. 203–219,
1971.
[36] E. Gourdin, B. Jaumard, and B. MacGibbon, “Global optimization decomposition methods for bounded parameter minimax risk evaluation,”
SIAM J. Sci. Comput., vol. 15, pp. 16–35, 1994.
[37] G. Casella and W. E. Strawderman, “Estimating a bounded normal
mean,” Ann. Statist., vol. 9, pp. 870–878, 1981.
[38] M. Raginsky, “On the information capacity of Gaussian channels
under small peak power constraints,” in Proc. 46th Annu. Allerton
Conf. Commun., Control, Comput., 2008, pp. 286–293.
[39] P. J. Bickel, “Minimax estimation of the mean of a normal distribution when the parameter space is restricted,” Ann. Statist., vol. 9, pp.
1301–1309, 1981.
[40] P. J. Huber, “Fisher information and spline interpolation,” Ann. Statist.,
vol. 2, pp. 1029–1033, 1974.
[41] E. Uhrmann-Klingen, “Minimal Fisher information distributions with
compact-supports,” Sankhyā: Indian J. Statist., Series A, vol. 57, pp.
360–374, 1995.
[42] S. S. Shitz and S. Verdú, “Worst-case power-constrained noise for
binary-input channels,” IEEE Trans. Inf. Theory, vol. 38, no. 5, pp.
1494–1511, Sep. 1992.
[43] R. Esposito, “On a relation between detection and estimation in decision theory,” Inf. Control, vol. 12, pp. 116–120, 1968.
[44] C. Hatsell and L. Nolte, “Some geometric properties of the likelihood
ratio,” IEEE Trans. Inf. Theory, vol. 17, no. 5, pp. 616–618, Sep. 1971.
[45] P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley,
1995.
[46] M. Fozunbal, “On regret of parametric mismatch in minimum mean
square error estimation,” in Proc. 2010 IEEE Int. Symp. Inf. Theory,
Austin, TX, Jun. 2010, pp. 1408–1412.
Yihong Wu (S’10) received the B.E. degree from Tsinghua University, Beijing, China, in 2006 and the M.A. and Ph.D. degrees from Princeton University,
Princeton, NJ, in 2008 and 2011, all in electrical engineering. He is currently a
postdoctoral fellow with the Statistics Department, The Wharton School, University of Pennsylvania, Philadelphia, PA.
His research interests are in information theory, statistical decision theory,
high-dimensional inference, approximation theory and fractal geometry.
Dr. Wu is a recipient of the Princeton University Wallace Memorial Honorific
Fellowship in 2010. He was a recipient of 2011 Marconi Society Paul Baran
Young Scholar Award and the Best Student Paper Award at the 2011 IEEE International Symposiums on Information Theory (ISIT).
Sergio Verdú (S’80–M’84–SM’88–F’93) received the telecommunications
engineering degree from the Universitat Politènica de Barcelona, Barcelona,
Spain, in 1980, and the Ph.D. degree in electrical engineering from the
University of Illinois at Urbana-Champaign, Urbana, in 1984.
Since 1984, he has been a member of the faculty of Princeton University, Princeton, NJ, where he is the Eugene Higgins Professor of Electrical
Engineering.
Dr. Verdú is the recipient of the 2007 Claude E. Shannon Award and the 2008
IEEE Richard W. Hamming Medal. He is a member of the National Academy of
Engineering and was awarded a Doctorate Honoris Causa from the Universitat
Politènica de Catalunya in 2005. He is a recipient of several paper awards from
the IEEE: the 1992 Donald Fink Paper Award, the 1998 Information Theory
Outstanding Paper Award, an Information Theory Golden Jubilee Paper Award,
the 2002 Leonard Abraham Prize Award, the 2006 Joint Communications/Information Theory Paper Award, and the 2009 Stephen O. Rice Prize from the
IEEE Communications Society. He has also received paper awards from the
Japanese Telecommunications Advancement Foundation and from Eurasip. He
received the 2000 Frederick E. Terman Award from the American Society for
Engineering Education for his book Multiuser Detection. He served as President
of the IEEE Information Theory Society in 1997 and is currently Editor-in-Chief
of Foundations and Trends in Communications and Information Theory.