Download The Entropy of the Normal Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
CHAPTER 8
THE ENTROPY OF THE NORMAL
DISTRIBUTION
INTRODUCTION
The “normal distribution” or “Gaussian distribution” or Gaussian probability density function is
defined by
2
2
1
N(x; µ, σ) =
e −(x−µ) /2σ .
(8.1)
(2πσ 2 ) 1/2
This density function, which is symmetrical about the line x = µ, has the familiar bell shape shown
in Figure 8.1. The two parameters, µ and σ 2 , each have special significance; µ is the mean and σ 2 the
variance of the distribution. All probability density functions must be normalized to unity, and it is
shown in most textbooks on advanced calculus that
∞
∫ −∞ N(x; µ, σ) dx = 1 .
(8.2)
The expectation of x, E(x), is equal to the mean; that is,
E(x) =
∞
∫ −∞ x N(x; µ, σ) dx = µ .
(8.3)
The expectation of (x − µ) 2 , E(x − µ) 2 , is equal to the variance; that is,
E(x − µ) 2 =
∞
∫ −∞ (x − µ) 2 N(x; µ, σ) dx = σ 2 .
(8.4)
Figure 8.1 The normal distribution with mean, µ, and variance σ 2 : N(x; µ, σ). About 2/3 of the
area under the curve lies within one standard deviation, σ, of the mean.
Information, Sensation and Perception.  Kenneth H. Norwich, 2003.
81
8. The Entropy of the Normal Distribution
82
The latter two equations, if unfamiliar, may be found in all textbooks on mathematical statistics, or
may be verified directly by the reader.
The differential entropy of the normal distribution can be found without difficulty. From the
definition of differential entropy given in Chapter 7, and using Equation (8.1),
∞
H = − ∫ (2πσ 2 ) − 2 e −(x−µ)
1
2 /2σ 2
−∞
1
ln (2πσ 2 ) − 2 e −(x−µ)
2 /2σ 2
dx
∞
1
2
2
H = 1 ln(2πσ 2 ) ∫ (2πσ 2 ) − 2 e −(x−µ) /2σ dx
−∞
2
∞
1
2
2
1
+
(2πσ 2 ) − 2 (x − µ) 2 e −(x−µ) /2σ dx .
2 ∫ −∞
2σ
Introducing Equations (8.2) and (8.4),
H = 1 ln(2πσ 2 ) + 1 .
2
2
Writing
1
2
as ( 12 ln e),
H = 1 ln(2πeσ 2 ) ,
2
(8.5)
which is the simple result we sought. We note that the differential entropy of the Gaussian probability
density function depends only on the variance and not on the mean.
It has often been demonstrated (for example, Goldman, 1953) that for a given, fixed value of
variance, σ 2 , the probability density with the greatest value of H is the Gaussian density.
For an n-dimensional Gaussian density defined by
N(x 1 , x 2 , . . . ; µ 1 , µ 2 , ..., σ 1 , σ 2 , . . . )
=
n
∏(2πeσ 2i ) −
1
2
exp −
i=1
(x i − µ i )
2σ 2i
2
(8.6)
,
the differential entropy is given by
H = (n/2) ln 2πe(σ 21 σ 22 . . . σ 2n ) 1/n
(8.7)
as shown by McEliece (1977). In the limiting case, for n = 1, Equation (8.7) reduces to (8.5).
CONVOLUTION OF TWO GAUSSIANS
Suppose that a pure signal is described by N(x; µ S , σ S ), and its obfuscating noise by N(x; µ N , σ N ).
Then, as shown by Equation (7.20), the density function resulting from pure signal in the presence of
noise is provided by the convolution
p SN (x) =
∞
∫ −∞ N(x − x U ; µ S , σ S ) N(x U ; µ N , σ N ) dx U .
(8.8)
In fact, when we carry out the convolution of two Gaussians, the result is a third Gaussian density
whose mean is the sum of the means of the two component functions and whose variance is the sum of
the variances of the two component functions. That is,
1
p SN = N(x; µ S + µ N , (σ 2S + σ 2N ) 2 ) .
(8.9)
The full demonstration of (8.9) is not usually given in the textbooks because it is rather tedious, but
it is provided in the Appendix for completeness. Writing Equation (8.9) explicitly,
1
p SN = [2π(σ 2S + σ 2N )] − 2 exp −
[x − (µ S + µ N )] 2
2(σ 2S + σ 2N )
Information, Sensation and Perception.  Kenneth H. Norwich, 2003.
.
(8.10)
8. The Entropy of the Normal Distribution
83
Using Equations (8.5) and (8.10), we can now write down directly the differential entropy of the
two component densities and of the convolution of the two Gaussian components:
H S = 1 ln(2πeσ 2S )
2
H N = 1 ln(2πeσ 2N )
2
(8.11)
H SN = 1 ln[2πe(σ 2S + σ 2N )] .
2
(8.13)
(8.12)
and
INFORMATION
Following Equation (7.17), we represent the information of a measurement as a difference
(Shannon, 1948):
, = H SN − H N
= 1 ln[2πe(σ 2S + σ 2N )] − 1 ln[2πeσ 2N ]
2
2
, = 1 ln[1 + σ 2S / σ 2N ] natural units per signal.
2
(7.18)
(8.14)
Remember that (7.18) was put forward as a “ reasonable” candidate for the information obtained by
making a measurement of a variable that was distributed continuously. Equation (8.14) is just a specific
instance of (7.18) where the respective density functions are Gaussian. Equation (8.14) demonstrates
various properties that would support its candidacy for an information function. , increases
monotonically with increasing σ S ; the greater the standard deviation of the measured signal, the more
uncertain we are about what the signal is, the more information we obtain from the measurement.
Moreover, , increases monotonically with decreasing σ N ; the smaller the obfuscating factor, the
greater the information obtained from the measurement. And of course, when σ 2S = 0, , = ln 1 = 0;
when the signal (effectively) vanishes, no information is obtained. Or, looked at in another way, when
the measurement is certain (σ S = 0), no information is obtained.
A brief derivation of Equation (8.14) and its relation to “ Shannon’s second theorem” is provided by
Beck (1976).
In the sensory analysis that follows, it will be helpful to interpret , as an information, because
information is rather a tangible quantity that may conjure a picture in our minds. However, the
informational interpretation of Equation (8.14) is not mandatory; no problem of a mathematical nature
will be encountered by regarding , in this equation as simply the difference between differential
entropies. In fact, since we shall hold σ 2N to be constant, , may be regarded simply as a differential
entropy plus a constant, which definitely conjures no picture. The application of the function , in the
analysis of sensory events will proceed in either case, with informational or entropic interpretation.
Information is just a useful “ currency” in which we can visualize a sensory neuron as trading. It is
rather a concrete matter to state that a certain afferent neuron has relayed b bits of information to the
brain. However, although less concrete, it is equally valid to say simply that a sensory receptor served
by this afferent has reduced its entropy by b bits. But we are getting a little ahead of our story.
MORE ON THE INTERPRETATION OF THE INFORMATION FROM
CONTINUOUS SOURCES
Using Equation (8.14), we can continue from Chapter 7 the attempt to interpret the information
from continuous probability densities into the more intuitive information from discrete probability
functions. You will remember that the probability density function for noise was used to limit the
number of discrete rectangles into which the probability density for the signal might be divided: the less
Information, Sensation and Perception.  Kenneth H. Norwich, 2003.
8. The Entropy of the Normal Distribution
84
Figure 8.2 Squaring the normal curve (sort of), or “ discretizing” the continuum. Two normal
distributions are shown, the one on the right-hand side representing the pure signal, and the
other representing the noise signal. AB designates the region between −σ N and +σ N , and CD
designates the region between −σ S and +σ S (see Figure 8.1). It is seen that CD can be divided
into 8 rectangles of width AB. We can then regard the 8 rectangles as a histogram defining 8
equally probable, discrete, outcomes to an event. The information obtained from a measurement
of the outcome is equal to ln(σ S / σ N ) = ln(4/0. 5) = ln 8, which is approximately equal to the
outcome of the original continuous event, 12 ln[1 + ( 12 CD) 2 /( 12 AB) 2 ]. We can see that as the
noise variance, σ 2 , becomes smaller, σ becomes smaller, more AB’s fit into CD, and the
information is greater.
intense the noise, the greater the number of rectangles, and the greater the value of the (discrete)
entropy. The process of dividing into narrower and narrower rectangles had to be limited by some
natural constraint, so that a unique value of (discrete) entropy could be obtained. The problem was one
of “ discretizing” the continuum.
For the normal distribution, Equation (8.14) illustrates how the continuous distribution can be
rendered, in effect, discrete. Suppose we regard σ 2S / σ 2N >> 1. Then Equation (8.14) becomes
effectively,
, = 1 ln(σ 2S / σ 2N ) .
2
(8.15)
But σ S and σ N are the standard deviations of the probability density functions for signal and noise
respectively. If we regard σ S / σ N , rounded to the nearest integer = n, as the equivalent number of
equally probable outcomes to the measurement event, then from the usual equation for discrete entropy,
H = ln(σ S / σ N ) = ln(n) .
(8.16)
From (8.15), of course, , = H. This idea is shown schematically in Figure 8.2 .
THE CENTRAL LIMIT THEOREM
Suppose that x 1 , x 2 , . . . , x n constitute a random sample drawn from an infinite population. We say
that x 1 , x 2 , . . . , x n constitute a random sample of size n. Let
x = (x 1 + x 2 +. . . + x n ) / n
(8.17)
be the sample mean. The Central Limit Theorem states that if random samples of size n are drawn from
a large or infinite population with mean µ, and variance σ 2 , the sample mean, x , is approximately
distributed normally with mean µ, and variance σ 2 /n. Note that the theorem makes no mention of the
Information, Sensation and Perception.  Kenneth H. Norwich, 2003.
8. The Entropy of the Normal Distribution
85
nature of the population from which samples are drawn. Even if the population is far from a “ normal”
or “ Gaussian” population, the sample means will still be distributed normally for sample size ≥ 30. If,
however, the population is not too different from normal, the distribution of means will be normal for
values of n much smaller than 30. The populations we shall be considering in our sensory work are
expected to fall into the latter category.
Statistically, no mention need be made about how a sample of size n is obtained. However, in our
scientific applications of the Central Limit Theorem it is, indeed, necessary to consider how the sample
was obtained. In fact, a measuring device will reach into the large or infinite population and
sequentially make n measurements, x 1 , x 2 , ..., x n . I shall refer to each of these measurements as one
sampling. That is, it is necessary to make n samplings (or measurements) of the population to obtain
one sample of size n. The language is a little unwieldy, but I hope it is clear.
We have seen, now, that if the original population has variance σ 2 , the means of samples of size n
are normally distributed with variance σ 2 / n. Therefore, the differential entropy of the original
distribution is given by Equation (8.5) directly, while the differential entropy of the distribution of
means of samples of size n is obtained from Equation (8.5) by replacing σ 2 by σ 2 / n:
H mean = 1 ln(2πeσ 2 / n) .
2
(8.18)
If the precision of the measurement of the means (net result of sampling + computation) is limited
by Gaussian noise with variance σ 2N , the information obtained from such a measurement is given by
Equation (8.14) with σ 2S replaced by σ 2S /n:
σ2 / n
, = 1 ln 1 + S 2
2
σN
natural units per measurement.
(8.19)
Thus it would appear that the information received by obtaining a measurement of the mean of 10
samplings (n = 10) is less than the information received by making a measurement based on a single
sampling from the population (n = 1). However, this is not the interpretation I wish to pursue.
If one looks at the information given by Equation (8.19) as a function of n, the sample size, it is
seen that , is maximum for n = 1, and , → 0 for n → ∞. When n → ∞, the sample variance,
σ 2 /n → 0, implying that one has near-perfect knowledge of the population mean. We have
incorporated the idea of Fisher’s information (Chapter 5) into Shannon’s structure. However, Equation
(8.19) was not given by Shannon. I interpret
σ2 / n
H(n) = , (n) = 1 ln 1 + S 2
2
σN
(8.19a)
as the information which can still be gained about the population mean after n samplings of a
population have produced a single sample of size n. That is, , (n) is an absolute entropy; an uncertainty
about the value of the population mean; and a potential information that may be received as the process
of sampling continues. That is, with increasing n, uncertainty and potential information decrease, while
information about the population mean increases. This interpretation of Equation (8.19) will be pursued
in the next chapter when we come to model the process of sensation.
The difference in potential information, and, therefore, the gain in information, when the sample
size is increased from n 1 to n 2 is given by H(n 1 ) − H(n 2 ). The reader might like to show that for
σ 2S / n 2 σ 2N >> 1, the gain in information is equal to ln n 2 / n 1 (cf. Equation (11.10)).
ANALOG CHANNELS
You may remember that in Chapter 5 we left some unfinished business. We discussed the
applications of information from discrete systems to communications engineering, but we could not, at
that time, examine continuous or analog systems. However, we are now in a position to do so.
Communications systems deal usually with signals such as electrical potentials (voltages) that are
transmitted with complex waveforms having, effectively, zero mean value. Equation (8.14) gives
information in natural units per sample (of a complex signal). The well-known sampling theorem
Information, Sensation and Perception.  Kenneth H. Norwich, 2003.
8. The Entropy of the Normal Distribution
86
(Shannon, 1949) states that if a function contains no frequencies higher than W, it is completely
determined by giving its ordinates at a series of points spaced 1/(2W) seconds apart. The theorem has
also been generalized to include the case where the frequency band does not start at zero but at some
higher value. W is then a bandwidth. Therefore, if we divide the right-hand side of Equation (8.14)
[natural units of information per sample] by 1/(2W) [seconds per sample] we obtain
C = W ln(1 + σ 2S / σ 2N ) natural units per second.
(8.20)
Shannon has shown (1949), using an argument involving the volumes of spheres in hyperspace, that
C is the channel capacity of the channel. If we divide (8.20) by ln 2 we get, of course, bits per second.
The ratio of variances is usually written as P / N, the signal-to-noise ratio, so that
C = W ln(1 + P / N) .
(8.21)
This equation, then, gives the greatest rate at which an analog channel with a given signal-to-noise
ratio and Gaussian noise (“ white thermal noise” ) can transmit information. (Remember that the
Gaussian distribution has the greatest differential entropy for a given variance.) As an example (from
Raisbeck), if an audio circuit for the transmission of speech has a signal-to-noise ratio P / N equal to 36
decibels, and the bandwidth, W, is 4500 Hz, we can immediately calculate the channel capacity, C.
Since P / N = 10 3.6 (note 2, Chapter 3),
C = (4500 / ln 2) ln(1 + 10 3.6 ) ,
or about 50,000 bits per second.
APPENDIX: CONVOLUTION OF TWO GAUSSIAN FUNCTIONS
The convolution of the two Gaussian functions N(x; µ S , σ S ) and N(x; µ N , σ N ) that is given formally
in Equation (8.8) is now carried out explicitly.
p SN (x) =
∞
∫ −∞
−(x − x U − µ S ) 2
2σ 2S
1
exp
2π σ S σ N
exp
−(x U − µ N ) 2
2σ 2N
dx U .
Changing variable, we set
Z = xU − µN .
p SN (x) =
∞
∫ −∞
−(x − Z − µ S − µ N ) 2
2σ 2S
1
exp
2π σ S σ N
exp
−Z 2
2σ 2N
dZ .
Setting
X = x − µS − µN ,
p SN (x) =
=
1
2π σ S σ N
∞
∫ −∞ e −(X−Z) /2σ
2
2
S
e −Z
∞
2
2
1
e −X /2σ S ∫ exp
−∞
2π σ S σ N
2 /2σ 2
N
dZ
2XZ − Z 2 − Z 2
2σ 2S
2σ 2S
2σ 2N
∞
2
2
1
e −X /2σ S ∫ exp − 2 −X2
−∞
2σ S
2π σ S σ N
∞
2
1
=
e bX ∫ e −(aZ +2bZ) dZ ,
−∞
2π σ S σ N
=
(A8.1)
Z+
dZ
1 + 1
2σ 2S
2σ 2N
Information, Sensation and Perception.  Kenneth H. Norwich, 2003.
Z 2 dZ
8. The Entropy of the Normal Distribution
87
where
a= 1
2
1 + 1
σ 2S
σ 2N
,
and
b = − X2 .
2σ S
(A8.2)
∞
2
2
1
e bX ∫ e b /a e −a(Z+b/a) dZ .
−∞
2π σ S σ N
p SN (x) =
Changing variable by setting u = Z + b/a, du = dZ,
p SN (x) =
∞
2
Since ∫ −∞ e −au du =
∞
2
2
1
e bX e b /a ∫ e −au du .
−∞
2π σ S σ N
π/a , a > 0 (see any discussion of the error function),
p SN (x) =
π/a
2
e bX+b /a .
2π σ S σ N
(A8.3)
b 2 /a can be evaluated from Equation (A8.2):
2
b 2 /a = X 4 O 1
4σ S 2
1 + 1
σ 2S
σ 2N
=
σ 2N
σ 2S + σ 2N
X2 .
2σ 2S
Completing the algebra,
bX + b 2 /a = −X 2 / 2(σ 2S + σ 2N ) .
(A8.4)
From the definition of a in (A8.2),
π/a
=
2π σ S σ N
1
2π(σ 2S + σ 2N )
.
(A8.5)
Substituting Equations (A8.4) and (A8.5) into (A8.3), and returning the value for X from (A8.1), we
obtain the required result,
1
p SN = [2π(σ 2S + σ 2N )] − 2 exp −
[x − (µ S + µ N )] 2
2(σ 2S + σ 2N )
.
(A8.10)
REFERENCES
Beck, A.H.W. 1976. Statistical Mechanics, Fluctuations and Noise. Edward Arnold, London.
Goldman, S. 1953. Information Theory. Prentice-Hall, Englewood Cliffs, N.J.
McEliece, R.J. l977. The Theory of Information and Coding: A Mathematical Framework for Communication.
Addison-Wesley, Reading, Mass.
Raisbeck, G. 1963. Information Theory: An Introduction for Scientists and Engineers. M.I.T. Press, Cambridge.
Shannon, C.E. 1948. A mathematical theory of communication. Bell System Technical Journal 27, 623-656.
Shannon, C.E. 1949. Communication in the presence of noise. Proceedings of the IRE, 37, 10-21.
Information, Sensation and Perception.  Kenneth H. Norwich, 2003.