Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 8 THE ENTROPY OF THE NORMAL DISTRIBUTION INTRODUCTION The “normal distribution” or “Gaussian distribution” or Gaussian probability density function is defined by 2 2 1 N(x; µ, σ) = e −(x−µ) /2σ . (8.1) (2πσ 2 ) 1/2 This density function, which is symmetrical about the line x = µ, has the familiar bell shape shown in Figure 8.1. The two parameters, µ and σ 2 , each have special significance; µ is the mean and σ 2 the variance of the distribution. All probability density functions must be normalized to unity, and it is shown in most textbooks on advanced calculus that ∞ ∫ −∞ N(x; µ, σ) dx = 1 . (8.2) The expectation of x, E(x), is equal to the mean; that is, E(x) = ∞ ∫ −∞ x N(x; µ, σ) dx = µ . (8.3) The expectation of (x − µ) 2 , E(x − µ) 2 , is equal to the variance; that is, E(x − µ) 2 = ∞ ∫ −∞ (x − µ) 2 N(x; µ, σ) dx = σ 2 . (8.4) Figure 8.1 The normal distribution with mean, µ, and variance σ 2 : N(x; µ, σ). About 2/3 of the area under the curve lies within one standard deviation, σ, of the mean. Information, Sensation and Perception. Kenneth H. Norwich, 2003. 81 8. The Entropy of the Normal Distribution 82 The latter two equations, if unfamiliar, may be found in all textbooks on mathematical statistics, or may be verified directly by the reader. The differential entropy of the normal distribution can be found without difficulty. From the definition of differential entropy given in Chapter 7, and using Equation (8.1), ∞ H = − ∫ (2πσ 2 ) − 2 e −(x−µ) 1 2 /2σ 2 −∞ 1 ln (2πσ 2 ) − 2 e −(x−µ) 2 /2σ 2 dx ∞ 1 2 2 H = 1 ln(2πσ 2 ) ∫ (2πσ 2 ) − 2 e −(x−µ) /2σ dx −∞ 2 ∞ 1 2 2 1 + (2πσ 2 ) − 2 (x − µ) 2 e −(x−µ) /2σ dx . 2 ∫ −∞ 2σ Introducing Equations (8.2) and (8.4), H = 1 ln(2πσ 2 ) + 1 . 2 2 Writing 1 2 as ( 12 ln e), H = 1 ln(2πeσ 2 ) , 2 (8.5) which is the simple result we sought. We note that the differential entropy of the Gaussian probability density function depends only on the variance and not on the mean. It has often been demonstrated (for example, Goldman, 1953) that for a given, fixed value of variance, σ 2 , the probability density with the greatest value of H is the Gaussian density. For an n-dimensional Gaussian density defined by N(x 1 , x 2 , . . . ; µ 1 , µ 2 , ..., σ 1 , σ 2 , . . . ) = n ∏(2πeσ 2i ) − 1 2 exp − i=1 (x i − µ i ) 2σ 2i 2 (8.6) , the differential entropy is given by H = (n/2) ln 2πe(σ 21 σ 22 . . . σ 2n ) 1/n (8.7) as shown by McEliece (1977). In the limiting case, for n = 1, Equation (8.7) reduces to (8.5). CONVOLUTION OF TWO GAUSSIANS Suppose that a pure signal is described by N(x; µ S , σ S ), and its obfuscating noise by N(x; µ N , σ N ). Then, as shown by Equation (7.20), the density function resulting from pure signal in the presence of noise is provided by the convolution p SN (x) = ∞ ∫ −∞ N(x − x U ; µ S , σ S ) N(x U ; µ N , σ N ) dx U . (8.8) In fact, when we carry out the convolution of two Gaussians, the result is a third Gaussian density whose mean is the sum of the means of the two component functions and whose variance is the sum of the variances of the two component functions. That is, 1 p SN = N(x; µ S + µ N , (σ 2S + σ 2N ) 2 ) . (8.9) The full demonstration of (8.9) is not usually given in the textbooks because it is rather tedious, but it is provided in the Appendix for completeness. Writing Equation (8.9) explicitly, 1 p SN = [2π(σ 2S + σ 2N )] − 2 exp − [x − (µ S + µ N )] 2 2(σ 2S + σ 2N ) Information, Sensation and Perception. Kenneth H. Norwich, 2003. . (8.10) 8. The Entropy of the Normal Distribution 83 Using Equations (8.5) and (8.10), we can now write down directly the differential entropy of the two component densities and of the convolution of the two Gaussian components: H S = 1 ln(2πeσ 2S ) 2 H N = 1 ln(2πeσ 2N ) 2 (8.11) H SN = 1 ln[2πe(σ 2S + σ 2N )] . 2 (8.13) (8.12) and INFORMATION Following Equation (7.17), we represent the information of a measurement as a difference (Shannon, 1948): , = H SN − H N = 1 ln[2πe(σ 2S + σ 2N )] − 1 ln[2πeσ 2N ] 2 2 , = 1 ln[1 + σ 2S / σ 2N ] natural units per signal. 2 (7.18) (8.14) Remember that (7.18) was put forward as a “ reasonable” candidate for the information obtained by making a measurement of a variable that was distributed continuously. Equation (8.14) is just a specific instance of (7.18) where the respective density functions are Gaussian. Equation (8.14) demonstrates various properties that would support its candidacy for an information function. , increases monotonically with increasing σ S ; the greater the standard deviation of the measured signal, the more uncertain we are about what the signal is, the more information we obtain from the measurement. Moreover, , increases monotonically with decreasing σ N ; the smaller the obfuscating factor, the greater the information obtained from the measurement. And of course, when σ 2S = 0, , = ln 1 = 0; when the signal (effectively) vanishes, no information is obtained. Or, looked at in another way, when the measurement is certain (σ S = 0), no information is obtained. A brief derivation of Equation (8.14) and its relation to “ Shannon’s second theorem” is provided by Beck (1976). In the sensory analysis that follows, it will be helpful to interpret , as an information, because information is rather a tangible quantity that may conjure a picture in our minds. However, the informational interpretation of Equation (8.14) is not mandatory; no problem of a mathematical nature will be encountered by regarding , in this equation as simply the difference between differential entropies. In fact, since we shall hold σ 2N to be constant, , may be regarded simply as a differential entropy plus a constant, which definitely conjures no picture. The application of the function , in the analysis of sensory events will proceed in either case, with informational or entropic interpretation. Information is just a useful “ currency” in which we can visualize a sensory neuron as trading. It is rather a concrete matter to state that a certain afferent neuron has relayed b bits of information to the brain. However, although less concrete, it is equally valid to say simply that a sensory receptor served by this afferent has reduced its entropy by b bits. But we are getting a little ahead of our story. MORE ON THE INTERPRETATION OF THE INFORMATION FROM CONTINUOUS SOURCES Using Equation (8.14), we can continue from Chapter 7 the attempt to interpret the information from continuous probability densities into the more intuitive information from discrete probability functions. You will remember that the probability density function for noise was used to limit the number of discrete rectangles into which the probability density for the signal might be divided: the less Information, Sensation and Perception. Kenneth H. Norwich, 2003. 8. The Entropy of the Normal Distribution 84 Figure 8.2 Squaring the normal curve (sort of), or “ discretizing” the continuum. Two normal distributions are shown, the one on the right-hand side representing the pure signal, and the other representing the noise signal. AB designates the region between −σ N and +σ N , and CD designates the region between −σ S and +σ S (see Figure 8.1). It is seen that CD can be divided into 8 rectangles of width AB. We can then regard the 8 rectangles as a histogram defining 8 equally probable, discrete, outcomes to an event. The information obtained from a measurement of the outcome is equal to ln(σ S / σ N ) = ln(4/0. 5) = ln 8, which is approximately equal to the outcome of the original continuous event, 12 ln[1 + ( 12 CD) 2 /( 12 AB) 2 ]. We can see that as the noise variance, σ 2 , becomes smaller, σ becomes smaller, more AB’s fit into CD, and the information is greater. intense the noise, the greater the number of rectangles, and the greater the value of the (discrete) entropy. The process of dividing into narrower and narrower rectangles had to be limited by some natural constraint, so that a unique value of (discrete) entropy could be obtained. The problem was one of “ discretizing” the continuum. For the normal distribution, Equation (8.14) illustrates how the continuous distribution can be rendered, in effect, discrete. Suppose we regard σ 2S / σ 2N >> 1. Then Equation (8.14) becomes effectively, , = 1 ln(σ 2S / σ 2N ) . 2 (8.15) But σ S and σ N are the standard deviations of the probability density functions for signal and noise respectively. If we regard σ S / σ N , rounded to the nearest integer = n, as the equivalent number of equally probable outcomes to the measurement event, then from the usual equation for discrete entropy, H = ln(σ S / σ N ) = ln(n) . (8.16) From (8.15), of course, , = H. This idea is shown schematically in Figure 8.2 . THE CENTRAL LIMIT THEOREM Suppose that x 1 , x 2 , . . . , x n constitute a random sample drawn from an infinite population. We say that x 1 , x 2 , . . . , x n constitute a random sample of size n. Let x = (x 1 + x 2 +. . . + x n ) / n (8.17) be the sample mean. The Central Limit Theorem states that if random samples of size n are drawn from a large or infinite population with mean µ, and variance σ 2 , the sample mean, x , is approximately distributed normally with mean µ, and variance σ 2 /n. Note that the theorem makes no mention of the Information, Sensation and Perception. Kenneth H. Norwich, 2003. 8. The Entropy of the Normal Distribution 85 nature of the population from which samples are drawn. Even if the population is far from a “ normal” or “ Gaussian” population, the sample means will still be distributed normally for sample size ≥ 30. If, however, the population is not too different from normal, the distribution of means will be normal for values of n much smaller than 30. The populations we shall be considering in our sensory work are expected to fall into the latter category. Statistically, no mention need be made about how a sample of size n is obtained. However, in our scientific applications of the Central Limit Theorem it is, indeed, necessary to consider how the sample was obtained. In fact, a measuring device will reach into the large or infinite population and sequentially make n measurements, x 1 , x 2 , ..., x n . I shall refer to each of these measurements as one sampling. That is, it is necessary to make n samplings (or measurements) of the population to obtain one sample of size n. The language is a little unwieldy, but I hope it is clear. We have seen, now, that if the original population has variance σ 2 , the means of samples of size n are normally distributed with variance σ 2 / n. Therefore, the differential entropy of the original distribution is given by Equation (8.5) directly, while the differential entropy of the distribution of means of samples of size n is obtained from Equation (8.5) by replacing σ 2 by σ 2 / n: H mean = 1 ln(2πeσ 2 / n) . 2 (8.18) If the precision of the measurement of the means (net result of sampling + computation) is limited by Gaussian noise with variance σ 2N , the information obtained from such a measurement is given by Equation (8.14) with σ 2S replaced by σ 2S /n: σ2 / n , = 1 ln 1 + S 2 2 σN natural units per measurement. (8.19) Thus it would appear that the information received by obtaining a measurement of the mean of 10 samplings (n = 10) is less than the information received by making a measurement based on a single sampling from the population (n = 1). However, this is not the interpretation I wish to pursue. If one looks at the information given by Equation (8.19) as a function of n, the sample size, it is seen that , is maximum for n = 1, and , → 0 for n → ∞. When n → ∞, the sample variance, σ 2 /n → 0, implying that one has near-perfect knowledge of the population mean. We have incorporated the idea of Fisher’s information (Chapter 5) into Shannon’s structure. However, Equation (8.19) was not given by Shannon. I interpret σ2 / n H(n) = , (n) = 1 ln 1 + S 2 2 σN (8.19a) as the information which can still be gained about the population mean after n samplings of a population have produced a single sample of size n. That is, , (n) is an absolute entropy; an uncertainty about the value of the population mean; and a potential information that may be received as the process of sampling continues. That is, with increasing n, uncertainty and potential information decrease, while information about the population mean increases. This interpretation of Equation (8.19) will be pursued in the next chapter when we come to model the process of sensation. The difference in potential information, and, therefore, the gain in information, when the sample size is increased from n 1 to n 2 is given by H(n 1 ) − H(n 2 ). The reader might like to show that for σ 2S / n 2 σ 2N >> 1, the gain in information is equal to ln n 2 / n 1 (cf. Equation (11.10)). ANALOG CHANNELS You may remember that in Chapter 5 we left some unfinished business. We discussed the applications of information from discrete systems to communications engineering, but we could not, at that time, examine continuous or analog systems. However, we are now in a position to do so. Communications systems deal usually with signals such as electrical potentials (voltages) that are transmitted with complex waveforms having, effectively, zero mean value. Equation (8.14) gives information in natural units per sample (of a complex signal). The well-known sampling theorem Information, Sensation and Perception. Kenneth H. Norwich, 2003. 8. The Entropy of the Normal Distribution 86 (Shannon, 1949) states that if a function contains no frequencies higher than W, it is completely determined by giving its ordinates at a series of points spaced 1/(2W) seconds apart. The theorem has also been generalized to include the case where the frequency band does not start at zero but at some higher value. W is then a bandwidth. Therefore, if we divide the right-hand side of Equation (8.14) [natural units of information per sample] by 1/(2W) [seconds per sample] we obtain C = W ln(1 + σ 2S / σ 2N ) natural units per second. (8.20) Shannon has shown (1949), using an argument involving the volumes of spheres in hyperspace, that C is the channel capacity of the channel. If we divide (8.20) by ln 2 we get, of course, bits per second. The ratio of variances is usually written as P / N, the signal-to-noise ratio, so that C = W ln(1 + P / N) . (8.21) This equation, then, gives the greatest rate at which an analog channel with a given signal-to-noise ratio and Gaussian noise (“ white thermal noise” ) can transmit information. (Remember that the Gaussian distribution has the greatest differential entropy for a given variance.) As an example (from Raisbeck), if an audio circuit for the transmission of speech has a signal-to-noise ratio P / N equal to 36 decibels, and the bandwidth, W, is 4500 Hz, we can immediately calculate the channel capacity, C. Since P / N = 10 3.6 (note 2, Chapter 3), C = (4500 / ln 2) ln(1 + 10 3.6 ) , or about 50,000 bits per second. APPENDIX: CONVOLUTION OF TWO GAUSSIAN FUNCTIONS The convolution of the two Gaussian functions N(x; µ S , σ S ) and N(x; µ N , σ N ) that is given formally in Equation (8.8) is now carried out explicitly. p SN (x) = ∞ ∫ −∞ −(x − x U − µ S ) 2 2σ 2S 1 exp 2π σ S σ N exp −(x U − µ N ) 2 2σ 2N dx U . Changing variable, we set Z = xU − µN . p SN (x) = ∞ ∫ −∞ −(x − Z − µ S − µ N ) 2 2σ 2S 1 exp 2π σ S σ N exp −Z 2 2σ 2N dZ . Setting X = x − µS − µN , p SN (x) = = 1 2π σ S σ N ∞ ∫ −∞ e −(X−Z) /2σ 2 2 S e −Z ∞ 2 2 1 e −X /2σ S ∫ exp −∞ 2π σ S σ N 2 /2σ 2 N dZ 2XZ − Z 2 − Z 2 2σ 2S 2σ 2S 2σ 2N ∞ 2 2 1 e −X /2σ S ∫ exp − 2 −X2 −∞ 2σ S 2π σ S σ N ∞ 2 1 = e bX ∫ e −(aZ +2bZ) dZ , −∞ 2π σ S σ N = (A8.1) Z+ dZ 1 + 1 2σ 2S 2σ 2N Information, Sensation and Perception. Kenneth H. Norwich, 2003. Z 2 dZ 8. The Entropy of the Normal Distribution 87 where a= 1 2 1 + 1 σ 2S σ 2N , and b = − X2 . 2σ S (A8.2) ∞ 2 2 1 e bX ∫ e b /a e −a(Z+b/a) dZ . −∞ 2π σ S σ N p SN (x) = Changing variable by setting u = Z + b/a, du = dZ, p SN (x) = ∞ 2 Since ∫ −∞ e −au du = ∞ 2 2 1 e bX e b /a ∫ e −au du . −∞ 2π σ S σ N π/a , a > 0 (see any discussion of the error function), p SN (x) = π/a 2 e bX+b /a . 2π σ S σ N (A8.3) b 2 /a can be evaluated from Equation (A8.2): 2 b 2 /a = X 4 O 1 4σ S 2 1 + 1 σ 2S σ 2N = σ 2N σ 2S + σ 2N X2 . 2σ 2S Completing the algebra, bX + b 2 /a = −X 2 / 2(σ 2S + σ 2N ) . (A8.4) From the definition of a in (A8.2), π/a = 2π σ S σ N 1 2π(σ 2S + σ 2N ) . (A8.5) Substituting Equations (A8.4) and (A8.5) into (A8.3), and returning the value for X from (A8.1), we obtain the required result, 1 p SN = [2π(σ 2S + σ 2N )] − 2 exp − [x − (µ S + µ N )] 2 2(σ 2S + σ 2N ) . (A8.10) REFERENCES Beck, A.H.W. 1976. Statistical Mechanics, Fluctuations and Noise. Edward Arnold, London. Goldman, S. 1953. Information Theory. Prentice-Hall, Englewood Cliffs, N.J. McEliece, R.J. l977. The Theory of Information and Coding: A Mathematical Framework for Communication. Addison-Wesley, Reading, Mass. Raisbeck, G. 1963. Information Theory: An Introduction for Scientists and Engineers. M.I.T. Press, Cambridge. Shannon, C.E. 1948. A mathematical theory of communication. Bell System Technical Journal 27, 623-656. Shannon, C.E. 1949. Communication in the presence of noise. Proceedings of the IRE, 37, 10-21. Information, Sensation and Perception. Kenneth H. Norwich, 2003.