Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer simulation wikipedia , lookup
Probability box wikipedia , lookup
Multidimensional empirical mode decomposition wikipedia , lookup
Pattern recognition wikipedia , lookup
Generalized linear model wikipedia , lookup
Inverse problem wikipedia , lookup
Least squares wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Entropy of mixing wikipedia , lookup
ESTIMATING THE ENTROPY OF A SIGNAL WITH APPLICATIONS J.-F. Bercher C. Vignat Équipe Communications Numériques, ESIEE 93 162 Noisy-le-Grand, FRANCE, and Laboratoire Systèmes de Communications, UMLV [email protected] Laboratoire Systèmes de Communications Université de Marne la Vallée 93 166 Noisy-le-Grand, FRANCE [email protected] ABSTRACT We present an estimator of the entropy of a signal. The basic idea is to adopt a model of the probability law, in the form of an AR spectrum. Then, the law parameters can be estimated from the data. We examine the statistical behavior of our estimates of laws and entropy. Finally, we give several examples of applications: an adaptive version of our entropy estimator is applied to detection of law changes, blind deconvolution and sources separation. This paper is organized as follows: in the first part, we present theoretical backgrounds of the estimation procedure. Then we present the statistical behaviour of the estimates. Finally, we exhibit sample applications of this method, such as detection of law changes, blind deconvolution and sources separation. 2. ESTIMATING ENTROPY 2.1. Construction of the estimate 1. INTRODUCTION Entropy is a major tool in information theory. However, this tool is not directly used in the context of signal processing, but in theoretical frameworks, because it is difficult to compute or even to estimate from a set of data. A few interesting attempts of direct use of entropy for signal processing applications can be found in [1, 2]. Let us recall that the entropy HX of a random variable X (! ) with continuous probability law fX (x) is defined as: Z +1 HX = fX (x) log fX (x) dx: (1) , ,1 This formula raises the two main difficulties that characterize the computation of the entropy: the first difficulty is due to the fact that the probability law fX (x) is generally unknown. However, there exist numerous methods for estimating probability laws, such as kernel methods. But these methods share the major drawback of slow convergence rate (typically T ,4=5 ): the second difficulty comes from the fact that, even if fX (x) is known, formula (1) requires numerical integration. These two remarks illustrate the necessity for finding fast converging and accurate estimates of entropy. The underlying idea for the new estimate of entropy we propose here states as follows: the probability law is seeked in a set of functions that are characterized by two fundamental features: (i) this set is characterized by a finite set of parameters (parametric model) (ii) the functions are chosen in such a manner that computation of the entropy as defined by (1) reduces to a problem of AR process identification. Morever, this method inherits regularization techniques allowing integration of a priori informations about the law (such as its smoothness). The first property induces obviously a small bias over the estimate entropy. The second property makes of this estimate a new and potentially powerful tool in signal processing. The main features we wish for our estimate are: a fixed number of parameters, as opposed to nonparametric approaches, an easy method to select the best fitting parameters from the sole observation of samples of the random process X (n; ! ), the capability of iteratively updating these parameters. 2.2. The AR identification problem Given observations w (n) of a process W (n; ! ) and a fixed integer p; a set of parameters ai 1ip is seeked that best fits the following model : f g w(n) = p X i=1 ai w(n , i) + (n) where (n) is a white noise with power 2 : The exact solution of this problem is well known and requires the knowledge of the correlation function Rw (k) of the process W (n; !) : This solution is given by a = Rw,1 rw ; with matrix (Rw )i;j = Rw (i , j ) and vector (rw )i = Rw (i) : The spectrum of this AR process writes 2 : 1 , k=1 ak e,j2kf 2 Sw (f ) = Pp (2) 2.3. Application to the estimation of entropy 2.3.1. The approach Application of the AR identification tool to the entropy estimation problem is straightforward if we look for an estimate f^X (x) of the 2 [,0:5; 0:5], under the form 2 (3) f^X (x) = Pp ,j2kx 2 1 , k=1 ak e = Sw (x) (4) that is the restriction over interval [,0:5; +0:5], of the power spectral density Sw (x) of an AR process denoted as W (n; ! ). probability law fX (x), for x 2.3.4. Implementation of the method 2.3.2. An underlying process Given law (3) ; it is possible to exhibit a random process W (n; ! ) whose spectrum is precisely Sw (f ) = f^X (f ). If we define: W (n; !) = ej(nX +(!)) ; where X is any sample of process X (n; ! ) (for example X = X (1; !)) and (!) is a uniformly distributed phase, then W (n; !) is a centered process whose correlation function Rw (k) is: Rw (k) = E [W (n) W (n + k)] = E ejkX = FT ,1 f^X (x) so that obviously ergodic. Sw (x) = f^X (x). Note that this process is not Although an analytical expression of entropy can be derived in terms of AR parameters [4], the particular choice expressed by (3) also leads to an easy and natural procedure for the estimation of entropy. Denote by ^X (k) and ^X (k) the first and second character^ by istic functions of X (k) = X (!; n) ; defined,respectively 1 , 1 ^ ^ ^ FT fX (x) and X (k) = FT log fX (x) ; then ^X (k) = Rw (k) 8k ^X (k) = Cw (k) 8k (5) where Cw (k) denotes the cepstrum function of W (n; ! ) : Application of the Parseval Plancherel identity to the definition (1) of the entropy writes, with the help of relations (5) ; as follows: H^ X = , +1 X k=,1 ^X (k) ^X (k) = , +1 X k=,1 Rw (k) Cw (k) : (6) The entropy HX of X (n; ! ) can thus be computed from both correlation and cepstrum functions of W (n; ! ). At this step, we can take advantage of the AR structure, exploiting the fact that correlation and cepstrum functions of an AR process verify Rw (k) = 8 < Cw (k) = : p X i=1 ai Rw (k , i) + 2 (k) P0 (7) ,i i=k+1 k Cw (i) w (k , i) if k < 0 2 log R (0) if k = 0 P ,1 , i w w (k) , ki=1 C ( i ) ( k , i ) if k > 0 w k w w (k) , The proposed method thus consists in the three following steps: first step: compute a raw estimate of the law of X; (equivalently of Sw (x)), involving samples xi 1in+1 as, for instance: f g n+1 X ^fX(n+1) (x) = 1 n + 1 i=1 (x , xi ) h (x) ; where function h is any kernel function: However, as we wish an iterative estimate, it is useful to notice that f^X(n+1) (x) = n +n 1 f^X(n) (x) + n +1 1 h(x , xn+1 ) so that by inverse Fourier transform: Rw(n+1) (k) = n +n 1 Rw(n) (k) + n +1 1 H (k)ej2kxn+1 2.3.3. Estimation of entropy where function w (k) is the normalized correlation (or correlation w (k) coefficient) of process W (n; ! ) ; defined as w (k) = R Rw (0) . The trick here lies in the fact that the first and second characteristic functions involved in formula (6) ; being identified with the correlation and cepstrum function of the AR process W (n; ! ) ; ^ X of can be computed using relations (7; 8). Thus the estimate H the entropy as defined by (6) can be computed without any numerical integration, and from the sole observation of the data. (8) (9) where H (k) =FT,1 (h(x)). second step: the set of estimated correlations Rw(n+1) (k)0kp ( n +1) allows to compute the set of parameters ai 1ip and thus both (n+1) (k) and ^(n+1) (k), using (7,8). series ^ X X third step: application of formula (6) gives the estimated entropy of process X (n; ! ) computed with the (n + 1) first samples. 2.4. Estimation of the AR parameters of the law f g f g Accurate estimates of the law parameters ak have now to be derived from a finite number of samples xi 1iN . We present three methods for solving this problem in our special context. 2.4.1. Normal equations The AR parameters satisfy the well known normal, or Yule-Walker equations 2 Rw ,1a = 0 where R is the (p + 1) correlation matrix. The correlation coefficients are estimated using (9). However, some free parameters remain to be chosen : (i) the form and length of the kernel h; (ii) the order p of the AR model. The choice of these parameters should result, as usual in estimation problems, of a trade-off between bias and variance. 2.4.2. Long AR models and regularization Accurate modelization of non-AR processes via AR techniques requires the use of long AR models. The counterpart of adopting a high number of coefficients is a loss in the stability of the estimate. The exploitation of regularization techniques enables to use long AR models, and thus to model ‘non-AR’ spectra, without sacrifying stability. The idea is to use a long AR model with the addition of some prior knowledge about the ‘smoothness’ of the spectrum. In [3], Kitagawa and Gersch defined the PSD kth smoothness by Z 1 k @ A (f ) 2 Dk = @f k df at k a; / 0 5 7 4.5 6 4 5 3.5 3 4 2.5 3 2 1.5 2 1 1 0.5 where k is the diagonal matrix with elements [k ]ii = i2k . The corresponding regularized least squares estimate is 0 −0.5 a^ = (R^w + k ),1 r^w : (10) The hyperparameter balances the fidelity to the data and the Figure 1: The AR estimates and theoretical gaussian (a) and uniform (b) laws. smoothness prior. In [3], a bayesian interpretation of this regularized least-squares is derived, that also leads to a selection rule for the hyperparameter , as the minimizer of the following marginal likelihood: L() = log(det(R^ w + k )) , p log() , N log(2 ); (11) where 2 is chosen such that the AR probability distribution is properly normalized. 2.4.3. Maximum a Posteriori estimation of the parameters Another possible approach is to derive the parameters from the probabilistic model of the available data. Since the model of the probability density is given by (3), the probability law of a sample of data of length N is the product law (assuming, without any extra information, independence of the data). Prior knowledge regarding the smoothness of the law can be introduced in the form of the Kitagawa-Gersch gaussian prior for a, fA (a) e,at k a: This leads to the posterior distribution / fAjX (ajx) / N Y 2 ,at k a , j 2kxi j2 e j 1 , a e k k=1 i=1 Pp The MAP estimate of the AR parameters can now be computed as the minimizer of 2 p N X X , j 2 kx 2 t i J (a; ) = log 1 ak e N log + a k a: i=1 , , k=1 In this section, we give simulation results regarding the estimation of probability laws using our AR modelization approach. We also examine the accuracy of entropy estimates for several distributions. Experiments were performed on sequences of 500 samples1 . The laws were modelized with an order p = 50. We tested the long AR method presented in 2.4.2. We examined the cases of a uniform law U[, 1 ; 1 ] and a gaussian law N (0; 0:01) : The 20 20 optimal regularization hyperparameter was selected using the minimization of a likelihood as in (11). Typical results are given in Figs1 (a) and (b) x 1 Since the law is modelized as the restriction of the spectrum on interval the data had to be rescaled on this interval. This does not restrict our approach because the entropy of the rescaled variable differs from the original entropy only by a known additive term. , 21 + 12 ; ; −0.3 −0.2 −0.1 0 x 0.1 0.2 0.3 0.4 0.5 0 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 x We also tested the MAP approach as presented in 2.4.3. From the practical point of view, the MAP estimation approach usually gives more accurate but slowly converging estimates, contrary to the long AR approach. As far as the accuracy of the estimate of the entropy is concerned, we evaluated the entropy of Gaussian and uniform data using a Monte Carlo type simulation over 100 trials. The results are given in the following table, where Mh is the mean of the estimates and H their standard deviation. Gaussian 2 = 18 1 1 Uniform 6; 6 , theoretical 1.09 1.58 Mh H 0.9544 1.4786 0.0485 0.0159 These results exhibit the good statistical behaviour of our estimate, that is a low bias and a small variance. 3. SAMPLE APPLICATIONS 3.1. Detecting law changes 3.1.1. An adaptive estimate As our approach involves iterative evaluations of the empirical law and the corresponding correlation sequence, it is straightforward to derive an adaptive version of the entropy estimation. It suffices to introduce a forgetting factor in the updating formula (9) of the correlation sequence. Then, the AR parameters and entropy are evaluated for each new sample. The regurecursively larized least squares,,solution (10) can be computed a(n+1) = a(n) + R^ (n) + k a(n) , r^(n) : 2.5. First simulation results −0.4 An interesting application of this adaptive estimate consists in detecting law changes in signals. As an illustration, we consider a signal x (n) that consists in 400 samples generated according to a mixture of two Gaussian distributions, followed by 400 uniformly distributed samples. First law has entropy H1 = 1:8 whereas the second has entropy H2 = 1:18: Figure 2 below shows x (n): it is difficult by a simple inspection to detect that there is a law change. Figure 3 shows the adaptive estimate of the negentropy on this test signal, using a forgetting factor = 0:98: The following points are of importance: (i) the law change appears clearly, (ii) the rupture time is properly revealed, (iii) the entropy is estimated with high accuracy and (iv) our adaptive estimate has a high tracking capability due to its fast convergence (typically in less than 100 samples). , , 0.25 measure of independence is the mutual information, that is the Kullback-Leibler divergence between pS1 ;:::;SN (s1 ; :::; sN ) and 1iN pSi (si ) : In the source separation context, this reduces to minimizing the following cost function 0.2 0.15 0.1 0.05 0 −0.05 C (B ) = , log jdet B j + −0.1 −0.15 −0.2 −0.25 0 100 200 300 400 500 600 700 800 Figure 2: x(n) with abrupt law change at sample 400. N X i=1 H (^si ) In classical methods, as no estimate of the entropy is available, B is chosen as the solution of : E [^si j (^sj )] = 0 (i 6= j ) 2.5 2 1.5 1 0.5 100 200 300 400 Time 500 600 700 800 Figure 3: Adaptive estimate of the entropy. 3.2. Blind deconvolution of AR systems The difficult problem of blind deconvolution lies in recovering the inputs and parameters of a filter from the sole observation of its output. The concept of entropy brings an interesting answer to this problem, relying on the following proposition: Proposition 1R Let Y (n; ! ) be the output of a filter G(f ) normal1 ized such that ,1 G(f ) 2 df = 1, whose input is a non-gaussian i.i.d. sequence X (n; ! ). Then HY > HX . j j The intuitive reason behind this result is that fY (y ) is closer to a gaussian distribution than fX (x), the gaussian distribution having the maximum entropy in the set of distributions of given variance, see [5]. The deconvolution procedure then simply consists in adjusting the filter such that the reconstructed input has minimum (estimated) entropy. If are the filter parameters, this writes opt = arg min H^ X submitted to X (f ) = Y (f ) =G (f ) Simulations were performed in the case of non-minimum phase AR systems. They show that the AR parameters can be identified with an outstanding accuracy, and the input can be perfectly reconstructed, even if the AR order is overestimated. These simulations were performed in the case of uniform and binary inputs, with 500 samples of data. 3.3. Source separation In the context of source separation, N signals s (n) = [s1 (n) ; :::; sN (n)] are mixed by an unknown N N matrix A to provide observed signals x (n) = [x1 (n) ; :::; xN (n)]. The task is, from the sole observation of signals x (n), to recover the sources assuming only their independence. This goal is reached by designing a matrix B such that the reconstructed signal s^ (n) = Bx (n) has independent components. The information theoretic (12) (13) that expresses the stationarity condition of C (B ) : Function j (sj ) is the so-called score function, the log derivative of the density pSj (sj ) : Using our AR parametrization, we can (i) either estimate the cost function C (B ) and minimize it using any standard optimization procedure (ii) or estimate the solution of (13) using an analytical expression (in terms of the AR parameters) of the score functions i (si ) : We performed simulations using the first approach, using 500 samples of data in the case of the mixture of N = 2 sources. The following table presents, for several distributions of the sources, the resulting matrix M = AB that should be the identity matrix, up to a scaling factor and a permutation. source 1 U[,0:5;0:5] U[,0:5;0:5] binary 1 1 ,2; 2 U[,0:5;0:5] source 2 1 fN (0:3; 0:2) 2 N (,0:3; 0:2)g U[,0:5;0:5] N (0; 1) N (0; 1) M 1 0:0395 0 : 0011 1 1 0:02 0 : 0163 1 1 0:0354 0 : 02 1 1 0:23 ,0:0491 1 4. REFERENCES [1] P. Viola, N. N. Schraudolph, T. J. Sejnowski, “Empirical Entropy Manipulation for Real-World Problems”, Advances in Neural Information Processing Systems 8, MIT Press, 1996. [2] D. T. Pham, “Blind Separation of Instantaneous Mixture of Sources via an Independent Component Analysis”, IEEE trans. on Signal Processing, vol. 44, no 11, pp. 2768-2779, nov. 1996. [3] G.Kitagawa and W.Gersh, “A smoothness priors long AR model method for spectral estimation”, IEEE trans. Automatic Control, no. 1, pp. 57–65, 1985. [4] J.-F. Bercher and C. Vignat, “Entropy Estimation with Applications to Signal Processing Problems”, LSC internal report no. 98-057. [5] D. Donoho, “On minimum entropy deconvolution”, Applied Time Series Analysis II, pp. 565-609, Academic Press, 1981.