Download Estimating the entropy of a signal with applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computer simulation wikipedia , lookup

Probability box wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Pattern recognition wikipedia , lookup

Generalized linear model wikipedia , lookup

Inverse problem wikipedia , lookup

Least squares wikipedia , lookup

T-symmetry wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Entropy of mixing wikipedia , lookup

Information theory wikipedia , lookup

Gibbs paradox wikipedia , lookup

Transcript
ESTIMATING THE ENTROPY OF A SIGNAL WITH APPLICATIONS
J.-F. Bercher
C. Vignat
Équipe Communications Numériques, ESIEE
93 162 Noisy-le-Grand, FRANCE,
and Laboratoire Systèmes de Communications, UMLV
[email protected]
Laboratoire Systèmes de Communications
Université de Marne la Vallée
93 166 Noisy-le-Grand, FRANCE
[email protected]
ABSTRACT
We present an estimator of the entropy of a signal. The basic idea
is to adopt a model of the probability law, in the form of an AR
spectrum. Then, the law parameters can be estimated from the
data. We examine the statistical behavior of our estimates of laws
and entropy. Finally, we give several examples of applications: an
adaptive version of our entropy estimator is applied to detection of
law changes, blind deconvolution and sources separation.
This paper is organized as follows: in the first part, we present
theoretical backgrounds of the estimation procedure. Then we
present the statistical behaviour of the estimates. Finally, we exhibit sample applications of this method, such as detection of law
changes, blind deconvolution and sources separation.
2. ESTIMATING ENTROPY
2.1. Construction of the estimate
1. INTRODUCTION
Entropy is a major tool in information theory. However, this tool
is not directly used in the context of signal processing, but in theoretical frameworks, because it is difficult to compute or even to
estimate from a set of data. A few interesting attempts of direct
use of entropy for signal processing applications can be found in
[1, 2].
Let us recall that the entropy HX of a random variable X (! )
with continuous probability law fX (x) is defined as:
Z +1
HX =
fX (x) log fX (x) dx:
(1)
,
,1
This formula raises the two main difficulties that characterize
the computation of the entropy:
the first difficulty is due to the fact that the probability law
fX (x) is generally unknown. However, there exist numerous
methods for estimating probability laws, such as kernel methods.
But these methods share the major drawback of slow convergence
rate (typically T ,4=5 ):
the second difficulty comes from the fact that, even if fX (x) is
known, formula (1) requires numerical integration.
These two remarks illustrate the necessity for finding fast converging and accurate estimates of entropy. The underlying idea for
the new estimate of entropy we propose here states as follows: the
probability law is seeked in a set of functions that are characterized
by two fundamental features:
(i) this set is characterized by a finite set of parameters (parametric
model)
(ii) the functions are chosen in such a manner that computation of
the entropy as defined by (1) reduces to a problem of AR process
identification. Morever, this method inherits regularization techniques allowing integration of a priori informations about the law
(such as its smoothness).
The first property induces obviously a small bias over the estimate entropy. The second property makes of this estimate a new
and potentially powerful tool in signal processing.
The main features we wish for our estimate are:
a fixed number of parameters, as opposed to nonparametric approaches,
an easy method to select the best fitting parameters from the sole
observation of samples of the random process X (n; ! ),
the capability of iteratively updating these parameters.
2.2. The AR identification problem
Given observations w (n) of a process W (n; ! ) and a fixed integer p; a set of parameters ai 1ip is seeked that best fits the
following model :
f g
w(n) =
p
X
i=1
ai w(n , i) + (n)
where (n) is a white noise with power 2 :
The exact solution of this problem is well known and requires
the knowledge of the correlation function Rw (k) of the process
W (n; !) : This solution is given by
a = Rw,1 rw ;
with matrix (Rw )i;j = Rw (i , j ) and vector (rw )i = Rw (i) :
The spectrum of this AR process writes
2
:
1 , k=1 ak e,j2kf 2
Sw (f ) = Pp
(2)
2.3. Application to the estimation of entropy
2.3.1. The approach
Application of the AR identification tool to the entropy estimation
problem is straightforward if we look for an estimate f^X (x) of the
2 [,0:5; 0:5], under the form
2
(3)
f^X (x) = Pp ,j2kx 2
1 , k=1 ak e
= Sw (x)
(4)
that is the restriction over interval [,0:5; +0:5], of the power spectral density Sw (x) of an AR process denoted as W (n; ! ).
probability law fX (x), for x
2.3.4. Implementation of the method
2.3.2. An underlying process
Given law (3) ; it is possible to exhibit a random process W (n; ! )
whose spectrum is precisely Sw (f ) = f^X (f ). If we define:
W (n; !) = ej(nX +(!)) ;
where X is any sample of process X (n; ! ) (for example X =
X (1; !)) and (!) is a uniformly distributed phase, then
W (n; !) is a centered process whose correlation function Rw (k)
is:
Rw (k) = E [W (n) W (n + k)] = E ejkX = FT ,1 f^X (x)
so that obviously
ergodic.
Sw (x) = f^X (x). Note that this process is not
Although an analytical expression of entropy can be derived in
terms of AR parameters [4], the particular choice expressed by (3)
also leads to an easy and natural procedure for the estimation of
entropy.
Denote by ^X (k) and ^X (k) the first and second character^
by
istic functions
of
X (k) =
X (!; n) ; defined,respectively
1
,
1
^
^
^
FT fX (x) and X (k) = FT log fX (x) ; then
^X (k) = Rw (k) 8k
^X (k) = Cw (k) 8k
(5)
where Cw (k) denotes the cepstrum function of W (n; ! ) :
Application of the Parseval Plancherel identity to the definition
(1) of the entropy writes, with the help of relations (5) ; as follows:
H^ X = ,
+1
X
k=,1
^X (k) ^X (k) = ,
+1
X
k=,1
Rw (k) Cw (k) :
(6)
The entropy HX of X (n; ! ) can thus be computed from both correlation and cepstrum functions of W (n; ! ).
At this step, we can take advantage of the AR structure, exploiting the fact that correlation and cepstrum functions of an AR
process verify
Rw (k) =
8
<
Cw (k) = :
p
X
i=1
ai Rw (k , i) + 2 (k)
P0
(7)
,i
i=k+1 k Cw (i) w (k , i) if k < 0
2 log R (0)
if k = 0
P ,1 , i w
w (k) , ki=1
C
(
i
)
(
k
,
i
)
if k > 0
w
k w
w (k) ,
The proposed method thus consists in the three following steps:
first step: compute a raw estimate of the law of X; (equivalently
of Sw (x)), involving samples xi 1in+1 as, for instance:
f g
n+1
X
^fX(n+1) (x) = 1
n + 1 i=1 (x , xi ) h (x) ;
where function h is any kernel function: However, as we wish an
iterative estimate, it is useful to notice that
f^X(n+1) (x) = n +n 1 f^X(n) (x) + n +1 1 h(x , xn+1 )
so that by inverse Fourier transform:
Rw(n+1) (k) = n +n 1 Rw(n) (k) + n +1 1 H (k)ej2kxn+1
2.3.3. Estimation of entropy
where function w (k) is the normalized correlation (or correlation
w (k)
coefficient) of process W (n; ! ) ; defined as w (k) = R
Rw (0) .
The trick here lies in the fact that the first and second characteristic functions involved in formula (6) ; being identified with the
correlation and cepstrum function of the AR process W (n; ! ) ;
^ X of
can be computed using relations (7; 8). Thus the estimate H
the entropy as defined by (6) can be computed without any numerical integration, and from the sole observation of the data.
(8)
(9)
where H (k) =FT,1 (h(x)).
second step: the set of estimated correlations
Rw(n+1) (k)0kp
(
n
+1)
allows to compute the set of parameters ai 1ip and thus both
(n+1) (k) and ^(n+1) (k), using (7,8).
series ^
X
X
third step: application of formula (6) gives the estimated entropy
of process X (n; ! ) computed with the (n + 1) first samples.
2.4. Estimation of the AR parameters of the law
f g
f g
Accurate estimates of the law parameters ak have now to be
derived from a finite number of samples xi 1iN . We present
three methods for solving this problem in our special context.
2.4.1. Normal equations
The AR parameters satisfy the well known normal, or Yule-Walker
equations
2
Rw ,1a = 0
where R is the (p + 1) correlation matrix. The correlation coefficients are estimated using (9). However, some free parameters remain to be chosen : (i) the form and length of the kernel h; (ii) the
order p of the AR model. The choice of these parameters should
result, as usual in estimation problems, of a trade-off between bias
and variance.
2.4.2. Long AR models and regularization
Accurate modelization of non-AR processes via AR techniques requires the use of long AR models. The counterpart of adopting a
high number of coefficients is a loss in the stability of the estimate.
The exploitation of regularization techniques enables to use long
AR models, and thus to model ‘non-AR’ spectra, without sacrifying stability.
The idea is to use a long AR model with the addition of some
prior knowledge about the ‘smoothness’ of the spectrum. In [3],
Kitagawa and Gersch defined the PSD kth smoothness by
Z 1 k
@ A (f ) 2
Dk = @f k df at k a;
/
0
5
7
4.5
6
4
5
3.5
3
4
2.5
3
2
1.5
2
1
1
0.5
where k is the diagonal matrix with elements [k ]ii = i2k .
The corresponding regularized least squares estimate is
0
−0.5
a^ = (R^w + k ),1 r^w :
(10)
The hyperparameter balances the fidelity to the data and the
Figure 1: The AR estimates and theoretical gaussian (a) and uniform (b) laws.
smoothness prior. In [3], a bayesian interpretation of this regularized least-squares is derived, that also leads to a selection rule for
the hyperparameter , as the minimizer of the following marginal
likelihood:
L() = log(det(R^ w + k )) , p log() , N log(2 ); (11)
where 2 is chosen such that the AR probability distribution is
properly normalized.
2.4.3. Maximum a Posteriori estimation of the parameters
Another possible approach is to derive the parameters from the
probabilistic model of the available data. Since the model of the
probability density is given by (3), the probability law of a sample
of data of length N is the product law (assuming, without any extra
information, independence of the data). Prior knowledge regarding
the smoothness of the law can be introduced in the form
of the
Kitagawa-Gersch gaussian prior for a, fA (a)
e,at k a: This
leads to the posterior distribution
/
fAjX (ajx) /
N
Y
2
,at k a
,
j 2kxi j2 e
j
1
,
a
e
k
k=1
i=1
Pp
The MAP estimate of the AR parameters can now be computed
as the minimizer of
2
p
N
X
X
,
j
2
kx
2
t
i
J (a; ) = log 1
ak e
N log + a k a:
i=1
,
,
k=1
In this section, we give simulation results regarding the estimation of probability laws using our AR modelization approach. We
also examine the accuracy of entropy estimates for several distributions. Experiments were performed on sequences of 500 samples1 . The laws were modelized with an order p = 50. We tested
the long AR method presented in 2.4.2. We examined the cases
of a uniform law U[, 1 ; 1 ] and a gaussian law N (0; 0:01) : The
20 20
optimal regularization hyperparameter was selected using the
minimization of a likelihood as in (11).
Typical results are given in Figs1 (a) and (b)
x
1 Since the law is modelized
as the restriction of the spectrum on interval
the data had to be rescaled on this interval. This does not
restrict our approach because the entropy of the rescaled variable differs
from the original entropy only by a known additive term.
, 21 + 12
;
;
−0.3
−0.2
−0.1
0
x
0.1
0.2
0.3
0.4
0.5
0
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
x
We also tested the MAP approach as presented in 2.4.3. From
the practical point of view, the MAP estimation approach usually
gives more accurate but slowly converging estimates, contrary to
the long AR approach. As far as the accuracy of the estimate of
the entropy is concerned, we evaluated the entropy of Gaussian
and uniform data using a Monte Carlo type simulation over 100
trials. The results are given in the following table, where Mh is
the mean of the estimates and H their standard deviation.
Gaussian 2 = 18
1 1
Uniform
6; 6
,
theoretical
1.09
1.58
Mh
H
0.9544
1.4786
0.0485
0.0159
These results exhibit the good statistical behaviour of our estimate, that is a low bias and a small variance.
3. SAMPLE APPLICATIONS
3.1. Detecting law changes
3.1.1. An adaptive estimate
As our approach involves iterative evaluations of the empirical
law and the corresponding correlation sequence, it is straightforward to derive an adaptive version of the entropy estimation. It
suffices to introduce a forgetting factor in the updating formula (9) of the correlation sequence. Then, the AR parameters and entropy are evaluated for each new sample. The regurecursively
larized least squares,,solution (10) can be computed
a(n+1) = a(n) + R^ (n) + k a(n) , r^(n) :
2.5. First simulation results
−0.4
An interesting application of this adaptive estimate consists in
detecting law changes in signals. As an illustration, we consider a
signal x (n) that consists in 400 samples generated according to a
mixture of two Gaussian distributions, followed by 400 uniformly
distributed samples. First law has entropy H1 = 1:8 whereas
the second has entropy H2 = 1:18:
Figure 2 below shows x (n): it is difficult by a simple inspection
to detect that there is a law change. Figure 3 shows the adaptive
estimate of the negentropy on this test signal, using a forgetting
factor = 0:98:
The following points are of importance: (i) the law change appears clearly, (ii) the rupture time is properly revealed, (iii) the
entropy is estimated with high accuracy and (iv) our adaptive estimate has a high tracking capability due to its fast convergence
(typically in less than 100 samples).
,
,
0.25
measure of independence is the mutual information, that is the
Kullback-Leibler divergence between pS1 ;:::;SN (s1 ; :::; sN ) and
1iN pSi (si ) : In the source separation context, this reduces
to minimizing the following cost function
0.2
0.15
0.1
0.05
0
−0.05
C (B ) = , log jdet B j +
−0.1
−0.15
−0.2
−0.25
0
100
200
300
400
500
600
700
800
Figure 2: x(n) with abrupt law change at sample 400.
N
X
i=1
H (^si )
In classical methods, as no estimate of the entropy is available, B
is chosen as the solution of :
E [^si j (^sj )] = 0 (i 6= j )
2.5
2
1.5
1
0.5
100
200
300
400
Time
500
600
700
800
Figure 3: Adaptive estimate of the entropy.
3.2. Blind deconvolution of AR systems
The difficult problem of blind deconvolution lies in recovering the
inputs and parameters of a filter from the sole observation of its
output. The concept of entropy brings an interesting answer to this
problem, relying on the following proposition:
Proposition 1R Let Y (n; ! ) be the output of a filter G(f ) normal1
ized such that ,1 G(f ) 2 df = 1, whose input is a non-gaussian
i.i.d. sequence X (n; ! ). Then HY > HX .
j
j
The intuitive reason behind this result is that fY (y ) is closer to
a gaussian distribution than fX (x), the gaussian distribution having the maximum entropy in the set of distributions of given variance, see [5]. The deconvolution procedure then simply consists in
adjusting the filter such that the reconstructed input has minimum
(estimated) entropy. If are the filter parameters, this writes
opt = arg min
H^
X
submitted to X (f ) = Y
(f ) =G (f )
Simulations were performed in the case of non-minimum phase
AR systems. They show that the AR parameters can be identified
with an outstanding accuracy, and the input can be perfectly reconstructed, even if the AR order is overestimated. These simulations
were performed in the case of uniform and binary inputs, with 500
samples of data.
3.3. Source separation
In the context of source separation, N signals s (n) =
[s1 (n) ; :::; sN (n)] are mixed by an unknown N N matrix A
to provide observed signals x (n) = [x1 (n) ; :::; xN (n)]. The
task is, from the sole observation of signals x (n), to recover the
sources assuming only their independence. This goal is reached by
designing a matrix B such that the reconstructed signal s^ (n) =
Bx (n) has independent components. The information theoretic
(12)
(13)
that expresses the stationarity condition of C (B ) : Function
j (sj ) is the so-called score function, the log derivative of the
density pSj (sj ) :
Using our AR parametrization, we can
(i) either estimate the cost function C (B ) and minimize it using
any standard optimization procedure
(ii) or estimate the solution of (13) using an analytical expression (in terms of the AR parameters) of the score functions
i (si ) :
We performed simulations using the first approach, using 500
samples of data in the case of the mixture of N = 2 sources. The
following table presents, for several distributions of the sources,
the resulting matrix M = AB that should be the identity matrix,
up to a scaling factor and a permutation.
source 1
U[,0:5;0:5]
U[,0:5;0:5]
binary
1 1
,2; 2
U[,0:5;0:5]
source 2
1 fN (0:3; 0:2)
2
N (,0:3; 0:2)g
U[,0:5;0:5]
N (0; 1)
N (0; 1)
M
1
0:0395
0
:
0011
1 1
0:02
0
:
0163
1 1 0:0354
0
:
02
1 1
0:23
,0:0491 1
4. REFERENCES
[1] P. Viola, N. N. Schraudolph, T. J. Sejnowski, “Empirical Entropy Manipulation for Real-World Problems”, Advances in
Neural Information Processing Systems 8, MIT Press, 1996.
[2] D. T. Pham, “Blind Separation of Instantaneous Mixture of
Sources via an Independent Component Analysis”, IEEE
trans. on Signal Processing, vol. 44, no 11, pp. 2768-2779,
nov. 1996.
[3] G.Kitagawa and W.Gersh, “A smoothness priors long AR
model method for spectral estimation”, IEEE trans. Automatic Control, no. 1, pp. 57–65, 1985.
[4] J.-F. Bercher and C. Vignat, “Entropy Estimation with Applications to Signal Processing Problems”, LSC internal report
no. 98-057.
[5] D. Donoho, “On minimum entropy deconvolution”, Applied
Time Series Analysis II, pp. 565-609, Academic Press, 1981.