Download SPATIO-TEMPORAL PATTERN CLUSTERING METHOD BASED

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Mixture model wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
SPATIO-TEMPORAL PATTERN CLUSTERING METHOD BASED ON INFORMATION
BOTTLENECK PRINCIPLE
L. Gueguen
M. Datcu
GET-Télécom Paris
Signal and Image Processing department
46 rue Barrault, 75013 Paris, France
German Aerospace Center DLR
Remote Sensing Technology Institute IMF
Oberpfaffenhofen, D-82234 Wessling, Germany
ABSTRACT
We present a spatio-temporal pattern clustering method based
on information theory. We present theoretically the Information Bottleneck optimization problem, which emerged from
Rate-Distortion theory. Then, we present Gauss-Markov Random Field (GMRF) for characterization of spatio-temporal information contained in data. This class of stochastic models
provides an intuitive description of 3-dimensional relations
in data. Using GMRF modeling and Information Bottleneck
principle, we present a method for calculating clusters which
represent varying quantity of information contained in data.
Finally, we present some results obtained by clustering spatiotemporal pattern on Satellite Image Time-Series.
1. INTRODUCTION
Nowadays, huge quantities of satellite images are available
thanks to the growing number of satellite sensors. Moreover,
a same scene can be acquired several times a year, which enables to create Satellite Image Time-Series (SITS). The high
spatial resolution of the sensors give access to detailed spatial
structures, which are extended to spatio-temporal structures
considering the time evolution of the scene. Therefore, SITS
are high complexity data containing numerous and various
spatio-temporal structures. For example in a SITS, growth,
maturation or harvest of cultures can be observed. Specialized tools for information extraction in SITS have been made
such as change detection, monitoring or validation of physical
models. However, these technics are dedicated to specific applications. Consequently in order to exploit the information
contained in SITS, more general analyzing methods are required. Some methods for low resolution images and uniform
sampling have been studied in [1]. For high resolution and
non-uniform time-sampled SITS, new spatio-temporal analyzing tool is presented in [2, 3]. It is based on a Bayesian
hierarchical model of information content. The concept was
first introduced in [4, 5, 6] for information mining in remote
This work has been done within the Competence Center in the field of Information Extraction and Image Understanding for Earth Observation funded
by CNES, DLR and ENST.
sensing image archives. The method is based on two representations of the information: objective and subjective. In
fact, the subjective representation is obtained from the objective representation by machine learning under constraints
provided by the users. The advantage of such a concept is that
it is free of the application specificity and adapts to the user’s
query. This paper addresses the problem of representing objectively the information contained in SITS by unsupervised
clustering.
The paper is organized as follows. Section 2 introduces the information theoretical concept for spatio-temporal pattern detection and recognition. Section 3 introduces theory about
Information Bottleneck principle. Section 4 presents spatiotemporal features extraction. Section 5 presents the Information Bottleneck approach for unsupervised clustering. Experiments and discussion are detailed in Section 6. Finally, section 7 concludes.
2. INFORMATION THEORETICAL CONCEPT FOR
SPATIO-TEMPORAL PATTERNS DETECTION AND
RECOGNITION
In order to detect or recognize spatio-temporal patterns, it
is essential to characterize information in a low-dimensional
space. Features are extracted by fitting parametric models to
data, in order to characterize information contained in data.
This task can be viewed as a Bayesian hierarchical model in
two stages. The first level of inference is the model selection and the second level is the model fitting. Some criteria,
such as Minimum Description Length [7], have been developed stating that a good model provides the shortest description of data and the model. The description contains the whole
data information and data are represented by features. An unsupervised clustering is processed on features space, reducing
the complexity for fast retrieval of similar patterns. Clustering is equivalent to vector quantization. The choice of model
is done first, then the optimal clustering is calculated thanks
to the model. However, joint choice of the model and clustering would be more accurate. Clustering is done by minimizing a distortion functional, while the choice of model is
done by minimizing the entropy of the signal expressed with
the model. Consequently, the problem can be viewed as a
Rate-Distortion optimization. There is a trade-off between
the amount of relevant information (distortion defined with a
divergence measure d), and the complexity of representation
(rate quantified using mutual information I). Considering the
signal X, the features Θ, the model M and E the expectation
operator, the optimization problem is formally expressed as:
min I(X, Θ) + βEX,Θ [d(X, Θ)]
M,p(θ|x)
(1)
The best model is the one that gives the lowest Rate-Distortion
curve that is a parametric function of β. When β → 0, simpler models are chosen to obtain few clusters. On the contrary,
when β → ∞, more complex models are chosen to fit data.
Therefore, with varying β, we can obtain different clusterings
depending how much we trust the model. The more we trust
in the model, the more the distortion measure is decreased.
Hence, an idea is to introduce a new distortion measure between the model and the representation to quantify how much
the model is trusted. This is done by using the Information
Bottleneck principle.
3. INFORMATION BOTTLENECK PRINCIPLE
min I(X̃, X) − βI(X̃, Y )
p(z̃ | z)
=
N (z, β)
=
p(z̃) −βd(z,z̃)
e
N (z, β)
X
p(z̃)e−βd(z,z̃)
(5)
(6)
z̃
where N (z, β) is the partition function. For fixed probabilistic assignements p(z̃ | z), the solution is given by:
z̃ ∗
= EZ|z̃ [Z]
X
=
p(z | z̃)z
(7)
(8)
z
Using this two properties, Banerjee proposed in [9, 10] an iterative algorithm to compute Z̃s and p(z̃ | z). This algorithm
is used to solve the problem, and to reach a local optimum
of the functional. Finally, from this optimization the divergence Dβ and the rate Rβ can be computed by the following
formulas:
X
p(z)p(z̃ | z)d(z, z̃)
(9)
Dβ =
z,z̃
Information Bottleneck emerged from Rate-Distortion theory.
The problem is stated as follows: we would like a relevant
quantizer X̃ to compress X as much as possible under the
constraint of a distortion measure between X and X̃. In contrast, we want also to capture as much of information in X̃ as
possible about a third variable Y . In fact, we pass the information that X provides about Y through a bottleneck formed
by the compact summary formed by X̃. The problem is mathematically expressed as:
p(x̃|x)
Cover and Thomas gave the solution to this problem for a
fixed Z̃s .
(2)
To solve this problem, algorithm are described in [8] and are
mainly inspired from Blahut-Arimoto algorithm. They make
the assumption of the following Markov chain : Y ↔ X ↔
X̃ However, Banerjee demonstrated in [9], that Information
Bottleneck can be viewed as Rate-Distortion problem based
on Bregman divergence. He considered Z = p(Y | X) and
Z̃ = p(Y | X̃) as sufficient statistics for X and X̃ respectively. Z takes values over the set of conditional distribution
{p(Y | x)}, and Z̃ takes values over the set of conditional
distribution {p(Y | x̃)} = Z̃s . Therefore, the problem equivalent to Bottleneck Information is written as:
h
i
(3)
min I(Z, Z̃) + βEZ,Z̃ d(Z, Z̃)
Z̃s ,p(z̃|z)
d is a Bregman Divergence, that corresponds here to the Kullback Leibler divergence.
X
p(y | x)
(4)
d(z, z̃) =
p(y | x) log
p(y | x̃)
y
Rβ
=
X
p(z)p(z̃ | z) log
z,z̃
p(z̃ | z)
p(z̃)
(10)
4. SPATIO-TEMPORAL FEATURES EXTRACTION
Gauss-Markov Random Field have presented interesting properties for characterizing textures in satellite images [11, 12].
GMRF are parametric models. We can extend the principle to
a 3 dimensional signal. The field is defined on a rectangular
grid. Let Xs be the signal, s belonging to a lattice Ω and let
N be the half of a symmetric 3-d neighborhood (Fig.1). So
GMRF are defined as follows:
X
Xs =
θr (Xs+r + Xs−r ) + es
(11)
r∈N
where es is a white Gaussian noise. Then parameters Θ̂ and
the noise variance σ̂ are estimated by Least Mean Square,
which corresponds to the Maximum Likelihood estimation
considering a white Gaussian error. The equation (11) is expressed vectorially (12), introducing a matrix G expressed
from X. Hence, estimated parameters are expressed in the
following equations.
X
Θ̂
σ̂ 2
=
GΘ + E
T −1
(12)
T
= (GG ) G X
= X T X − (GΘ̂)T (GΘ̂)
(13)
(14)
From estimated parameters, it is possible in the case of GMRF
to approximate the evidence of the model [13]. P, Q being the
t=0
t=−1
6
nb of parameters
1
3
2
9
3
5
−log(Distortion)
order
13
Fig. 1. Half of symmetric 3-d neighborhood. 3 different order
of neighborhood are represented here. The central pixel that
does not belong to the neighborhood is in black. t is the third
dimension.
P −Q
T
−1/2
π −N/2 Γ( Q
2 )Γ( 2 )|G G|
4Rδ Rσ (Θ̂T Θ̂)Q/2 σ̂ P −Q
(15)
By maximizing the evidence, we select the order of the model
that represents the most the information contained in the data.
However, we have several realization of the random variable
X, and one model order has to be chosen to represent data in
a unique features space. So we consider that we have T independent realizations of X which are named {Xi }i6T . Therefore the evidence of the model is expressed as:
Y
p(Xi | M )
(16)
p(X1 , X2 , · · · , XT | M ) =
16i6T
The model order is selected by choosing the one which maximizes the evidence. Then, each realization is represented by
the estimated features corresponding to model order selected.
Introducing a distance in the features space, it is possible
to compare spatio-temporal structures efficiently trough their
centroids. By doing this space transposition, the dimensionality of data to be processed is greatly reduced but similarity
between spatio-temporal structures is kept.
5. INFORMATION BOTTLENECK APPROACH FOR
CLUSTERING
We consider that X is the data, and Θ = X̃ is the summary
of X. Let the model M = Y be the random variable that contains the relevant information. So the information bottleneck
gives a formalism to express the trade off to do between compression (short summary) and the relevant information contained in the summary. In fact, this principle gives a way to
know how much information can be extracted from the data
by a predetermined set of models. So the problem is written
as:
min I(Θ, X) − βI(Θ, M )
(17)
p(θ|x)
4
3
2
1
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Rate
Fig. 2. Rate-Distortion curves computed for 3 analyzing window sizes. The size is noted as width×length×time.
respective dimensions of X, Θ, the evidence is given by:
p(X | M ) ≈
12x12x5
10x10x5
8x8x5
Considering z = p(M | x) and z̃ = p(M | θ), the problem
is written as in (3) and solutions are given in (5) and (7). To
compute the results, we need to evaluate p(z) and z. The
other variables are evaluated by the Banerjee algorithm. So to
compute z = p(M | x)1 , we use Bayes rule:
z = p(M | x) =
p(x | M )p(M )
p(x)
(18)
So, if we consider that the random variable M takes values
in GMRF family of different order. We have the following
equations that avoid us to calculate p(x):
X
p(m | x) = 1
(19)
m
p(m | x)
=
p(x | m)p(m)
P
m p(x | m)p(m)
(20)
p(x | m) is given by (15). There are two choices to evaluate p(m). Either we consider that p(m) = 1 or p(m) is
the number of occurrences that m is selected by maximizing
the evidence for each realization of X. Finally to evaluate
p(z), since we have computed z, we estimate p(z) using an
histogram based on Parzen window.
6. EXPERIMENTS AND DISCUSSION
For our experiments, we have worked on SITS provided by
the CNES. A parallepipedic partition of the data is done and
we consider each parallepiped as a realization of a random
variable X. The partition is determined by the size (width ×
height × time) of parallelepipeds. This size is also called the
analyzing window size. We fixed the number of centroids z̃
at 10% of the number of realizations. Some Rate-Distortion
1 To calculate p(x | m), we made the assumption that the constant R R
δ σ
of equation (15) does not vary by changing the order of GMRF.
35
ciple. This methods differentiate from others because it does
not need the choice of a model. Moreover, the method enables
the creation of a compact representation of the information for
data-mining. Finally, the technique can be extended by applying the Information-Bottleneck principle to a wider family of
models like autobinomial GRF.
number of clusters
30
25
20
15
8. REFERENCES
10
5
0 0
10
1
2
10
10
log(beta)
Fig. 3. Influence of β on the number of clusters found.
0.8
p(M2|x)
[2] P. Heas and M. Datcu, “Modeling Trajectory of Dynamic Clusters in Image Time-Series for Spatio-Temporal Reasoning,”
IEEE Transactions on Geoscience and Remote Sensing, vol.
43, no. 7, pp. 1635–1647, July 2005.
[3] P. Heas, P. Marthon, M. Datcu, and A. Giros, “Image timeseries mining,” in IGARSS’04, Anchorage, USA, Sept. 2004,
vol. 4, pp. 2420–2423.
1
[4] M. Datcu and K. Seidel, “Image Information Mining: Exploration of Image Content in Large Archives,” in IEEE
Aerospace Conference Proceedings, March 2000, vol. 3 of 1825, pp. 253–264.
0.6
0.4
0.2
0
−0.2
−0.2
[1] C.M. Antunes and A.L. Oliveira, “Temporal Data Mining: an
Overview,” Workshop on temporal data mining, IST, Lisbon
Technical University, 2001.
0
0.2
0.4
0.6
0.8
1
1.2
p(M1|x)
Fig. 4. 150 centroids z̃ fused in 9 distinguishable clusters
centroids. The points represent the projection of z̃ on a plane.
M1 and M2 are first and second order GMRF models.
curves have been computed (Fig.2) to show the influence of
the analyzing window size. It is noticeable that the bigger
the size is, the better the curve is. It has to be noted that the
same number of realizations have been taken for the experiments. Therefore, bigger analyzing window size would be
chosen. However, we can not extend the size indefinitely, so
the size is arbitrarily limited. We processed an other experiments about the number of clusters derived from the soft clustering by MaximumA Posteriori. For a fixed number of centroids z̃, we calculated the number of clusters z̃ ∗ obtained for
several β. In Fig.3, as β increases, the number of clusters increases also. This phenomenon results from Rate-Distortion
theory. Indeed, the more we trust models, the more numerous
the clusters are. On the contrary, the less we trust the models,
the more the mutual information between signal and centroids
decreases. That leads to a more compact representation and
few clusters. In fact, the centroids fuse and the number of
clusters is obtained depending on β (Fig. 4).
7. CONCLUSION
In this paper, we have presented a new way of clustering
spatio-temporal pattern based on Information-Bottleneck prin-
[5] M. Datcu, K. Seidel, S. D’Elia, and P.G. Marchetti,
“Knowledge-driven Information Mining in Remote-Sensing
Image Archives,” ESA Bulletin, vol. 110, pp. 26–33, May
2002.
[6] M. Datcu, H. Daschiel, and al., “Information Mining in Remote Sensing Image Archives: System Concepts,” IEEE
Transaction on Geoscience and Remote Sensing, vol. 41, no.
12, pp. 2923–2936, Dec. 2003.
[7] J.J. Rissanen, “A Universal Data Compression System,” IEEE
Transactions on Information Theory, vol. IT-29, no. 5, pp. 656–
664, Sept. 1983.
[8] N. Tishby, F. Pereira, and W. Bialek, “The Infomation Bottleneck Method,” in Proc 37th Annual Allerton Conference on
Communication, Control and Computing, 1999, pp. 368–377.
[9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Merugu, “An information theoretic analysis of maximum likelihood mixture
estimation for exponential families,” in ACM Twenty-first international conference on Machine learning, Alberta, Canada,
July 2004, number 8, ACM Press.
[10] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering
with Bregman Divergences,” in SIAM International Conference on Data Mining, 2004.
[11] M. Schroder, H. Rehrauer, K. Seidel, and M. Datcu, “Spatial
Information Retrieval from Remote-Sensing Images. ii. GibbsMarkov Random Fileds,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 36, no. 5, pp. 1446–1455, Sept. 1998.
[12] R. Chellappa and R.I. Kashyap, “Texture Synthesis Using 2D Noncausal Autoregressive Models,” IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. ASSP-33, no. 1,
pp. 194–204, Feb 1985.
[13] J.J.K. O Ruanaidh and W.J. Fitzgerald, Numerical Bayesian
Methods Applied to Signal Processing, chapter 2, Springer,
1996.