Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SPATIO-TEMPORAL PATTERN CLUSTERING METHOD BASED ON INFORMATION BOTTLENECK PRINCIPLE L. Gueguen M. Datcu GET-Télécom Paris Signal and Image Processing department 46 rue Barrault, 75013 Paris, France German Aerospace Center DLR Remote Sensing Technology Institute IMF Oberpfaffenhofen, D-82234 Wessling, Germany ABSTRACT We present a spatio-temporal pattern clustering method based on information theory. We present theoretically the Information Bottleneck optimization problem, which emerged from Rate-Distortion theory. Then, we present Gauss-Markov Random Field (GMRF) for characterization of spatio-temporal information contained in data. This class of stochastic models provides an intuitive description of 3-dimensional relations in data. Using GMRF modeling and Information Bottleneck principle, we present a method for calculating clusters which represent varying quantity of information contained in data. Finally, we present some results obtained by clustering spatiotemporal pattern on Satellite Image Time-Series. 1. INTRODUCTION Nowadays, huge quantities of satellite images are available thanks to the growing number of satellite sensors. Moreover, a same scene can be acquired several times a year, which enables to create Satellite Image Time-Series (SITS). The high spatial resolution of the sensors give access to detailed spatial structures, which are extended to spatio-temporal structures considering the time evolution of the scene. Therefore, SITS are high complexity data containing numerous and various spatio-temporal structures. For example in a SITS, growth, maturation or harvest of cultures can be observed. Specialized tools for information extraction in SITS have been made such as change detection, monitoring or validation of physical models. However, these technics are dedicated to specific applications. Consequently in order to exploit the information contained in SITS, more general analyzing methods are required. Some methods for low resolution images and uniform sampling have been studied in [1]. For high resolution and non-uniform time-sampled SITS, new spatio-temporal analyzing tool is presented in [2, 3]. It is based on a Bayesian hierarchical model of information content. The concept was first introduced in [4, 5, 6] for information mining in remote This work has been done within the Competence Center in the field of Information Extraction and Image Understanding for Earth Observation funded by CNES, DLR and ENST. sensing image archives. The method is based on two representations of the information: objective and subjective. In fact, the subjective representation is obtained from the objective representation by machine learning under constraints provided by the users. The advantage of such a concept is that it is free of the application specificity and adapts to the user’s query. This paper addresses the problem of representing objectively the information contained in SITS by unsupervised clustering. The paper is organized as follows. Section 2 introduces the information theoretical concept for spatio-temporal pattern detection and recognition. Section 3 introduces theory about Information Bottleneck principle. Section 4 presents spatiotemporal features extraction. Section 5 presents the Information Bottleneck approach for unsupervised clustering. Experiments and discussion are detailed in Section 6. Finally, section 7 concludes. 2. INFORMATION THEORETICAL CONCEPT FOR SPATIO-TEMPORAL PATTERNS DETECTION AND RECOGNITION In order to detect or recognize spatio-temporal patterns, it is essential to characterize information in a low-dimensional space. Features are extracted by fitting parametric models to data, in order to characterize information contained in data. This task can be viewed as a Bayesian hierarchical model in two stages. The first level of inference is the model selection and the second level is the model fitting. Some criteria, such as Minimum Description Length [7], have been developed stating that a good model provides the shortest description of data and the model. The description contains the whole data information and data are represented by features. An unsupervised clustering is processed on features space, reducing the complexity for fast retrieval of similar patterns. Clustering is equivalent to vector quantization. The choice of model is done first, then the optimal clustering is calculated thanks to the model. However, joint choice of the model and clustering would be more accurate. Clustering is done by minimizing a distortion functional, while the choice of model is done by minimizing the entropy of the signal expressed with the model. Consequently, the problem can be viewed as a Rate-Distortion optimization. There is a trade-off between the amount of relevant information (distortion defined with a divergence measure d), and the complexity of representation (rate quantified using mutual information I). Considering the signal X, the features Θ, the model M and E the expectation operator, the optimization problem is formally expressed as: min I(X, Θ) + βEX,Θ [d(X, Θ)] M,p(θ|x) (1) The best model is the one that gives the lowest Rate-Distortion curve that is a parametric function of β. When β → 0, simpler models are chosen to obtain few clusters. On the contrary, when β → ∞, more complex models are chosen to fit data. Therefore, with varying β, we can obtain different clusterings depending how much we trust the model. The more we trust in the model, the more the distortion measure is decreased. Hence, an idea is to introduce a new distortion measure between the model and the representation to quantify how much the model is trusted. This is done by using the Information Bottleneck principle. 3. INFORMATION BOTTLENECK PRINCIPLE min I(X̃, X) − βI(X̃, Y ) p(z̃ | z) = N (z, β) = p(z̃) −βd(z,z̃) e N (z, β) X p(z̃)e−βd(z,z̃) (5) (6) z̃ where N (z, β) is the partition function. For fixed probabilistic assignements p(z̃ | z), the solution is given by: z̃ ∗ = EZ|z̃ [Z] X = p(z | z̃)z (7) (8) z Using this two properties, Banerjee proposed in [9, 10] an iterative algorithm to compute Z̃s and p(z̃ | z). This algorithm is used to solve the problem, and to reach a local optimum of the functional. Finally, from this optimization the divergence Dβ and the rate Rβ can be computed by the following formulas: X p(z)p(z̃ | z)d(z, z̃) (9) Dβ = z,z̃ Information Bottleneck emerged from Rate-Distortion theory. The problem is stated as follows: we would like a relevant quantizer X̃ to compress X as much as possible under the constraint of a distortion measure between X and X̃. In contrast, we want also to capture as much of information in X̃ as possible about a third variable Y . In fact, we pass the information that X provides about Y through a bottleneck formed by the compact summary formed by X̃. The problem is mathematically expressed as: p(x̃|x) Cover and Thomas gave the solution to this problem for a fixed Z̃s . (2) To solve this problem, algorithm are described in [8] and are mainly inspired from Blahut-Arimoto algorithm. They make the assumption of the following Markov chain : Y ↔ X ↔ X̃ However, Banerjee demonstrated in [9], that Information Bottleneck can be viewed as Rate-Distortion problem based on Bregman divergence. He considered Z = p(Y | X) and Z̃ = p(Y | X̃) as sufficient statistics for X and X̃ respectively. Z takes values over the set of conditional distribution {p(Y | x)}, and Z̃ takes values over the set of conditional distribution {p(Y | x̃)} = Z̃s . Therefore, the problem equivalent to Bottleneck Information is written as: h i (3) min I(Z, Z̃) + βEZ,Z̃ d(Z, Z̃) Z̃s ,p(z̃|z) d is a Bregman Divergence, that corresponds here to the Kullback Leibler divergence. X p(y | x) (4) d(z, z̃) = p(y | x) log p(y | x̃) y Rβ = X p(z)p(z̃ | z) log z,z̃ p(z̃ | z) p(z̃) (10) 4. SPATIO-TEMPORAL FEATURES EXTRACTION Gauss-Markov Random Field have presented interesting properties for characterizing textures in satellite images [11, 12]. GMRF are parametric models. We can extend the principle to a 3 dimensional signal. The field is defined on a rectangular grid. Let Xs be the signal, s belonging to a lattice Ω and let N be the half of a symmetric 3-d neighborhood (Fig.1). So GMRF are defined as follows: X Xs = θr (Xs+r + Xs−r ) + es (11) r∈N where es is a white Gaussian noise. Then parameters Θ̂ and the noise variance σ̂ are estimated by Least Mean Square, which corresponds to the Maximum Likelihood estimation considering a white Gaussian error. The equation (11) is expressed vectorially (12), introducing a matrix G expressed from X. Hence, estimated parameters are expressed in the following equations. X Θ̂ σ̂ 2 = GΘ + E T −1 (12) T = (GG ) G X = X T X − (GΘ̂)T (GΘ̂) (13) (14) From estimated parameters, it is possible in the case of GMRF to approximate the evidence of the model [13]. P, Q being the t=0 t=−1 6 nb of parameters 1 3 2 9 3 5 −log(Distortion) order 13 Fig. 1. Half of symmetric 3-d neighborhood. 3 different order of neighborhood are represented here. The central pixel that does not belong to the neighborhood is in black. t is the third dimension. P −Q T −1/2 π −N/2 Γ( Q 2 )Γ( 2 )|G G| 4Rδ Rσ (Θ̂T Θ̂)Q/2 σ̂ P −Q (15) By maximizing the evidence, we select the order of the model that represents the most the information contained in the data. However, we have several realization of the random variable X, and one model order has to be chosen to represent data in a unique features space. So we consider that we have T independent realizations of X which are named {Xi }i6T . Therefore the evidence of the model is expressed as: Y p(Xi | M ) (16) p(X1 , X2 , · · · , XT | M ) = 16i6T The model order is selected by choosing the one which maximizes the evidence. Then, each realization is represented by the estimated features corresponding to model order selected. Introducing a distance in the features space, it is possible to compare spatio-temporal structures efficiently trough their centroids. By doing this space transposition, the dimensionality of data to be processed is greatly reduced but similarity between spatio-temporal structures is kept. 5. INFORMATION BOTTLENECK APPROACH FOR CLUSTERING We consider that X is the data, and Θ = X̃ is the summary of X. Let the model M = Y be the random variable that contains the relevant information. So the information bottleneck gives a formalism to express the trade off to do between compression (short summary) and the relevant information contained in the summary. In fact, this principle gives a way to know how much information can be extracted from the data by a predetermined set of models. So the problem is written as: min I(Θ, X) − βI(Θ, M ) (17) p(θ|x) 4 3 2 1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Rate Fig. 2. Rate-Distortion curves computed for 3 analyzing window sizes. The size is noted as width×length×time. respective dimensions of X, Θ, the evidence is given by: p(X | M ) ≈ 12x12x5 10x10x5 8x8x5 Considering z = p(M | x) and z̃ = p(M | θ), the problem is written as in (3) and solutions are given in (5) and (7). To compute the results, we need to evaluate p(z) and z. The other variables are evaluated by the Banerjee algorithm. So to compute z = p(M | x)1 , we use Bayes rule: z = p(M | x) = p(x | M )p(M ) p(x) (18) So, if we consider that the random variable M takes values in GMRF family of different order. We have the following equations that avoid us to calculate p(x): X p(m | x) = 1 (19) m p(m | x) = p(x | m)p(m) P m p(x | m)p(m) (20) p(x | m) is given by (15). There are two choices to evaluate p(m). Either we consider that p(m) = 1 or p(m) is the number of occurrences that m is selected by maximizing the evidence for each realization of X. Finally to evaluate p(z), since we have computed z, we estimate p(z) using an histogram based on Parzen window. 6. EXPERIMENTS AND DISCUSSION For our experiments, we have worked on SITS provided by the CNES. A parallepipedic partition of the data is done and we consider each parallepiped as a realization of a random variable X. The partition is determined by the size (width × height × time) of parallelepipeds. This size is also called the analyzing window size. We fixed the number of centroids z̃ at 10% of the number of realizations. Some Rate-Distortion 1 To calculate p(x | m), we made the assumption that the constant R R δ σ of equation (15) does not vary by changing the order of GMRF. 35 ciple. This methods differentiate from others because it does not need the choice of a model. Moreover, the method enables the creation of a compact representation of the information for data-mining. Finally, the technique can be extended by applying the Information-Bottleneck principle to a wider family of models like autobinomial GRF. number of clusters 30 25 20 15 8. REFERENCES 10 5 0 0 10 1 2 10 10 log(beta) Fig. 3. Influence of β on the number of clusters found. 0.8 p(M2|x) [2] P. Heas and M. Datcu, “Modeling Trajectory of Dynamic Clusters in Image Time-Series for Spatio-Temporal Reasoning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 7, pp. 1635–1647, July 2005. [3] P. Heas, P. Marthon, M. Datcu, and A. Giros, “Image timeseries mining,” in IGARSS’04, Anchorage, USA, Sept. 2004, vol. 4, pp. 2420–2423. 1 [4] M. Datcu and K. Seidel, “Image Information Mining: Exploration of Image Content in Large Archives,” in IEEE Aerospace Conference Proceedings, March 2000, vol. 3 of 1825, pp. 253–264. 0.6 0.4 0.2 0 −0.2 −0.2 [1] C.M. Antunes and A.L. Oliveira, “Temporal Data Mining: an Overview,” Workshop on temporal data mining, IST, Lisbon Technical University, 2001. 0 0.2 0.4 0.6 0.8 1 1.2 p(M1|x) Fig. 4. 150 centroids z̃ fused in 9 distinguishable clusters centroids. The points represent the projection of z̃ on a plane. M1 and M2 are first and second order GMRF models. curves have been computed (Fig.2) to show the influence of the analyzing window size. It is noticeable that the bigger the size is, the better the curve is. It has to be noted that the same number of realizations have been taken for the experiments. Therefore, bigger analyzing window size would be chosen. However, we can not extend the size indefinitely, so the size is arbitrarily limited. We processed an other experiments about the number of clusters derived from the soft clustering by MaximumA Posteriori. For a fixed number of centroids z̃, we calculated the number of clusters z̃ ∗ obtained for several β. In Fig.3, as β increases, the number of clusters increases also. This phenomenon results from Rate-Distortion theory. Indeed, the more we trust models, the more numerous the clusters are. On the contrary, the less we trust the models, the more the mutual information between signal and centroids decreases. That leads to a more compact representation and few clusters. In fact, the centroids fuse and the number of clusters is obtained depending on β (Fig. 4). 7. CONCLUSION In this paper, we have presented a new way of clustering spatio-temporal pattern based on Information-Bottleneck prin- [5] M. Datcu, K. Seidel, S. D’Elia, and P.G. Marchetti, “Knowledge-driven Information Mining in Remote-Sensing Image Archives,” ESA Bulletin, vol. 110, pp. 26–33, May 2002. [6] M. Datcu, H. Daschiel, and al., “Information Mining in Remote Sensing Image Archives: System Concepts,” IEEE Transaction on Geoscience and Remote Sensing, vol. 41, no. 12, pp. 2923–2936, Dec. 2003. [7] J.J. Rissanen, “A Universal Data Compression System,” IEEE Transactions on Information Theory, vol. IT-29, no. 5, pp. 656– 664, Sept. 1983. [8] N. Tishby, F. Pereira, and W. Bialek, “The Infomation Bottleneck Method,” in Proc 37th Annual Allerton Conference on Communication, Control and Computing, 1999, pp. 368–377. [9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Merugu, “An information theoretic analysis of maximum likelihood mixture estimation for exponential families,” in ACM Twenty-first international conference on Machine learning, Alberta, Canada, July 2004, number 8, ACM Press. [10] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering with Bregman Divergences,” in SIAM International Conference on Data Mining, 2004. [11] M. Schroder, H. Rehrauer, K. Seidel, and M. Datcu, “Spatial Information Retrieval from Remote-Sensing Images. ii. GibbsMarkov Random Fileds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 36, no. 5, pp. 1446–1455, Sept. 1998. [12] R. Chellappa and R.I. Kashyap, “Texture Synthesis Using 2D Noncausal Autoregressive Models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-33, no. 1, pp. 194–204, Feb 1985. [13] J.J.K. O Ruanaidh and W.J. Fitzgerald, Numerical Bayesian Methods Applied to Signal Processing, chapter 2, Springer, 1996.