Download IMPLEMENTATION OF DATA MINING TECHNIQUES FOR

Realizing Teracomputing. W. Zwieflhofer and N. Kreitz, editors. World Scientific, 215 - 240, 2003 IMPLEMENTATION OF DATA MINING TECHNIQUES FOR METEOROLOGICAL APPLICATIONS A.S. COFIÑO AND J.M. GUTIÉRREZ Dept. of Applied Mathematics. University of Cantabria, Avda. de los Castros, Santander 39005, Spain E-mail: [email protected] B. JAKUBIAK AND M. MELONEK Interdisciplinary Centre for Mathematical and Computational modelling, Warsaw University, Pawinskiego 5A, 02-106, Poland E-mail: [email protected] The CrossGrid project is one of the ongoing research projects involving GRID technology. One of the main tasks in the Meteorological applications package is the implementation of data mining systems for the analysis of operational and reanalysis databases of atmospheric circulation patterns. Previous parallel data mining algorithms reported in the literature focus on parallel computers with predetermined resources (processing units) and high-performance communications. The main goal in this project is designing adaptive schemes for distributing data and computational load according to the changing resources available for each GRID job submitted. In this paper, some preliminary work regarding two different data mining algorithms (self organizing maps and smoothing filters) is presented. These techniques can be used in combination with databases of observations to provide downscaled local forecasts from operative model outputs. This is a more general framework to look at data mining techniques from the meteorological point of view. 1 INTRODUCTION Nowadays, the increasing need of computing power in meteorological applications is addressed by either using supercomputers, or by using high-performance computer networks. In each case, the development of parallel computing techniques and algorithms face different problems, due to the respective hardward and network constraints. In this article we present the actual stage of the work on one specific application of the EU 5th FrameWork Programme project CrossGrid (http://www.cyf-kr.edu.pl/crossgrid/). The primary objective of the CrossGrid Project is to further extend the GRID high-performance compute- and data-distributed network to a new category of applications, such as an interactive simulation and visualisation for DataMiningPaper.doc World Scientific 07/02/2005 1/14 surgical procedures, decision support system for flooding crisis team, distributed data analysis in high-energy physics, air pollution and weather forecasting. In this project eleven European countries are involved. The applications we are interested in are characterised by the interaction with a person in a processing loop. They require a response from the computer system to an action by the person in different time scales; from real through intermediate to long time, and they are simultaneously compute- as well as data-intensive. Meteorology is involved into three subtasks of CrossGrid Project, including air pollution issues, mesoscale weather forecasting for maritime applications and development of data mining systems. The challenging points are: seamless access to large distributed databases in the Grid environment, development of distributed data mining techniques suited to the meteorological and hydrological applications, and integration in a user-friendly interactive way, including specific portal tools. Works of our Spanish-Polish team are focused on implementation of data mining techniques. This application meets the requirements to be suitable for GRID processing, since databases are distributed among different weather services and research laboratories, and data mining algorithms are computing intensive, thus requiring parallel processing. An efficient development of this category of applications for the GRID environment requires extension of the GRID software resources by new components for application-performance monitoring, efficient distributed data access, and specific resource management. Users should be able to run their applications on the Grid in an easy and transparent way, without needing to know details of the GRID structure and operation. CrossGrid will develop user-friendly portals and mobile personalised environments and will integrate new components into the GRID and application development tools. The elaborated methodology, generic application architecture, programming environment, and new Grid services will be validated and tested thoroughly on the CrossGrid testbeds. This will result in further extension of the GRID across Europe. In our CrossGrid development we will exploit all the available achievements of DataMiningPaper.doc World Scientific 07/02/2005 2/14 DataGrid, EuroGrid and other related projects in a way which enables their interoperability. CrossGrid will closely collaborate with DataGrid. This paper is organized as follows. In Section 2 we describe the databases used in this paper. In Section 3 we analyze the problem of statistical downscaling and show how it is closely related to data mining problems. Finally, in Section 4 we present two different data mining techniques which can be used for this problem. 2 DISTRIBUTED DATABASES Our data mining applications can use global meteorological databases created by reanalysis projects from NCEP/NCAR and ECMWF, the mesoscale data archive generated by the operational UMPL system3, and local databases with observations from Spanish and Polish meteorological stations posts and radar sites. These elements are stored in very heterogeneous forms, ranging from magnetic tapes, to binary files, and to object-relational databases. Thus, data access will be a middlewear challenge for this application. At the begining, data mining applications will be designed considering machine indepentent structured data format. During the project at least two data mining algorithms will be developed and implemented to selected cases from the databases described below. 2.1 ECMWF & NCEP reanalysis databases The NCEP reanalysis project covers the period from 1958 to present (integrating a T62L28 model with 1.875º resolution). On the other hand ECMWF ERA reanalysis project extends from December-1978 to February-1994 (ERA-15) (T106L31 model with 1.125º resolution); an extension with 0.675º from mid-1957 to 2001 (ERA-40) is still under development (see http://dss.ucar.edu/pub/reanalyses.html for a description and links to different reanalysis products). In both cases the total information volume is very high, since it comprises data from approx. 20 variables at 30 preassure levels of a 360x180 nodes grid DataMiningPaper.doc World Scientific 07/02/2005 3/14 (for instance, Fig.1(a) shows the grid covering Europe). All this information is stored on magnetic tapes managed by the tape robot. Fig. 1 (a) Europe region of ERA reanalysis grid (separation is 1º in meridians and parallels). (b) Region of the mesoscale UMPL model foreacsts, every four grid point is plotted. 2.2 Database of the mesoscale forecasts for Central Europe The UMPL model (a mesoscale version of the British Unified Model with a resolution of 17 Km) came into operation at ICM in May 1997 (the operative grid of the model is shown in Fig.1(b)). Since then every day ICM archives the outputs of the model. Now the archive embodies over five years of data. All results in the form of binary pp files (format characteristic for Unified Model) as well as observational files and boundary conditions needed for run of the model are kept in the tape archive. Because the pp format is not standard in meteorological community there is a work carried out to transform these files to a common in the meteorological world GRIB format (GRIdded Binary data). The results in the new format will be available in near future. 2.3 Databases of local observations in Spain and Poland In addition to gridded forecasts, an important source of meteorological information is encoded in the historical series of soil variables (precipitation, temperature, wind speed, etc.) registered at local stations over long periods of time. DataMiningPaper.doc World Scientific 07/02/2005 4/14 For instance, the Spanish weather service INM database contains observations of several soil variables, including precipitation and minimum and maximum temperature, at 6750 stations over the Iberian peninsula and Balearic islands from 1970 to present. Fig. 2 shows the almost homogenous network, which provides a highresolution description of soil variables (the same area is covered by less than 100 reanalysis grid points; see Fig.1(a)). All this information has been recently stored in a relational database. Fig. 2 Network of 6750 stations over Iberian peninsula and Balearic islands. On the other hand, since April 1999, the ICM collects hourly synoptic reports containing actual observational data for 61 Polish meteorological stations. Now the observations archive embodies over 3 years of data, stored in an object-relational database (PostgreSQL). The databases used in this work will be grid distributed and only very loosely coupled through a meta-catalog. In practice it means that a “global database service” will be defined via this catalog and implemented via the corresponding “subdatabase services” implemented locally on a grid node. These grid nodes will run the database server, and provide the information in a three-tier scheme: a servlet will return in XML format the information to the client after querying the O/R DBMS database motor. The first implementation will use IBM IDS9.20, and implement queries via JDBC calls. The DataMiningPaper.doc World Scientific 07/02/2005 5/14 ResultSet, as indicated before, will provide an XML ResultSet output in the specified format. This scheme will be implemented in three steps: first in a local node (currently being done), then in a local cluster, and finally in the grid environment. Adequate load balance is provided via database partitioning, and so it is an essential component. Also an adequate stripping of these databases is mandatory to avoid both loosening this load balance after any prefiltering, and possible effects due to partial processing of the database leading to possible systematic shifts in the results. 3 DOWNSCALING AS A DATA MINING PROBLEM The skill of numerical Atmospheric Circulation Models (ACMs) used for short and medium-range weather prediction have increased substantially in the last decades, due to the advances both in assimilation procedures and physical parametrizations. Current ACMs integrated by different weather services simulate accurately the atmospheric synoptic dynamics on coarse-grained 40-100 Km resolution grids, covering a wide geographical area. However, at finer spatial resolution these models have much smaller skill, since the physical parametrization of sub-grid scale processes -such as cloud formation, evaporation, orography, turbulence, etc.- is a difficult task and, moreover, the parameters may not be tuned for a particular region of interest. Therefore, the patterns of, e.g. precipitation, predicted by ACMs are not suitable for local or regional forecast, since the resolution of local observations is higher than the resolution of ACMs (see Figs. 1 and 2). With the aim of gaining sub-grid local detail, several methods have been proposed in the literature; these techniques take advantage of local databases of historical records, and are referred to as statistical downscaling methods. These methods work by relating the atmospheric forecasted patterns to local observations (by means of regression, etc.) and provide a simple way for performing local weather forecast. Downscaling methods are also needed for providing input to models, which need a high-resolution information on soil DataMiningPaper.doc World Scientific 07/02/2005 6/14 variables and precipitation patterns (agrometeorological models of crop growth, hydrological models of river state prediction, etc). In this project we are concerned with both local forecast and hydrological models. As we have seen, statistical downscaling is done by relating different sources of information stored in different databases. Thus, this problem could greatly benefit from standard data mining techniques, introduced to simplify the huge amounth of information stored in large databases. The techniques described in this paper can be easily applied for downscaling applications, but we skip in this paper the details of the implementation. 4 IMPLEMENTED DATA MINING TECHNIQUES Due to the high-dimensional character of the data involved in the problems above described, it is necessary to first analyze and simplify the data before proceeding with other analysis. Some data mining techniques are appropriate in this context. Unsupervised selforganizing maps (SOM) clustering techniques allow to partition the global reanalysis databases, producing realistic prototype for weather forecasting applications. Mesoscale data archive will be used to develop the methods able to improve direct atmospheric model output. We planned to use B-splines smooth of data with standard error bands and bootstrap technique to complete implementation of nonparametric or parametric maximum likelihood model inference. 4.1 SOM. Self-Organizing Maps Self-Organizing Maps (SOM) is one of the most popular neural network models, which is especially suitable for high dimensional data visualization and modelling. It uses an unsupervised learning (no domain knowledge is needed and no human intervention is required) for creating a set of prototype vectors representing the data. Moreover, a topology preserving projection of the prototypes from the original input space onto a low-dimensional grid is carried out. Thus, the DataMiningPaper.doc World Scientific 07/02/2005 7/14 resulting ordered grid can be efficiently used for extracting data features, clustering the data, etc.6 Self-Organized maps have been recently applied in several meteorological problems, such as classifying climate modes and anomalies in the area of Balkans1. A SOM is formed by an arbitrary number of neurons c1, …, cm, located on a regular low-dimensional grid, usually a 2D lattice for visualization purposes; each of the vectors ci describes the position of the neuron on the grid and is represented by a prototype wi=(wi1, …,win), where n is the dimension of the data. Following a neural nets analogy, we could say that ci is connected to each of the components of the data vectors through a weight vector wi. For instance, if we define daily surface temperature patterns using the information provided by the reanalysis ERA-15 at 12 UTC, then the resulting 15X365=5500 vectors of dimension 60X30=1800 would characterize the evolution of surface temperature over Europe, the period 19791993. Therefore, a SOM trained to this data would find some representative configurations (prototypes) displayed on a 2D lattice which preserve original distances, hence self-organizing the obtained prototypes (see Fig.3). Fig. 3 (left) SOM lattice projected onto the space spaned by the first two principal components of ERA15 data for the GRID shown in Fig. 1(a). (right) 1000 mb temperature fields of some of the resulting prototypes. The weight vectors of the SOM are first initialized to random values. The goal of the training algorithm is iteratively adapting the weights, so the final neurones become to represent a cluster of data DataMiningPaper.doc World Scientific 07/02/2005 8/14 vectors (those which are closer to the vector of weights). What makes SOM different from other clustering algorithms is that the training process includes a neighbourhood adaptation mechanism so neighboring clusters in the 2D lattice space are quite similar, while more distant clusters become increasingly diverse. Therefore, SOM provides us with a neighbourhood preserving mapping from a high dimensional space to a m=kXk 2D lattice which quantizes the original data by means of m prototypes (the final weight vectors of the SOM). The training proceeds in cycles; on each training step, a data sample vi is considered and the best-matching (or “wining") neurone cki is obtained as the one minimizing the Euclidean distance to the pattern vector: || vi – wki ||=mink{|| vi – wk ||, k=1,… ,m}. Then, the prototype vector of the winning neurone and also of its neighbours are moved towards the sample vector: wk (t+1)= wk (t)+? (t) hki(||vi – wk(t)||), where the factor ? (t) gives the learning rate and the function hki(x) is a neighbourhood kernel, centred on the grid on the neuron ck, which determines the rate of change around the winner unit (usually a Gaussian function is considered): hki(x)=exp(-||ck - ci||/? (t))). As a consequence of the neighbourhood kernel, during the iterative training the SOM behaves like a flexible grid folding onto the cloud formed by the data in the original n dimensional space. Both learning rate and neighbourhood kerned radius decrease monotonically with time, softening the folding process; a linear decay to zero is usually chosen for these functions6. We analized the suitability of different scalable parallel implementations of this algorithm for the GRID environment. Previous SOM parallel algorithms have focused on parallel computers with predetermined resources and fast communications5. On the contrary, our aim is designing an adaptive scheme for distributing data and computational load according to the changing resources available for each GRID job submitted. To this aim, some preliminary work is needed to understand the limitations and requirements of different parallel implementations of the SOM algorithm. DataMiningPaper.doc World Scientific 07/02/2005 9/14 The simplest form for parallelizing the SOM algorithm is splitting up the data between different processors, as shown in Fig.4(a). However, in this case, after each complete cycle the slaves must send the prototypes to the master, which computes them up, sending the final centers back to the slaves. This is not an efficient implementation for the GRID enviroment, since it requires intensive message passing of high-dimensional data. Figures 4(b) and (c) shows two different alternatives for distributing computational resources with replicated (or centralized) prototype vectors. The different messages required for each of the schemes are shown using dashed lines, which may correspond to either an iteration of the algorithm, or a whole cycle. Fig. 4 Three different parallel schemes for the SOM training algorithm. (a) distributing data, (b) distributing computational resources with replicated prototype vectors, (c) distributing computational resources with centralized prototype vectors. To check the efficiency of these algorithms, we performed several experiments and found that there is no optimal alternative, but a DataMiningPaper.doc World Scientific 07/02/2005 10/1 4 combination (hybridization) of algorithms is required for the GRID framework. 4.2 MCMC. Markov Chain Monte Carlo Precipitation fields produced by NWP models need further processing to be useful for hydrological purposes. The starting point for such purpose is an application of the linear state space model and the Kalman filter. In the standard state space form, multivariate observations yt are related to unobserved state vectors a t by a linear observation equation yt = Zta t + et , t=1,2,..... where Zt is a design matrix of appropriate dimension, and {et} is a white noise process, i.e., a sequence of mutually uncorrelated error variables with E(et) = 0 and cov(et) = St. The observation equation is a varying-coefficient model with time t as effect-modifyer. The sequence of states is defined by a linear transition equation a t = Ft a t-1 + ?t, t=1,2,..... where Ft is a transiton matrix, {?t} is a white noise sequence with E(?t) = 0, cov(?t) =Qt, and the initial state has E(a o) =ao and cov(a o) = Qo. The mean and covariance structure of the model is fully specified by assuming that {et} and {?t} are mutually uncorrelated and uncorrelated with the initial state a o. The joint and marginal distributions of {yt,a t} are completely specified by distributional assumptions on the errors and the initial state. As soon as a model can be written in state space form, it provides the key for employing unified methods of statisitical inference. Given the observations y1,....yT, estimation of a t is the primary goal. This is termed filtering for t=T, smoothing for t<T and prediction for t>T. Under the normality assumption, the optimal solution to the filtering problem is given by the posterior mean a t|t = E(a t | yt,wt) of a t given yt and wt. Since the model is linear and Gaussian, the posterior distribution of a t is also Gaussian, a t | yt,wt ~ N(a t|t ,Vt|t) DataMiningPaper.doc World Scientific 07/02/2005 11/1 4 with posterior covariance matrix Vt|t = E[(a t - at|t)(a t - at|t)']. The linear Kalman filter and smoother computes the posterior means and covariance matrices in a effective recursive way. The usual derivation of the Kalman filter and smoother takes advantage of the fact that posterior distributions are normal. Generalized extended Kalman filter can be derived as an approximate posterior mode estimator. The filter is derived as gradient algorithm via the discrete maximum principle, replacing Gaussian loglikelihoods and derivatives by corresponding terms for non-Gaussian observations. In Taylor expansions, unknown parameters have to be replaced by currently available estimates. This also concerns observation covariance matrices St(a t), in contrast to the Gaussian case, where St is assumed to be known. In an empirical Bayesian framework hyperparameters T , such as initial values or covariance matrices, are considered as unknown constants. To estimate these hyperparametrs we used EM-type algorithm, which replaces posterior means and covariance matrices by posterior modes and curvatures obtained from the filtering and smoothing algorithms. To explain the principles of EM algorithm we consider the case of univariate time-invariant state space model with unknown vector of hyperparameters T =(s 2, Q, ao, Qo). The joint loglikehood of the complete data is given by l(y1,....,yT,a o,.....,a T;T)=-0.5Tlogs 2-1/(2s 2)St(yt-z'ta t)2-0.5Tlog(detQ)0.5 St (a t-Fta t-1)'Q(a t-Fta t-1)-0.5log(detQo)-0.5(a o-ao)'Qo-1(a o-ao) In the pth cycle of the algorithm, the E-step consisits of computing the conditional expectation of the log-likelihood given the observation and the current iterate T (p). The next iterate T (p+1) is obtained as the maximizer of M(T |T (p)) with respect to T . Conditional expectations can be computed by running the Kalman filter and smoother fixed at T (p), and the maximization problem can be solved analytically2. In contrast to Gaussian linear state space models, full conditionals p(a|.) for state parameters are no longer Gaussian for dynamic generalized linear models with fundamentally non-Gausian responses. Therefore, direct Gibbs sampling is not possible. From few different DataMiningPaper.doc World Scientific 07/02/2005 12/1 4 approaches to overcome this problem, we decided to implement the method of hybrid MH block move algorithm with conditional prior proposals4. This algorithm uses the full conditional prior distribution p(a rs|a ?rs,Q) to generate a proposal value a rs for blocks a rs = (a r,....,a s), 1=r<s=T. This algorithm is computationally very efficient, since it avoids any Fisher scoring steps and all proposals are drawn from Gaussian distributions with known moments. The basic tool we applied for statistical inference is the R public domain statistical package7. To achive distributed, parallel computing in R we planed to use Common Object Request Broker Architecture available as CORBA package. In this implementation one R session acts as a manager and dispatches tasks to different servers running in the other R worker sessions. This uses the ability to invoke asynchronous or background CORBA calls in R (http://www.omegahat.org/RSCORBA) Acknowledgements The work described in this paper is supported in part by the European Union through the IST-2001-32243 project "CrossGrid". The authors are also grateful to the Instituto Nacional de Meteorología INM, the Comisión Interministerial de Ciencia y Tecnología (CICYT Project REN2000-1572), and to the Polish Commitee for Scientific Research (KBN project 115/E-343/SPB/5.PR UE/DZ 206) for partial financial support. References Cavazos T. Using self-organization maps to investigate extreme climate event. Journal of Climate 2000; 13: 1718-1732. 2. Fahrmeir L. and Tutz T. Multivariate statistical modelling based on generalised linear models. New York: Springer-Verlag, 2000. 3. Herman-Izycki L., Jakubiak B., Nowinski K., and Niezgodka B. UMPL – the numerical weather prediction system for operational 1. DataMiningPaper.doc World Scientific 07/02/2005 13/1 4 4. 5. 6. 7. applications. In: Research works based on the ICM’s UMPL numerical weather prediction system results. B. Jakubiak (Ed), Wydawnictwa ICM, Warsaw, 14-27, 2000. Knorr-Held L. Conditional prior proposals in dynamic models. Scandinavian Journal of Statistics 1999; 26: 129-144. Lawrence R.D., Almasi G.S. and Rushmeier H.E. A scalable parallel algorithm for self-organizing maps with applications to sparse data mining problems. Data Mining and Knowledge Discovery 1999; 3: 171-195. Oja E. And Kaski S. Kohonen Maps. Amsterdam: Elsevier, 1999. Venables W.N., Smith D.M., and the R Development Core Team. An Introduction to R: A programming Environment for Data Analysis and Graphics, User manual for version 1.6.2, 2003. DataMiningPaper.doc World Scientific 07/02/2005 14/1 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download IMPLEMENTATION OF DATA MINING TECHNIQUES FOR