Download IMPLEMENTATION OF DATA MINING TECHNIQUES FOR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Realizing Teracomputing. W. Zwieflhofer and N. Kreitz, editors.
World Scientific, 215 - 240, 2003
IMPLEMENTATION OF DATA MINING TECHNIQUES
FOR METEOROLOGICAL APPLICATIONS
A.S. COFIÑO AND J.M. GUTIÉRREZ
Dept. of Applied Mathematics. University of Cantabria,
Avda. de los Castros, Santander 39005, Spain
E-mail: [email protected]
B. JAKUBIAK AND M. MELONEK
Interdisciplinary Centre for Mathematical and Computational modelling,
Warsaw University, Pawinskiego 5A, 02-106, Poland
E-mail: [email protected]
The CrossGrid project is one of the ongoing research projects involving GRID technology.
One of the main tasks in the Meteorological applications package is the implementation of
data mining systems for the analysis of operational and reanalysis databases of atmospheric
circulation patterns. Previous parallel data mining algorithms reported in the literature focus
on parallel computers with predetermined resources (processing units) and high-performance
communications. The main goal in this project is designing adaptive schemes for distributing
data and computational load according to the changing resources available for each GRID job
submitted. In this paper, some preliminary work regarding two different data mining
algorithms (self organizing maps and smoothing filters) is presented. These techniques can be
used in combination with databases of observations to provide downscaled local forecasts
from operative model outputs. This is a more general framework to look at data mining
techniques from the meteorological point of view.
1
INTRODUCTION
Nowadays, the increasing need of computing power in meteorological
applications is addressed by either using supercomputers, or by using
high-performance computer networks. In each case, the development
of parallel computing techniques and algorithms face different
problems, due to the respective hardward and network constraints. In
this article we present the actual stage of the work on one specific
application of the EU 5th FrameWork Programme project CrossGrid
(http://www.cyf-kr.edu.pl/crossgrid/). The primary objective of the
CrossGrid Project is to further extend the GRID high-performance
compute- and data-distributed network to a new category of
applications, such as an interactive simulation and visualisation for
DataMiningPaper.doc World Scientific 07/02/2005
1/14
surgical procedures, decision support system for flooding crisis team,
distributed data analysis in high-energy physics, air pollution and
weather forecasting. In this project eleven European countries are
involved. The applications we are interested in are characterised by
the interaction with a person in a processing loop. They require a
response from the computer system to an action by the person in
different time scales; from real through intermediate to long time, and
they are simultaneously compute- as well as data-intensive.
Meteorology is involved into three subtasks of CrossGrid Project,
including air pollution issues, mesoscale weather forecasting for
maritime applications and development of data mining systems. The
challenging points are: seamless access to large distributed databases
in the Grid environment, development of distributed data mining
techniques suited to the meteorological and hydrological applications,
and integration in a user-friendly interactive way, including specific
portal tools. Works of our Spanish-Polish team are focused on
implementation of data mining techniques. This application meets the
requirements to be suitable for GRID processing, since databases are
distributed among different weather services and research
laboratories, and data mining algorithms are computing intensive, thus
requiring parallel processing.
An efficient development of this category of applications for the
GRID environment requires extension of the GRID software resources
by new components for application-performance monitoring, efficient
distributed data access, and specific resource management. Users
should be able to run their applications on the Grid in an easy and
transparent way, without needing to know details of the GRID
structure and operation. CrossGrid will develop user-friendly portals
and mobile personalised environments and will integrate new
components into the GRID and application development tools. The
elaborated
methodology,
generic
application
architecture,
programming environment, and new Grid services will be validated
and tested thoroughly on the CrossGrid testbeds. This will result in
further extension of the GRID across Europe. In our CrossGrid
development we will exploit all the available achievements of
DataMiningPaper.doc World Scientific 07/02/2005
2/14
DataGrid, EuroGrid and other related projects in a way which enables
their interoperability. CrossGrid will closely collaborate with
DataGrid.
This paper is organized as follows. In Section 2 we describe the
databases used in this paper. In Section 3 we analyze the problem of
statistical downscaling and show how it is closely related to data
mining problems. Finally, in Section 4 we present two different data
mining techniques which can be used for this problem.
2
DISTRIBUTED DATABASES
Our data mining applications can use global meteorological databases
created by reanalysis projects from NCEP/NCAR and ECMWF, the
mesoscale data archive generated by the operational UMPL system3,
and local databases with observations from Spanish and Polish
meteorological stations posts and radar sites. These elements are
stored in very heterogeneous forms, ranging from magnetic tapes, to
binary files, and to object-relational databases. Thus, data access will
be a middlewear challenge for this application. At the begining, data
mining applications will be designed considering machine indepentent
structured data format. During the project at least two data mining
algorithms will be developed and implemented to selected cases from
the databases described below.
2.1 ECMWF & NCEP reanalysis databases
The NCEP reanalysis project covers the period from 1958 to present
(integrating a T62L28 model with 1.875º resolution). On the other
hand ECMWF ERA reanalysis project extends from December-1978
to February-1994 (ERA-15) (T106L31 model with 1.125º resolution);
an extension with 0.675º from mid-1957 to 2001 (ERA-40) is still
under development (see http://dss.ucar.edu/pub/reanalyses.html for a
description and links to different reanalysis products). In both cases
the total information volume is very high, since it comprises data from
approx. 20 variables at 30 preassure levels of a 360x180 nodes grid
DataMiningPaper.doc World Scientific 07/02/2005
3/14
(for instance, Fig.1(a) shows the grid covering Europe). All this
information is stored on magnetic tapes managed by the tape robot.
Fig. 1 (a) Europe region of ERA reanalysis grid (separation is 1º in meridians and parallels). (b) Region
of the mesoscale UMPL model foreacsts, every four grid point is plotted.
2.2 Database of the mesoscale forecasts for Central Europe
The UMPL model (a mesoscale version of the British Unified Model
with a resolution of 17 Km) came into operation at ICM in May 1997
(the operative grid of the model is shown in Fig.1(b)). Since then
every day ICM archives the outputs of the model. Now the archive
embodies over five years of data. All results in the form of binary pp
files (format characteristic for Unified Model) as well as observational
files and boundary conditions needed for run of the model are kept in
the tape archive. Because the pp format is not standard in
meteorological community there is a work carried out to transform
these files to a common in the meteorological world GRIB format
(GRIdded Binary data). The results in the new format will be
available in near future.
2.3 Databases of local observations in Spain and Poland
In addition to gridded forecasts, an important source of meteorological
information is encoded in the historical series of soil variables
(precipitation, temperature, wind speed, etc.) registered at local
stations over long periods of time.
DataMiningPaper.doc World Scientific 07/02/2005
4/14
For instance, the Spanish weather service INM database contains
observations of several soil variables, including precipitation and
minimum and maximum temperature, at 6750 stations over the
Iberian peninsula and Balearic islands from 1970 to present. Fig. 2
shows the almost homogenous network, which provides a highresolution description of soil variables (the same area is covered by
less than 100 reanalysis grid points; see Fig.1(a)). All this information
has been recently stored in a relational database.
Fig. 2 Network of 6750 stations over Iberian peninsula and Balearic islands.
On the other hand, since April 1999, the ICM collects hourly
synoptic reports containing actual observational data for 61 Polish
meteorological stations. Now the observations archive embodies over
3 years of data, stored in an object-relational database (PostgreSQL).
The databases used in this work will be grid distributed and only
very loosely coupled through a meta-catalog. In practice it means that
a “global database service” will be defined via this catalog and
implemented via the corresponding “subdatabase services”
implemented locally on a grid node. These grid nodes will run the
database server, and provide the information in a three-tier scheme: a
servlet will return in XML format the information to the client after
querying the O/R DBMS database motor. The first implementation
will use IBM IDS9.20, and implement queries via JDBC calls. The
DataMiningPaper.doc World Scientific 07/02/2005
5/14
ResultSet, as indicated before, will provide an XML ResultSet output
in the specified format. This scheme will be implemented in three
steps: first in a local node (currently being done), then in a local
cluster, and finally in the grid environment.
Adequate load balance is provided via database partitioning, and
so it is an essential component. Also an adequate stripping of these
databases is mandatory to avoid both loosening this load balance after
any prefiltering, and possible effects due to partial processing of the
database leading to possible systematic shifts in the results.
3
DOWNSCALING AS A DATA MINING PROBLEM
The skill of numerical Atmospheric Circulation Models (ACMs) used
for short and medium-range weather prediction have increased
substantially in the last decades, due to the advances both in
assimilation procedures and physical parametrizations. Current ACMs
integrated by different weather services simulate accurately the
atmospheric synoptic dynamics on coarse-grained 40-100 Km
resolution grids, covering a wide geographical area. However, at finer
spatial resolution these models have much smaller skill, since the
physical parametrization of sub-grid scale processes -such as cloud
formation, evaporation, orography, turbulence, etc.- is a difficult task
and, moreover, the parameters may not be tuned for a particular region
of interest.
Therefore, the patterns of, e.g. precipitation, predicted by ACMs
are not suitable for local or regional forecast, since the resolution of
local observations is higher than the resolution of ACMs (see Figs. 1
and 2). With the aim of gaining sub-grid local detail, several methods
have been proposed in the literature; these techniques take advantage
of local databases of historical records, and are referred to as
statistical downscaling methods. These methods work by relating the
atmospheric forecasted patterns to local observations (by means of
regression, etc.) and provide a simple way for performing local
weather forecast. Downscaling methods are also needed for providing
input to models, which need a high-resolution information on soil
DataMiningPaper.doc World Scientific 07/02/2005
6/14
variables and precipitation patterns (agrometeorological models of
crop growth, hydrological models of river state prediction, etc). In this
project we are concerned with both local forecast and hydrological
models.
As we have seen, statistical downscaling is done by relating
different sources of information stored in different databases. Thus,
this problem could greatly benefit from standard data mining
techniques, introduced to simplify the huge amounth of information
stored in large databases. The techniques described in this paper can
be easily applied for downscaling applications, but we skip in this
paper the details of the implementation.
4
IMPLEMENTED DATA MINING TECHNIQUES
Due to the high-dimensional character of the data involved in the
problems above described, it is necessary to first analyze and simplify
the data before proceeding with other analysis. Some data mining
techniques are appropriate in this context. Unsupervised selforganizing maps (SOM) clustering techniques allow to partition the
global reanalysis databases, producing realistic prototype for weather
forecasting applications. Mesoscale data archive will be used to
develop the methods able to improve direct atmospheric model
output. We planned to use B-splines smooth of data with standard
error bands and bootstrap technique to complete implementation of
nonparametric or parametric maximum likelihood model inference.
4.1 SOM. Self-Organizing Maps
Self-Organizing Maps (SOM) is one of the most popular neural
network models, which is especially suitable for high dimensional
data visualization and modelling. It uses an unsupervised learning (no
domain knowledge is needed and no human intervention is required)
for creating a set of prototype vectors representing the data. Moreover,
a topology preserving projection of the prototypes from the original
input space onto a low-dimensional grid is carried out. Thus, the
DataMiningPaper.doc World Scientific 07/02/2005
7/14
resulting ordered grid can be efficiently used for extracting data
features, clustering the data, etc.6 Self-Organized maps have been
recently applied in several meteorological problems, such as
classifying climate modes and anomalies in the area of Balkans1.
A SOM is formed by an arbitrary number of neurons c1, …, cm,
located on a regular low-dimensional grid, usually a 2D lattice for
visualization purposes; each of the vectors ci describes the position of
the neuron on the grid and is represented by a prototype wi=(wi1,
…,win), where n is the dimension of the data. Following a neural nets
analogy, we could say that ci is connected to each of the components
of the data vectors through a weight vector wi. For instance, if we
define daily surface temperature patterns using the information
provided by the reanalysis ERA-15 at 12 UTC, then the resulting
15X365=5500 vectors of dimension 60X30=1800 would characterize
the evolution of surface temperature over Europe, the period 19791993. Therefore, a SOM trained to this data would find some
representative configurations (prototypes) displayed on a 2D lattice
which preserve original distances, hence self-organizing the obtained
prototypes (see Fig.3).
Fig. 3 (left) SOM lattice projected onto the space spaned by the first two principal components of ERA15 data for the GRID shown in Fig. 1(a). (right) 1000 mb temperature fields of some of the resulting
prototypes.
The weight vectors of the SOM are first initialized to random
values. The goal of the training algorithm is iteratively adapting the
weights, so the final neurones become to represent a cluster of data
DataMiningPaper.doc World Scientific 07/02/2005
8/14
vectors (those which are closer to the vector of weights). What makes
SOM different from other clustering algorithms is that the training
process includes a neighbourhood adaptation mechanism so
neighboring clusters in the 2D lattice space are quite similar, while
more distant clusters become increasingly diverse. Therefore, SOM
provides us with a neighbourhood preserving mapping from a high
dimensional space to a m=kXk 2D lattice which quantizes the original
data by means of m prototypes (the final weight vectors of the SOM).
The training proceeds in cycles; on each training step, a data
sample vi is considered and the best-matching (or “wining") neurone
cki is obtained as the one minimizing the Euclidean distance to the
pattern vector: || vi – wki ||=mink{|| vi – wk ||, k=1,… ,m}.
Then, the prototype vector of the winning neurone and also of its
neighbours are moved towards the sample vector:
wk (t+1)= wk (t)+? (t) hki(||vi – wk(t)||),
where the factor ? (t) gives the learning rate and the function hki(x) is a
neighbourhood kernel, centred on the grid on the neuron ck, which
determines the rate of change around the winner unit (usually a
Gaussian function is considered): hki(x)=exp(-||ck - ci||/? (t))).
As a consequence of the neighbourhood kernel, during the
iterative training the SOM behaves like a flexible grid folding onto the
cloud formed by the data in the original n dimensional space. Both
learning rate and neighbourhood kerned radius decrease
monotonically with time, softening the folding process; a linear decay
to zero is usually chosen for these functions6.
We analized the suitability of different scalable parallel
implementations of this algorithm for the GRID environment.
Previous SOM parallel algorithms have focused on parallel computers
with predetermined resources and fast communications5. On the
contrary, our aim is designing an adaptive scheme for distributing data
and computational load according to the changing resources available
for each GRID job submitted. To this aim, some preliminary work is
needed to understand the limitations and requirements of different
parallel implementations of the SOM algorithm.
DataMiningPaper.doc World Scientific 07/02/2005
9/14
The simplest form for parallelizing the SOM algorithm is splitting
up the data between different processors, as shown in Fig.4(a).
However, in this case, after each complete cycle the slaves must send
the prototypes to the master, which computes them up, sending the
final centers back to the slaves. This is not an efficient
implementation for the GRID enviroment, since it requires intensive
message passing of high-dimensional data. Figures 4(b) and (c) shows
two different alternatives for distributing computational resources
with replicated (or centralized) prototype vectors. The different
messages required for each of the schemes are shown using dashed
lines, which may correspond to either an iteration of the algorithm, or
a whole cycle.
Fig. 4 Three different parallel schemes for the SOM training algorithm. (a) distributing data, (b)
distributing computational resources with replicated prototype vectors, (c) distributing computational
resources with centralized prototype vectors.
To check the efficiency of these algorithms, we performed several
experiments and found that there is no optimal alternative, but a
DataMiningPaper.doc World Scientific 07/02/2005
10/1
4
combination (hybridization) of algorithms is required for the GRID
framework.
4.2 MCMC. Markov Chain Monte Carlo
Precipitation fields produced by NWP models need further processing
to be useful for hydrological purposes. The starting point for such
purpose is an application of the linear state space model and the
Kalman filter. In the standard state space form, multivariate
observations yt are related to unobserved state vectors a t by a linear
observation equation
yt = Zta t + et , t=1,2,.....
where Zt is a design matrix of appropriate dimension, and {et} is a
white noise process, i.e., a sequence of mutually uncorrelated error
variables with E(et) = 0 and cov(et) = St. The observation equation is a
varying-coefficient model with time t as effect-modifyer. The
sequence of states is defined by a linear transition equation
a t = Ft a t-1 + ?t, t=1,2,.....
where Ft is a transiton matrix, {?t} is a white noise sequence with E(?t)
= 0, cov(?t) =Qt, and the initial state has E(a o) =ao and cov(a o) = Qo.
The mean and covariance structure of the model is fully specified by
assuming that {et} and {?t} are mutually uncorrelated and
uncorrelated with the initial state a o. The joint and marginal
distributions of {yt,a t} are completely specified by distributional
assumptions on the errors and the initial state.
As soon as a model can be written in state space form, it provides
the key for employing unified methods of statisitical inference. Given
the observations y1,....yT, estimation of a t is the primary goal. This is
termed filtering for t=T, smoothing for t<T and prediction for t>T.
Under the normality assumption, the optimal solution to the filtering
problem is given by the posterior mean
a t|t = E(a t | yt,wt)
of a t given yt and wt. Since the model is linear and Gaussian, the
posterior distribution of a t is also Gaussian,
a t | yt,wt ~ N(a t|t ,Vt|t)
DataMiningPaper.doc World Scientific 07/02/2005
11/1
4
with posterior covariance matrix
Vt|t = E[(a t - at|t)(a t - at|t)'].
The linear Kalman filter and smoother computes the posterior
means and covariance matrices in a effective recursive way.
The usual derivation of the Kalman filter and smoother takes
advantage of the fact that posterior distributions are normal.
Generalized extended Kalman filter can be derived as an approximate
posterior mode estimator. The filter is derived as gradient algorithm
via the discrete maximum principle, replacing Gaussian loglikelihoods and derivatives by corresponding terms for non-Gaussian
observations. In Taylor expansions, unknown parameters have to be
replaced by currently available estimates. This also concerns
observation covariance matrices St(a t), in contrast to the Gaussian
case, where St is assumed to be known.
In an empirical Bayesian framework hyperparameters T , such as
initial values or covariance matrices, are considered as unknown
constants. To estimate these hyperparametrs we used EM-type
algorithm, which replaces posterior means and covariance matrices by
posterior modes and curvatures obtained from the filtering and
smoothing algorithms. To explain the principles of EM algorithm we
consider the case of univariate time-invariant state space model with
unknown vector of hyperparameters T =(s 2, Q, ao, Qo). The joint loglikehood of the complete data is given by
l(y1,....,yT,a o,.....,a T;T)=-0.5Tlogs 2-1/(2s 2)St(yt-z'ta t)2-0.5Tlog(detQ)0.5 St (a t-Fta t-1)'Q(a t-Fta t-1)-0.5log(detQo)-0.5(a o-ao)'Qo-1(a o-ao)
In the pth cycle of the algorithm, the E-step consisits of computing
the conditional expectation of the log-likelihood given the observation
and the current iterate T (p). The next iterate T (p+1) is obtained as the
maximizer of M(T |T (p)) with respect to T . Conditional expectations
can be computed by running the Kalman filter and smoother fixed at
T (p), and the maximization problem can be solved analytically2.
In contrast to Gaussian linear state space models, full conditionals
p(a|.) for state parameters are no longer Gaussian for dynamic
generalized linear models with fundamentally non-Gausian responses.
Therefore, direct Gibbs sampling is not possible. From few different
DataMiningPaper.doc World Scientific 07/02/2005
12/1
4
approaches to overcome this problem, we decided to implement the
method of hybrid MH block move algorithm with conditional prior
proposals4. This algorithm uses the full conditional prior distribution
p(a rs|a ?rs,Q) to generate a proposal value a rs for blocks a rs = (a r,....,a s),
1=r<s=T. This algorithm is computationally very efficient, since it
avoids any Fisher scoring steps and all proposals are drawn from
Gaussian distributions with known moments.
The basic tool we applied for statistical inference is the R public
domain statistical package7. To achive distributed, parallel computing
in R we planed to use Common Object Request Broker Architecture
available as CORBA package. In this implementation one R session
acts as a manager and dispatches tasks to different servers running in
the other R worker sessions. This uses the ability to invoke
asynchronous
or
background
CORBA
calls
in
R
(http://www.omegahat.org/RSCORBA)
Acknowledgements
The work described in this paper is supported in part by the European
Union through the IST-2001-32243 project "CrossGrid". The authors
are also grateful to the Instituto Nacional de Meteorología INM, the
Comisión Interministerial de Ciencia y Tecnología (CICYT Project
REN2000-1572), and to the Polish Commitee for Scientific Research
(KBN project 115/E-343/SPB/5.PR UE/DZ 206) for partial financial
support.
References
Cavazos T. Using self-organization maps to investigate extreme
climate event. Journal of Climate 2000; 13: 1718-1732.
2. Fahrmeir L. and Tutz T. Multivariate statistical modelling based
on generalised linear models. New York: Springer-Verlag, 2000.
3. Herman-Izycki L., Jakubiak B., Nowinski K., and Niezgodka B.
UMPL – the numerical weather prediction system for operational
1.
DataMiningPaper.doc World Scientific 07/02/2005
13/1
4
4.
5.
6.
7.
applications. In: Research works based on the ICM’s UMPL
numerical weather prediction system results. B. Jakubiak (Ed),
Wydawnictwa ICM, Warsaw, 14-27, 2000.
Knorr-Held L. Conditional prior proposals in dynamic models.
Scandinavian Journal of Statistics 1999; 26: 129-144.
Lawrence R.D., Almasi G.S. and Rushmeier H.E. A scalable
parallel algorithm for self-organizing maps with applications to
sparse data mining problems. Data Mining and Knowledge
Discovery 1999; 3: 171-195.
Oja E. And Kaski S. Kohonen Maps. Amsterdam: Elsevier, 1999.
Venables W.N., Smith D.M., and the R Development Core Team.
An Introduction to R: A programming Environment for Data
Analysis and Graphics, User manual for version 1.6.2, 2003.
DataMiningPaper.doc World Scientific 07/02/2005
14/1
4