Download Mining Spatio-Temporal Data using Independent

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College,
Kanchipuram Dt.PIN-631 605,INDIA
A Survey on Mining Spatio-Temporal Data using
Independent Component Analysis
- 1B. PRIYA, 2 M. VELU
1
Lecturer, MCA Dept, Sri SaiRam Engineering College, Chennai,[email protected]
2 Senior Analyst Programmer, London Underground - Information Management, London,
[email protected]
brain imaging, and electrical brain signals to
telecommunications and stock predictions.
Abstract
The recent advances of telecommunications (e.g.,
GPS, Cellular networks, etc.) has facilitated the
collection of large spatial and spatio-temporal
datasets. The volume of such data and their
potentially high update rate makes their manual
analysis extremely difficult (if not impossible),
calling for mining techniques for automatic
extraction of valuable information.Furthermore
the special nature of the data and the analysis
objectives renders knowledge extraction
techniques for simple data types inadequate.
Special issues in this Data Mining (DM) field
include the fuzzy and implicit nature of spatial
and spatio-temporal relationships between
objects, the complex geometry of spatial objects,
the varying temporal nature of events
(instantaneous vs. durable), the variability of
spatio-temporal data (moving objects, evolution
of spatial events or phenomena, etc.), and the
multiple (spatial and temporal) resolution levels
of abstraction.
Spatio-temporal data mining represents the
confluence of several fields including
spatiotemporal databases, machine learning,
statistics,
geographic
visualization,
and
information theory. Exploration of spatial data
mining and temporal data mining has received
much attention independently in KDD and DM
research community. Nevertheless, the need to
investigate both “spatial” and “temporal”
relations at the same time complicates the data
mining tasks even further. A crucial challenge in
Spatio-Temporal DM is the exploration of
efficient methods due to the large amount of
spatio-temporal data and the complexity of
spatio-temporal data types, data representation,
and spatial data structure.
The applications for Independent Component
Analysis (ICA) range from speech processing,
This article deals with mining the spatiotemporal data using ICA with some case studies
related to weather Data Mining .
Key Words:
Spatio-Temporal Data, ICA, PCA, NAO,
Machine learning, Weather Data Mining.
1. Introduction
Spatio-temporal data records spatial views of
objects across time. Data produced from fluid
dynamics simulations, or geoinformatics data
that tracks the behavior of intrusions are of this
type. A fundamental difference between pure
spatial data and spatio-temporal data is that
objects in spatio-temporal data are under
constant change. Regardless of the nature of a
change (e.g., location change, shape change), a
standard assumption is that change is
continuous. This means that while changing, a
quantity must pass through all the intermediate
values. For example, in the quantity space f-, 0,
+g, a value cannot change from ’-’ to ’+’ without
going through value 0. It is often impossible to
model and represent the continuous properties of
changes, especially, when multiple objects are
involved. A commonly used approach is to
represent a continuously changing system as a
sequence of snapshots, where each snapshot
records the state of every involved object at a
certain time point.
Temporal and spatial data are complex data to be
mined because of their internal structure, that
can be considered as multi-dimensional. Indeed,
spatial data may involve two or three dimensions
for determining a region and complex relations
as well for describing the relative positions of
regions between each others. Temporal data may
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College,
Kanchipuram Dt.PIN-631 605,INDIA
present a linear but also a two-dimensional
aspect, when time intervals are taken into
account and have to be analyzed. In this way,
mining temporal or spatial data are tasks related
to KDDK [1].
ICA is becoming an increasingly important tool
for analyzing large data sets. In essence, ICA
separates an observed set of signal mixtures into
a set of statistically independent component
signals, or source signals. In so doing, this
powerful method can extract the relatively small
amount of useful information typically found in
large data sets.
2. Independent Component Analysis (ICA)
ICA is a fairly new and a generally applicable
method to several challenges in signal
processing. It reveals a diversity of theoretical
questions and opens a variety of potential
applications. Successful results in Electro
Encephalo Graphic (EEG), functional Magnetic
Resonance Imaging (fMRI), and speech
recognition and face recognition systems indicate
the power and optimistic hope in the new
paradigm.
ICA is a method for finding underlying factors or
components from multidimensional statistical
data. There are many latent variable
decompositions method, such as Principle
Component Analysis (PCA), singular value
decomposition (SVD), factor analysis, projection
pursuit and so on. What distinguishes between
ICA from these methods is that it looks for
components that are both statistically
independent and non-Gaussian. In PCA or factor
analysis, an observed vector x(t) is first centered
by removing its mean(in practice, the mean is
estimated as the average value of the vector in a
sample). Then the vector is transformed by a
linear transformation into a new vector, possibly
of lower dimension, whose elements are
uncorrelated with each other.
The linear transformation is found by computing
the eigen value decomposition of the covariance
matrix, which for zero-mean vectors is the
correlation matrix
E[x(t)(x(t))T]
and the eigenvectors of it form a new coordinate
system in which the data are presented. As a
result, the number of components yi(t) will be
quite small, maybe only 1 or 2, but these
components contain most information which
may provide an insight into the structure of the
data in the meaning of second order statistics.
The basic PCA network can be described by
y i (t)= ∑w i j x j (t)
x j ‘(t)= x j (t) - ∑w i j y j (t)
w i j = ή x j ‘(t) y j (t)
i=1,2,… N
But in many applications, uncorrelatedness is not
enough, we must find the independent
components (ICs). Here, the independence is not
corresponding to the independence in factor
analysis. Factor analysis originally developed in
social sciences, which is often claimed that the
factors are independent, but this is only partly
true, because factor analysis assumes that the
data has a Gaussian distribution. If the data has a
Gaussian distribution, it is easy to find the ICs,
for Gaussian data, uncorrelated components are
equivalent to independent [2].
On the other hand, ICA tries to find statistical
independent sources by additionally minimizing
higher order statistics between various
components. In the applications, when we
intend to find the independent factors among the
huge set, ICA is a ideal method more than PCA
or factor analysis.
3. Mining Spatio-Temporal Data
“Data mining is the process of digging through
large data sets and extracting the useful
information for analyzing to find the hidden
patterns and relationships using modern
statistical and computational techniques.”[3]
Many natural phenomena present intrinsic spatial
and temporal characteristics. With the recent
advances in data collection technologies, high
resolution spatio-temporal datasets can be stored
and analyzed to accurately study the behavior of
such events. However, these datasets are often
very large and difficult to analyze and display.
Recently much attention has been dedicated to
the application of innovative data-mining
techniques to filter out relevant subsets of very
large repositories as well as to the development
of visualization tools to effectively display the
corresponding results.
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College,
Kanchipuram Dt.PIN-631 605,INDIA
Spatio-temporal data mining is an emerging
research area dedicated to the development and
application of novel computational techniques
for the analysis of large spatio-temporal
databases.
manipulating the geometrical components of the
spatial data (Shneiderman, 2002) are some of the
challenges that still need to be tackled[4]. ICA is
a efficient method to be used in geospatial
environment for mining Spatio-Temporal Data.
Some research estimates that about 80% of the
data stored in corporate databases integrate
spatial information (Fayyad and Grinstein,
2001), leading to huge amounts of georeferenced information that need to be analyzed
and processed. These datasets are often critical
for decision support, but their value depends on
the ability to extract useful information for
studying and understanding the phenomena
governing the data source. Therefore, the need
for efficient and effective techniques for
analyzing spatiotemporal datasets has recently
emerged as a research priority (Bédard et al,
2001): spatio-temporal Data Mining aims at
addressing these needs. It encompasses a set of
exploratory, computational and interactive
approaches for analyzing very large spatial and
spatio-temporal datasets. Numerous research
projects on spatial data mining have been
conducted in the last two decades (a
comprehensive review is provided by Andrienko
et al., 2003). Several open issues have been
identified, ranging from the definition of mining
techniques capable of dealing with spatialtemporal information, to the development of
effective methods for interpreting and visualizing
the final results.
The main impulse to research in this subfield of
data mining comes from the large amount of
In particular, visualization techniques are widely
recognized to be powerful in this domain
(Andrienko et al., 2003), (Andrienko et al.,
2005), (Johnston, 2001), since they take
advantage of human abilities to perceive visual
patterns and to interpret them (Andrienko et al.,
2003), (Kopanakis and Theodoulidis, 2003),
(Costabile and Malerba, 2003). However, it is
recognized that spatial visualization features
provided in the existing geographical
applications are not adequate for decision
support when used alone. Hence, alternative
solutions have to be defined (Bédard et al, 2001),
to dynamically and interactively obtain different
spatial and temporal views, and to interact in
different ways with the results produced during
the data mining process. The problems of how to
visualize the spatio-temporal multidimensional
dataset (Bédard et al, 1997) and how to define
effective visual interfaces for viewing and


spatial data made available by GIS,
CAD, robotics and computer vision
applications, computational biology,
mobile computing applications;
temporal data obtained by registering
events (e.g., telecommunication or web
traffic data) and monitoring processes
and workflows.
Both the temporal and spatial dimensions add
substantial complexity to data mining tasks. First
of all, the spatial relations, both metric (such as
distance) and non-metric (such as topology,
direction, shape, etc.) and the temporal relations
(such as before and after) are information
bearing and therefore need to be considered in
the data mining methods.
Secondly, some spatial and temporal relations
are implicitly defined, that is, they are not
explicitly encoded in a database. These relations
must be extracted from the data and there is a
trade-off between pre computing them before the
actual mining process starts (eager approach) and
computing them on-the-fly when they are
actually needed (lazy approach). Moreover,
despite much formalization of space and time
relations available in spatio-temporal reasoning,
the extraction of spatial/temporal relations
implicitly defined in the data introduces some
degree of fuzziness that may have a large impact
on the results of the data mining process.
Thirdly, working at the level of stored data, that
is, geometric representations (points, lines and
regions) for spatial data or time stamps for
temporal data is often undesirable. For instance,
urban planning researchers are interested in
possible relations between two roads, which
either cross each other, or run parallel, or can be
confluent, independently of the fact that the two
roads are represented by one or more tuples of a
relational table of “lines” or “regions”.
Therefore, complex transformations are required
to describe the units of analysis at higher
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College,
Kanchipuram Dt.PIN-631 605,INDIA
conceptual levels, where human-interpretable
properties and relations are expressed.
Fourthly, spatial resolution or temporal
granularity can have direct impact on the
strength of patterns that can be discovered in the
datasets. Interesting patterns are more likely to
be discovered at the lowest resolution/granularity
level. On the other hand, large support is more
likely to exist at higher levels.
Fifthly, many rules of qualitative reasoning on
spatial and temporal data (e.g., transitive
properties for temporal relations after and
before) as well as spatio-temporal ontologies,
provide a valuable source of domain independent
knowledge that should be taken into account
when generating patterns. How to express these
rules and how to integrate spatio-temporal
reasoning mechanisms in data mining systems
are still open problems.
Additional research issues related to spatiotemporal data mining concern visualization of
spatio-temporal patterns and phenomena,
scalability of the methods, data structures used to
represent and efficiently index spatio-temporal
data [5].
4. Analysis of Spatio-Temporal Climate
Variability by ICA
Statistical approaches to weather and climate
prediction have a long and distinguished history
that predates modeling based on physics and
dynamics. This trend continues today with new
approaches based on machine learning
algorithms. The central problem in weather and
climate modeling is to predict the future states of
the atmospheric system. It is therefore possible
to view the weather variables as sources of
spatio-temporal signals. The information from
these spatio temporal signals can be extracted
using data mining techniques. The variation in
the weather variables can be viewed as a mixture
of several independently occurring spatio
temporal signals with different strengths.
A key problem in climatology is to deduce from
observations the physical phenomena at the
origin of climate variability. Classical
approaches such as PCA are based on
hypotheses that are not always valid for the
analysis of climate (linearity, Gaussian
distributions, orthogonality of components,
maximum of variance in a minimum number of
modes). This statistical technique (ICA) aims at
extracting linearly or nonlinearly independent
components from a dataset of observations or
model outputs using a criterion of statistical
independence, which is a stronger constraint than
decorrelation, used in the classical approaches.
Recently there has been increased interest in the
use of the ICA for image analysis. ICA can be
considered as one approach to component
analysis. Among other approaches, the
traditional Principle Component Analysis (PCA)
is most popular. The component analysis that
extracts the most important components of the
data is useful for data mining in remote sensing
which normally involves a very large amount of
data. While PCA method attempts to decorrelate
the components in a vector, ICA methods are to
make the components as statistically independent
as possible. There are several ICA algorithms,
which can be implemented efficiently by a neural
network. As such it is a very useful tool for data
mining in remote sensing.
5. Case Studies
5.1 Weather Data Mining
Weather Data Mining is a form of Data mining
concerned with finding the hidden patterns out of
the Large available meteorological data, so that
the information retrieved can be transformed into
the usable knowledge.
5.2 Use of ICA in weather Data Mining with
regard to Pacific Decadal Oscillation
If the assumption of independent stable activity
in the weather variables holds true then it is also
possible to extract them using the same
technique of ICA. One basic assumption is that
we view the weather phenomenon as a mixture
of a certain number of signals with independent
stable activity. The weather changes due to the
changes in the mixing patterns of these stable
activities over time. For linear mixtures, the
change in the mixing coefficients gives rise to
the changing nature of the global weather. We
have to investigate if there exist any such set of
spatio-temporal stable patterns such that the
variation of the mixture gives rise to the
observed weather or climate phenomena. The
conjecture is that there exist independent stable
spatio-temporal activities, the mixture of which
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College,
Kanchipuram Dt.PIN-631 605,INDIA
give rise to the weather variables; and these
stable activities can be extracted by ICA of the
data arising from the weather and climate
patterns, viewing them as spatio-temporal
signals. If the conjecture about the existence of
stable spatio-temporal activity in the weather is
true, then the mixing coefficients will vary in
accordance with the changes in the weather
variables. Figure 5.2. represents an application of
temporal ICA with regard to Pacific Decadal
Oscillation[6].
5.3.meteorological measurements. The method
of mining spatio-temporal data is generic in
nature and is not subject only to the weather
phenomenon. The same method can be applied
to find certain stable characteristics in other
spatio-temporal systems. Even when a spatiotemporal system is chaotic, the method may be
applied to extract meaningful patterns if the
system embeds some such stable patterns
(possibly weather is a natural example of a
physical chaotic system) as shown in figure 5.3.
Figure 5.2 Pacific Decadal Oscillation
5.3 North Atlantic Oscillation (NAO)
In the research work by M.S.Santhanam [7], they
have provided a new way of viewing the
physical phenomena of changing weather and
climate by mining spatio-temporal data of
weather and climate variables. NAO is
considered as a typical example and mine the
Sea level Pressure (SLP) data using ICA.
Techniques are provided for determining the
strongest independent components in the
multidimensional data set, and observed that the
strongest stable patterns as obtained by ICA
matched with the physical patterns of oscillation
in SLP. The results are also verified by finding a
linear fit of the independent components with the
standard NAO index as provided by the
meteorological measurements. The method of
mining spatio-temporal data is generic in nature
and is not subject only to the weather
phenomenon. The same method can be applied
to find certain stable characteristics in other
spatio-temporal systems. Even when a spatiotemporal system is chaotic, the method may be
applied to extract meaningful patterns if the
system embeds some such stable patterns
(possibly weather is a natural example of a
physical chaotic system) as shown in figure
Figure 5.3 North Atlantic Oscillation
The method can be further investigated in the
following manner. First, it extracts certain stable
patterns whose temporal trend perfectly matches
with the physical phenomenon. Therefore, the
individual stable oscillations (obtained as
independent components from the spatiotemporal data) can be analyzed further to predict
the time-series behavior of the oscillation.
Second, it is very difficult to analyze the NAO in
order to find the physical correlations between
various modes that interact to produce the NAO
phenomenon. However, ICA gives a mixing
matrix that provides an indication about how the
various modes interact (in a linear manner).
Third, we assumed a linear mixture of various
independent
components.
In
further
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College,
Kanchipuram Dt.PIN-631 605,INDIA
investigation, this assumption can be relaxed and
nonlinear ICA can be performed on these kinds
of spatio-temporal data sets in order to find even
more meaningful characteristics.
6. Conclusion
Spatio-temporal data mining is an emerging
research area dedicated to the development and
application of novel computational techniques
for the analysis of very large, spatio-temporal
databases. Data mining techniques are typically
inductive, as opposed to deductive, in that they
are not used to prove or disprove pre-existing
hypotheses but rather to identify patterns
embedded within data, and thereby support
hypothesis generation. Most research in spatial,
temporal, and spatio-temporal data mining has
sought to adapt ‘classical’ data mining
algorithms intended to operate on more
conventional data types. Spatiotemporal data
mining presents a number of challenges due to
the complexity of geographic domains, the
mapping of all data values into a spatial and
temporal framework, and the spatial and
temporal autocorrelation exhibited in most
spatio-temporal data sets. ICA is a powerful tool
for mining spatio temporal data for usage in
weather data mining.
REFERENCES
[1]http://ralyx.inria.fr/2007/Raweb/orpailleur/ui
d17.html
[2] “Data mining with Independent Component
Analysis” by Rui Li, proceedings of the 6th world
congress on Intelligent Control and Automation.
[3]http://en.wikipedia.org/wiki/Weather_Data_
Mining"
[4]http://geoanalytics.net/VisA-SDS2006/paper28.pdf
[5]http://www.di.uniba.it/~malerba/activities/mst
d/
[6] ti.arc.nasa.gov/is/IDU/tasks/MLDM.html
[7] “Weather Data Mining Using Independent
Component Analysis” by
Basak,
J.
Sudarshan, A. Trivedi, D. Santhanam, M. S.
JOURNAL
OF
MACHINE
LEARNING
RESEARCH 2005, VOL 5; NUMB 1, pages 239254
[8] http://cnl.salk.edu/~tewon/ica_cnl.html
Copy Right @CSE/IT/ECE/MCA-LVEC-2009