Download A Survey of Spatial Data Mining Approaches: Algorithms and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Survey of Spatial Data Mining Approaches:
Algorithms and Architecture
Arvind Sharma1,R K Gupta2,
1 Department of Computer Science
MPCT ,Gwalior-474011,INDIA
2 Department of Computer Science
MITS,Gwalior,INDIA
1 [email protected] ,2 [email protected]
Abstract
Knowledge discovery in spatial data mining is rapidly
growing field, whose development is driven and based
on advance research as well as urgent practical, social
and environmental needs. There are so many important
and sophisticated areas like designing of road maps for
different regions or states, countries, cloud cover, traffic
control or GPS etc on the basis of recorded data
whether it is collected from satellite or local cameras
overview .In this paper, we provide an overview of
common knowledge discovery algorithms in SPDM. We
propose a feature classification scheme on the basis of
clustering and classification for 3D databases .A
comparative study of algorithms also have done in this
paper.
Keywords:3D databases, SPDM
1. INTRODUCTION
1.1 Overview and motivation
The collection of data usually referred to as the
database,contains information relevant to an entity
Suchas an organization,enterprise etc. The primary goal
of a database system is to provide a way to store or
retrieve database information that is both convenient
and efficient. A very interesting and efficient method has
introduced for this purpose and it is called as Data
mining. Data Mining is usually defined as searching,
analyzing and sifting through large amounts of data to
find relationships ,patterns, or any significant statistical
correlation. Data Mining is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data. The term ‘data mining ‘
refers to the finding of relevant and useful information
from databases .Data mining and knowledge discovery in
the data bases is a new interdisciplinary field,merging
ideas from statistics,machine learning ,databases and
parallel computing.
Basically there are two techniques for managing database
in the process of data mining: Spatial data mining and
temporal data mining.Spatial Data Mining(SPDM)is the
process of discovering interesting ,useful ,nontrivial
patterns information or knowledge from large spatial
datasets. Here the term spatial stands for all those data
sets which are related with space or geographical regions.
Spatial data is data related to space. The number and size
of spatial data base e.g. for geo marketing ,traffic control
or environmental studies, medical diagnosis, weather
prediction are rapidly growing which result in an
increasing need for spatial data mining. For these
applications we need to store huge amount of data and
certain approaches for getting fruitful results. A variety of
SPDM algorithms have been discussed in this paper with
some suggestions. Automated tools with intelligent
algorithms have the capability to analyze the raw data
and present the extracted high level information to the
analyst or decision maker, rather than having the analyst
find it for himself or herself. In this paper, we present a
survey and study of different spatial data mining
algorithms whichever have implemented over this
topic[7]. It has been studied by different research groups
that market for SPDM will grow $2000 million to $3000
million by 2012.
The aim of this study is:
1) A review of data mining and SPDM process
2) Study of existing knowledge and spatial data
mining algorithms.
3) Working architecture of spatial data mining.
While data mining is the process of extracting of
meaningful patterns from data. Data mining is becoming
an increasingly important tool to transfer this data into
information. Hence, data mining is just one step in the
overall KDD process. Detailed steps are given belowi. Developing and understanding of the application
domain and the goals of the data mining process .
ii. Selection of target data set.
iii. Integrating and checking the data set
iv. Data cleaning ,preprocessing, and transformation
v. Hypothesis building and software selection
vi. Identification, selection and development of suitable
algorithm.
vii. Result interpretation and visualization.
viii. Result testing, verification ,and refinement
ix. Result application
2.2 Database sources and issues
2. KNOWLEDGE DISCOVERY AND SPATIAL DATA MINING
This section covers knowledge discovery and spatial data
mining process for feature extraction and classification of
spatial and non spatial attributes of the databases.
2.1 The KDD process
Sometimes KDD and data mining are used as
interchangeably but actually this is not true. Actually the
data mining is a stage in a whole KDD process. A simple
definition of KDD is as follows :knowledge discovery in
database is the nontrivial process of identifying valid,
novel, potentially useful, and ultimately understandable
patterns in data.
With the help of advance technology and tools ,spatial
data may be collected in huge amount from different
resources for various applications ranging from remote
sensing and satellite telemetry systems, to computer
cartography ,medical diagnosis, weather analysis and
prediction, and all kinds of environmental planning.
Various national and international agencies are also
providing spatial data in different dimensions. Most
common data sources are satellite images, medical
images, human body’s protein structure and all those
data who can be represented in the form of cuboids,
polygon, cylinder etc.
Various sites are also available for collection of GIS
data[10][11][9].Google earth, visible earth(NASA),JSC
digital image collection(NASA),Global land cover facility
are also available for collection of spatial data.
Basically spatial data include geographic data such As
maps and associated information, and computer aided
design data such as integrated circuit design or building
designs. It has observed that 2D database are not more
efficient in storing, indexing and queing of data on the
basis of spatial locations .Additionally for 2D databases,
we can not use standard index structures, such as B-trees
or hash indices, to answer such a query efficiently. So it is
recommended that we should work for higher
dimensional data.
2.3 Spatial Data Mining Tasks
For extracting of patterns from spatial data there is a
need of various methods and techniques by which we can
collect meaningful patterns of data from different
samples. It should also be noted that several methods
with different goals may be applied successively to
achieve a desired result. Some of the SPDM tasks are
listed hereData processing- Analyst or users may select, filter,
aggregate, sample, clean and/or transform data into
much more understandable form. Unwanted and useless
portions may be cut from the existing data hence to
improve the productivity and applicability of the data.
Prediction- Prediction means to give some outcomes in
advance on the basis of previous history or patterns of
data items. Values of specific attributes of the data items
may also be calculated accurately with iterative methods
and different samples of data.
Regression- Given a set of data items, regression
identifies dependency of some attribute values upon the
values of other attributes in the same item and apply
these values on other data items or records.
Classification- Given a set of predefined categorical
classes ,determine to which of these classes a specific
data item belongs. For example, in weather prediction
system we classify satellite images into different classes
on the basis of some common properties and patterns.
Clustering- given a set of data items, group items that are
similar. For example, given a set of satellite images,
identify subgroup of objects of patterns(colored,
noncolored, size, shape)and their behavior.
Link Analysis – Given a set of data items identify
relationships b/w attributes and items such as the
presence of one pattern implies the presence of another
pattern.
Model visualization- Visualization plays a very important
role in understanding and demonstration the desired task
properly. Visualization techniques may range from simple
scatter plots and histogram plots over parallel
coordinates to 3D data items.
2.4 Data Mining methods
There are so many methods by which we can get more
information out of data. According to application and
easier level of understanding these methods can be
grouped as shown in figure no.1. Authors to authors
these methods may be different.
3. ALGORITHMS AND METHODS INCLUDED IN THIS PAPER
Some of the well known algorithms are discussed here1. A density based algorithm for discovering clusters
in large spatial databases with noise.
The task considered in this paper is class identification i.e
the grouping of the objects of a database into meaningful
subclasses .It requires one input parameter and supports
the user in determining an appropriate value for it. It
discovers clusters of arbitrary shape. Here DBSCAN has
implemented on the basis of R*-tree. All experiments
have been run on HP 735/100 workstations with the help
of synthetic data and the database of the SEQUOIA 2000
benchmark.
Positive aspects- i. Faster ii. Efficient iii. Applicable for
large database iv. Applicable on arbitrary shape. V.
Extendable for polygons over point objects.
Points missed: i. High dimensional data not considered. ii.
It is only about static rather than moving obstacles.
2. Algorithm for characterization
detection in spatial databases..
and
trend
In this algorithm it has observed that for spatial
characterization ,it is important that class membership of
a database object is not only determined by its non
spatial attributes but also by the attributes of objects in
its neighborhood. In this paper neighborhood
relationship is considered as centered point of discussion.
With the help of different databases , various local and
global trends have detected.
proposed. A characterization rule is an assertion which
characterizes a concept satisfied by all or a majority
number of the examples in the class undergoing
learning(called the target class).A discrimination rule is an
assertion which discriminates a concept of the class being
learned from other classes (called contrasting classes).In
medical science for diagnosis of diseases, it is very
important and usable.
In this paper, proposed algorithm is very much suitable
for identification of weather patterns. In this algorithm,
the characteristics of some spatial objects can be found
as well as what the characteristics of that spatial objects
discriminate from other contrast spatial objects can also
be found.
Positive aspects of this algorithms arei.
3. Density connected sets and their application for
trend detection in spatial databases.
ii.
In this paper, the concept of density connected sets and a
generalized form of DBSCAN has introduced .The concept
of trend detection has explained with nice examples. A
systematic change of one or several non spatial attributes
in 2D or 3D space have described successfully. On the
basis of repeated trends of the databases certain
predictions are explained. Somewhere it has observed
that given algorithm is not able to give clear cut relations
between different datasets.
iii.
iv.
v.
It extracts not only the properties of
target and contrast objects but also the
properties of their neighbors as they
impact on the characteristics of all
objects.
It shows successful implementation of
general frame work for SPDM.
This algorithm is more suitable for
medical and weather applications
Negative or future aspectsSometimes the concept of relative
frequency in target region does not
match and work at satisfactory level with
actual database.
4. SOFTWARE ORGANIZATION AND ARCHITECTURES
4.Extended
algorithm
for
Spatial
characterization and discrimination rules.
In this paper, A new spatial data mining algorithm for
both characterization and discrimination rules have been
Software organization and architecture for SPDM are
shown in fig.2 and 3.
There is a big need of an intelligent and reliable machine
tool for supporting an interactive knowledge discovery
process in large centralized or distributed spatial
databases.
A list of unexplored and incomplete issues is given here
for future discussion and implementation
The organization of SPDM software tool is shown in fig.1
.Here different sites are shown at different locations with
their own algorithms and environments in a integrated
manner. Normally organization and architecture should
be designed in distributed manner so that it can share
raw data and intermediate results with the coordination
of central GUI.
Merging of different techniques. Currently available tools
deploy either a single technique or a limited set of
techniques to carry out data collection and analysis.In the
paragraph of issues and methods already it has discussed
that there is no one best technique at hand for all kind of
data analysis.The problem becomes more complex when
we combine different techniques for getting better result.
Each distributed site has its own local data, SPDM
software package ,file transfer and remote connection
software .Each user i(i=1 to n)uses some learning
algorithms on one or more local spatial databases DBi ,to
produce a local classifier Ci .Now ,all local classifier can be
sent to a central repository ,where these classifiers can
be combined into a new global classifier(GC)using
majority or weighted majority principle . This classifier GC
is now sent to all the individual users to use it as a
possible method for improving local classifiers.
Compatibility with higher dimension of data. According to
time and arrival of new applications and technologies it
has become necessary to interact with 3 or more
dimensional data and produce result with greater
satisfaction.So designing of such applications is really a
typical job.
Designing of these software and organizing data and
working modules in efficient manner is also an important
subject for research now a days.
4.0 CONCLUSIONS AND FUTURE RESEARCH
In this paper the definition of data mining and Spatial
data mining has explained and it is clear that spatial data
mining is one step at the core of the knowledge discovery
process, dealing with the extraction of patterns and
relationships from large amounts of data. The Spatial
data mining is just as an extension of data mining and it is
still an emerging field of research.
Application of methods and algorithms on changing data
i.e dynamic data. Sometimes it becomes necessary to
apply methods and techniques on the dynamic data i.e.
we get data in a regular fashion as per defined interval
and hence results must also be changed according to this
new data.This changing data may make previously
discovered patterns invalid and hold new ones
instead.There is clearly a need for incremental methods
or adaptive methods that are able to update working
models.
Non –Standard data types. Today’s requirement is to
process all kind of data such as audio, video, image,
temporal, spatial and other data types.Those data types
contain special patterns ,which cannot be handled well by
the standard analysis methods.Therefore, these
applications require special methods and algorithms.
Data Mining methods
Site1 SPDM S/W Package
Verification
C
E
N
Data/Knowledge
Discovery
T
Site2 SPDM S/W Package
R
A
Description
Prediction
L
Site N-1 SPDM S/W
Regression
User
Classification
G
Site N SPDM S/W Package
Neural
Network
Bayesian
Network
Decision
Trees
Data/KB N
U
Association
Rules
Information
Theoretic
Networks
Figure1 : Data Mining Methods
I
Figure2:The organization of the distributed SPDM S/W
Model Integration
Modeling
S
P
D
M
Data Partitioning
G
U
Data processing
I
Data Inspection
Data Generation and Manipulation
S.Data NonS
Data
Pre.
Data knowledge
Data Partition
Figure 3. Internal architecture of working of SPDM S/W
User
6.0 References
[1] Martin Ester ,Hans-Peter Kriegel, Jorg
Sander, “Algorithms and applications for Spatial
Data Mining” published in GDM and KD
,research monographs in GIS, taylor and
Francis,2001.
[2]Michael Goebel,Le Gruenwald, “survey of
Data Mining and Knowledge Discovery Software
tools”SIGKDD
Explorations
June
1999.Volume1,Issue1 pp 20-33.
[3] Martin Ester,Hans-Peter Kriegel,Jorg
Sander,Xiaowei Xu “A Density based Algorithm
for Discovering Clusters in large Spatial
Databases with Noise”2nd International
conference on KDD,Portland,California pp226231.
[4] Martin Ester,Hans-Peter Kriegel,Jorg
Sander,Alexander Frommelt, “Algorithms for
characterization and Trend detection in Spatial
database”4th International conference on
KDD,New York City,pp-44-50.
[5] Martin Ester,Hans-Peter Kriegel,Jorg Sander,
Spatial data mining :A Database Approach.Proc
5th Int.Symp.on large spatial databases,4766,Berlin Springer.
[6] Martin Ester,Hans-Peter Kriegel,Jorg
Sander,Xiaowei Xu “Density connected sets and
their application for trend detection in Spatial
Databases” 3rd int. Conf. on KDD-97
[7] Gueting R.H. 1994 “An introduction to
Spatial Database Systems” Special issue on
Spatial database Systems of the VLDB
Journal,Vol.3 No.4,October 1994.
[6] Martin Ester,Hans-Peter Kriegel,Jorg
Sander,Xiaowei Xu “Density connected sets and
their application for trend detection in Spatial
Databases” 3rd int. Conf. on KDD-97
[7] Gueting R.H. 1994 “An introduction to
Spatial Database Systems” Special issue on
Spatial database Systems of the VLDB
Journal,Vol.3 No.4,October 1994.
[8] Md. Rashidul Hasan,Md. Zakir Hossain,Fahim
Md. Chaudahry , Md. Hasan “Extended
Algorithm for Spatial Characterization and
Discrimination Rules”
Proceeding of 11th
International conf. on ICCIT 2008,Bangladesh.
[9] Aleksandar Lazarevic ,Tim Fiez “ A Software
System for Spatial Data Analysis and Modeling”
[10]
Sample
spatial
(http::www.apress.com)
[11] IBM corporation. Intelligent
(http::/www.software.ibm.com)
datasets
Miner