Download A Study on Spatial Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
IJCST Vol. 4, Issue 2, April - June 2013
Database Primitives, Algorithms and Implementation
Procedure: A Study on Spatial Data Mining
1
1
J Rajanikanth, 2Dr. T.V. Rajinikanth
Dept. of CSE, S.R.K.R Engg. College, Affiliated to Andhra University, Bhimavaram, AP, India
2
Dept. of IT, GRIET, Hyderbad, AP, India
Abstract
Spatial data mining means extraction of useful knowledge from
large amounts of spatial data. It’s highly demanding field because
huge amounts of spatial data have been collected in various
applications, ranging from remote sensing to geographical
information system, computer cartography; environmental
assessment and planning etc. The huge explorative growth of
collected data pose challenges to the research community in terms
of ability to analyze. Recent studies on data mining have extended
the scope of data mining from traditional based relational and
transactional databases to spatial databases. This study provides
data mining primitives, typical algorithms, profile procedures and
their performances under various situations.
Keywords
Spatial Data Mining, Spatial Data, Geographical Information
System, Data Mining Primitives, Profile Procedures
I. Introduction
The computerization of many business and government transactions
and the advances in scientific data collection tools provide us with
a huge and continuously increasing amount of data. This explosive
growth of databases has far outpaced the human ability to interpret
this data, creating an urgent need for new techniques and tools
that support the researchers in transforming the data into useful
information and knowledge. Knowledge Discovery in Databases
(KDD) has been defined as the non-trivial process of discovering
valid, novel, and potentially useful, and ultimately understandable
patterns from data [1-2]. The process of KDD is interactive and
iterative, involving several steps such as the following ones:
A. Selection
Selecting a subset of all attributes and a subset of all data from
which the knowledge should be discovered.
B. Data Reduction
Using dimensionality reduction or transformation techniques to
reduce the effective number of attributes to be considered.
C. Data Mining
The application of appropriate algorithms that, under acceptable
computational efficiency limitations, produce a particular
enumeration of patterns over the data.
D. Evaluation
Interpreting and evaluating the discovered patterns with respect
to their usefulness in the given application.
Spatial Database Systems (SDBS) [3] are database systems for
the management of spatial data. To find implicit regularities,
rules or patterns hidden in large spatial databases, e.g. for geomarketing, traffic control or environmental studies, spatial data
mining algorithms are very important [4].
Most of the existing data mining algorithms run on separate and
specially prepared files, but integrating them with a database
650
International Journal of Computer Science And Technology
management system (DBMS) have the following advantages.
Redundant storage and potential inconsistencies can be avoided
[5]. Furthermore, commercial database systems offer various
index structures to support different types of database queries.
This functionality can be used without extra implementation effort
to speed-up the execution of data mining algorithms (which, in
general, have to perform many database queries). Similar to the
relational standard language SQL, the use of standard primitives
will speed-up the development of new data mining algorithms
and will also make them more portable.
II. Primitives of Spatial Data Mining
A. Rules
There are several kinds of rules can be discovered from databases
in general. For example, characteristic rules, discriminate rules,
association rules, or deviation and evaluation rules can be used for
mining [6]. A Spatial characteristic rule is a general description of
the spatial data. For example, a rule describing the general price
range of houses in various geographic regions in a city is a spatial
characteristic rule. A discriminate rule is general description of
the features discriminating or contrasting a class of spatial data
from other classes like the comparison of price ranges of houses
in different geographical regions. A spatial association rule is
a rule which describes the implication of one a set of features
by another set of features in spatial databases. For example, a
rule associating the price range of the houses with nearby spatial
features, like beaches, is a spatial association rule.
B. Thematic Maps
Thematic map is map primarily design to show a theme, a single
spatial distribution or a pattern, using a specific map type. These
maps show the distribution of features over limited geography
areas [6]. Each map defines a partitioning of the area into a set of
closed and disjoint regions; each includes all the points with the
same feature value. Thematic maps present the spatial distribution
of a single or a few attributes. This differs from general or reference
maps where the main objective is to present the position of the
object in relation to other spatial objects. Thematic maps may
be used for discovering different rules. For example, we may
want to look at temperature thematic map while analyzing the
general weather pattern of a geographic region. There are two
ways to represent thematic maps: Raster, and Vector. In the
raster image form thematic maps have pixels associated with the
attribute values. For example, a map may have the altitude of the
spatial objects coded as the intensity of the pixel (or the color).
In the vector representation, a spatial object is represented by its
geometry, most commonly being the boundary representation
along with the thematic attributes. For example, a park may be
represented by the boundary points and corresponding elevation
values.
w w w. i j c s t. c o m
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
III. Spatial Data Mining Tasks
As shown in the table below, spatial data mining tasks are generally
an extension of data mining tasks in which spatial data and criteria
are combined [7-9]. These tasks aim to:
• Summarize data,
• Find classification rules,
• Make clusters of similar objects,
• Find associations and dependencies to characterize data,
and
• Detect deviations after looking for general trends.
They are carried out using different methods, some of which
are derived from statistics and others from the field of machine
learning.
Table 1: Comparison Between Staticall and Machine Learning
SDM Tasks
Statistics
Machine learning
Global
autocorrelation,
Density analysis,
Generalization
Summarization
Smooth and contrast Characteristic rules
analysis, & Factorial
analysis
Class
Spatial classification Decision trees
identification
Point pattern
Geometric
Clustering
analysis
clustering
Local autocorrelation
Dependencies
Correspondence
Association rules
analysis
Trends and
Kriging
Trend rules`
deviations
IV. Spatial Data Summarization
The main goal is to describe data in a global way, which can be
done in several ways. One involves extending statistical methods
such as variance or factorial analysis to spatial structures. Another
entails applying the generalization method to spatial data.
A. Statistical Analysis of Contiguous Objects
1. Global Autocorrelation
The most common way of summarizing a dataset is to apply
elementary statistics, such as the calculation of average, variance,
etc., and graphic tools like histograms and pie charts. New methods
have been developed for measuring neighborhood dependency
at a global level, such as local variance and local covariance,
spatial auto-correlation by Geary, and Moran indices [10-12].
These methods are based on the notion of a contiguity matrix
that represents the spatial relationships between objects. It should
be noted that this contiguity can correspond to different spatial
relationships, such as adjacency, a distance gap, and so on.
2. Density Analysis
This method forms part of Exploratory Spatial Data Analysis
(ESDA) which, contrary to the autocorrelation measure, does
not require any knowledge about data. The idea is to estimate the
density by computing the intensity of each small circle window
on the space and then to visualize the point pattern. It could be
described as a graphical method.
w w w. i j c s t. c o m
IJCST Vol. 4, Issue 2, April - June 2013
V. Class Identification
This task, also called supervised classification, provides a logical
description that yields the best partitioning of the database.
Classification rules constitute a decision tree where each node
contains a criterion on an attribute. The difference in spatial
databases is that this criterion could be a spatial predicate and,
because spatial objects are dependent on neighborhood, a rule
involving the non-spatial properties of an object should be extended
to neighborhood properties. In spatial statistics, classification has
essentially served to analyze remotely-sensed data, and aims to
identify each pixel with a particular category. Homogeneous
pixels are then aggregated in order to form a geographic entity
[13]. In the spatial database approach [14], classification is seen
as an arrangement of objects using both their properties (nonspatial values) and their neighbors’ properties, not only for direct
neighbors but also for the neighbors of neighbors and so on, up
to degree N. Let us take as an example the classification of areas
by their economic power. Classification rules are described as
follows:
High population ۸ neighbor = road ۸ neighbor of neighbor = airport
=> high economic power (95%). In GeoMiner, a classification
criterion can also be related to a spatial attribute, in which case
it reflects its inclusion in a wider zone. These zones could be
determined by the algorithm, whether by clustering or by merging
adjacent objects, or it could arise from a predefined spatial
hierarchy. A new algorithm [15] extends this classification method
in GeoMiner to spatial predicates. For example, to determine high
level wholesale profits, a decision factor can be the proximity to
densely populated districts.
A. Clustering
This task is an automatic or unsupervised classification that yields a
partition of a given dataset depending on a similarity function.
1. Database Approach
Paradoxically, clustering methods for spatial databases do not
appear to be very revolutionary compared with those applied to
relational databases (automatic classification). The clustering is
performed using a similarity function which was already classed
as a semantic distance. Hence, in spatial databases it appears
natural to use the Euclidean distance in order to group neighboring
objects. Research studies have focused on the optimization of
algorithms. Geometric clustering generates new classes, such as
the location of houses in terms of residential areas [16]. This
stage is often performed before other data mining tasks, such as
association detection between groups or other geographic entities,
or characterization of a group.
GeoMiner combines geometric clustering applied to a point set
distribution with generalization based on non-spatial attributes.
For example, we may want to characterize groups of major cities
in the United States and see how they are grouped. Cluster results
will be represented by new areas, which correspond to the convex
hull of a group of towns. A few points could stay outside clusters
and represent noise. A description of each group may be generated
for each attribute specified.
Many algorithms have been proposed for performing clustering,
such as CLARANS [17], DBSCAN [18] or STING [19]. They
usually focus on cost optimization. Recently, a method that is more
specifically applicable to spatial data, GDBSCAN, was outlined
in [21]. It applies to any spatial shape, not only to points data,
and incorporates attributes data.
International Journal of Computer Science And Technology 651
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
IJCST Vol. 4, Issue 2, April - June 2013
2. Statistic Approach:
Clustering arises from point pattern analysis [20-21] and was
mainly applied to epidemiological research. This is implemented
in Opens haw’s well-known Geographical Analysis Machine
(GAM) and could be tested by using the K-function [22]. The
clusters could also be detected by the ratio of two density estimates:
one of the studied subset and the other of the whole reference
dataset.
(i). Database Approach
Using the process described in [18], which is based on the central
places theory, the analysis is performed in four stages. The first
one involves discovering centers by computing local maxima of
particular attributes; in the second, the theoretical trend of these
attributes is determined by moving away from the centers; the third
stage determines the deviations in relation to these trends; and
finally, we explain these trends by analyzing the properties of these
zones. One example is the trend analysis of the unemployment
rate in comparison with the distance to a metropolis like Munich.
Another example is the trend analysis of the development of house
construction.
(ii). Geo-Statistical Approach
Geo-statistics is a tool used for spatial analysis and for the prediction
of spatiotemporal phenomena. It was first used for geological
applications (the geo prefix comes from geology). Nowadays,
geo-statistics encompasses a class of techniques used to analyze
and predict the unknown values of variables distributed in space
and/or time. These values are supposed to be connected to the
environment. The study of such a correlation is called structural
analysis. The prediction of location values outside the sample is
then performed by the “kriging” technique [23]. It is important
to remember that geo-stastics is limited to point set analysis
or polygonal subdivisions and deals with a unique variable or
attributes. Under those conditions, it constitutes a good tool for
spatial and spatiotemporal trend analysis.
VI. Implementation Procedure
Joint index has been proposed by Valduriez in [24] as a technique
to accelerate the joins in relational database framework. Their
extension to the spatial data has been proposed by Zeitouni and
al. in [25]. This extension consists in adding a third attribute
representing the spatial relationship between two objects shown
in fig 1. Each tuple (ID1, Spatial Relationship, ID2) traduces
the existence of a spatial relationship between the pair of spatial
objects identified by ID1 and ID2. This relationship can be
topological or metric. In the case of topological relationship,
the Spatial Relationship attribute will contain a negative code.
Otherwise, it will store the exact distance value. For performance
reason, the calculation of distances is limited to a given useful
perimeter around objects.
Spatial join index has two main advantages: (i) the storage of the
spatial join index avoid the re-computing of the spatial relationship
for every application, (ii) the same index allows optimizing the join
operation according to all topological or metric spatial predicate.
Spatial data mining methods can directly use the relational schema
instead of predicates set. Indeed, the methods use a target table,
the join index table, and neighbor tables describing other themes.
However, this new data organization is not directly exploitable by
the conventional data mining methods because these last consider
only one input table with one observation by row. Recently, some
works in relational data mining [26] have been done to work
652
International Journal of Computer Science And Technology
Fig. 1: Spatial Join Index
out this problem. They are based on induction logic programming
(ILP). Their inconvenience is that their use requires expensive
transformations of the relational data into a set of first order logic
assertions.
A. Querying on the Fly the Different Tables
Unlike the conventional algorithms, the proposed method takes
as input three tables: table of objects to analyze, neighborhood
objects table and spatial join index table. Whenever, the attribute
to analyze is a neighborhood attribute, the algorithm does two
join operations between the target table, the spatial join index
table and the neighbors table (It is here where the modifications
on the existing algorithms got to be made). Otherwise, we apply
the conventional algorithm without modification. An example of
algorithm using this alternative is given in [27].
B. Join Materialization
This alternative materializes, once for all, the joins on keys between
the three tables. This avoids the multiplication of the joins queries
of the first alternative. However, these joins lead to the duplication
of the analyzed objects. So, we are forced to modify the existing
data mining algorithms in order to take in consideration this
duplication in the different calculations[27,28]. For example, the
calculation of the informational gain for observations represented
on several rows should count the observation only one time. This
alternative has the advantage to be faster than the previous thanks
to the joins materialization.
C. Reorganizing the Data
It reorganizes the data in a unique table by joining the three
tables without duplicating the analyzed objects. The idea here is
to complete, and not to join, the target table by the data present in
others tables[27-28]. Its principle is to generate for each attribute
value of the linked table an attribute in the result table. The
previous application of this operator has the advantage to avoid
the duplication of the analyzed objects and to allow the use of
any data mining method, without any modification.
D. Examples
Where table I is a weighted correspondence table that joins the
target table R with the table V which contains other considered
dimension. This operator is only recommended when the attributes
Bi of V do not contain many distinct values. The COMPLETE
operator main property is that the result always includes, as left
part, all objects of R without duplication and that this one is
completed in right part by the weights of the “dimension” coming
from V and I. This case happens usually in the relational data
where the correspondence table represents the links with N-M
w w w. i j c s t. c o m
IJCST Vol. 4, Issue 2, April - June 2013
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
cardinalities. This operator can be seen as a mean of relational
data preparation for data mining[27-30]. We notice that for a same
tuple of R and a same value of bij attribute of V, we can have
several links in I with possibly different weights.
Example 1:
Example 2
Fig. 3: Shows the Example 2
Fig. 2: shows the Example1
VII. Benchmark Results
The tests aim to compare the performances of each alternative.
We kept three criteria: the size of the target table, the size of the
linked table (neighbors table in SDM case) and the size of the
correspondence table (spatial join index in SDM case). The fig.
4, gives the execution time of the three algorithms according to
the size of the target table [27-30].
Table 2: Performance Based Comparisons of Target Table, Linked Table and Correspondence Table
A(Size
B(Size of I
of R
(objects))
(objects))
122
204
1594
3437
8668
15574
21892
29810
147
148
6330
20180
20180
27054
53631
74302
1st alternative
D(Total time
C(Size
of the 1st
of V
alternative
(objects))
(seconds))
37
240
52
300
869
16860
869
24799
869
35205
869
48403
869
70020
869
88320
2nd alternative
E(Total time F(execution
of the 2nd
time of the
alternative
first step
(seconds))
(seconds))
3
1
4
1
7
1
8
1
18
2
19
2
56
2
32
2
G(execution
time of the
second step
(seconds))
2
3
6
7
16
17
54
30
3rd alternative
H(Total time
of the 3 rd
alternative
(seconds))
3
3
9
76
157
640
925
1372
I(execution
time of the
first step
(seconds))
1
1
5
71
150
626
882
1330
J(execution
time of the
second step
(seconds))
2
2
4
5
7
14
43
42
Fig. 4: Comparative Study
w w w. i j c s t. c o m
International Journal of Computer Science And Technology 653
IJCST Vol. 4, Issue 2, April - June 2013
VIII. Conclusion
We have shown that spatial data mining is a promising field of
research with wide applications in GIS, medical imaging, robot
motion planning, etc., although the field is quite young, a number
of algorithms and techniques have been proposed to discover
various kinds of knowledge from spatial data. This led us to future
directions and suggestions for the spatial data mining filed in
general. Finally this study will help the researchers to develop the
better techniques in the field of spatial data mining.
References
[1] Fayyad U. M., .J., Piatetsky-Shapiro G., Smyth P.,“From
Data Mining to Knowledge Discovery: An Overview”, In:
Advances in Knowledge Discovery and Data Mining, AAAI
Press, Menlo Park, 1996, pp. 1 - 34.
[2] S. Shekhar, P. Zhang, Y. Huang, R. Vatsavai,"Trend in Spatail
Data Mining, as a chapter to appear in Data Mining: Next
Generation Challenges and Future Directions, H. Kargupta,
A. Joshi, K. Sivakumar, and Y. Yesha(eds.), AAAI/MIT Press,
2003.
[3] Gueting R. H.,“An Introduction to Spatial Database Systems”,
Special Issue on Spatial Database Systems of the VLDB
Journal, Vol. 3, No. 4, October 1994.
[4] Koperski K., Adhikary J., Han J.,“Knowledge Discovery in
Spatial Databases: Progress and Challenges”, Proc. SIGMOD
Workshop on Research Issues in Data Mining and Knowledge
Discovery, Technical Report 96-08, University of British
Columbia, Vancouver, Canada, 1996.
[5] Zhang, P., Steinbach, M., Kumar, V., Shekhar, S., Tan,
P., Klooster, S., Potter, C,"Discovery of Patterns of Earth
Science Data Using Data Mining, as a chapter to appear in
Next Generation of Data Mining Applications, Mehmed M.
Kantardzic and Jozef Zurada (editors), IEEE Press, 2004.
[6] M.Hemalatha.M; Naga Saranya.N.,"A Recent Survey
on Knowledge Discovery in Spatial Data Mining, IJCI
International Journal of Computer Science, Vol. 8, Issue 3,
No. 2, May, 2011.
[7] M. Ester, H. P. Kriegel, J. Sander, X. Xu.,"A density-based
algorithm for discovering clusters in large spatial databases
with noise", Proc. 2nd Int. Conf Knowledge Discovery and
Data Mining (KDD-96), pp. 226-231, Portland, OR, USA,
August 1996.
[8] R. Ng, J. Han,"Efficient and effective clustering method
for spatial data mining", Proc. I994 Int. Conf Very Large
Databases, pp. 144-155, Santiago, Chile, September 1994.
[9] E. M. Knorr, R. Ng.,"Extraction of spatial proximity patterns
by concept generalization", Proc. 2nd Int. Conf Knowledge
Discovery and Data Mining (KDD-96), pp. 347-350, Portland,
OR, USA, August 1996.
[10]S. Chawla, S. Shekhar, W. Wu, U. Ozesmi,"Modeling Spatial
Dependencies for Mining Geospatial Data: An Introduction
, as Chapter of Geographic Data Mining and Knowledge
Discovery. Harvey J. Miller and Jiawei Han (eds.) Taylor
and Francis, 2001, ISBN 0-415-23369-0.
[11] S. Shekhar, C.T. Lu, P. Zhang, A Unified Approach to Spatial
Outliers Detection, to appear in Geoinformatica, 7(2), June,
2003.
[12]S. Shekhar, C.T. Lu, P. Zhang Detecting Graph-based
Spatial Outliers: Algorithms and Applications, Proceedings
of the Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2001, San Francisco,
CA, 2001.
654
International Journal of Computer Science And Technology
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
[13]Longley P. A., Goodchild M. F., Maguire D. J., Rhind D. W.,
Geographical Information Systems - Principles and Technical
Issues, John Wiley & Sons, Inc., Second Edition, 1999.
[14]Ester, M., Kriegel, H.-P., Sander, J.: Spatial Data Mining: A
Database Approach, Proc. 5th Symp. on Spatial Databases,
Berlin, Germany, 1997
[15]Koperski, K., Han, J., Stefanovi,c N.,"An Efficient TwoStep Method for Classification of Spatial Data", In Proc.
International Symposium on Spatial Data Handling (SDH’98),
pp. 45-54, Vancouver, Canada, July 1998
[16]S. Chawla, S. Shekhar, W. Wu, U. Ozesmi,"Extending Data
Mining for Spatial Applications: A Case Study in Predicting
Nest Locations", Proc. Int. Confi. on 2000 ACM SIGMOD
Workshop on Research Issues in Data Mining and Knowledge
Discovery (DMKD 2000), Dallas, TX, May 14, 2000.
[17]Ng, R., Han, J.,"Efficient and Effective Clustering Method
for Spatial Data Mining", in Proc. of 1994 Int’l Conf. on Very
Large Data Bases (VLDB’94), Santiago, Chile, September
1994, pp. 144-155
[18]Ester, M., Kriegel ,H.-P., Sander, J., Xu, X.,"DensityConnected Sets and their Application for Trend Detection
in Spatial Databases", Proc. 3rd Int. Conf. on Knowledge
Discovery and Data Mining, Newport Beach, CA, 1997, pp.
10-15
[19]Wang, W., Yang, J., Muntz, R.,"STING: A Statistical
Information Grid Approach to Spatial Data Mining", Technical
Report CSD-97006, Computer Science Department,
University of California, Los Angeles, February 1997
[20]Openshaw S., Charlton M., Wymer C., Craft A., 1987,“A mark
1 geographical analysis machine for the automated analysis
of point data sets”, International Journal of Geographical
Information Systems, Vol. 1, n & deg, 4, pp. 335-358
[21]Fotheringham S., Zhan B.,“A comparison of three exploratory
methods for cluster detection in spatial point patterns”,
Geographical Analysis, Vol. 28, n° 3, pp. 200-218,
1996.
[22]Diggle P.J.,"Point process modeling in environmental
epidemiology", In Barnett V., Turkman K. (eds) Statistics
for the environment, Chichester, John Wiley & Sons, pp
89-110, 1993.
[23]Isobel C.,“Practical geostatistics”, Applied Science Publisher,
Reprinted 1987. Also at [Onine] Available: http://curie.ej.jrc.
it/faq/introduction.html
[24]Valduriez P,"Join indices, ACM Transactions on Database
Systems", 12 (2), pp. 218-246, 1990.
[25]Zeitouni K.,Yeh L.,"Aufaure M-A, Join indices as a tool for
spatial data mining", Int. Workshop on Temporal, Spatial and
Spatio-Temporal Data Mining, Lecture Notes in Artificial
Intelligence n° 2007, Springer, pp. 102-114, Lyon, France,
September 12-16, 2000.
[26]Dzeroski S., Lavrac N.,"Relational Data Mining, Springer",
2001.
[27]Chelghoum N., Zeitouni K,"A decision tree for multi-layered
spatial data", In: Joint International Symposium on Geospatial
Theory, Processing and Applications, Ottawa, Canada, July
8-12 2002.
[28]Shashi S., Sanjay C.,"Spatial Databases: A Tour", Prentice
Hall, pp. 237, 2003.
[29]Breiman L., Friedman J.H., Ol shen R.A.,"Stone C.J,
Classification and Regression Trees", Ed: Wadsworth &
Brooks. Montery, California, 1984.
w w w. i j c s t. c o m
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
IJCST Vol. 4, Issue 2, April - June 2013
[30]Koperski K., Han J., Stefanovic N,"An Efficient Two-Step
Method for Classification of Spatial Data, In proceedings
of International Symposium on Spatial Data Handling
(SDH’98), p. 45-54, Vancouver, Canada, July 1998.
w w w. i j c s t. c o m
International Journal of Computer Science And Technology 655