Download Spatial association analysis: A literature review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Spatial association analysis: A literature review
Group 8: Daniel Hess, Yun Zhang
Introduction
The immense explosion of geographically referenced data calls for efficient discovery
of spatial knowledge. Spatial association analysis is a typical data mining approach
for discovering spatial knowledge. Associate rules are patterns of form X→Y, where
pattern Y is likely to occur when pattern X occurs. One of the most famous patterns,
Diapers → Beer, is a typical association rule example. Spatial association analysis
adds spatial features to association analysis, which makes it different from traditional
association analysis. Previous research on spatial association analysis mainly focuses
on collocation pattern mining. Collocation is the presence of more than two spatial
objects at the same or close distances from each other. Spatial collocation patterns
associate the co-existence of non-spatial features in a spatial neighborhood. It has
many applications such as ecology, medical science, crime data analysis, et al. A
typical example of spatial collocation patterns is bird and house, which tends to stay
together. The figure below gives a graph example of collocation pattern.
Figure 1: Sample collocation patterns [7]
Classification One
Spatial collocation pattern mining approaches can be divided into two categories,
namely statistic approach and data mining approach. Data mining approach can be
further classified to clustering based approach, association rule based approach and
classification based approach.
Statistic approach is often based on spatial correlation. The key idea of spatial
correlation is that whatever is causing an observation in one location also causes
similar observations in nearby locations. Existing approaches such as cross-K
function [9], regression models [8] belongs to this category. The problem of statistic
approach is the computational cost is often large because of the large volume of data.
Thus, these days, data mining based approach is more popular. Our survey will focus
on data mining based approach.
spatial
association
analysis
Data mining
Classification
based
Association
rule based
statistic
Clustering
based
Figure 2: Classification One
Classification based approach
The classification based approach uses distance or topological measure to identify
spatial relationship of objects. Based on the spatial relationship, such as Euclidean
distance, network distance, adjacent and so on, these objects will be classified to
different patterns. Then method like threshold value can be applied to find collocation
pattern.
Y. Morimoto [1] uses classification based approach, it defines distance based pattern
named k-neighboring class sets. In that paper, the number of instances of each pattern
is used as the prevalence measure. A non-overlapping-instance constraint is
introduced to get the anti-monotone property. However, the limitation is that the
author makes the assumption that instances are non-overlapping, which is not always
true in real application. Thus, the anti-monotone property of their approach is not
guaranteed.
Yan Huang et al [2] also propose an approach belongs to this category. It introduces a
new prevalence measure called the participation index, which has the desired property
of anti-monotone. Based on this interest measure, this paper proposes an event centric
model removing the non-overlapping constraint to find collocation patterns using
threshold pruning method. There are the major contributions of that paper. However,
the limitation of this paper is their approach can only apply to point data. That means
line and region objects collocation can’t be successfully addressed.
Hui Xiong et al [6] extend event centric model. This paper generalizes the concept of
collocation pattern to extended spatial objects. That includes points, lines, and regions.
Thus this paper provided a more general algorithm by introducing the notion of buffer,
which is a zone of specified distances from the fixed spatial object. This is the major
contribution. Possible improvement of this paper is to extend the algorithm to include
time interval. Therefore make it applicable to spatio-temporal collocation pattern
mining.
Clustering based approach
Clustering based approach treats every spatial attribute as a map layer and considers
spatial regions in each layer as candidates for mining collocation pattern. For instance,
given sets of layers X and Y, where X and Y are disjoint, the association rule is
defined as X→Y (CS, CC%). CS is support value, which is defined as the ratio of area
of cluster that both satisfy X and Y. CC% can be viewed as confidence value, which is
defined as intersection area of X and Y. Given these concept and notion, clustering
based approach can apply various approaches such as traditional data mining
algorithm to find collocation pattern.
Ding, W et al [3] use clustering based approach. The major contribution of this paper
is it focuses on regional collocation pattern while previous literature only studied
global collocation pattern. A reward-based regional discovery methodology is
introduced, and a new divisive, grid-based supervised clustering algorithm is
presented to identify interesting sub-regions. They integrate the interesting sub-region
into possible largest region. The figure below illustrates how it works. However, there
are limitations in this approach. This paper assumes a region is contiguous. That
means for each pair of objects belonging to the same region, there must always be a
path within this region that connects them. In real life, such assumption will not
always hold, thus this fact limits the application of this approach.
Figure 3: regional association analysis
Zhang, X et al [4] focus on the performance of collocation pattern mining algorithm.
They are motivated to design an algorithm to speed up mining process while
minimize the storage cost. Therefore, this paper proposed a framework whose view is
to combines the discovery of spatial neighborhoods with the mining process. This
combination reduces the computational cost. This is the major contribution of this
paper. The limitation of this paper is similar to paper [2]. While the approach is only
designed for point data, the scalability of this approach is limited.
Association rule based approach
Association based approach often uses transaction. They focus on defining
transactions over space in order to apply algorithm similar to Apriori. Thus algorithm
used in this category is often derived from Apriori.
Celik, M et al [5] propose an association rule based approach. However, the major
contribution of this paper is not the association rule algorithm; rather it is the dynamic
parameters it deals with. This paper allows users to change interest measure value
according to their preference. To achieve that it designs an indexing structure for
co-location patterns and proposes an algorithm to discover zonal collocation patterns
efficiently for dynamic parameters. Possible improvement of this paper is to extend
the approach to global collocation pattern mining. In addition, when considering the
efficiency of pattern mining process, the storage cost should also be taken into
account.
Paper /approach used
Classification Association
based
rule based
×
Collocation miner [2]
SCMRG [3]
Fast
mining
algorithm[4]
Zoloc Miner [5]
Apriori-Gen [1]
×
EXCOM [6]
×
Regression model [8]
Clustering
based
Statistic
×
×
×
×
Table 1: Classification of papers
Classification Two
A second classification scheme that can be applied to the papers related to data types.
The data types used in the experimentation and validation of the algorithms included
in each of the papers fall into one of three categories. Those categories are as
follows: 1. Real-world data; 2. Synthetic data; 3. a combination of Real-world and
Synthetic data.
Real-World Data
The authors of Paper [1] use real-world data to test their algorithm. The authors of
this paper use a real-world road map dataset, evaluating the accuracy of this digital
road map dataset.
Paper [2] also falls into this category of using real-world data. The authors of this
paper used data about arsenic concentrations in Texas obtained from the Texas
Ground Water Database (GWDB). The authors have sorted through and cleaned this
data. This was necessary because the data goes back 25 years, and different data
collection procedures have been used over those 25 years, leading to missing,
inconsistent and repeated values in the data.
The other paper included in this category, among those included in the survey, is
Paper [5]. The author of this paper used real-world data from a telephone directory
database. The author identified the X and Y coordinate values for each record.
Then, a database of point records was created. The algorithm was then tested using
this data to determine the time it took to compute the co-location patterns.
Synthetic Data
The authors of Paper [6] use synthetic datasets in the validation of their proposed
algorithm. The authors of this paper used these synthetic datasets, created using a
spatial data generator, to evaluate the behavior of the naïve approach with their
Zoloc-Miner Algorithm, specifically the computation time, based on the following
effects: 1. Number of zones; 2. Size of a zone; 3. Amount of overlap in a zone.
Combination
Paper [3] falls under this category. The authors of this paper use real-world NASA
climate datasets. These datasets contain monthly measurements on numeric climate
variables, for example, precipitation and sea surface temperature, over a period of
twelve years. The performance of the authors’ proposed algorithm is tested on the
real-world dataset and is compared against the algorithm applied to a synthetic dataset.
The synthetic dataset used in this paper was generated in order to allow better control
towards studying the effects of parameters.
The other paper that uses this combination approach in this survey is Paper [4]. The
authors of this paper use real-world from Digital Chart of the World (DCW). Eight
layers of the state of Minnesota are included in this real-world data: Drainage,
Drainage Supplemental, Hypsography, Hypsography Supplemental, Population places,
Aeronautical, Land Cover and Cultural Landmarks. In this paper, both the
real-world and synthetic data were used to test the authors’ proposed algorithm
against previous algorithms’ running time.
Paper/data type
[1]
[2]
[3]
[4]
[5]
[6]
Real-World
X
X
Synthetic
Combination
X
X
X
X
Table 2: classification based on data types
Alternative classification scheme
Another possible classification is based on the scale of spatial association relationship
the papers deal with. Specifically, some of them focus on global spatial collocation,
while others look for regional spatial collocation pattern.
So the classification can be represented as follows:
Spatial
association
analysis
Regional
pattern
Global
pattern
Figure 4: classification based on scale
Therefore, the paper can be classified as this:
Paper /scale focused
Global pattern
Regional pattern
Collocation miner [2]
×
SCMRG [3]
×
Fast
mining ×
algorithm[4]
Zoloc Miner [5]
×
Apriori-Gen [1]
×
EXCOM [6]
×
Table 3: classification of papers based on scale
Reference:
[1] Y. Morimoto. Mining Frequent Neighboring Class Sets in Spatial Databases. In
Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2001.
[2] Yan Huang, Shashi Shekhar, and Hui Xiong, Discovering Co-location Patterns
from Spatial Datasets: A General Approach, IEEE Transactions on Knowledge and
Data Engineering (TKDE), 16(12), pp. 1472-1485, December 2004.
[3] Ding, W., Eick, C. F., Wang, J., and Yuan, X. 2006. A Framework for Regional
Association Rule Mining in Spatial Datasets. In Proceedings of the Sixth international
Conference on Data Mining (December 18 - 22, 2006). ICDM. IEEE Computer
Society, Washington, DC, 851-856.
[4] Zhang, X., Mamoulis, N., Cheung, D. W., and Shou, Y. 2004. Fast mining of
spatial collocations. In Proceedings of the Tenth ACM SIGKDD international
Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA, August 22
- 25, 2004). KDD '04. ACM, New York, NY, 384-393.
[5] Celik, M., Kang, J. M., and Shekhar, S. 2007. Zonal Co-location Pattern
Discovery with Dynamic Parameters. In Proceedings of the 2007 Seventh IEEE
international Conference on Data Mining (October 28 - 31, 2007). ICDM. IEEE
Computer Society, Washington, DC, 433-438.
[6] Hui Xiong, Shashi Shekhar, Yan Huang, Vipin Kumar, Xiaobin Ma, Jin Soung Yoo,
A Framework for Discovering Co-location Patterns in Data Sets with Extended
Spatial Objects. In Proc. of SIAM International Conf. on Data Mining (SDM), Florida,
USA, 2004.
[7] Shashi Shekhar and Sanjay Chawla. Spatial database: a tour. Prentice Hall, 2003.
[8] Y. Chou. Exploring Spatial Analysis in Geographic Information System. Onward
Press, ISBN: 1566901197, 1997.
[9] N.A.C. Cressie. Statistics for Spatial Data. Wiley and Sons, ISBN:0471843369,
1991.