A Comprehensive Framework for Spatial Association Rule
T.H.D. Dao, J-C. Thill
Department of Geography and Earth Sciences, University of North Carolina at Charlotte
9201 University City Blvd, Charlotte, NC 28223
Email: {tdao; jean-claude.thill}
1. Introduction
Association rule mining has been extensively applied in market basket transaction analysis
and is efficient at finding frequent and meaningful relations, positive associations, and
stochastic plus asymmetric patterns in large relational data warehouses. Association rule
mining has been adapted to spatial analysis by simply including spatial predicates (Koperski
and Han 1995). With linguistic expressions, spatial predicates allow flexibility in
representing explicit spatial relations of objects in terms of distance, direction, and topology
but also implicit spatial dependencies, i.e. spatial autocorrelation, or more generally the
spatial structure embedded in the studied phenomenon. However, the dynamics and
complexity of the spatial component captured by spatial predicates are often overlooked.
Moreover, there exists no comprehensive procedure on generating predicates to capture the
spatial dependencies. Short of this, adopting association rule mining in spatial analysis is
rather problematic. Interestingly, a very similar predicament afflicts spatial regression
analysis with a spatial weight matrix that would be assigned a priori, without validation on
the specific domain of application.
This study aims to remedy this deficiency by introducing a complete geospatial knowledge
discovery framework using spatial association rule (SAR) mining algorithms for the detection
of spatial patterns. The emphasis is on developing a methodology to incorporate spatial
relations and spatial dependencies under the form of spatial predicates. In addition, novel
visualization techniques and geographical knowledge based evaluation schemes are proposed.
The approach is illustrated on crime analysis in the city of Charlotte, NC.
2. Fundamentals
A SAR mining problem is stated as following:
Let D be a spatial database containing unique objects O and P be a set of all possible
predicates, both non-spatial and spatial, that can be derived from D. Each object O possesses
X, a set of predicates in P, if X ∈ P and X represents attribute-related or spatial-related
properties of O. A SAR is an implication of the form X → Y , where X ∈ P , Y ∈ P ,
X ∩Y ≠ φ , and ∃(x ∈ X or y ∈ Y ) | x, y are spatial predicates. The rule X → Y holds with
confidence c if c% of objects in D that contain X also contain Y. It has support s if s% of
objects in D contain X∪ Y. The problem of SAR mining can be decomposed into three steps:
Extract non-spatial/spatial predicates, find all frequent sets of predicates, and generate strong
association rules.
The literature on SAR shows limited efforts on the issues of how spatial structure is modeled
and represented by spatial predicates. Spatial predicates are often restricted to represent
spatial relations rather than spatial dependencies. Due to the large number of spatial relations
in large databases, much research focuses on developing algorithms to efficiently extract
them. For example, Koperski and Han (1995) proposed a top-down progressive refinement
method towards spatial query results. Others include pre-computing spatial relations and
storing them with inductive logic like SPADA (Malerba and Lisi 2001) or partitioning the
database to reduce the number of spatial relations to be computed like SPIN! (May and
Savinov 2003). Geographical knowledge is incorporated in SAR mining in a limited way.
For instance, Bogorny et al. (2010) refer to the concept of “well-known geographic
dependence” defined to be the obvious geographical relations such as gasStation-intersectsstreet for pruning input space of spatial predication. However, this concept is inadequate as it
only refers to spatial relations derived on geo-ontologies which are assumed given.
One can find implementations of SAR mining related to specific spatial phenomena such as
urban socioeconomic and land cover change (Mennis and Liu 2005), location of convenience
stores (Jung and Sun 2006), or crime (Lee and Phillips 2008). Nevertheless, the spatial
dimension remains overlooked or treated in a very simple manner such as using distances
among spatial features, or feature layers overlaid within GIS environments.
3. Framework
A framework prefixed by the term “spatial” should be centered on its capability to handle the
spatial component of the problem at hand, namely heterogeneity and dependency of the
phenomenon under study across space, as well as spatial interactions among features.
Moreover, from a knowledge discovery perspective, it is essential to perform an evaluation,
especially because association rule mining is well-known for producing a significant number
of rules. In this paper, we propose a geospatial knowledge discovery process using SAR
mining, as depicted in Figure 1, aimed at disentangling the above issues by consecutively
accomplish the following tasks:
(a) Identify associative variables
(b) Select and transform attribute information; derive non-spatial predicates
(c) Identify and quantify spatial components involved; derive spatial predicates
(d) Mine spatial association rules
(e) Visualize and evaluate intermediate mined results for interestingness using geospatial
knowledge base; update geospatial knowledge base.
While tasks (a), (b), and (d) are straightforward or well documented, the elements that
distinguish this framework are (c) and (e). Regarding task (c), it is always an issue of how to
quantify and represent spatial relations and dependencies because semantics and vagueness
are often involved. Popular but modest means to account for spatial dependency such as
global indicators based on simple statistics (e.g. average of differences to the mean) or
regular neighbourhood structures are undesirable compared to robust data-driven
methodologies. Regarding task (e), as a large number of rules is generated under text format,
it is impossible to evaluate them without visual analytics. Capability to quickly detect strong
and useful associations is the objective of the visualization-evaluation system.
Spatial data modeling for
data aggregation and user
Knowledge Base
• Information selection &
• Input space pruning
• Identifying & quantifying
spatial components
Spatial Data
Frequent Set &
Association Rule
Spatial Scientists
Intermediate Results
Spatial Associations
Decision Maker
Figure 1. A comprehensive framework for SAR mining
4. Algorithms and Technical Specification
Several computational algorithms are coupled to implement the framework described above.
A Multi-directional Optimum Ecotope-Based Algorithm (AMOEBA) (Aldstadt and Getis
2006) is utilized to search locally and multi-directionally for autocorrelation with irregular
structures. Although designed for areal analysis, its deployment on spatial point processes is
feasible by point aggregation to meaningful areal or network-segment units. Local G statistics
are used to test for spatial dependency. Parallel computational implementation is possible for
large databases in order to increase the computational efficiency of the algorithm in searching
for significant spatial neighbourhoods and dependencies. A fuzzy-set mapping mechanism
that transforms quantitative to linguistic measurements for spatial components while
prioritizing automatic procedures is used for predication.
An a priori-based algorithm implemented in the Weka data mining software (Bouckaert et al.
2011) is used to find frequent sets using singular relational table. A multi-dimensional visual
analytic system for evaluation is developed as a stand-alone platform, aiming to display
strongest rules and allow interactive sub-group visualization. Libraries of known associations
are constructed based on domain theories and ontologies to detect potential unknown and
interesting rules.
5. Case Study
Analysis of crime in the city of Charlotte, NC in 2009 is presented for illustration. American
community survey five-year estimates for 2005-2009 at the block group level, as well as
ancillary data such as business data, bus stops, and park-n-ride facilities, are used as
As incidents are recorded at street addresses, it is reasonable to consider the distribution of
crime being constrained to the street network. Street segments are thus used as analysis units.
They are split up when crossing block group boundaries or having length greater than a
predefined threshold. Different segmentation strategies are tested. Measurements are
population-weighted count of incidents along segments. Network distance is used to estimate
distance-based relations.
6. Conclusions
SAR mining has potential for extracting unknown patterns within large spatial databases only
if spatial components are well addressed. This study proposes a comprehensive framework
and library of algorithms of spatial analysis and visual analytics to resolve this fundamental
challenge. The framework is the first attempt in delivering a complete geo-spatial knowledge
discovery framework using spatial association rule mining.
