Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Comprehensive Framework for Spatial Association Rule Mining T.H.D. Dao, J-C. Thill Department of Geography and Earth Sciences, University of North Carolina at Charlotte 9201 University City Blvd, Charlotte, NC 28223 Email: {tdao; jean-claude.thill}@uncc.edu 1. Introduction Association rule mining has been extensively applied in market basket transaction analysis and is efficient at finding frequent and meaningful relations, positive associations, and stochastic plus asymmetric patterns in large relational data warehouses. Association rule mining has been adapted to spatial analysis by simply including spatial predicates (Koperski and Han 1995). With linguistic expressions, spatial predicates allow flexibility in representing explicit spatial relations of objects in terms of distance, direction, and topology but also implicit spatial dependencies, i.e. spatial autocorrelation, or more generally the spatial structure embedded in the studied phenomenon. However, the dynamics and complexity of the spatial component captured by spatial predicates are often overlooked. Moreover, there exists no comprehensive procedure on generating predicates to capture the spatial dependencies. Short of this, adopting association rule mining in spatial analysis is rather problematic. Interestingly, a very similar predicament afflicts spatial regression analysis with a spatial weight matrix that would be assigned a priori, without validation on the specific domain of application. This study aims to remedy this deficiency by introducing a complete geospatial knowledge discovery framework using spatial association rule (SAR) mining algorithms for the detection of spatial patterns. The emphasis is on developing a methodology to incorporate spatial relations and spatial dependencies under the form of spatial predicates. In addition, novel visualization techniques and geographical knowledge based evaluation schemes are proposed. The approach is illustrated on crime analysis in the city of Charlotte, NC. 2. Fundamentals A SAR mining problem is stated as following: Let D be a spatial database containing unique objects O and P be a set of all possible predicates, both non-spatial and spatial, that can be derived from D. Each object O possesses X, a set of predicates in P, if X ∈ P and X represents attribute-related or spatial-related properties of O. A SAR is an implication of the form X → Y , where X ∈ P , Y ∈ P , X ∩Y ≠ φ , and ∃(x ∈ X or y ∈ Y ) | x, y are spatial predicates. The rule X → Y holds with confidence c if c% of objects in D that contain X also contain Y. It has support s if s% of objects in D contain X∪ Y. The problem of SAR mining can be decomposed into three steps: Extract non-spatial/spatial predicates, find all frequent sets of predicates, and generate strong association rules. The literature on SAR shows limited efforts on the issues of how spatial structure is modeled and represented by spatial predicates. Spatial predicates are often restricted to represent spatial relations rather than spatial dependencies. Due to the large number of spatial relations in large databases, much research focuses on developing algorithms to efficiently extract them. For example, Koperski and Han (1995) proposed a top-down progressive refinement method towards spatial query results. Others include pre-computing spatial relations and storing them with inductive logic like SPADA (Malerba and Lisi 2001) or partitioning the database to reduce the number of spatial relations to be computed like SPIN! (May and Savinov 2003). Geographical knowledge is incorporated in SAR mining in a limited way. For instance, Bogorny et al. (2010) refer to the concept of “well-known geographic dependence” defined to be the obvious geographical relations such as gasStation-intersectsstreet for pruning input space of spatial predication. However, this concept is inadequate as it only refers to spatial relations derived on geo-ontologies which are assumed given. One can find implementations of SAR mining related to specific spatial phenomena such as urban socioeconomic and land cover change (Mennis and Liu 2005), location of convenience stores (Jung and Sun 2006), or crime (Lee and Phillips 2008). Nevertheless, the spatial dimension remains overlooked or treated in a very simple manner such as using distances among spatial features, or feature layers overlaid within GIS environments. 3. Framework A framework prefixed by the term “spatial” should be centered on its capability to handle the spatial component of the problem at hand, namely heterogeneity and dependency of the phenomenon under study across space, as well as spatial interactions among features. Moreover, from a knowledge discovery perspective, it is essential to perform an evaluation, especially because association rule mining is well-known for producing a significant number of rules. In this paper, we propose a geospatial knowledge discovery process using SAR mining, as depicted in Figure 1, aimed at disentangling the above issues by consecutively accomplish the following tasks: (a) Identify associative variables (b) Select and transform attribute information; derive non-spatial predicates (c) Identify and quantify spatial components involved; derive spatial predicates (d) Mine spatial association rules (e) Visualize and evaluate intermediate mined results for interestingness using geospatial knowledge base; update geospatial knowledge base. While tasks (a), (b), and (d) are straightforward or well documented, the elements that distinguish this framework are (c) and (e). Regarding task (c), it is always an issue of how to quantify and represent spatial relations and dependencies because semantics and vagueness are often involved. Popular but modest means to account for spatial dependency such as global indicators based on simple statistics (e.g. average of differences to the mean) or regular neighbourhood structures are undesirable compared to robust data-driven methodologies. Regarding task (e), as a large number of rules is generated under text format, it is impossible to evaluate them without visual analytics. Capability to quickly detect strong and useful associations is the objective of the visualization-evaluation system. Non-Spatial Spatial data modeling for data aggregation and user navigation Geographic Knowledge Base • Information selection & transformation • Input space pruning • Identifying & quantifying spatial components Spatial Databases Spatial Data Warehouse Predication Spatial/Non-spatial Frequent Set & Association Rule Generation Spatial Scientists Visualization Evaluation Intermediate Results Spatial Associations Knowledge Decision Maker Figure 1. A comprehensive framework for SAR mining 4. Algorithms and Technical Specification Several computational algorithms are coupled to implement the framework described above. A Multi-directional Optimum Ecotope-Based Algorithm (AMOEBA) (Aldstadt and Getis 2006) is utilized to search locally and multi-directionally for autocorrelation with irregular structures. Although designed for areal analysis, its deployment on spatial point processes is feasible by point aggregation to meaningful areal or network-segment units. Local G statistics are used to test for spatial dependency. Parallel computational implementation is possible for large databases in order to increase the computational efficiency of the algorithm in searching for significant spatial neighbourhoods and dependencies. A fuzzy-set mapping mechanism that transforms quantitative to linguistic measurements for spatial components while prioritizing automatic procedures is used for predication. An a priori-based algorithm implemented in the Weka data mining software (Bouckaert et al. 2011) is used to find frequent sets using singular relational table. A multi-dimensional visual analytic system for evaluation is developed as a stand-alone platform, aiming to display strongest rules and allow interactive sub-group visualization. Libraries of known associations are constructed based on domain theories and ontologies to detect potential unknown and interesting rules. 5. Case Study Analysis of crime in the city of Charlotte, NC in 2009 is presented for illustration. American community survey five-year estimates for 2005-2009 at the block group level, as well as ancillary data such as business data, bus stops, and park-n-ride facilities, are used as associates. As incidents are recorded at street addresses, it is reasonable to consider the distribution of crime being constrained to the street network. Street segments are thus used as analysis units. They are split up when crossing block group boundaries or having length greater than a predefined threshold. Different segmentation strategies are tested. Measurements are population-weighted count of incidents along segments. Network distance is used to estimate distance-based relations. 6. Conclusions SAR mining has potential for extracting unknown patterns within large spatial databases only if spatial components are well addressed. This study proposes a comprehensive framework and library of algorithms of spatial analysis and visual analytics to resolve this fundamental challenge. The framework is the first attempt in delivering a complete geo-spatial knowledge discovery framework using spatial association rule mining. References Aldstadt J, and Getis A, 2006, Using AMOEBA to create a spatial weights matrix and identify spatial clusters. Geographical Analysis, 38, 327-343. Bogorny V, Valiati J, and Alvares L, 2010, Semantic-based pruning of redundant and uninteresting frequent geographic patterns. GeoInformatica, 14, 201-220. Bouckaert R R, Frank E, Hall M, Kirkby R, Reutemann P, Seewald A, and Scuse D, 2011, Weka Manual for Version 3-7-5. The University of Waikato. Jung C, and Sun C-H, 2006, Development of a GIService based on spatial data mining for location choice of convenience stores in Taipei City. In Geoinformatics 2006: Geospatial Information Technology, eds. H. Wu and Q. Zhu. Koperski K, and Han J, 1995, Discovery of spatial association rules in geographic information databases. In Advances in Spatial Databases, eds. M. Egenhofer and J. Herring, 47-66. Springer Berlin / Heidelberg. Lee I, and Phillips P, 2008 Urban Crime Analysis through Areal Categorized Multivariate Associations Mining. Applied Artificial Intelligence, 22, 483-499. Malerba D, and Lisi F A, 2001, An ILP Method for Spatial Association Rule Mining. In The 1st Workshop on Multi-Relational Data Mining, 18-29. May M, and Savinov A, 2003 SPIN!–an Enterprise Architecture for Spatial Data Mining. Lecture Notes in Computer Science, 2773 510-517. Mennis J, and Liu J W, 2005, Mining Association Rules in Spatio-Temporal Data: An Analysis of Urban Socioeconomic and Land Cover Change. Transactions in GIS, 9, 5-17.