Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Spatial association analysis: A literature review Group 8: Daniel Hess, Yun Zhang Introduction The immense explosion of geographically referenced data calls for efficient discovery of spatial knowledge. Spatial association analysis is a typical data mining approach for discovering spatial knowledge. Associate rules are patterns of form X→Y, where pattern Y is likely to occur when pattern X occurs. One of the most famous patterns, Diapers → Beer, is a typical association rule example. Spatial association analysis adds spatial features to association analysis, which makes it different from traditional association analysis. Previous research on spatial association analysis mainly focuses on collocation pattern mining. Collocation is the presence of more than two spatial objects at the same or close distances from each other. Spatial collocation patterns associate the co-existence of non-spatial features in a spatial neighborhood. It has many applications such as ecology, medical science, crime data analysis, et al. A typical example of spatial collocation patterns is bird and house, which tends to stay together. The figure below gives a graph example of collocation pattern. Figure 1: Sample collocation patterns [7] Classification One Spatial collocation pattern mining approaches can be divided into two categories, namely statistic approach and data mining approach. Data mining approach can be further classified to clustering based approach, association rule based approach and classification based approach. Statistic approach is often based on spatial correlation. The key idea of spatial correlation is that whatever is causing an observation in one location also causes similar observations in nearby locations. Existing approaches such as cross-K function [9], regression models [8] belongs to this category. The problem of statistic approach is the computational cost is often large because of the large volume of data. Thus, these days, data mining based approach is more popular. Our survey will focus on data mining based approach. spatial association analysis Data mining Classification based Association rule based statistic Clustering based Figure 2: Classification One Classification based approach The classification based approach uses distance or topological measure to identify spatial relationship of objects. Based on the spatial relationship, such as Euclidean distance, network distance, adjacent and so on, these objects will be classified to different patterns. Then method like threshold value can be applied to find collocation pattern. Y. Morimoto [1] uses classification based approach, it defines distance based pattern named k-neighboring class sets. In that paper, the number of instances of each pattern is used as the prevalence measure. A non-overlapping-instance constraint is introduced to get the anti-monotone property. However, the limitation is that the author makes the assumption that instances are non-overlapping, which is not always true in real application. Thus, the anti-monotone property of their approach is not guaranteed. Yan Huang et al [2] also propose an approach belongs to this category. It introduces a new prevalence measure called the participation index, which has the desired property of anti-monotone. Based on this interest measure, this paper proposes an event centric model removing the non-overlapping constraint to find collocation patterns using threshold pruning method. There are the major contributions of that paper. However, the limitation of this paper is their approach can only apply to point data. That means line and region objects collocation can’t be successfully addressed. Hui Xiong et al [6] extend event centric model. This paper generalizes the concept of collocation pattern to extended spatial objects. That includes points, lines, and regions. Thus this paper provided a more general algorithm by introducing the notion of buffer, which is a zone of specified distances from the fixed spatial object. This is the major contribution. Possible improvement of this paper is to extend the algorithm to include time interval. Therefore make it applicable to spatio-temporal collocation pattern mining. Clustering based approach Clustering based approach treats every spatial attribute as a map layer and considers spatial regions in each layer as candidates for mining collocation pattern. For instance, given sets of layers X and Y, where X and Y are disjoint, the association rule is defined as X→Y (CS, CC%). CS is support value, which is defined as the ratio of area of cluster that both satisfy X and Y. CC% can be viewed as confidence value, which is defined as intersection area of X and Y. Given these concept and notion, clustering based approach can apply various approaches such as traditional data mining algorithm to find collocation pattern. Ding, W et al [3] use clustering based approach. The major contribution of this paper is it focuses on regional collocation pattern while previous literature only studied global collocation pattern. A reward-based regional discovery methodology is introduced, and a new divisive, grid-based supervised clustering algorithm is presented to identify interesting sub-regions. They integrate the interesting sub-region into possible largest region. The figure below illustrates how it works. However, there are limitations in this approach. This paper assumes a region is contiguous. That means for each pair of objects belonging to the same region, there must always be a path within this region that connects them. In real life, such assumption will not always hold, thus this fact limits the application of this approach. Figure 3: regional association analysis Zhang, X et al [4] focus on the performance of collocation pattern mining algorithm. They are motivated to design an algorithm to speed up mining process while minimize the storage cost. Therefore, this paper proposed a framework whose view is to combines the discovery of spatial neighborhoods with the mining process. This combination reduces the computational cost. This is the major contribution of this paper. The limitation of this paper is similar to paper [2]. While the approach is only designed for point data, the scalability of this approach is limited. Association rule based approach Association based approach often uses transaction. They focus on defining transactions over space in order to apply algorithm similar to Apriori. Thus algorithm used in this category is often derived from Apriori. Celik, M et al [5] propose an association rule based approach. However, the major contribution of this paper is not the association rule algorithm; rather it is the dynamic parameters it deals with. This paper allows users to change interest measure value according to their preference. To achieve that it designs an indexing structure for co-location patterns and proposes an algorithm to discover zonal collocation patterns efficiently for dynamic parameters. Possible improvement of this paper is to extend the approach to global collocation pattern mining. In addition, when considering the efficiency of pattern mining process, the storage cost should also be taken into account. Paper /approach used Classification Association based rule based × Collocation miner [2] SCMRG [3] Fast mining algorithm[4] Zoloc Miner [5] Apriori-Gen [1] × EXCOM [6] × Regression model [8] Clustering based Statistic × × × × Table 1: Classification of papers Classification Two A second classification scheme that can be applied to the papers related to data types. The data types used in the experimentation and validation of the algorithms included in each of the papers fall into one of three categories. Those categories are as follows: 1. Real-world data; 2. Synthetic data; 3. a combination of Real-world and Synthetic data. Real-World Data The authors of Paper [1] use real-world data to test their algorithm. The authors of this paper use a real-world road map dataset, evaluating the accuracy of this digital road map dataset. Paper [2] also falls into this category of using real-world data. The authors of this paper used data about arsenic concentrations in Texas obtained from the Texas Ground Water Database (GWDB). The authors have sorted through and cleaned this data. This was necessary because the data goes back 25 years, and different data collection procedures have been used over those 25 years, leading to missing, inconsistent and repeated values in the data. The other paper included in this category, among those included in the survey, is Paper [5]. The author of this paper used real-world data from a telephone directory database. The author identified the X and Y coordinate values for each record. Then, a database of point records was created. The algorithm was then tested using this data to determine the time it took to compute the co-location patterns. Synthetic Data The authors of Paper [6] use synthetic datasets in the validation of their proposed algorithm. The authors of this paper used these synthetic datasets, created using a spatial data generator, to evaluate the behavior of the naïve approach with their Zoloc-Miner Algorithm, specifically the computation time, based on the following effects: 1. Number of zones; 2. Size of a zone; 3. Amount of overlap in a zone. Combination Paper [3] falls under this category. The authors of this paper use real-world NASA climate datasets. These datasets contain monthly measurements on numeric climate variables, for example, precipitation and sea surface temperature, over a period of twelve years. The performance of the authors’ proposed algorithm is tested on the real-world dataset and is compared against the algorithm applied to a synthetic dataset. The synthetic dataset used in this paper was generated in order to allow better control towards studying the effects of parameters. The other paper that uses this combination approach in this survey is Paper [4]. The authors of this paper use real-world from Digital Chart of the World (DCW). Eight layers of the state of Minnesota are included in this real-world data: Drainage, Drainage Supplemental, Hypsography, Hypsography Supplemental, Population places, Aeronautical, Land Cover and Cultural Landmarks. In this paper, both the real-world and synthetic data were used to test the authors’ proposed algorithm against previous algorithms’ running time. Paper/data type [1] [2] [3] [4] [5] [6] Real-World X X Synthetic Combination X X X X Table 2: classification based on data types Alternative classification scheme Another possible classification is based on the scale of spatial association relationship the papers deal with. Specifically, some of them focus on global spatial collocation, while others look for regional spatial collocation pattern. So the classification can be represented as follows: Spatial association analysis Regional pattern Global pattern Figure 4: classification based on scale Therefore, the paper can be classified as this: Paper /scale focused Global pattern Regional pattern Collocation miner [2] × SCMRG [3] × Fast mining × algorithm[4] Zoloc Miner [5] × Apriori-Gen [1] × EXCOM [6] × Table 3: classification of papers based on scale Reference: [1] Y. Morimoto. Mining Frequent Neighboring Class Sets in Spatial Databases. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001. [2] Yan Huang, Shashi Shekhar, and Hui Xiong, Discovering Co-location Patterns from Spatial Datasets: A General Approach, IEEE Transactions on Knowledge and Data Engineering (TKDE), 16(12), pp. 1472-1485, December 2004. [3] Ding, W., Eick, C. F., Wang, J., and Yuan, X. 2006. A Framework for Regional Association Rule Mining in Spatial Datasets. In Proceedings of the Sixth international Conference on Data Mining (December 18 - 22, 2006). ICDM. IEEE Computer Society, Washington, DC, 851-856. [4] Zhang, X., Mamoulis, N., Cheung, D. W., and Shou, Y. 2004. Fast mining of spatial collocations. In Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA, August 22 - 25, 2004). KDD '04. ACM, New York, NY, 384-393. [5] Celik, M., Kang, J. M., and Shekhar, S. 2007. Zonal Co-location Pattern Discovery with Dynamic Parameters. In Proceedings of the 2007 Seventh IEEE international Conference on Data Mining (October 28 - 31, 2007). ICDM. IEEE Computer Society, Washington, DC, 433-438. [6] Hui Xiong, Shashi Shekhar, Yan Huang, Vipin Kumar, Xiaobin Ma, Jin Soung Yoo, A Framework for Discovering Co-location Patterns in Data Sets with Extended Spatial Objects. In Proc. of SIAM International Conf. on Data Mining (SDM), Florida, USA, 2004. [7] Shashi Shekhar and Sanjay Chawla. Spatial database: a tour. Prentice Hall, 2003. [8] Y. Chou. Exploring Spatial Analysis in Geographic Information System. Onward Press, ISBN: 1566901197, 1997. [9] N.A.C. Cressie. Statistics for Spatial Data. Wiley and Sons, ISBN:0471843369, 1991.