Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vol. 4, Issue 2, April - June 2013 Database Primitives, Algorithms and Implementation Procedure: A Study on Spatial Data Mining 1 1 J Rajanikanth, 2Dr. T.V. Rajinikanth Dept. of CSE, S.R.K.R Engg. College, Affiliated to Andhra University, Bhimavaram, AP, India 2 Dept. of IT, GRIET, Hyderbad, AP, India Abstract Spatial data mining means extraction of useful knowledge from large amounts of spatial data. It’s highly demanding field because huge amounts of spatial data have been collected in various applications, ranging from remote sensing to geographical information system, computer cartography; environmental assessment and planning etc. The huge explorative growth of collected data pose challenges to the research community in terms of ability to analyze. Recent studies on data mining have extended the scope of data mining from traditional based relational and transactional databases to spatial databases. This study provides data mining primitives, typical algorithms, profile procedures and their performances under various situations. Keywords Spatial Data Mining, Spatial Data, Geographical Information System, Data Mining Primitives, Profile Procedures I. Introduction The computerization of many business and government transactions and the advances in scientific data collection tools provide us with a huge and continuously increasing amount of data. This explosive growth of databases has far outpaced the human ability to interpret this data, creating an urgent need for new techniques and tools that support the researchers in transforming the data into useful information and knowledge. Knowledge Discovery in Databases (KDD) has been defined as the non-trivial process of discovering valid, novel, and potentially useful, and ultimately understandable patterns from data [1-2]. The process of KDD is interactive and iterative, involving several steps such as the following ones: A. Selection Selecting a subset of all attributes and a subset of all data from which the knowledge should be discovered. B. Data Reduction Using dimensionality reduction or transformation techniques to reduce the effective number of attributes to be considered. C. Data Mining The application of appropriate algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. D. Evaluation Interpreting and evaluating the discovered patterns with respect to their usefulness in the given application. Spatial Database Systems (SDBS) [3] are database systems for the management of spatial data. To find implicit regularities, rules or patterns hidden in large spatial databases, e.g. for geomarketing, traffic control or environmental studies, spatial data mining algorithms are very important [4]. Most of the existing data mining algorithms run on separate and specially prepared files, but integrating them with a database 650 International Journal of Computer Science And Technology management system (DBMS) have the following advantages. Redundant storage and potential inconsistencies can be avoided [5]. Furthermore, commercial database systems offer various index structures to support different types of database queries. This functionality can be used without extra implementation effort to speed-up the execution of data mining algorithms (which, in general, have to perform many database queries). Similar to the relational standard language SQL, the use of standard primitives will speed-up the development of new data mining algorithms and will also make them more portable. II. Primitives of Spatial Data Mining A. Rules There are several kinds of rules can be discovered from databases in general. For example, characteristic rules, discriminate rules, association rules, or deviation and evaluation rules can be used for mining [6]. A Spatial characteristic rule is a general description of the spatial data. For example, a rule describing the general price range of houses in various geographic regions in a city is a spatial characteristic rule. A discriminate rule is general description of the features discriminating or contrasting a class of spatial data from other classes like the comparison of price ranges of houses in different geographical regions. A spatial association rule is a rule which describes the implication of one a set of features by another set of features in spatial databases. For example, a rule associating the price range of the houses with nearby spatial features, like beaches, is a spatial association rule. B. Thematic Maps Thematic map is map primarily design to show a theme, a single spatial distribution or a pattern, using a specific map type. These maps show the distribution of features over limited geography areas [6]. Each map defines a partitioning of the area into a set of closed and disjoint regions; each includes all the points with the same feature value. Thematic maps present the spatial distribution of a single or a few attributes. This differs from general or reference maps where the main objective is to present the position of the object in relation to other spatial objects. Thematic maps may be used for discovering different rules. For example, we may want to look at temperature thematic map while analyzing the general weather pattern of a geographic region. There are two ways to represent thematic maps: Raster, and Vector. In the raster image form thematic maps have pixels associated with the attribute values. For example, a map may have the altitude of the spatial objects coded as the intensity of the pixel (or the color). In the vector representation, a spatial object is represented by its geometry, most commonly being the boundary representation along with the thematic attributes. For example, a park may be represented by the boundary points and corresponding elevation values. w w w. i j c s t. c o m ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) III. Spatial Data Mining Tasks As shown in the table below, spatial data mining tasks are generally an extension of data mining tasks in which spatial data and criteria are combined [7-9]. These tasks aim to: • Summarize data, • Find classification rules, • Make clusters of similar objects, • Find associations and dependencies to characterize data, and • Detect deviations after looking for general trends. They are carried out using different methods, some of which are derived from statistics and others from the field of machine learning. Table 1: Comparison Between Staticall and Machine Learning SDM Tasks Statistics Machine learning Global autocorrelation, Density analysis, Generalization Summarization Smooth and contrast Characteristic rules analysis, & Factorial analysis Class Spatial classification Decision trees identification Point pattern Geometric Clustering analysis clustering Local autocorrelation Dependencies Correspondence Association rules analysis Trends and Kriging Trend rules` deviations IV. Spatial Data Summarization The main goal is to describe data in a global way, which can be done in several ways. One involves extending statistical methods such as variance or factorial analysis to spatial structures. Another entails applying the generalization method to spatial data. A. Statistical Analysis of Contiguous Objects 1. Global Autocorrelation The most common way of summarizing a dataset is to apply elementary statistics, such as the calculation of average, variance, etc., and graphic tools like histograms and pie charts. New methods have been developed for measuring neighborhood dependency at a global level, such as local variance and local covariance, spatial auto-correlation by Geary, and Moran indices [10-12]. These methods are based on the notion of a contiguity matrix that represents the spatial relationships between objects. It should be noted that this contiguity can correspond to different spatial relationships, such as adjacency, a distance gap, and so on. 2. Density Analysis This method forms part of Exploratory Spatial Data Analysis (ESDA) which, contrary to the autocorrelation measure, does not require any knowledge about data. The idea is to estimate the density by computing the intensity of each small circle window on the space and then to visualize the point pattern. It could be described as a graphical method. w w w. i j c s t. c o m IJCST Vol. 4, Issue 2, April - June 2013 V. Class Identification This task, also called supervised classification, provides a logical description that yields the best partitioning of the database. Classification rules constitute a decision tree where each node contains a criterion on an attribute. The difference in spatial databases is that this criterion could be a spatial predicate and, because spatial objects are dependent on neighborhood, a rule involving the non-spatial properties of an object should be extended to neighborhood properties. In spatial statistics, classification has essentially served to analyze remotely-sensed data, and aims to identify each pixel with a particular category. Homogeneous pixels are then aggregated in order to form a geographic entity [13]. In the spatial database approach [14], classification is seen as an arrangement of objects using both their properties (nonspatial values) and their neighbors’ properties, not only for direct neighbors but also for the neighbors of neighbors and so on, up to degree N. Let us take as an example the classification of areas by their economic power. Classification rules are described as follows: High population ۸ neighbor = road ۸ neighbor of neighbor = airport => high economic power (95%). In GeoMiner, a classification criterion can also be related to a spatial attribute, in which case it reflects its inclusion in a wider zone. These zones could be determined by the algorithm, whether by clustering or by merging adjacent objects, or it could arise from a predefined spatial hierarchy. A new algorithm [15] extends this classification method in GeoMiner to spatial predicates. For example, to determine high level wholesale profits, a decision factor can be the proximity to densely populated districts. A. Clustering This task is an automatic or unsupervised classification that yields a partition of a given dataset depending on a similarity function. 1. Database Approach Paradoxically, clustering methods for spatial databases do not appear to be very revolutionary compared with those applied to relational databases (automatic classification). The clustering is performed using a similarity function which was already classed as a semantic distance. Hence, in spatial databases it appears natural to use the Euclidean distance in order to group neighboring objects. Research studies have focused on the optimization of algorithms. Geometric clustering generates new classes, such as the location of houses in terms of residential areas [16]. This stage is often performed before other data mining tasks, such as association detection between groups or other geographic entities, or characterization of a group. GeoMiner combines geometric clustering applied to a point set distribution with generalization based on non-spatial attributes. For example, we may want to characterize groups of major cities in the United States and see how they are grouped. Cluster results will be represented by new areas, which correspond to the convex hull of a group of towns. A few points could stay outside clusters and represent noise. A description of each group may be generated for each attribute specified. Many algorithms have been proposed for performing clustering, such as CLARANS [17], DBSCAN [18] or STING [19]. They usually focus on cost optimization. Recently, a method that is more specifically applicable to spatial data, GDBSCAN, was outlined in [21]. It applies to any spatial shape, not only to points data, and incorporates attributes data. International Journal of Computer Science And Technology 651 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vol. 4, Issue 2, April - June 2013 2. Statistic Approach: Clustering arises from point pattern analysis [20-21] and was mainly applied to epidemiological research. This is implemented in Opens haw’s well-known Geographical Analysis Machine (GAM) and could be tested by using the K-function [22]. The clusters could also be detected by the ratio of two density estimates: one of the studied subset and the other of the whole reference dataset. (i). Database Approach Using the process described in [18], which is based on the central places theory, the analysis is performed in four stages. The first one involves discovering centers by computing local maxima of particular attributes; in the second, the theoretical trend of these attributes is determined by moving away from the centers; the third stage determines the deviations in relation to these trends; and finally, we explain these trends by analyzing the properties of these zones. One example is the trend analysis of the unemployment rate in comparison with the distance to a metropolis like Munich. Another example is the trend analysis of the development of house construction. (ii). Geo-Statistical Approach Geo-statistics is a tool used for spatial analysis and for the prediction of spatiotemporal phenomena. It was first used for geological applications (the geo prefix comes from geology). Nowadays, geo-statistics encompasses a class of techniques used to analyze and predict the unknown values of variables distributed in space and/or time. These values are supposed to be connected to the environment. The study of such a correlation is called structural analysis. The prediction of location values outside the sample is then performed by the “kriging” technique [23]. It is important to remember that geo-stastics is limited to point set analysis or polygonal subdivisions and deals with a unique variable or attributes. Under those conditions, it constitutes a good tool for spatial and spatiotemporal trend analysis. VI. Implementation Procedure Joint index has been proposed by Valduriez in [24] as a technique to accelerate the joins in relational database framework. Their extension to the spatial data has been proposed by Zeitouni and al. in [25]. This extension consists in adding a third attribute representing the spatial relationship between two objects shown in fig 1. Each tuple (ID1, Spatial Relationship, ID2) traduces the existence of a spatial relationship between the pair of spatial objects identified by ID1 and ID2. This relationship can be topological or metric. In the case of topological relationship, the Spatial Relationship attribute will contain a negative code. Otherwise, it will store the exact distance value. For performance reason, the calculation of distances is limited to a given useful perimeter around objects. Spatial join index has two main advantages: (i) the storage of the spatial join index avoid the re-computing of the spatial relationship for every application, (ii) the same index allows optimizing the join operation according to all topological or metric spatial predicate. Spatial data mining methods can directly use the relational schema instead of predicates set. Indeed, the methods use a target table, the join index table, and neighbor tables describing other themes. However, this new data organization is not directly exploitable by the conventional data mining methods because these last consider only one input table with one observation by row. Recently, some works in relational data mining [26] have been done to work 652 International Journal of Computer Science And Technology Fig. 1: Spatial Join Index out this problem. They are based on induction logic programming (ILP). Their inconvenience is that their use requires expensive transformations of the relational data into a set of first order logic assertions. A. Querying on the Fly the Different Tables Unlike the conventional algorithms, the proposed method takes as input three tables: table of objects to analyze, neighborhood objects table and spatial join index table. Whenever, the attribute to analyze is a neighborhood attribute, the algorithm does two join operations between the target table, the spatial join index table and the neighbors table (It is here where the modifications on the existing algorithms got to be made). Otherwise, we apply the conventional algorithm without modification. An example of algorithm using this alternative is given in [27]. B. Join Materialization This alternative materializes, once for all, the joins on keys between the three tables. This avoids the multiplication of the joins queries of the first alternative. However, these joins lead to the duplication of the analyzed objects. So, we are forced to modify the existing data mining algorithms in order to take in consideration this duplication in the different calculations[27,28]. For example, the calculation of the informational gain for observations represented on several rows should count the observation only one time. This alternative has the advantage to be faster than the previous thanks to the joins materialization. C. Reorganizing the Data It reorganizes the data in a unique table by joining the three tables without duplicating the analyzed objects. The idea here is to complete, and not to join, the target table by the data present in others tables[27-28]. Its principle is to generate for each attribute value of the linked table an attribute in the result table. The previous application of this operator has the advantage to avoid the duplication of the analyzed objects and to allow the use of any data mining method, without any modification. D. Examples Where table I is a weighted correspondence table that joins the target table R with the table V which contains other considered dimension. This operator is only recommended when the attributes Bi of V do not contain many distinct values. The COMPLETE operator main property is that the result always includes, as left part, all objects of R without duplication and that this one is completed in right part by the weights of the “dimension” coming from V and I. This case happens usually in the relational data where the correspondence table represents the links with N-M w w w. i j c s t. c o m IJCST Vol. 4, Issue 2, April - June 2013 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) cardinalities. This operator can be seen as a mean of relational data preparation for data mining[27-30]. We notice that for a same tuple of R and a same value of bij attribute of V, we can have several links in I with possibly different weights. Example 1: Example 2 Fig. 3: Shows the Example 2 Fig. 2: shows the Example1 VII. Benchmark Results The tests aim to compare the performances of each alternative. We kept three criteria: the size of the target table, the size of the linked table (neighbors table in SDM case) and the size of the correspondence table (spatial join index in SDM case). The fig. 4, gives the execution time of the three algorithms according to the size of the target table [27-30]. Table 2: Performance Based Comparisons of Target Table, Linked Table and Correspondence Table A(Size B(Size of I of R (objects)) (objects)) 122 204 1594 3437 8668 15574 21892 29810 147 148 6330 20180 20180 27054 53631 74302 1st alternative D(Total time C(Size of the 1st of V alternative (objects)) (seconds)) 37 240 52 300 869 16860 869 24799 869 35205 869 48403 869 70020 869 88320 2nd alternative E(Total time F(execution of the 2nd time of the alternative first step (seconds)) (seconds)) 3 1 4 1 7 1 8 1 18 2 19 2 56 2 32 2 G(execution time of the second step (seconds)) 2 3 6 7 16 17 54 30 3rd alternative H(Total time of the 3 rd alternative (seconds)) 3 3 9 76 157 640 925 1372 I(execution time of the first step (seconds)) 1 1 5 71 150 626 882 1330 J(execution time of the second step (seconds)) 2 2 4 5 7 14 43 42 Fig. 4: Comparative Study w w w. i j c s t. c o m International Journal of Computer Science And Technology 653 IJCST Vol. 4, Issue 2, April - June 2013 VIII. Conclusion We have shown that spatial data mining is a promising field of research with wide applications in GIS, medical imaging, robot motion planning, etc., although the field is quite young, a number of algorithms and techniques have been proposed to discover various kinds of knowledge from spatial data. This led us to future directions and suggestions for the spatial data mining filed in general. Finally this study will help the researchers to develop the better techniques in the field of spatial data mining. References [1] Fayyad U. M., .J., Piatetsky-Shapiro G., Smyth P.,“From Data Mining to Knowledge Discovery: An Overview”, In: Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, 1996, pp. 1 - 34. [2] S. Shekhar, P. Zhang, Y. Huang, R. Vatsavai,"Trend in Spatail Data Mining, as a chapter to appear in Data Mining: Next Generation Challenges and Future Directions, H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha(eds.), AAAI/MIT Press, 2003. [3] Gueting R. H.,“An Introduction to Spatial Database Systems”, Special Issue on Spatial Database Systems of the VLDB Journal, Vol. 3, No. 4, October 1994. [4] Koperski K., Adhikary J., Han J.,“Knowledge Discovery in Spatial Databases: Progress and Challenges”, Proc. SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Technical Report 96-08, University of British Columbia, Vancouver, Canada, 1996. [5] Zhang, P., Steinbach, M., Kumar, V., Shekhar, S., Tan, P., Klooster, S., Potter, C,"Discovery of Patterns of Earth Science Data Using Data Mining, as a chapter to appear in Next Generation of Data Mining Applications, Mehmed M. Kantardzic and Jozef Zurada (editors), IEEE Press, 2004. [6] M.Hemalatha.M; Naga Saranya.N.,"A Recent Survey on Knowledge Discovery in Spatial Data Mining, IJCI International Journal of Computer Science, Vol. 8, Issue 3, No. 2, May, 2011. [7] M. Ester, H. P. Kriegel, J. Sander, X. Xu.,"A density-based algorithm for discovering clusters in large spatial databases with noise", Proc. 2nd Int. Conf Knowledge Discovery and Data Mining (KDD-96), pp. 226-231, Portland, OR, USA, August 1996. [8] R. Ng, J. Han,"Efficient and effective clustering method for spatial data mining", Proc. I994 Int. Conf Very Large Databases, pp. 144-155, Santiago, Chile, September 1994. [9] E. M. Knorr, R. Ng.,"Extraction of spatial proximity patterns by concept generalization", Proc. 2nd Int. Conf Knowledge Discovery and Data Mining (KDD-96), pp. 347-350, Portland, OR, USA, August 1996. [10]S. Chawla, S. Shekhar, W. Wu, U. Ozesmi,"Modeling Spatial Dependencies for Mining Geospatial Data: An Introduction , as Chapter of Geographic Data Mining and Knowledge Discovery. Harvey J. Miller and Jiawei Han (eds.) Taylor and Francis, 2001, ISBN 0-415-23369-0. [11] S. Shekhar, C.T. Lu, P. Zhang, A Unified Approach to Spatial Outliers Detection, to appear in Geoinformatica, 7(2), June, 2003. [12]S. Shekhar, C.T. Lu, P. Zhang Detecting Graph-based Spatial Outliers: Algorithms and Applications, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, San Francisco, CA, 2001. 654 International Journal of Computer Science And Technology ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) [13]Longley P. A., Goodchild M. F., Maguire D. J., Rhind D. W., Geographical Information Systems - Principles and Technical Issues, John Wiley & Sons, Inc., Second Edition, 1999. [14]Ester, M., Kriegel, H.-P., Sander, J.: Spatial Data Mining: A Database Approach, Proc. 5th Symp. on Spatial Databases, Berlin, Germany, 1997 [15]Koperski, K., Han, J., Stefanovi,c N.,"An Efficient TwoStep Method for Classification of Spatial Data", In Proc. International Symposium on Spatial Data Handling (SDH’98), pp. 45-54, Vancouver, Canada, July 1998 [16]S. Chawla, S. Shekhar, W. Wu, U. Ozesmi,"Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations", Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000. [17]Ng, R., Han, J.,"Efficient and Effective Clustering Method for Spatial Data Mining", in Proc. of 1994 Int’l Conf. on Very Large Data Bases (VLDB’94), Santiago, Chile, September 1994, pp. 144-155 [18]Ester, M., Kriegel ,H.-P., Sander, J., Xu, X.,"DensityConnected Sets and their Application for Trend Detection in Spatial Databases", Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining, Newport Beach, CA, 1997, pp. 10-15 [19]Wang, W., Yang, J., Muntz, R.,"STING: A Statistical Information Grid Approach to Spatial Data Mining", Technical Report CSD-97006, Computer Science Department, University of California, Los Angeles, February 1997 [20]Openshaw S., Charlton M., Wymer C., Craft A., 1987,“A mark 1 geographical analysis machine for the automated analysis of point data sets”, International Journal of Geographical Information Systems, Vol. 1, n & deg, 4, pp. 335-358 [21]Fotheringham S., Zhan B.,“A comparison of three exploratory methods for cluster detection in spatial point patterns”, Geographical Analysis, Vol. 28, n° 3, pp. 200-218, 1996. [22]Diggle P.J.,"Point process modeling in environmental epidemiology", In Barnett V., Turkman K. (eds) Statistics for the environment, Chichester, John Wiley & Sons, pp 89-110, 1993. [23]Isobel C.,“Practical geostatistics”, Applied Science Publisher, Reprinted 1987. Also at [Onine] Available: http://curie.ej.jrc. it/faq/introduction.html [24]Valduriez P,"Join indices, ACM Transactions on Database Systems", 12 (2), pp. 218-246, 1990. [25]Zeitouni K.,Yeh L.,"Aufaure M-A, Join indices as a tool for spatial data mining", Int. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, Lecture Notes in Artificial Intelligence n° 2007, Springer, pp. 102-114, Lyon, France, September 12-16, 2000. [26]Dzeroski S., Lavrac N.,"Relational Data Mining, Springer", 2001. [27]Chelghoum N., Zeitouni K,"A decision tree for multi-layered spatial data", In: Joint International Symposium on Geospatial Theory, Processing and Applications, Ottawa, Canada, July 8-12 2002. [28]Shashi S., Sanjay C.,"Spatial Databases: A Tour", Prentice Hall, pp. 237, 2003. [29]Breiman L., Friedman J.H., Ol shen R.A.,"Stone C.J, Classification and Regression Trees", Ed: Wadsworth & Brooks. Montery, California, 1984. w w w. i j c s t. c o m ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST Vol. 4, Issue 2, April - June 2013 [30]Koperski K., Han J., Stefanovic N,"An Efficient Two-Step Method for Classification of Spatial Data, In proceedings of International Symposium on Spatial Data Handling (SDH’98), p. 45-54, Vancouver, Canada, July 1998. w w w. i j c s t. c o m International Journal of Computer Science And Technology 655