Download Spatial data mining as a tool for improving geographical models

HELSINKI UNIVERSITY OF TECHNOLOGY Department of Surveying Institute of Cartography and Geoinformatics Věra Karasová Spatial data mining as a tool for improving geographical models Master’s Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Technology. Espoo, May, 2005 Supervisor: Prof. Kirsi Virrantaus Instructor: M.Sc. (Tech.) Jussi Ahola, Lic. Tech. Jukka Matthias Krisp HELSINKI UNIVERSITY OF TECHNOLOGY ABSTRACT OF THE MASTER’S THESIS Author: Věra Karasová Title: Spatial data mining as a tool for improving geographical models Date: May, 2005 Department: Department of Surveying Number of pages: 63 + 2 Professorship: Maa-123 Cartography and Geoinformatics Supervisor: Prof. Kirsi Virrantaus Instructor: M.Sc. (Tech.) Jussi Ahola, Lic. Tech. Jukka M. Krisp Spatial data mining is a new and rapidly developing technique for analyzing geographical data. In this master’s thesis, the usability of the technique is examined for the improvement of an existing geographical model regarding rescue operations. The main focus of spatial data mining is set on the discovery of interesting patterns of information embedded in large geographical databases. Due to its ability to operate without a previously formulated hypothesis, spatial data mining is becoming a popular tool for spatial data analyzes. After a short explanation of the best known spatial data mining techniques, this thesis concentrates on association rule mining in more detail. Discovered spatial association rules may detect useful relationships among spatially distributed objects. Once the relations are identified, the existing spatial model can be extended by the variables with strongest relations to the modeled phenomenon. The behavior of association rule mining is studied by applying it on sample data representing incident locations within the Helsinki city center. The core data is provided by the Fire and Rescue department in Espoo. To observe interaction of the incident with its neighbourhood, information of geographical objects situated within the study area is obtained from the SeutuCD geographical database. Although spatial data mining does not yet belong to the most commonly used spatial data analyzes, it was found effective for detecting strong relationships among geographical objects. Key words: knowledge discovery from databases, spatial data mining, association rules, risk model ii Acknowledgements I would like to thank the Ministry of Agriculture and Forestry in Finland for financially supporting this research project and therefore giving me the opportunity to finish my Master’s degree at HUT. Many thanks go to my brilliant supervisor Professor Kirsi Virrantaus, for her encouragement and guidance during my whole studies in Finland. Her open, family behavior and the patience with which she was always carefully listening to all my troubles and problems (not always study related) made my time in Finland easier and unforgettable. My gratitude also goes out to Jussi Ahola for familiarization with the concepts of data mining and contribution of valuable comments to my thesis, as well as to Jukka Matthias Krisp for endless debates on disaster management and risk assessment procedures. I also want to thank all my colleagues from the Institute of Cartography and Geoinformatics for an inimitable working atmosphere and their friendship. My time in Finland would never have been fulfilled without the extensive care of my boyfriend Huib. Finally I would like to express my greatest thanks to my dearest parents and other members of our family, who have been always there with their immense support! Espoo, May 2005 Věra Karasová Contents Abbreviations iv List of Figures v List of Tables vi 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research approach . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . 4 2 Literature survey 6 2.1 Definition of SDM and KDD . . . . . . . . . . . . . . . . . . . 6 2.2 Spatial data characteristics . . . . . . . . . . . . . . . . . . . . 7 2.3 Spatial data mining techniques . . . . . . . . . . . . . . . . . 8 2.3.1 Clustering and outlier detection . . . . . . . . . . . . . 9 2.3.2 Association and co-location . . . . . . . . . . . . . . . 12 2.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.4 Trend detection . . . . . . . . . . . . . . . . . . . . . . 15 3 Association rules and geographic data 3.1 3.2 17 Spatial association rules . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 18 Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 19 i 3.2.1 Discovering large itemsets . . . . . . . . . . . . . . . . 20 3.2.2 Extraction of association rules . . . . . . . . . . . . . . 21 3.3 Evaluation of the rules . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Mining multivariate associations using clustering . . . . . . . . 23 4 Disaster management in Finland 27 4.1 Risk assessment procedure . . . . . . . . . . . . . . . . . . . . 27 4.2 General model . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Risk model in the city of Espoo . . . . . . . . . . . . . . . . . 28 5 Data 31 5.1 Dataset of incidents . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 SeutuCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Data for the case study . . . . . . . . . . . . . . . . . . . . . . 33 6 Method 35 6.1 Process description . . . . . . . . . . . . . . . . . . . . . . . . 35 6.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 37 6.2.1 Grid approach . . . . . . . . . . . . . . . . . . . . . . . 37 6.2.2 Buffer approach . . . . . . . . . . . . . . . . . . . . . . 39 Transformation to transaction format . . . . . . . . . . . . . . 42 6.3.1 Grid data integration . . . . . . . . . . . . . . . . . . . 42 6.3.2 Buffer data integration . . . . . . . . . . . . . . . . . . 44 Mining association rules . . . . . . . . . . . . . . . . . . . . . 46 6.4.1 46 6.3 6.4 Constraints definition . . . . . . . . . . . . . . . . . . . 7 Results 48 7.1 Grid approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.2 Buffer approach . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.3 General results . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8 Discussion 8.1 53 Unsolved problems . . . . . . . . . . . . . . . . . . . . . . . . ii 53 8.2 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . 55 9 Conclusion 58 A Cells containing railway 64 B Sample of the railway data in the text format 65 iii Abbreviations DW Data Warehouses KDD Knowledge Discovery from Database LR Linear Regression SAR Spatial Autoregressive Regression SDM Spatial Data Mining YTV Helsinki Metropolitan Area Council iv List of Figures 2.1 Co-location patterns . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Trend detection . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Vertical-view approach . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Risk classification process . . . . . . . . . . . . . . . . . . . . 30 5.1 Map of the study area . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Data representing the study area . . . . . . . . . . . . . . . . 34 6.1 Schema of the association rule mining process . . . . . . . . . 36 6.2 Grid cells numbering . . . . . . . . . . . . . . . . . . . . . . . 38 6.3 Cell evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.4 Buffer zones around incidents . . . . . . . . . . . . . . . . . . 41 6.5 Integration algorithm for grid data . . . . . . . . . . . . . . . 43 6.6 Integration algorithm for buffer data . . . . . . . . . . . . . . 45 8.1 Problems with grid division of the space . . . . . . . . . . . . 54 8.2 Reduction of selected objects . . . . . . . . . . . . . . . . . . . 55 8.3 The hierarchy of topological relations . . . . . . . . . . . . . . 56 8.4 Object hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . 56 v List of Tables 3.1 Example of basket data . . . . . . . . . . . . . . . . . . . . . . 20 3.2 4 x 4 relational table . . . . . . . . . . . . . . . . . . . . . . . 25 7.1 Relative frequencies of all object types for grid approach. . . . 49 vi Chapter 1 Introduction We are often interested in analyzing complex situations to more precisely predict the effect of some spatial phenomenon. Once its behavior is approximated by a model, the spatial phenomenon can be understood more correctly. However, currently used spatial models are usually created in a very simple way and represent only the general trend. To give the model a more realistic form, advanced methods for spatial data analyzes should be employed. When a more accurate representation of a spatial phenomenon exists, more can be discovered about its possible impact. Recently, the amount of natural and man-made disasters has increased. Therefore, actions concentrating on prediction and assessment of possible consequences for nature as well as human lives are becoming more important. Consequently, principal changes to the existing risk models for rescue operations are essential. Due to the fast development of geo-information technologies, a variety of new opportunities arise. Therefore, more accurate analyzes can be performed on spatial data. In this research the possible use of spatial data mining (SDM) methods is investigated for identifying factors that may influence occurrences of incidents within the Helsinki city center. 1 CHAPTER 1. INTRODUCTION 1.1 2 Background Traditional statistical analysis tools are having difficulties with handling huge volumes of data collected in recent years. Moreover, the statistical methods require a broader knowledge of test data in order to define a principal hypothesis for the analysis. As a consequence, the analyzes are getting more expensive and time consuming. Therefore, classical statistics becomes an inappropriate and unsuitable tool for analysis performed on data rich environments. [Miller and Han, 2001], [Shekhar and Chawla, 2003], [Mannila, 2002] Data mining is introduced as a discipline concentrating on the manipulation of extensive databases. The main goal of data mining is to search for deeply hidden information, that can be turned into knowledge for strategic decision making and answering fundamental research questions [Miller and Han, 2001]. Due to the ability of extracting implicit knowledge without any a priori stated hypothesis, data mining is becoming a popular tool. However, collected data is not always randomly distributed, independent or stored in relational databases. The core question regarding SDM is how to deal with the complex characteristics and spatial relations embedded in geographical databases. [Shekhar and Chawla, 2003] Although the requirements of SDM often differs from classical data mining in principle, some of the SDM researchers try to adjust classical data mining techniques instead of designing new algorithms. Each SDM technique is developed for analysis of different spatial phenomena. The most often used SDM methods like clustering, trend detection and classification are derived from spatial statistics. The only method, that is not yet commonly used for geographical data analysis is the association rule mining, which identifies not explicitly stored information about unexpected and possibly useful relationships. ([Ester et al., 2001], [Koperski and Han, 1995], [Miller and Han, 2001] etc.) This thesis concentrates on application of the spatial association rule mining to detect interesting spatial relationships among geographical objects. CHAPTER 1. INTRODUCTION 1.2 3 Research objectives Throughout this thesis, the existing knowledge about SDM and its most commonly used techniques is discussed. Since the amount of scientific literature in this area is very restricted, this thesis should contribute as a survey on research, which has been recently done in this field. The biggest challenges of SDM are the spatial attributes of geographic data. Every object, situated in a geographical space is always related to another. This fact should be tracked and recorded on an appropriate place in the geographical database. However, every geographical database keeps this record in a different format. Due to the variety and complexity of those records, applications implemented for SDM analysis are mostly case dependent. It is not always feasible to design a new SDM tool, sometimes small but well-considered modification of the data can enable the use of an already existing application. This thesis intends to demonstrate the use of Gnome Data Mine tool, originally implemented for classical data mining by Borgelt [Borgelt and Kruse, 2002], on geographical data. To test, whether SDM is a useful method for analyzing spatial data stored in an extensive geographical database, this research continues with application of association rule mining on a case study. The core of the data selected for the case study consists of records of incidents, which are located within the Helsinki city center. The goal of the case study is to discover the possible influence of geographical objects on incidents. Therefore, only the subjects to mine are the spatial relationships among selected objects. Since those relations are not known in advance, they first have to be identified. It is obvious, that many operations have to be done before the association rule mining can be applied. The main goal of the case study is to present a solution covering the whole process of operations, that are necessary for obtaining desirable results. It must be emphasized that the complexity of this process and detailed description of each of its steps is of main importance to this thesis. CHAPTER 1. INTRODUCTION 4 Although SDM is not based on any a priori given hypotheses, a general idea about the aim of the research should be known. Such assumption facilitates the whole process and yields to the discovery of valuable information. At first, the core objects the study relates to should be identified and extracted from the provided databases. Once the amount of data is restricted, identification of spatial relationships is faster and less expensive. This thesis offers two methods on how the spatial relations can be derived from the geographical coordinates of selected objects. By applying the association rule mining on the extracted relations, some interesting dependencies among selected objects are discovered. The knowledge of those dependencies enable more accurate selection of variables for improving currently used geographical models. 1.3 Research approach The research is based on a literature survey which identifies the core concepts and methods of SDM. This background is a starting point for further theoretical and conceptual analysis. To test the interaction of SDM with real data, the association rule mining is applied in a case study. Because the implementation of SDM tools is usually data dependent, there is no generally applicable ready made software. However, various programs for classical data mining exist. The aim of the case study is to test the possible use of classical data mining software for geographical data analysis and hereby facilitating the problems connected to program implementation. On the other side, the use of classical data mining methods requires extensive data pre-processing. The case study represents a constructive research approach. 1.4 Structure of the thesis Previously conducted research related to the topic is described in the following Chapter 2. It contains the definition of SDM and introduces some of the main CHAPTER 1. INTRODUCTION 5 techniques and algorithms. A more detailed description of the association rule mining including a definition of the most commonly used algorithm is given in Chapter 3. Chapter 4 explains the risk assessment procedure and currently used risk model of Fire and Rescue services in Espoo. The Fire and Rescue services provided a database of incidents, which is described together with the SeutuCD geographical database in Chapter 5. This chapter also contains a description of the data, selected for the case study. Chapter 6 illustrates two methods for data pre-processing. Besides, the whole process of association rule mining is presented in detail and demonstrated on the selected data. The results are evaluated in Chapter 7, where also the most interesting association rules are listed. The unsolved problems and ideas for the further development are discussed in Chapter 8. The work is concluded in Chapter 9. Chapter 2 Literature survey This chapter outlines the theoretical background of the research. A general overview of Knowledge Discovery from Databases (KDD) and Spatial Data Mining (SDM) is provided. Since SDM deals with geographical data, their typical characteristics are described later in this chapter. This chapter concludes with a possible classification of SDM techniques. 2.1 Definition of SDM and KDD Due to advanced data collection techniques such as remote sensing, census data acquiring, weather and climate monitoring etc. contemporary geographical datasets contain an enormous amount of data of various types and attributes. Analyzing this data is challenging for traditional data analysis methods which are mainly based on extensive statistical operations. Since classical data mining methods enable us to detect valuable information from extensive relational databases, SDM can be an appropriate technique for detecting possible interesting patterns in geographical datasets. Spatial data mining is a knowledge discovery process of extracting implicit interesting knowledge, spatial relations, or other patterns not explicitly stored in databases. [Koperski et al., 1996], [Koperski and Han, 1995] Knowledge discovery from database is a complex concept integrating several research fields including machine learning, database systems, statistics, visu6 CHAPTER 2. LITERATURE SURVEY 7 alization etc. Data mining is a core component of the KDD process. The KDD process assumes that interesting and unexpected patterns in very large databases are deeply hidden and often difficult or impossible to specify a priori. Consequently, traditional database queries and statistical methods do not reveal any implicit information from a large database. KDD is a tool for exploring domains that are too difficult to perceive with unaided human abilities [Miller and Han, 2001]. 2.2 Spatial data characteristics Extracting implicit information from geographical databases appears, in comparison to traditional non-spatial databases, to be more challenging. Together with non-spatial attributes, spatially referenced objects also carry information concerning their representation in space by geometrical and topological properties. [Koperski et al., 1996] Topology covers the geographical properties which are not closely connected to the actual position of objects, i.e. it represents the spatial relationships among objects. [Helokunnas, 1995] According to [Hutchinson, 1991] the topology is a branch of geometry that deals with those properties of a figure (object) that remain unchanged even when the figure is transformed. On the other side, geometric characteristics of data concerns information related to the actual location of the object in space. [Kraak and Ormeling, 2003] The location is usually described by Euclidian coordinates or Latitude and Longitude. Besides the core spatial characteristics dealing with geometry and topology, geographical data also contains information about the behavior of a phenomenon the data represents. In particular, the notion of spatial autocorrelation is fundamental to any spatial related operations. [Shekhar et al., 2003] Omitting the fact that nearby items tend to be more similar than items situated more apart, causes inconsistent results in the spatial data analysis. An other important characteristic of geographic data is spatial heterogeneity. Spatial data is not identically distributed in space, therefore data properties CHAPTER 2. LITERATURE SURVEY 8 are location dependent. It is possible that local trends can sometimes contradict the global trends. [Shekhar and Chawla, 2003] In other words, global parameters estimated from a geographic database do not sufficiently describe the geographic phenomenon at any particular location. [Miller and Han, 2001] Due to the spatial data diversity, a composition of geographical databases is crucial. Moreover, the data integration process has to deal with very complicated data transformations, because the collected data are often from different sources. [Bédard et al., 2001] Therefore good database design provides the possibility of analyzing geographical data with maximum efficiency on data processing and in the same time considers their spatial characteristics. 2.3 Spatial data mining techniques There is no unique way of classifying SDM techniques. Various kinds of patterns can be discovered from databases and can be presented in different forms. The categorization often depends on the background field of a particular researcher. If we assume a person to be interested in data visualization, the criteria for classification will probably be dependent on various visualization techniques, whereas a computer science researcher might see the main variance in the utilization of different algorithms. An illustrative overview about various possibilities of classifying data mining techniques is given in [Demšar, 2004]. Based on [Han, 1999], general data mining tasks can be classified into two main categories: descriptive data mining and predictive data mining. The former concisely describes the behavior of datasets and presents interesting general properties of the data. Whereas the latter attempts to construct models that tend to help predicting the behavior of the new datasets. Forecasting an employee’s potential salary based on the salary distribution of similar employees can be seen as an example of a predictive data mining task. While descriptive methods may be used for comparison of sales between a European and an Asian branch of a certain company. CHAPTER 2. LITERATURE SURVEY 9 Ester [Ester et al., 1997] divides spatial data mining techniques into four general groups: spatial association rules, spatial clustering, spatial trend detection and spatial classification. The categorization is based on the KDD algorithms. Based on Shekhar and Chawla [Shekhar and Chawla, 2003], the three most non-controversial techniques would be classification, clustering and association rules. However, some of those algorithms can be accompanied by supporting methods. For example for identification of so called Hot Spots which are areas of a high value of certain activity within a large area of low activity, clustering technique is performed together with outlier detection. Consequently, the basic idea of co-location is derived from a spatial association technique. In this thesis, the organization of the particular spatial data mining techniques as a combination of Ester’s and Shekhar and Chawla’s categorization: clustering and outlier detection association and co-location classification trend detection 2.3.1 Clustering and outlier detection Spatial clustering is a process of grouping a set of spatial objects into groups called clusters. Objects within a cluster show a high degree of similarity, whereas the clusters are as much dissimilar as possible. [Ester et al., 2001], [Shekhar et al., 2003] Unlike classification, clustering is an unsupervised process. This means that clustering does not rely on predefined labels of classes or a priori given number of classes. [Han et al., 2001] Clustering is a very well known technique in statistics and the data mining role is to scale a clustering algorithm to deal with the large geographical datasets. [Shekhar and Chawla, 2003] CHAPTER 2. LITERATURE SURVEY 10 Clustering algorithms can be separated into four general categories: partitioning method, hierarchical method, density-based method and grid-based method. [Han et al., 2001] The categorization is based on different cluster definition techniques. Partitioning method A partitioning algorithm organizes the objects into clusters such, that the total deviation of each object from its cluster center is minimized. [Han et al., 2001] At the beginning each object is classified as a single cluster. In the next steps, all data points are iteratively reallocated to every clusters until a stopping criterion is met. Due to the minimum distance to the center of the cluster, this method tends to find clusters of spherical shape. [Shekhar and Chawla, 2003] K-Means and K-Medoids are commonly used fundamental partitioning algorithms. The K-Medoids method uses the most centrally located object in the cluster to be the cluster center. Some of the recent algorithms that are based on the K-Medoids method are P artitioning Around M edoids (PAM), C lustering LARge Applications (CLARA) and C lustering LARge Applications based upon RAN domized S earch (CLARANS). [Han et al., 2001] Hierarchical method These clustering methods hierarchically decompose the dataset by splitting or merging all clusters until a stopping criterion is met. The result of the decomposition is a dendrogram tree, which can be formed in two ways: ”bottomup” or ”top-down”. The bottom-up approach, also called the agglomerative approach, starts with each object forming a separate group. In every step, objects are successively merged until all of the groups congregate into one; the top most level of the hierarchy. In the top-down approach, also called divisive, all objects are at the beginning united into one general cluster. In every iteration each cluster is split into several smaller ones, until eventually each object forms a separate cluster. Some of the recently used hierarchical clustering algorithms are B alanced I terative Reducing and C lustering using H ierarchies (BIRCH), and C lustering U sing RE presentatives (CURE). [Han et al., 2001] CHAPTER 2. LITERATURE SURVEY 11 Density-based method The method regards clusters as dense regions of objects, that are separated by regions of low density (representing noise). In contrast to partitioning methods, clusters of arbitrary shapes can be discovered. Density-based methods can be used to filter out noise and outliers. An example of a density basedalgorithm is a Density-B ased cluS tering method based on C onnected regions with sufficiently high density (DBSCAN). [Han et al., 2001] Grid-based method Grid-based clustering algorithms first quantize the clustering space into a finite number of cells and then perform the required operations on the grid structure. Cells that contain more than a certain number of points are treated as dense. The main advantage of the approach is its fast processing time, since the time is independent on the number of data objects, but dependent on the number of cells. Some of the grid-based algorithms are a ST atistical IN formation Grid (STING) and CLustering I n QUE st (CLIQUE). [Shekhar and Chawla, 2003], [Han et al., 2001] A clustering method is sometimes accompanied by outlier detection. The goal of outlier detection is to discover a small subset of data points, which are often viewed as noise, error, deviations or exceptions. A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from those of other spatially referenced objects in its spatial neighborhood [Shekhar and Chawla, 2003]. The identification of global outliers can lead to the discovery of unexpected knowledge and has a number of practical applications including transportation, public safety climatology etc. For example, outlier detection and clustering techniques can help to discover areas named hot spots which may for example represent the areas of high crime density. [Shekhar and Chawla, 2003] Possible utilization of hot spots analysis is further explained in Chapter 3. CHAPTER 2. LITERATURE SURVEY 2.3.2 12 Association and co-location When performing clustering methods on the data, we can find only characteristic rules, describing spatial objects according to their non-spatial attributes. In many situations we want to discover spatial rules that associate one or more spatial objects with others. A spatial association rule is of the form X ⇒ Y (c % ), where X and Y are sets of spatial or non-spatial predicates and c % is the confidence of the rule. [Koperski et al., 1996] An association rule is characterized by two parameters: support and confidence. The former expresses a ratio of transactions that satisfies both X and Y, to the number of transactions in a dataset. Whereas the latter one presents a conditional probability that Y is true under the condition that X is true. A large number of associations may be extracted from an extended geographical database. However, a majority of those rules are applicable to only a small number of objects and the extraction of all rules is very computationally expensive. Often the confidence of rules is low. Therefore, the concepts of minimum support and minimum confidence are used to guarantee that only important transactions are discovered. We state that a rule is strong when the support is large, i.e., no less than the minimum support threshold, and the confidence is large, i.e., no less than the minimum confidence threshold. [Koperski et al., 1996] However, one of the biggest research challenges in mining association rules is the development of methods for selecting potentially interesting rules from among the mass of all discovered rules. [Mannila, 2002] The following rule can be obtained from a geographic database: is a(x, school) ∧ close to(x, sport center) ⇒ close to(x, park)(80%) This rule states that 80% of schools, which are close to sport centers, are also close to parks. [Koperski et al., 1998] Compared to spatial association the co-location technique tends to discover only relations considering spatial CHAPTER 2. LITERATURE SURVEY 13 proximity of objects. Therefore, the number of transactions is reduced only to spatial transactions concerning object neighborhood. Consequently, attributes and their values do not influence the result. The example of co-location can be seen in Figure 2.1. This image represents an analysis of habitats of animals and plants. Co-location of predator-prey species, symbiotic species and fire events with ignition sources may be identified. In Figure 2.1, two co-location patterns can be observed: a fire is often located close to a dry tree and a bird is often seen in the neighbrouhood of a house. [Shekhar and Chawla, 2003], [Shekhar et al., 2003] 2.3.3 Classification Every data object stored in a database is characterized by its attributes. Classification is a technique, which aim is to find rules that describe the partition of the database into an explicitly given set of classes. Objects with similar attribute values are integrated into the same class. In spatial classification the attribute values of neighboring objects may also be relevant for the membership of objects in a certain group. Therefore, we have to include the neighbourhood factor in the calculation. A classification method consists of two parts. First the user defines the number of classes. To test, whether the number of classes was chosen correctly, a set of training data is selected and the classification is performed on it. Consequently, classification rules are derived from the training dataset. Next, those rules are applied to the test dataset. Classification is considered as predictive spatial data mining, because we first create a model according to which the whole dataset is analyzed. [Ester et al., 1997] A classification process can be performed in many different ways. A method offered by Shekhar and Chawla [Shekhar and Chawla, 2003] is based on the Linear Regression (LR). To guarantee the spatial dependencies of objects, a Spatial Autoregressive Regression (SAR) technique has been proposed by spatial statisticians. CHAPTER 2. LITERATURE SURVEY Figure 2.1: Example [Shekhar and Chawla, 2003] of co-location 14 spatial data mining. CHAPTER 2. LITERATURE SURVEY 2.3.4 15 Trend detection A spatial trend is a regular change of one or more non-spatial attributes when spatially moving away from a start object. Therefore, spatial trend detection is a technique for finding patterns of the attribute changes with respect to the neighborhood of some spatial object. [Ester et al., 1997] Let us consider a statement: ”When moving away from a big city, the Real Estate property is cheaper.” The trend is characterized by detection of a regular change of the Real Estate property price, dependent on the distance from a big city. The city in this case represents a start object. Ester [Ester et al., 1998] has presented an algorithm based on Linear Regression. In each step of the algorithm, local change of the specified attribute and distance to the neighbors is calculated. In the next step an LR is applied on the selected pairs of values. When the resulting correlation coefficient is larger than a specified threshold we can state that a trend is discovered. An illustrative example can be seen in Figure 2.2, originally created by Ester, where an attribute average rent from the BAVARIA dataset is depicted. A significant trend can be observed for the city of Munich: the average rent decreases quite regularly when moving away from Munich. [Ester et al., 1997] CHAPTER 2. LITERATURE SURVEY 16 Figure 2.2: Average rent for the communities of Bavaria. [Ester et al., 1997] Chapter 3 Association rules and geographic data This chapter gives a general overview on a spatial association rule mining technique. After a short introduction and definition of the rule, the most commonly used algorithm for mining association rules the Apriori algorithm, is briefly described. Since geographical databases deal with spatial data, the mining process tends to be more difficult. To simplify the process, geographical data are transformed to the format understandable to classical association rule mining. The chapter concludes with a case study concerning the application of the association rule on a geographical database representing geo-referenced crime data. [Estivill-Castro and Lee, 2001] 3.1 Spatial association rules For the case study, we decided to use association rule mining technique because this technique enables to detect interesting relationships among objects representing the study area. The association rule was originally introduced for the so-called market basket analysis. The basic idea is to find regularities in the shopping behavior of customers of supermarkets. Typical business decisions, for example about possible sales are usually based on past transaction data analysis. This analysis tends to improve the quality of such decisions. Since the progress in bar-code technologies has made it possible to collect 17 CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 18 massive amounts of basket data and to store them in a database, the necessary functionality for taking advantage of this process should be provided. [Agrawal et al., 1993] Spatial association rule is a rule denoting certain association relationships among a set of spatial and possibly some non-spatial attributes of geographical objects, which are for the analysis indicated as predicates. The spatial predicates may represent topological relationships between spatial objects, such as disjoint, intersects, adjacent to etc., they can also hold information about spatial orientation or ordering like left, north, east etc., or specify a distance e.g. close to. [Koperski and Han, 1995] For better understanding of the spatial association mining technique, some preliminary concepts are introduced in the following sections. 3.1.1 Definition Let χ = I1 , I2 , · · ·, Im be a set of binary attributes called items. Let T be a database of transactions. Each transaction t is represented as a binary vector, with t[k] = 1 if t contains the item Ik , and t[k] = 0 otherwise. There is one tuple in the database for each transaction. Let X be a set of some items in χ. We say that a transaction t satisfies X if for all items Ik in X, t[k] = 1. [Agrawal et al., 1993] A spatial association rule is an implication of the form X ⇒ Ij (c%), where X is a set of some items in χ, and Ij is a single item in χ that is not present in X. [Agrawal et al., 1993] The item set X is called antecedent and the part behind the implication arrow, is consequent. The most often used measure of a rule’s strength is confidence (c%), which indicates that c percent of the items satisfying the antecedent of the rule will also satisfy the consequent of the rule. Following the definition, a large number of spatial association rules can be derived from a large spatial database. However, only few rules are indicated as useful. Therefore, the amount of generated rules is restricted only to those, which satisfy certain additional constraints, which are of two different forms: CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 19 1. Syntactic Constraints: These constraints involve restrictions on items that can appear in a rule. For example, we may be interested only in rules that have a specific item Ix appearing in the consequent. 2. Support Constraints: These constraints concern the number of transactions in T that support a rule. The support for a rule is defined to be the fraction of transactions in T that satisfy the union of items in the consequent and antecedent of the rule. The aim of association rule mining is to generate all combinations of items that have the support above a certain threshold minsupport. Those combinations of items are called large itemsets. Consequently, all the combinations of items, that have support below the given minsupport threshold are called small itemsets. [Agrawal et al., 1993] For the given large itemset Y = I1 , I2 , · · ·Ik ; k ≥ 2 the association rules are generated afterwards. The number of rules is at the most k and the rules only contain items from the set I1 , I2 , · · ·Ik . The antecedent of each of these rules will be a subset X of Y such that X has k-1 items, and the consequent will be the item Y-X. To generate an interesting rule X ⇒ Ij (c%), where X = I1 , I2 · · · Ij−1 , Ij+1 · · · Ik , the confidence of the rule also has to exceed a certain minconfidence threshold. [Agrawal et al., 1993] The rule is strong when the support is large, i.e., no less then the minimum support threshold, and the confidence is large, i.e., no less then the minimum confidence threshold. [Koperski et al., 1996] 3.2 Apriori algorithm The main problem of mining association rule is the fact that a large number of rules can be derived from a large database. However, most people are only interested in patterns that occur relatively frequently, i.e. strong rules. Since it is not possible to inspect each rule separately, efficient algorithms are needed to restrict the search space and check only a subset of important rules. One of the best known algorithms for mining spatial associations is called Apriori algorithm and was developed by Agrawal et al. [Agrawal et al., 1993]. CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 20 Table 3.1: Example of basket data basket-id A B C D E t1 1 0 0 0 0 t2 1 1 1 1 0 t3 1 0 1 0 1 t4 0 0 1 0 0 t5 0 1 1 1 0 t6 1 1 1 0 0 t7 1 0 1 1 0 t8 0 1 1 0 1 t9 1 0 0 1 0 t10 0 1 1 0 1 This algorithm works in two steps. In the first step the large itemsets are determined. The second step represents the actual generation of association rules from the large itemsets detected in the first step. The first step is the more important one, because it accounts for the greater part of the processing time. [Borgelt and Kruse, 2002] 3.2.1 Discovering large itemsets The large itemsets are very simple patterns telling us that variables in the set occur reasonably often together. Fortunately, only relatively few large itemsets may be generated from real databases. To discover large itemsets, we need to find all itemset patterns that are frequent, i.e. occurrence of the pattern exceeds the minsupport threshold. Discovering large itemsets is demonstrated in the following example. Table 3.1, originated in [Mannila, 2002], represents transactions of several customers in a supermarket. Every line ti indicates a single customer transaction, which consists of items placed in the customer’s CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 21 shopping basket. Each column represents a single supermarket item. The purchase of a specific item is denoted by a 1 and a 0 means that the item is not bought. We can easily detect that customer t1 placed in his shopping basket only item A. Let us set the support, i.e. frequency threshold, to 0.4. From the example in Table 3.1 all the large itemsets satisfying the given constraint are {A}, {B }, {C }, {D}, {AC } and {BC }. [Mannila, 2002] The itemset is considered large if all of its subsets are large. Therefore, we can find all frequent itemsets by first identifying all large 1-itemsets, i.e. sets consisting of 1 variable like {A}, {C}. In the next step we build candidate itemsets of size 2 by connecting two large 1-itemsets {A, C}. This candidate itemset is tested and later approved as large, if all the test are passed successfully. We can similarly create a candidate itemsets of size 3 and larger. Figure 3.1 illustrates the steps of the algorithm. Where Lk is a set of large k -itemsets. Each member of this set has two fields: i) itemset and ii) support count. Ck represent a set of candidate k -itemsets. Every member of this set is also characterized by the same two fields as Lk . The first pass of the algorithm simply counts item occurrences to determine the large 1-itemsets. Every other pass k consists of two phases. First, the large itemsets Lk−1 detected in the (k-1)th pass are used to generate the candidate itemsets Ck , by the apriori-gen function. Next, the database is scanned and the support of candidates in Ck is counted. To make the counting faster, the candidates in Ck that are contained in a given transaction t should be efficiently determined. The subset function is used for this purpose. Further explanation of the apriori-gen and the subset functions is specified in [Agrawal and Srikant, 1994]. 3.2.2 Extraction of association rules After all the large itemsets are defined, the extraction of association rules is rather straightforward. If we consider the example in Table 3.1, two association rules can be discovered A ⇒ C with 67% confidence (c = 46 = 32 ) and the rule B ⇒ C with 100% confidence (c = 55 = 1). [Mannila, 2002] CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 22 L1 = {large 1- itemsets} ; for ( k = 2; Lk -1 ≠ ∅; k ++) do begin Ck = apriori − gen( Lk -1 ); // New candidates forall transactions t ∈ D do begin // Candidates contained in t Ct = subset (Ck , t ); forall candidates c ∈ Ct do c.count ++; end Lk = {c ∈ Ck | c.count ≥ minsup} end Answer = U k L k ; Figure 3.1: Apriori Algorithm 3.3 Evaluation of the rules In this chapter we have so far focused primarily on association rules formalism. However, an important part of the association rule mining is evaluation of generated rules. We can obtain hundreds of rules representing a dataset. Therefore, we need to validate the rules and select only those, which present only important patterns. It is obvious, that evaluation of all rules one-by-one is impossible for a human expert. Some automated techniques are needed to support the interesting rule selection and facilitate the work. The previous sections discussed methods that are used to find all rules that fulfill simple frequency and accuracy criteria. However, we should not restrict the selection of interesting rules to only those exceeding the minsupport and minconfidence threshold. Moreover, some rules with low support or low confidence may still hold very interesting patterns. On the other hand, not all rules with high confidence and support are interesting. [Borgelt and Kruse, 2002] The structure of interesting rules can be described by simple rule template which represents the syntactic constraints. The template gives users the possibility to specify both interesting and uninteresting rules by describing what attributes should occur in the antecedent and the consequent. [Klemettinen et al., 1994] CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 23 In the example of the basket data dataset, introduced in Table 3.1, we discovered two rules with support higher than 0.4; A ⇒ C(67%) and B ⇒ C(100%). Both rules are not necessarily equally interesting. Let us assume, that for our purpose, rules with B in an antecedent are not important. In this case we design a template in order to restrict the selection. From the dataset example, we discover one interesting rule of the form A ⇒ C(67%). 3.4 Mining multivariate associations using clustering The aim of the research carried out by Estivill-Castro and Lee is to examine and analyze crime data. [Estivill-Castro and Lee, 2001] Since crime data is a complex phenomenon, there is a great need for a sophisticated tool to facilitate the data analyses. One of the popular techniques is a Crime hot spot analysis. In this article authors identify the hot spots by clustering. After the crime clusters are identified, the detection of possible cause-effect relations follows. An association rule mining seems to be a feasible tool. The research proposed a vertical-view approach for the cluster association rule mining. Since this research deals with data similar to our case study, it is the main inspiration for this thesis. The vertical-view approach detects the relationships among layers by modeling a space into regular cells similarly as a raster. The input for the analysis consists of several geographical layers of the same area. Each layer represents a different item (different attribute of the geographical location). The layers include only point or polygon data. Every pinpointed location is assigned a value of each attribute, corresponding to the selected layers. Values of attributes become true (1) if the location lies within regions (clusters) of corresponding layers and false (0) otherwise. The vertical-view approach tries to discover interesting associations from the whole set of attributes. CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 24 Figure 3.2: The first four pictures represent four geographical layers. Picture a) displays railway stations, picture b) crime incidents. Parks are displayed on the picture c) and the d) picture depicts urban areas. Picture e) and f) show the point data after the cluster analysis overlayed by a four-cell grid. The last two pictures g) h) represent the polygon data, i.e. parks and urban areas overlayed by the same grid. [Estivill-Castro and Lee, 2001] The algorithmic procedure of the vertical-view approach is as follows: 1. Find spatial clusters for point-data layers. 2. Segment all the layers with a finite number of regular cells (rectangles). 3. Construct an m x n relational table with the binary {0;1} values. 4. Apply association rule mining to the table. The first step is to find homogenous groups of spatial concentrations of point data layers by applying cluster analysis. Noise points are ignored. The space, in every layer, is then divided into rectangular cells of an arbitrary size. The size of the cells is identical for each layer. After that the relational table is computed. The size of the table depends on the number of cells in the grid. Finally, the association rule is applied to find the correlation between a set of layers. Since the relational table of the locations is exactly the same as a table created from transactional databases, except layers replace items and CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 25 Table 3.2: 4 x 4 relational table layer(a) layer(b) layer(c) layer(d) loc(1) 1 1 0 0 loc(2) 1 1 1 1 loc(3) 1 0 1 1 loc(4) 1 1 1 1 locations replace transaction, it is now straightforward to discover associations among layers using traditional association rule mining. The algorithm is demonstrated by an example, illustrated in Figure 3.2 and described in the following text. The geographical database consists of four data layers. The first layer (a) shows railway stations as point data, the second layer displays incidents (b), the last two layers contain polygons depicting parks (c) and urban areas (d). The first step is to identify homogeneous groups of spatial concentrations of point data layers. Two clusters of railway stations (e) and one cluster of crime incidents (f) have been formed. Subsequently, a 2 x 2 grid is overlayed over all data layers. A relational table 3.2 is derived from the grid. In this table column layer(j)(0 5 j 5 n) represents j-th geographical layer, loc(i)(1 5 i 5 m) in rows denotes i-th cell in the grid numbered in Morton order. The transaction t[loc(i), layer(j)] is 1, if event in the j-th layer occurs in loc(i) cell and t[loc(i), layer(j)] is 0 if otherwise. For instance, t[loc(1), layer(a)] = 1, because the cluster of railway stations lies within the top-left cell as depicted in Figure 3.2 e). The association rules can be directly mined from the relational table. One of the association rules is as follows: layer(a) ∧ layer(b) ⇒ layer(d)(c = 2 . . . c = 66.7%) 3 With 50% support (s = 24 = 0.5), 66.7% of locations, that are situated near-by railway stations and has quoted crime incidents, fall within urban areas. CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA 26 In the vertical-view approach, the granularity of cells plays a critical role. However, a big advantage of this approach is its simplicity and the possibility of applying classical association rule mining techniques to a geographical database. Chapter 4 Disaster management in Finland For better understanding of the case study, this chapter gives an overview of the risk assessment procedure in Finland. In section 4.2, the general model for fire and rescue operations, proposed by the Ministry of Interior in cooperation with the Federation for Fire Brigade Chiefs is described. An explanation of the specific model used in the Helsinki Metropolitan area concludes this chapter. 4.1 Risk assessment procedure Accident prevention and prevention of disasters have been recognized as a fundamental topic in the areas of civil protection and emergency services. Risk assessment plays an essential role for improvement of a risk model, developed for the fire and rescue services. This model aims to provide crucial information for planning rescue operations. All Finnish municipalities are responsible for rescue operations in their respective areas. In 1992 the Finnish Ministry of the Interior defined the Guidelines on Preparedness of Municipal Fire Brigades. These Guidelines state, that preparedness in the fire brigades must be based on municipal risk analysis. To assist the municipal fire brigades in obtaining the risk assessments, a handbook was published in 1994 in co-operation with the Ministry of the Interior and the Federation for Fire Brigade Chiefs in Finland [Alliniemi J., 1994], [Lonka, 1999]. 27 CHAPTER 4. DISASTER MANAGEMENT IN FINLAND 4.2 28 General model According to [Lonka, 1999] the probabilities of different risks are estimated in the risk assessment procedure. In order to get a numerical risk estimate for each possible risk a simple calculation method is used: R = (L + F + P + E) ∗ P b In this formula R represents a risk, L are the consequences for life and health, F represents the rapidity of the development of accident, P are the consequences for property, E are the consequences for the environment and finally Pb is the probability of the risk occurrence. The consequences can be deaths, injuries, property losses, interruption losses and environmental damages. These calculations give only very rough estimates of the risk, therefore more sophisticated evaluation of the risk assessment is needed. The model for risk assessment should take into consideration the subjective sensitivity of municipalities to different risk categories. For example, one of the most considerable risks of the harbour area can be transit transport of liquid hazardous materials, while this risk has no significance in the inland. 4.3 Risk model in the city of Espoo In the Helsinki Metropolitan area (Municipalities of Helsinki, Espoo, Vantaa and Kauniainen), the fire and rescue department focuses on two-level protection plans. The first level is related to the normal situation, whereas the second level is created for the extreme situations, e.g. wartime. However, the general interest lays in the enhancement of risk analysis for the normal, i.e. everyday situations. To fulfill the risk assessment in a responsible area, a model was developed by the Rescue Office of Espoo. This model is used for calculation of the risk zones. The zone identification supports the decision of associating rescue service levels, that are mentioned in the law (Ministry of Interior 2000). The implementation of this model is based on the expertise of people working in CHAPTER 4. DISASTER MANAGEMENT IN FINLAND 29 the rescue services and on the national statistics. Simple spatio-statistical analyses form the core methods of the model. Those analyses were implemented in a GIS system and provided to the municipalities of the Metropolitan area. This tool depicts the risk zones within the municipalities responsible areas. The tool was designed for raster operations. The rescue services identified three factors with the possible strongest affection to the risk occurrence. Each factor is represented in a separate raster layer of cell size 250m x 250m. The first raster contains information about a population distribution, the second represents the floor area and the probability of traffic accidents is displayed on the third raster layer. The analysis proceeds in two steps. In the first step all three rasters are combined. Every resulting grid cell is then classified according the values in all three layers, and assigned a final risk level from 1 to 4. The level defines the time in which this area has to be reached by the rescue service unit. For instance, level 1 indicates, that the rescue unit has to be in the place within 6 minutes. In the second step the cells of the resulting raster are joined into a spatially connected regions. [Krisp et al., 2005] The joining process is based on a simple rules defined by Ministry of Interior. One of the rules states: If there are at least ten risk class 1 cells within an area of 10km2 , then the whole area is classified by risk level 1. Figure 4.1 illustrates the two steps of classifying the area into the specific risk regions. CHAPTER 4. DISASTER MANAGEMENT IN FINLAND 30 Figure 4.1: This figure represent the two basic steps in the process of identifying different risk regions. The picture a) displays the result of the raster overlay of the three data layers. Each pixel is assigned a colour according one of the four risk classes; red, yellow, blue and white. The second step is depicted on the picture b). According certain rules, the neighboring pixels are connected into bigger regions. [Ihamäki, 2000] Chapter 5 Data The study of spatial association relationships is confined to the center of the Finnish capital, Helsinki. The map of the area is presented in Figure 5.1. The data for the case study are derived from two specific datasets. The first one is acquired by the Finnish fire and rescue services and the second dataset, SeutuCD, is provided by the Helsinki Metropolitan Area Council (YTV). Both of the datasets and the data for the case study are described in three sections. 5.1 Dataset of incidents The Helsinki Fire and Rescue department maintains an up-to-date register of all incidents the Fire brigaded has been appealed to. The register data indicates the point locations of incidents concerning all the fire alarms, rescue missions and automated fire alarm systems missions within the Helsinki city area. Every record also includes a more detailed description of the incident. This information is stored as attribute data. Detailed understanding of the incident properties plays a crucial role in the decision making. For example, the knowledge of an occurrence time can help the rescue services with planning the future distribution of their resources. 31 = . 0 * ; v y ä à t º B Æ ¢ o O ë x G $ Ê Á Þ \ f O Ì > ù = ® î » ì U ¦ ã Ë 8 : Ì ç E Â Z } ç X ö ¼ õ ¹̧ l o © ¾ T V j m 3 w ñ æ [ ³ { k Ã ç ) â Á / ± ¹ Ô ø Â z G b $ e ¹̧ þ T § Æ i k á 3 Q £ ± ð ô B F S r É ó ý & Ñ ã ¤ d ø ² z Ò g & W ¬ µ « @ Ø ÷ O L U Û p + F 9 # 4 C b ¡ Û ½ Å { ü P ú ñ õ S ¥ Ü Y Ð à R A Ò µ , * ý × Á Ê * Ò ÿ · ; 2 Î à } M e D I i k ¹ Q ò ¿ Z ¨ ¤ °̄ h ý l ï | ÿ 5 Ç À ¦ ó Ó ³́ û M È / ª ß 6 ? Ð Ä K ; P Ö 3 Ä è x | ¯ á A ^ î [\ A a s ¡ ? Î Ø Ý å . L à Ê ð 6 N \ ï ü " K R [ u § ¿ õ J ~ [ ` Ã N l x ß 1 3 8 é ´ Ú ( c v y ° ¼ ì ! ¶ ¹̧ Æ Ý í / Q s 7 : â ¸ À Ý t ¦ « Ì Õ ò 9 H _ Â ä ' " C u Ç é W ¢ , Ô ) ] é J < Í ^ ] ú ù d % " f ? ê ñ Ú ô H ! r G ) þ ½ j É Ö Ñ î ÷ t ¨ # Ë Å F n Õ Þ n _ È Õ ¬ × G j Ü ' R X q û º 4 ° 0 ù % µ + % Y ~ y , ê ö } Û ª á 2 ò | ] a m à T w o Ù \ í Ï i : ~ þ ª ¾ 1 p < ë ß ¯ û 7 ê h ; » n Y : å o á ù ! U f æ Î > l Ú V Ø 8 o q u t w ø ¡ M ø c ä q A T S ³ X © À 6 ñ Ù H I î f ~ ¥ ð W w ¶ µ ê 7 ¢ v ä X Ó ÷ . ï { è 4 U º $ > ¦ < p F j Í>éÖXÛÑÎ]}Þ!¾V( ØÙË'öåÇkhH i Í Ã ² û $ · ì m ^ ¾ F d x ' } 9 E w Ô ç â ÉD=> å z V g p W Ô à Ó # ÿ + & Í vCí´u± FNÔTs,*Õ6Ó ?5ò©0°± BHI ® æ ¼ » ( E È Ü @ ³́ Ì è @ µ & ² £ ¢ { § ô º n Ø Ò C Ó # N Ç ` ¾ É æ Ì Ë [ J Ð Ñ 8 óÞä?%ÏÊ¡ x c ³́ z 2 Á 3 ô Ï ø v t Ý ¥ · Ò " q ² É 1 ¦ ¶ ½ | O m Y G 2 ¥ Ú J ] p I « P à % W Î LçëK:÷7üoiúuAÎ_ÆÅLg CHAPTER 5. DATA 32 Helsinki city center 1:25000 0 0.35 0.7 1.4 2.1 2.8 Kilometers Figure 5.1: Map of the study area CHAPTER 5. DATA 5.2 33 SeutuCD All Municipalities in Finland are obligated to gather register data on their population, buildings and land use plans. The municipalities can benefit from the obligation more, if the form of the data is standardized, because the data analyses can be realized independently on the municipal boundaries. The Metropolitan municipalities transferred some of the responsibility of urban planning and development of the area to the Helsinki Metropolitan Area Council (YTV). Since 1997 the YTV is working on the production of a database covering the whole Metropolitan area with data from the municipality registers. The outcome is a data package called SeutuCD. SeutuCD includes register data of population, buildings, agencies and enterprizes and data related to Land use planning and Real Estate. [YTV, 2005] 5.3 Data for the case study The case study area covers the center of Helsinki and adjacent water areas as depicted in Figure 5.1. The data for the analysis is extracted from both of the provided datasets. Because the whole SeutuCD database appeared to be unnecessarily extensive, only a representative sample is utilized. Analysis of only sample data seems to be sufficient and has no effect on the basic behavior of association rule mining. The core of the sample data consists of incident records, obtained from the Fire and Rescue service Office in Espoo between the years 2002 and 2003. All incidents are located within the study area. To simplify the analysis, all nonspatial attribute information is omitted. Every incident is characterized only by its unique ID, geographic coordinates and a definition of its data type, in the incident case point. To observe possible interaction of the incident with its neighborhood, information of geographical objects situated within the study area is needed. The source of the additional information is SeutuCD. For the study area, the follow- CHAPTER 5. DATA 34 Figure 5.2: Data representing the study area ing geographical layers were extracted from the database; bars and restaurants, kindergartens, parks and cemeteries, water areas and road network. The data is also described by a unique ID, geographical coordinates and data type definition. This particular selection is made based on the diversity of the data types. By analyzing points, lines and polygons, the behavior of association rule with respect to all data types can be studied. The only objects, selected from a recommendation of the rescue services are bars and restaurants. There exists a suspicion of their connection to incidents. Because some of the incidents occurred at sea, the layer containing water areas is included in the data selection. Since the center of Helsinki is situated on a peninsula, the sea plays an important role in this area. The parks and cemeteries data is used due to the inspiration of Estivill-Castro and Lee’s research [Estivill-Castro and Lee, 2001] more closely described in Chapter 3. The road network is a representative example of the line data. The roads are widely distributed over the whole city center. To keep the same character of all data, e.g. every layer representing only one object type, the road network is divided into several layers according to their category in the road classification. The last selected layer represents kindergartens. The main reason is inclusion of other point objects. In Figure 5.2 all the final layers are named and presented together with their symbol representation on the map. Chapter 6 Method This chapter explains an extensive process of mining geographical data. Since the format of the geographical data is very complex and therefore incomprehensible for applications, extensive pre-processing is required before the association rule mining is applied. This chapter offers a procedure, which leads to identification of objects influencing the occurrence of incidents. The aim of this thesis is not to implement any new algorithm, but to apply already existing tool. This chapter concludes with information about the used program and settings of required parameters. 6.1 Process description The whole process is outlined on the schema in Figure 6.1. The three core steps are data pre-processing, transformation to the transaction format and association rule mining. Those steps are symbolized by red ellipses in the schema and in the text they represents sections of this chapter. The black boxes, connected by arrows represent the particular actions, that have to be performed in each of the three steps. The left side of the schema illustrates the changes of the format of extracted data needed for the analysis. Additional operations are depicted on the right side of the schema together with the program used for the extraction of association rules. 35 CHAPTER 6. METHOD 36 Seutu CD Fire and Rescue services database relevant data extraction Buffer approach 11 geographical vector layers neighbourhood specification Grid approach identification of objects situated within the neighbourhood area 6.1 Data pre-processing 11 text files export selection to text files transformation to the transaction format 6.2 Transformation to the transaction format transaction file association rule mining Genome template selection of interesting rules 6.3 Association rule mining interesting assocaiton rules Figure 6.1: Schema of the association rule mining process. CHAPTER 6. METHOD 6.2 37 Data pre-processing Association rule mining is a rather straightforward process. However, the format of the data can generate problems when applying this data mining technique. This issue becomes a challenge once we concern the detection of associations among spatial objects. In spatial databases, data are seldom stored in the form of transactions. To be able to apply the association rule on the spatial-referenced objects, some changes to the data have to be done. In this case, the definition of every object from the study area is restricted to only two basic attributes. The first one relates to the data type and the second one characterize the location of the object in space. Every object is also given an ID number, which is unique within the geographical layer it belongs to. For instance, let’s select an incident with IDinc = 2 and a bar IDb = 2. Although those two different objects are assigned the same ID number, they are still distinguishable, because they belong to different geographical layers (incidents layer and bar and restaurants layer). Since only the spatial representation of the objects is known, the only reasonable subject to mine is the object’s geographical position. Therefore, we decided to mine spatial relations among data situated in different geographical layers. However, spatial relationships are not always easy to discover without using efficient algorithms like, e.g. Plane sweep [de Berg et al., 2000]. To prevent problems of defining various topological relationships, the only spatial relation identified among the sample data is the proximate neighbourhood of the objects. In this thesis, the neighborhood area is defined in two different ways. The first approach divides the study area space into regular square gird cells, where the second considers the neighbourhood as a regular buffer around the objects. 6.2.1 Grid approach The division of the space into a regular grid was introduced by Estivill-Castro and Lee’s vertical-view approach [Estivill-Castro and Lee, 2001] explained in CHAPTER 6. METHOD 38 Figure 6.2: Space filling curve for grid cells numbering. Chapter 3. The Crime data analyzes face similar problems as risk assessment modelling for the Rescue services. Therefore, the basic idea of this approach is adopted and adjusted to our case study. However, cluster analysis is not applied on the data before the association rules mining. By generalizing the original data into clusters, some important patterns may be lost. The objects within the study area are already selected by extracting specific geographical layers from the two available databases. Therefore, there is no need for a further generalization of the information. However, in [Estivill-Castro and Lee, 2001] the clustering analysis was a necessary step to focus the research only to the interesting Hot spot areas. In our case, the data is organized into 11 geographical layers. A regular square grid is placed over the whole area layer by layer. The size of one grid cell is chosen 50m x 50m. Every grid cell identifies a neighborhood and is characterized by a unique ID number. The cell numbering starts at the bottom left corner of the area. After all cells in a row are assigned a number, the numbering of the next row continues from the most left cell. The schema of this process is shown in Figure 6.2. After all cells obtain an ID, the pre-processing method can start. CHAPTER 6. METHOD 39 The method is same for all eleven layers, therefore the explanation is illustrated for only one, representing railway. The pre-processing technique is rather simple and consists of two basic steps: 1. Grid cells selection 2. Extraction of the data In the first step all cells, which are intersected by a railway are pointed up. Since the ID number of each cell is known, the selected cells can be easily identified. In the next step, all the emphasized cells are extracted from the grid layer and stored in a separate text file. The two steps are illustrated in Appendices A and B. Appendix A represents all pointed up cells on a map of the study area. A fragment of the same data, but after the second step is displayed in Appendix B, where every row carries an ID number of a selected cell. In many cases, one grid cell contains several objects from the same geographical layer. However, we are not interested in the amount of objects of one type belonging to one grid cell. The cell is selected, when at least one object of a particular layer is found within the cell area. For instance, two cells (5157, 5252) are highlighted in Figure 6.3. Just by visual comparison we can already state that cell number 5252 contains more than one segment of railway, while only one railway segment intersects cell 5157. However, both cells hold the equal information. For purpose of association rule mining, the count of the same type of objects within the neighborhood area is not essential. The main goal of the data pre-processing is to identify diverse object features laying within one grid cell. After applying this process to all geographical layers, eleven text files are obtained with names after the explored objects. 6.2.2 Buffer approach The neighborhood of objects can also be identified by a buffer. The buffer is placed only around the points representing incidents. We are only interested in discovering possible relations between incidents and selected geographical objects. Study of all existing relations among geographical objects from the SeutuCD and the relations among incidents are out of the scope of this study. CHAPTER 6. METHOD 40 Figure 6.3: Evaluation of the cell. The buffer has a circular shape with the incident located in the center. The radius of the circle has the same length as a size of the grid cell, i.e. 50m. Everything located within a buffer is considered to be neighboring object. Figure 6.4 shows an example, where a minor road and a bar or restaurant are adjacent to an incident, situated in the center of the highlighted buffer zone. Similar to the grid approach, all buffers are identified with a unique ID number. The numbering is in this case random, since the order of the cells is not important. The basic idea of the buffer approach resembles the grid approach. Instead of placing a grid over the whole area, we study an intersection of all layers with the created buffer zones. All objects, belonging to a particular layer, and situated within a buffer, are extracted and saved to a text file. As a result, we obtain ten text files (incident file is excluded), where each file represents a particular geographical layer. The structure of the files is the same as for the grid approach. CHAPTER 6. METHOD 41 Figure 6.4: The picture depicts the buffer areas (yellow) around incidents. The highlighted circle identifies the neighborhood of the incident located in the center. CHAPTER 6. METHOD 6.3 42 Transformation to transaction format In the previous steps all the relevant data is extracted and stored in text format. This pre-processing is necessary, because the complex geographical information need to be simplified. However, before applying the mining process, still more adjustments have to be done. All the data are now stored in separate files. In the next step, those files have to be integrated transformed into the suitable format, i.e. transactions containing itemsets. The basic idea of the integration is the same in both approaches, however, there are slight differences, which need to be explained. 6.3.1 Grid data integration As a result from the neighbourhood detection eleven text files are obtained. Each file stores ID of cells, where objects of a certain type are located. However, each file represents only single object type. In the following step we need to unite all files according the cell identification. The process is depicted in Figure 6.5. The top part of the figure shows three separate text files in three columns. For easier identification, the name of each file is added. Numbers in the columns represent the cell ID numbers. A number to each file is assigned according the input order to the integration algorithm. For instance, the file representing railway is given number 1, because it is detected first. The steps of the integration algorithm are: 1. Check every cell ID number of the grid. 2. If the ID number occurs in a file, classify the cell according to the file of origin, in our example (1, 2 or 3). 3. Add the classified cell as an item to the Results file. 4. If the same ID number item exists in a different file, add the file number item to the already created transaction in the Results file. 5. Save the Results file. Let’s demonstrate the algorithm on an example highlighted in red in Figure 6.5. After several passages through the files, cell number 50 is detected in the CHAPTER 6. METHOD RAILWAY(1) 2 45 48 50 132 145 159 43 INCIDENTS(2) 10 45 46 50 133 BARS(3) 1 8 16 23 50 133 165 181 RESULTS 3 1 3 2 3 3 12 1 2 123 1 23 1 1 3 3 Figure 6.5: Explanation of the integration algorithm for the grid approach. The top part depicts the three files containing the ID numbers of the selected cells. The Results file is the output of the algorithm. CHAPTER 6. METHOD 44 railway file. The cell is classified as number 1, because 1 is the label of a railway file. Consequently, a new transaction is created in the Results file. The same number (50) is found in file number 2, i.e. incidents. Therefore, the algorithm adds the item to the already existing transaction. Now the transaction row contains two items 1 and 2, railway and incident. Finally, the same cell number is detected also in the third file representing bars and restaurants. The cell is again classified by the number of the file and added to the transaction. The final itemset is of the form 1, 2, 3 and states: In one location within the study area, railway, incident and a bar or restaurant are identified as adjacent objects. The entire Results file illustrates all transactions, discovered from the example input files. 6.3.2 Buffer data integration The obtaining of the transactional file for the buffer approach is very similar to the grid approach. Every text file extracted from the database includes buffer ID numbers. The number represents a buffer, which contains or intersects a particular object. The buffer zones are created only around incidents, therefore the incident layer is extracted from the database and it is not anymore used in the further operations. Similar to the grid approach, all the files need to be integrated. The integration algorithm is performed the same way as in the grid approach, however an additional step is added. After the Results file is filled, one more item is added to every itemset. The neighbourhood is closely specified only to the proximity of incidents, but the Results file does not, until now, contain any information about it. Therefore, the additional item, substituting incidents, respectively the buffers around incidents, makes the itemsets complete. The whole process is illustrated in Figure 6.6, where the number 10 represents the additional information about incidents. The highlighted itemset contains representatives from all three files and expresses: On one location, situated in the Helsinki center, an incident happened in the proximity of a railway, bar or restaurant and park or cemetery. CHAPTER 6. METHOD RAILWAY(1) 2 45 48 50 132 145 159 45 PARKS(2) 10 45 46 50 133 BARS(3) 1 8 16 23 50 133 165 181 RESULTS 10 3 10 1 10 3 10 2 10 3 10 3 10 1 2 10 1 10 2 10 1 2 3 10 1 10 2 3 10 1 10 1 10 3 10 3 Figure 6.6: Explanation of the integration algorithm for buffer approach. The top part represents the three input files containing the ID of selected buffers. The Result file is the output of the algorithm. CHAPTER 6. METHOD 6.4 46 Mining association rules Until now, only data pre-processing methods are described. By integrating all files, the spatial information describing the neighbourhood of selected geographical objects is transformed to a simple set of transactions. Once the transaction file is obtained, the application of association rule mining is simple and straightforward. The Apriori algorithm, described in Chapter 3 is the best known among the algorithms for association rule induction. The aim of this thesis is not to implement the algorithm into a working program. We concentrate more on the possible application of the algorithm. Therefore, we decided to analyze our data with an already existing program. The program is designed by Borgelt and its implementation is explained in [Borgelt and Kruse, 2002]. A graphical user interface for this program was developed by Togaware [Togaware, 2005] as part of The Gnome Data Mine tool, and can be downloaded from [Gnome, 2005]. The input format of the program is a transaction file. Each record, i.e. one row, must contain one transaction, i.e. a list of items, which are separated by a blank. An empty record is interpreted as an empty transaction. Both our Results files from the grid and buffer approaches, are in the recommended format. Therefore, the selected program can be applied on our data. 6.4.1 Constraints definition Both transaction files are in the format suitable for the Genome data mining tool, selected for extracting the association rules. However, we are aware that large amounts of rules can be discovered. To obtain only valuable rules, the extraction has to be restricted. Therefore, three constrains are defined: Minsupport Minconfidence Syntactic constraint (template) CHAPTER 6. METHOD 47 Because rules with low confidence de facto represent negation, which can hold an interesting information, the minconfidence threshold is set to zero. With respect to the minconfidence, the minsupport is also equal to zero. With these settings, all existing rules are extracted from the data transactions. We are however not interested in all of them. The other way of solving problems related to extraction of only important rules it to apply the syntactic constraints. By designing a simple template, where the possible appearance of certain items is stated, only rules fulfilling the constraint are selected from the database independently on the value of the confidence and support. This designed template limits the number of rules to only those, containing incidents. The three constraints are applied to both transactional files with the same values. Chapter 7 Results This chapter evaluates the generated rules. The rules fulfilling the set constraints are described. Since the association rule mining is applied on both transaction files, the results are listed separately. The general results obtained by the whole process concludes this chapter. 7.1 Grid approach Since we divided the study area into regular grid cells of size 50m x 50m, the number of transactions in the file is equal to the number of cells and is 6510. Obviously, not every transaction contains an incident. By employing our predefined template, the amount of transactions rapidly decreases. It is reasonable to calculate the relative frequencies of all possible consequents, i.e. rules with empty antecedents [Borgelt, 2005], before we start to evaluate the discovered rules. The knowledge of relative frequencies of every object type can help to discover whether a strong rule can be evaluated as interesting. The list of all relative frequencies r can be observed in Table 7.1, where the first column depicts the object type and second column shows the calculated value for r. The biggest value for r belongs to the water areas. This validates the fact, that water covers a large part of the study area. 48 CHAPTER 7. RESULTS 49 Table 7.1: Relative frequencies of all object types for grid approach. rule r [%] ⇒ motorway 0.6 ⇒ kindergartens 0.8 ⇒ bars and restaurants 4.3 ⇒ railway 6.0 ⇒ incidents 7.2 ⇒ waterway 8.9 ⇒ paths 9.7 ⇒ main roads 14.6 ⇒ parks 17.7 ⇒ minor roads 29.6 ⇒ water 54.1 It is probable, that rules containing water hold high confidence value. If exist objects, representing other object type, which is also densely distributed over the study area, the rules containing those objects and water become strong. But those rules do not have to be necessarily interesting. Since the relative frequencies of the water and other object type are high, the confidence of the rules, containing those objects is also high. However, the high value of the confidence is not in this case based on a detected relation between those two object types, but on a fact that they are both common within the study area. In this case study, only water holds high value of relative frequency, therefore this problem does not have to be considered. After all the pre-defined constraints are set in the Gnome data mine program, the association rule mining is applied to the grid transaction file. We discovered about twenty potentially interesting rules. One of the most significant is rule 7.1. bars and restaurants ⇒ incidents (1.7%; 40.0%) (7.1) CHAPTER 7. RESULTS 50 The first number between brackets represents the support and the second the confidence of the rule. However, we also extracted rule 7.2 with swapped objects in antecedent and consequent. incidents ⇒ bars and restaurants (1.7%; 24.1%) (7.2) Those rules carry similar information but only one, representing a more interesting pattern is selected. We can see that the confidence of rule 7.1 is nearly twice as big as the confidence of rule 7.2. But the confidence value is not the only factor that influences the rule selection. The Table 7.1 of relative frequencies shows, that the total number of incidents is larger than the total number of bars and restaurants. Rule 7.1 demonstrates that, although the bars and restaurants are not densely sparsed over the area, they are often located close to the incidents. While rule 7.2 shows, that from all locations of incidents about only a quarter is situated near a bar or restaurant. Consequently, a significantly larger amount of incidents happens nearby different objects. Therefore we consider rule 7.1 to be more interesting. This means that there is a high probability that the presence of bars and restaurants strongly affect the occurrence of an incident. The following two rules show an association between incidents and two specific road classes: incidents ⇒ main roads (2.2%; 30.4%) (7.3) incidents ⇒ minor roads (1.7%; 24.1%) (7.4) Those rules detect, that incidents also occur in the neighbourhood of minor and main roads. We can combine both rules together. In that case, we obtain the more general rule incidents ⇒ roads. The confidence of the more general rule is stronger, and the relative frequency of combined road classes is higher. The decision whether to keep two separate rules or combine them together depends on the aim of further analysis. CHAPTER 7. RESULTS 51 Until now, only rules with high confidence were introduced. Since we set the minsupport and minconfidence thresholds low, we obtained several rules, that describe negation between incidents and other geographical objects. Some representatives of those rules are: motorway ⇒ incidents (0%; 2.9%) (7.5) incidents ⇒ kindergarten (0.1%; 1.7%) (7.6) incidents ⇒ water (0.4%; 5.7%) (7.7) Rule 7.5 shows that accidents on a motorway are not very common. However, this rule does not have important meaning for this particular area, because the motorway passes only through a negligible part of the Helsinki center. A similar pattern can be seen in rule 7.6. The kindergarten distribution in the center is rather sparse, consequently the rule has no significant importance. An illustrative example of expressing negation between incidents and spatial object is rule 7.7. Although the water covers more than 50 % of the study area, the association between water and incidents is not prominent. Therefore the water does have not strong impact on incidents. 7.2 Buffer approach By creating buffers around incidents, 1547 transactions are generated. Since incidents appear in every transaction, all rules, containing incidents become very strong with confidence of 100%. But those associations are not representative, because they are heavily influenced. Therefore rules, obtained from the buffer approach only have an informative purpose. However, the information is valuable. We can compare those rules with associations discovered by the grid method. Because both methods are independent on each other, the results are not correlated. Consequently, once an interesting association is detected by one method, its existence is approved, when the same rule is discovered by the second method. The following interesting rules are detected by mining the CHAPTER 7. RESULTS 52 buffer transaction file. bars and restaurants ⇒ incidents (35.2%) (7.8) mainroads ⇒ incidents (49.8%) (7.9) minorroads ⇒ incidents (67.7%) (7.10) water ⇒ incidents (15.1%) (7.11) The number between brackets represents support of the rule. 7.3 General results The goal of the case study is to demonstrate the use of the association rule mining technique on geographical data. Although this technique is originally designed for relational databases and uncorrelated data samples, it is possible to extend its applicability to more complicated spatial objects. After extensive data pre-processing, several interesting association rules are discovered from the two available databases. By mining the transaction file related to the grid approach, the discovered rules describe the general behavior of the sample data. Moreover, the obtained transaction file contains all relations existing within the studied area. Therefore, those results represent the real relations among the study objects and can be regarded as confident information about the studied data. The circular buffer describes the neighbourhood more exactly than a grid. However, in this study the buffer is created only around incidents. Therefore the results are closely related only to the incidents. The transaction file does not contain all existing itemsets. The amount of transactions is restricted only to those containing incidents. By mining this file the existing relationships between incidents and other objects are correctly detected, however, the measures of the extracted rules can not be compared to the results obtained by the grid approach. Chapter 8 Discussion This research demonstrates the possible use of association rule mining on provided databases. Since the goal of this study is to introduce basic concepts of the mining process, the presented application is kept simple. However, for obtaining more exact results, some further improvement of the method is advisable. In this chapter, some possible improvements are mentioned. The chapter concludes with the potential directions for further research. 8.1 Unsolved problems This thesis covers the whole process of identification of relevant data for improvement of a risk model. Since the process of data identification is very complex, we proposed a rather simple method. Although this method employs basic operations, interesting results are obtained. However, some further work can be done to improve the identification of the object neighbourhood. The grid approach is dependent on the grid granularity. To ensure better results, the best fitting grid has to be identified. Several gird sizes are tested on the study area, however this task is very time consuming. Although this approach is very simple its application has many disadvantages. The explicitly given division of the space causes problems especially on the edges of a cell. The Figure 8.1 shows an illustrative example. An incident is located in grid cell number 3391. The only identified object in the incident neighbourhood is 53 CHAPTER 8. DISCUSSION 54 3483 3484 3485 3486 3390 3391 3392 3393 3297 3298 3299 3300 Figure 8.1: The neighbourhood problem on the edges of a cell a main road. Although a bar or restaurant is located closer to the incident and probably has stronger influence on its occurrence, it is not depicted, because it is situated in different grid cell. Compared to the grid approach, the buffer approach is more flexible for further development. Besides changing the size of the neighbourhood, we can also modify its shape. The currently used buffer has a regular shape. The circular buffer is very simple to define, but it does not always identify the neighbourhood objects properly. In Figure 8.2 a) every object, located within the highlighted buffer is considered as the neighbourhood to the incident situated in the center. Therefore, water and minor road have been identified as influence factors of the incident. It is obvious, that the incident happened closer to the minor road than to the sea. There is also a high probability, that the sea has no impact on the incident. By restricting the identification of the neighbourhood to the minor road only (Figure 8.2 b)), the resulting associations become more accurate. This can be done by defining some basic constraints that allow selection only of certain object types according to the position of the incident or according specific relevance between objects. CHAPTER 8. DISCUSSION a) 55 b) Figure 8.2: Reduction of selected neighbourhood objects according to the location of an incident. Image a) represent the neighbourhood detected by the buffer approach. The possible change of the neighbourhood is illustrated in image b), where the water area is omitted. 8.2 Further research Starting from the beginning of the process, more automated tools for data retrieval from geographical databases can be developed. The tools should be able to acquire the relevant data together with their spatial and non-spatial properties. The implementation of those tools is probably case and data dependent. Once the data are available, topological relations can be identified directly from their spatial attributes. The topological relations in this thesis are represented by a simple regular neighbourhood. The neighbourhood relations can be defined also by neighbourhood graphs and paths introduced by Ester et al. in [Ester et al., 1998]. This definition speeds up the processing time and can be created by using relational tables and indexes. When the neighbourhood does not define spatial relations sufficiently and in order to obtain more accurate results, more exact topological relations among CHAPTER 8. DISCUSSION 56 objects can be identified. Koperski and Han [Koperski and Han, 1995] arrange several topological relations into a hierarchical tree illustrated in Figure 8.3. The hierarchical structure can be also utilized for geographical objects. In g_close_to not_disjoint intersects adjacent_to intersects inside covered_by close_to equal contains inside covers contains Figure 8.3: The hierarchy of topological relations [Koperski and Han, 1995] this thesis a similar classification was made on only the road network. However, defined road classes are not linked together and each of them carries its own object characteristics. To gather similar objects in a general group, classify them hierarchically according to their non-spatial attributes, enhance not only the processing time but also gives more exact results. Possible set of object hierarchies can be seen in Figure 8.4, which represents division of an urban area and road network. Moreover, hierarchical structures of ob- Figure 8.4: Object hierarchies of urban areas and roads jects and their spatial relations enable multiple-level association rule mining [Koperski and Han, 1995] and [Malerba et al., 2001]. By multiple-level association mining, we can discover rules, that express certain association on diverse levels of detail. For instance, rule 8.1 indicates an association among different hierarchical levels of topological relations (intersects, close to) from the hierarchical tree in Figure 8.3 and objects (large town, national motorway) from CHAPTER 8. DISCUSSION 57 the object hierarchies in Figure 8.4. is a(X, large town) ∧ intersects(X, national motorway) ⇒ close to(X,us boundary) (72%) (8.1) In the previous test, possible improvements of the used method and ideas of further development are outlined. In order to stress their importance, they are briefly summarized Adaptation of the neighbourhood area for obtaining more accurate results. Implementation of automated data retrieval from a geographical database. Hierarchical structure of object according to their non-spatial attributes. Hierarchical structure of spatial relations among neighbouring objects. Multiple-level association mining. Chapter 9 Conclusion This research concludes basic concepts of spatial data mining (SDM), which is a rapidly developing area of spatial data analysis. SDM provides techniques for discovering unexpected patterns from large geographical databases. Those techniques derive benefits from e.g. database management, spatial statistics and artificial intelligence. Although this discipline brings new possibilities, it also faces many challenging research problems especially related to spatial data characteristics. To obtain relevant results the spatial autocorrelation and spatial heterogeneity have to be taken into consideration. The previous research made in this area identified various SDM techniques. Although all techniques have the same goal, i.e. discovering information not explicitly given to the database, the way of obtaining this information differs. After studying the behavior of the best known SDM techniques, this thesis concentrates on association rule mining in more detail. Discovery of spatial associations may detect interesting relationships among spatially distributed objects. Therefore, application of this technique is convenient for analysis, dealing with identification of factors, that can have possible impact on the occurrence of a certain phenomenon. To test, whether association rule mining can be used in risk management, this thesis shows how to identify factors, that possibly have influence on the location of incidents within the Helsinki city center. However, detection of 58 CHAPTER 9. CONCLUSION 59 the relevant factors is very complex and require reasonable knowledge of the available databases and area. Any new application is not implemented, a Gnome Data Mine tool, originally designed for classical data mining by Borgelt [Borgelt and Kruse, 2002] was used for the analysis. We provide a study, that covers the whole process of association rule mining from the identification of the data to the evaluation of potentially interesting rules. In order to keep the process simple, the only subjects to mine are spatial relations of objects represented by their neighbourhood. Two different approaches are introduced for the neighbourhood definition, the grid and the buffer approach. The biggest advantage of the first one is its simplicity and applicability to any kind of data. However, this approach is heavily dependent on the granularity of the grid. Moreover, the distribution of the data does not have any affect on the results. This problem is solved by applying the buffer approach, where the definition of neighbourhood originates from the object locations. Therefore, the extracted relations are more exact. Comparable to the grid approach, the buffer approach is more flexible and open for further improvements. Even though this research is based on simple and not always flexible operations, it describes the whole process of association rule mining. We are aware, that the proposed process does not offer a general solution, which can be applicable to any geographical database. Moreover, the identification of the object neighbourhood is rather cumbrous and time consuming, therefore the amount of data in the case study is heavily restricted. However, the process is open to further improvements. Some of them are proposed in the Chapter 8. Although spatial data mining does not yet belong to the most commonly used spatial data analyzes, it was found useful for exploring enormous amounts of data stored in geographical databases. In this research, possible use of association rule induction, one of the most commonly known technique of spatial data mining, was demonstrated. Its application was found effective for detecting strong relationships among geographical objects. References [Agrawal and Srikant, 1994] Agrawal R. and Srikant R., Fast Algorithms for Mining Association Rules, proceedings of 20th International conference on Very Large Databases, 1994 [Agrawal et al., 1993] Agrawal R., Imielinski T. and Swami A., Mining association rules between sets of items in large databases, proceedings of ACMSIGMOD International Conference Management of Data, pages 207-016, 1993 [Alliniemi J., 1994] Alliniemi J., Threats and possibilities - a way to study accidents and their effects, 1994, in Finnish [Bédard et al., 2001] Bédard Y., Merrett T. and Han J., Fundamentals of spatial data warehousing for geographic knowledge discovery, in Geographic data mining and knowledge discovery, Miller H. J. and Han J., Taylor & Francis, ISBN 0-415-23369-0, 2001 [de Berg et al., 2000] de Berg M., van Kreveld M., Overmars M. and Schwarzkopf O., Computational Geometry Algorithms and Applications, Chapter 2, Springer-Verlag, ISBN 3-540-65620-0, 2000 [Borgelt, 2005] Borgelt Ch., Apriori, Finding Association Rules/Hyperedges with the Apriori Algorithm, URL: http://fuzzy.cs.unimagdeburg.de/ borgelt/doc/apriori/apriori.html (accessed 19.1.2005) [Borgelt and Kruse, 2002] Borgelt Ch. and Kruse R., Induction of Association Rules: Apriori Implementation, proceedings of 14th Conference on Computational Statistics, 2002 60 REFERENCES 61 [Demšar, 2004] Demšar U., Exploring geographical metadata by automatic and visual data mining, Licentiate Thesis, Royal Institute of Technology, Stockholm, 2004 [Ester et al., 1998] Ester M., Frommelt A., Kriegel H.-P., Sander J., Algorithms for Characterization and Trend Detection in Spatial Databases, proceedings of 4th International Conference on knowledge Discovery and Data Mining, pages 44-50, 1998 [Ester et al., 1997] Ester M., Kriegel H.-P. and Sander J., Spatial Data Mining: A Database Approach, proceedings of 5th International Symposium on Advances in Spatial Databases, pages 47-66, 1997 [Ester et al., 2001] Ester M., Kriegel H.-P. and Sander J., Algorithms and applications for spatial data mining, in Geographic data mining and knowledge discovery, Miller H. J. and Han J., Taylor & Francis, ISBN 0-41523369-0, 2001 [Estivill-Castro and Lee, 2001] Estivill-Castro V. and Lee I., Data Mining Techniques for Autonomous Exploration of Large Volumes of Georeferenced Crime Data, proceedings 6th International Conference on Geocomputaion, GeoComputation CD-ROM, ISBN 1864995637, 2001 [Gnome, 2005] URL: http://www.togaware.com/datamining/gdatamine/ gdmapriori.html (accessed 22.3.2005) [Han, 1999] Han J., Data Mining, in Encyclopedia of Distributed Computing, Urban J. and Dasgupta P. (eds.), Kluwer Academic Publishers,1999 [Han et al., 2001] Han J., Kamber M. and Tung A. K. H., Spatial clustering methods in data mining, in Geographic data mining and knowledge discovery, Miller H. J. and Han J., Taylor & Francis, ISBN 0-415-23369-0, 2001 [Helokunnas, 1995] Helokunnas T., Object-Oriented Approaches Applied to GIS Development, Acta Polytechnica Scandinavica, Mathematics and computing in engineering series No. 75, 1995 REFERENCES 62 [Hutchinson, 1991] Hutchinson, The Hutchinson Encyclopedic Dictionary, Helicon, ISBN 0091749980, 1991 [Ihamäki, 2000] Ihamäki V.-P., Geographic information in the planning of rescue services, The Emergency Services College, Espoo, 2000 [Klemettinen et al., 1994] Klemettinen M., Mannila H., Ronkainen P., Toivonen H. and Inkeri Verkamo A., Finding Interesting Rules from Large Sets of Discovered Association Rules, proceedings of 3rd International Conference on Inforamtion and Knowledge Management, pages 401-408, 1994 [Koperski and Han, 1995] Koperski K. and Han J., Discovery of Spatial Association Rules in Geographic Information Databases, proceedings of 4th International Symposium on Large Spatial Databases, pages 47-66, 1995 [Koperski et al., 1998] Koperski K., Han J. and Adhikary, Mining Knowledge in Geographical Data, accepted by IEEE Comuter, 1998 [Koperski et al., 1996] Koperski K., Adhikary J. and Han J., Spatial Data Mining: Progress and Challenges, in SIGMOD9́6 Workshop on Research Issues on Data mining and Knowledge Discovery, 1996 [Kraak and Ormeling, 2003] Kraak M.-J., Ormeling F., Cartography, Visualization of Geospatial Data, Second edition, Prentice Hall, ISBN 0-13088890-7, 2003 [Krisp et al., 2005] Krisp J.M., Virrantaus K. and Jolma A., Using Explorative Spatial Analysis to Improve Fire and Rescue Services, proceedings of 1st International Symposium on Geo-information for Disaster Management, 2005 [Lonka, 1999] Lonka H., Risk Assessment Procedures Used in the Field of Civil Protection and Rescue Services in Different European Union Countries and Norway, prepared in the framework of EU co-operation in the field of civil protection, 1999 [Malerba et al., 2001] Malerba D., Esposito F. and Lisi F.A., Mining Spatial Association Rules in Census Data, proceedings of the Joint Conferences REFERENCES 63 on New Techniques and Technologies for Statistics and Exchange of Technology and Know-how, 541-550, 2001 [Mannila, 2002] Mannila H., Local and Global Methods in Data Mining: Basic Techniques and Open Problems, proceedings of 29th International Colloquium on Automata, Languages and Programming, Lecture Notes on Computer Science, pages 57-68, Springer-Verlag, 2002 [Miller and Han, 2001] Miller H. J. and Han J., Geographic data mining and knowledge discovery, An overview, in Geographic data mining and knowledge discovery, Miller H. J. and Han J., Taylor & Francis, ISBN 0-41523369-0, 2001 [Shekhar and Chawla, 2003] Shekhar S. and Chawla S., Introduction to Spatial Data Mining, in Spatial Databases: A tour, Prentice Hall, ISBN 013-017480-7, 2003 [Shekhar et al., 2003] Shekhar S., Zhang P., Huang Y., Vatsavai R., Trend in Spatial Data Mining, to appear in Data Mining: Next Generation Challenges and Future Directions, Kargupta H., Joshi A., Sivakumar K., Yesha Y. (eds.), AAAI/MIT Press, 2003 [Togaware, 2005] Togaware URL: http://www.togaware.com/index.html (accessed 22.3.2005) [YTV, 2005] Helsinki Metropolitan Area Council, Developing Regional Data Utility, URL: http://www.ytv.fi/english/data/index.html (accessed 15.3.2005) Appendix A Cells containing railway 64 Appendix B Sample of the railway data in the text format "Id" 775 868 869 871 941 942 962 964 965 1035 1036 1037 1055 1056 1058 1130 1131 1132 1142 65

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Spatial data mining as a tool for improving geographical models