Download Spatial data mining as a tool for improving geographical models

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
HELSINKI UNIVERSITY OF TECHNOLOGY
Department of Surveying
Institute of Cartography and Geoinformatics
Věra Karasová
Spatial data mining as a tool for
improving geographical models
Master’s Thesis submitted in partial fulfillment of the requirements for the
degree of Master of Science in Technology.
Espoo, May, 2005
Supervisor:
Prof. Kirsi Virrantaus
Instructor:
M.Sc. (Tech.) Jussi Ahola, Lic. Tech. Jukka Matthias Krisp
HELSINKI UNIVERSITY
OF TECHNOLOGY
ABSTRACT OF THE
MASTER’S THESIS
Author:
Věra Karasová
Title:
Spatial data mining as a tool for improving geographical models
Date:
May, 2005
Department:
Department of Surveying
Number of pages: 63 + 2
Professorship: Maa-123 Cartography and Geoinformatics
Supervisor:
Prof. Kirsi Virrantaus
Instructor:
M.Sc. (Tech.) Jussi Ahola, Lic. Tech. Jukka M. Krisp
Spatial data mining is a new and rapidly developing technique for analyzing geographical data. In this master’s thesis, the usability of the technique is examined
for the improvement of an existing geographical model regarding rescue operations. The main focus of spatial data mining is set on the discovery of interesting
patterns of information embedded in large geographical databases. Due to its
ability to operate without a previously formulated hypothesis, spatial data mining is becoming a popular tool for spatial data analyzes.
After a short explanation of the best known spatial data mining techniques, this
thesis concentrates on association rule mining in more detail. Discovered spatial
association rules may detect useful relationships among spatially distributed objects. Once the relations are identified, the existing spatial model can be extended
by the variables with strongest relations to the modeled phenomenon.
The behavior of association rule mining is studied by applying it on sample data
representing incident locations within the Helsinki city center. The core data is
provided by the Fire and Rescue department in Espoo. To observe interaction of
the incident with its neighbourhood, information of geographical objects situated
within the study area is obtained from the SeutuCD geographical database.
Although spatial data mining does not yet belong to the most commonly used
spatial data analyzes, it was found effective for detecting strong relationships
among geographical objects.
Key words: knowledge discovery from databases, spatial data mining, association rules, risk model
ii
Acknowledgements
I would like to thank the Ministry of Agriculture and Forestry in Finland for
financially supporting this research project and therefore giving me the opportunity to finish my Master’s degree at HUT.
Many thanks go to my brilliant supervisor Professor Kirsi Virrantaus, for her
encouragement and guidance during my whole studies in Finland. Her open,
family behavior and the patience with which she was always carefully listening
to all my troubles and problems (not always study related) made my time in
Finland easier and unforgettable.
My gratitude also goes out to Jussi Ahola for familiarization with the concepts
of data mining and contribution of valuable comments to my thesis, as well as
to Jukka Matthias Krisp for endless debates on disaster management and risk
assessment procedures.
I also want to thank all my colleagues from the Institute of Cartography and
Geoinformatics for an inimitable working atmosphere and their friendship.
My time in Finland would never have been fulfilled without the extensive care
of my boyfriend Huib.
Finally I would like to express my greatest thanks to my dearest parents and
other members of our family, who have been always there with their immense
support!
Espoo, May 2005
Věra Karasová
Contents
Abbreviations
iv
List of Figures
v
List of Tables
vi
1 Introduction
1
1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Research objectives . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Research approach . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4
Structure of the thesis . . . . . . . . . . . . . . . . . . . . . .
4
2 Literature survey
6
2.1
Definition of SDM and KDD . . . . . . . . . . . . . . . . . . .
6
2.2
Spatial data characteristics . . . . . . . . . . . . . . . . . . . .
7
2.3
Spatial data mining techniques
. . . . . . . . . . . . . . . . .
8
2.3.1
Clustering and outlier detection . . . . . . . . . . . . .
9
2.3.2
Association and co-location . . . . . . . . . . . . . . .
12
2.3.3
Classification . . . . . . . . . . . . . . . . . . . . . . .
13
2.3.4
Trend detection . . . . . . . . . . . . . . . . . . . . . .
15
3 Association rules and geographic data
3.1
3.2
17
Spatial association rules . . . . . . . . . . . . . . . . . . . . .
17
3.1.1
Definition . . . . . . . . . . . . . . . . . . . . . . . . .
18
Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
19
i
3.2.1
Discovering large itemsets . . . . . . . . . . . . . . . .
20
3.2.2
Extraction of association rules . . . . . . . . . . . . . .
21
3.3
Evaluation of the rules . . . . . . . . . . . . . . . . . . . . . .
22
3.4
Mining multivariate associations using clustering . . . . . . . .
23
4 Disaster management in Finland
27
4.1
Risk assessment procedure . . . . . . . . . . . . . . . . . . . .
27
4.2
General model . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.3
Risk model in the city of Espoo . . . . . . . . . . . . . . . . .
28
5 Data
31
5.1
Dataset of incidents . . . . . . . . . . . . . . . . . . . . . . . .
31
5.2
SeutuCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
5.3
Data for the case study . . . . . . . . . . . . . . . . . . . . . .
33
6 Method
35
6.1
Process description . . . . . . . . . . . . . . . . . . . . . . . .
35
6.2
Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . .
37
6.2.1
Grid approach . . . . . . . . . . . . . . . . . . . . . . .
37
6.2.2
Buffer approach . . . . . . . . . . . . . . . . . . . . . .
39
Transformation to transaction format . . . . . . . . . . . . . .
42
6.3.1
Grid data integration . . . . . . . . . . . . . . . . . . .
42
6.3.2
Buffer data integration . . . . . . . . . . . . . . . . . .
44
Mining association rules . . . . . . . . . . . . . . . . . . . . .
46
6.4.1
46
6.3
6.4
Constraints definition . . . . . . . . . . . . . . . . . . .
7 Results
48
7.1
Grid approach . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
7.2
Buffer approach . . . . . . . . . . . . . . . . . . . . . . . . . .
51
7.3
General results . . . . . . . . . . . . . . . . . . . . . . . . . .
52
8 Discussion
8.1
53
Unsolved problems . . . . . . . . . . . . . . . . . . . . . . . .
ii
53
8.2
Further research . . . . . . . . . . . . . . . . . . . . . . . . . .
55
9 Conclusion
58
A Cells containing railway
64
B Sample of the railway data in the text format
65
iii
Abbreviations
DW
Data Warehouses
KDD
Knowledge Discovery from Database
LR
Linear Regression
SAR
Spatial Autoregressive Regression
SDM
Spatial Data Mining
YTV
Helsinki Metropolitan Area Council
iv
List of Figures
2.1
Co-location patterns . . . . . . . . . . . . . . . . . . . . . . .
14
2.2
Trend detection . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.1
Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.2
Vertical-view approach . . . . . . . . . . . . . . . . . . . . . .
24
4.1
Risk classification process . . . . . . . . . . . . . . . . . . . .
30
5.1
Map of the study area . . . . . . . . . . . . . . . . . . . . . .
32
5.2
Data representing the study area . . . . . . . . . . . . . . . .
34
6.1
Schema of the association rule mining process . . . . . . . . .
36
6.2
Grid cells numbering . . . . . . . . . . . . . . . . . . . . . . .
38
6.3
Cell evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
40
6.4
Buffer zones around incidents . . . . . . . . . . . . . . . . . .
41
6.5
Integration algorithm for grid data . . . . . . . . . . . . . . .
43
6.6
Integration algorithm for buffer data . . . . . . . . . . . . . .
45
8.1
Problems with grid division of the space . . . . . . . . . . . .
54
8.2
Reduction of selected objects . . . . . . . . . . . . . . . . . . .
55
8.3
The hierarchy of topological relations . . . . . . . . . . . . . .
56
8.4
Object hierarchies . . . . . . . . . . . . . . . . . . . . . . . . .
56
v
List of Tables
3.1
Example of basket data . . . . . . . . . . . . . . . . . . . . . .
20
3.2
4 x 4 relational table . . . . . . . . . . . . . . . . . . . . . . .
25
7.1
Relative frequencies of all object types for grid approach. . . .
49
vi
Chapter 1
Introduction
We are often interested in analyzing complex situations to more precisely predict the effect of some spatial phenomenon. Once its behavior is approximated
by a model, the spatial phenomenon can be understood more correctly. However, currently used spatial models are usually created in a very simple way
and represent only the general trend. To give the model a more realistic form,
advanced methods for spatial data analyzes should be employed. When a more
accurate representation of a spatial phenomenon exists, more can be discovered about its possible impact.
Recently, the amount of natural and man-made disasters has increased. Therefore, actions concentrating on prediction and assessment of possible consequences for nature as well as human lives are becoming more important. Consequently, principal changes to the existing risk models for rescue operations
are essential. Due to the fast development of geo-information technologies, a
variety of new opportunities arise. Therefore, more accurate analyzes can be
performed on spatial data. In this research the possible use of spatial data mining (SDM) methods is investigated for identifying factors that may influence
occurrences of incidents within the Helsinki city center.
1
CHAPTER 1. INTRODUCTION
1.1
2
Background
Traditional statistical analysis tools are having difficulties with handling huge
volumes of data collected in recent years. Moreover, the statistical methods require a broader knowledge of test data in order to define a principal hypothesis
for the analysis. As a consequence, the analyzes are getting more expensive
and time consuming. Therefore, classical statistics becomes an inappropriate and unsuitable tool for analysis performed on data rich environments.
[Miller and Han, 2001], [Shekhar and Chawla, 2003], [Mannila, 2002]
Data mining is introduced as a discipline concentrating on the manipulation
of extensive databases. The main goal of data mining is to search for deeply
hidden information, that can be turned into knowledge for strategic decision
making and answering fundamental research questions [Miller and Han, 2001].
Due to the ability of extracting implicit knowledge without any a priori stated
hypothesis, data mining is becoming a popular tool.
However, collected data is not always randomly distributed, independent or
stored in relational databases. The core question regarding SDM is how to
deal with the complex characteristics and spatial relations embedded in geographical databases. [Shekhar and Chawla, 2003] Although the requirements
of SDM often differs from classical data mining in principle, some of the SDM
researchers try to adjust classical data mining techniques instead of designing
new algorithms.
Each SDM technique is developed for analysis of different spatial phenomena. The most often used SDM methods like clustering, trend detection and
classification are derived from spatial statistics. The only method, that is not
yet commonly used for geographical data analysis is the association rule mining, which identifies not explicitly stored information about unexpected and
possibly useful relationships. ([Ester et al., 2001], [Koperski and Han, 1995],
[Miller and Han, 2001] etc.) This thesis concentrates on application of the spatial association rule mining to detect interesting spatial relationships among
geographical objects.
CHAPTER 1. INTRODUCTION
1.2
3
Research objectives
Throughout this thesis, the existing knowledge about SDM and its most commonly used techniques is discussed. Since the amount of scientific literature
in this area is very restricted, this thesis should contribute as a survey on research, which has been recently done in this field.
The biggest challenges of SDM are the spatial attributes of geographic data.
Every object, situated in a geographical space is always related to another.
This fact should be tracked and recorded on an appropriate place in the geographical database. However, every geographical database keeps this record
in a different format. Due to the variety and complexity of those records, applications implemented for SDM analysis are mostly case dependent. It is not
always feasible to design a new SDM tool, sometimes small but well-considered
modification of the data can enable the use of an already existing application.
This thesis intends to demonstrate the use of Gnome Data Mine tool, originally
implemented for classical data mining by Borgelt [Borgelt and Kruse, 2002],
on geographical data.
To test, whether SDM is a useful method for analyzing spatial data stored
in an extensive geographical database, this research continues with application of association rule mining on a case study. The core of the data selected
for the case study consists of records of incidents, which are located within
the Helsinki city center. The goal of the case study is to discover the possible
influence of geographical objects on incidents. Therefore, only the subjects to
mine are the spatial relationships among selected objects. Since those relations are not known in advance, they first have to be identified.
It is obvious, that many operations have to be done before the association
rule mining can be applied. The main goal of the case study is to present a
solution covering the whole process of operations, that are necessary for obtaining desirable results. It must be emphasized that the complexity of this
process and detailed description of each of its steps is of main importance to
this thesis.
CHAPTER 1. INTRODUCTION
4
Although SDM is not based on any a priori given hypotheses, a general idea
about the aim of the research should be known. Such assumption facilitates
the whole process and yields to the discovery of valuable information. At first,
the core objects the study relates to should be identified and extracted from
the provided databases. Once the amount of data is restricted, identification
of spatial relationships is faster and less expensive. This thesis offers two
methods on how the spatial relations can be derived from the geographical
coordinates of selected objects.
By applying the association rule mining on the extracted relations, some interesting dependencies among selected objects are discovered. The knowledge of
those dependencies enable more accurate selection of variables for improving
currently used geographical models.
1.3
Research approach
The research is based on a literature survey which identifies the core concepts
and methods of SDM. This background is a starting point for further theoretical and conceptual analysis. To test the interaction of SDM with real data, the
association rule mining is applied in a case study. Because the implementation
of SDM tools is usually data dependent, there is no generally applicable ready
made software. However, various programs for classical data mining exist.
The aim of the case study is to test the possible use of classical data mining
software for geographical data analysis and hereby facilitating the problems
connected to program implementation. On the other side, the use of classical
data mining methods requires extensive data pre-processing. The case study
represents a constructive research approach.
1.4
Structure of the thesis
Previously conducted research related to the topic is described in the following
Chapter 2. It contains the definition of SDM and introduces some of the main
CHAPTER 1. INTRODUCTION
5
techniques and algorithms. A more detailed description of the association rule
mining including a definition of the most commonly used algorithm is given
in Chapter 3. Chapter 4 explains the risk assessment procedure and currently
used risk model of Fire and Rescue services in Espoo. The Fire and Rescue
services provided a database of incidents, which is described together with the
SeutuCD geographical database in Chapter 5. This chapter also contains a
description of the data, selected for the case study. Chapter 6 illustrates two
methods for data pre-processing. Besides, the whole process of association
rule mining is presented in detail and demonstrated on the selected data. The
results are evaluated in Chapter 7, where also the most interesting association
rules are listed. The unsolved problems and ideas for the further development
are discussed in Chapter 8. The work is concluded in Chapter 9.
Chapter 2
Literature survey
This chapter outlines the theoretical background of the research. A general
overview of Knowledge Discovery from Databases (KDD) and Spatial Data
Mining (SDM) is provided. Since SDM deals with geographical data, their
typical characteristics are described later in this chapter. This chapter concludes with a possible classification of SDM techniques.
2.1
Definition of SDM and KDD
Due to advanced data collection techniques such as remote sensing, census data
acquiring, weather and climate monitoring etc. contemporary geographical
datasets contain an enormous amount of data of various types and attributes.
Analyzing this data is challenging for traditional data analysis methods which
are mainly based on extensive statistical operations. Since classical data mining methods enable us to detect valuable information from extensive relational
databases, SDM can be an appropriate technique for detecting possible interesting patterns in geographical datasets. Spatial data mining is a knowledge
discovery process of extracting implicit interesting knowledge, spatial relations,
or other patterns not explicitly stored in databases. [Koperski et al., 1996],
[Koperski and Han, 1995]
Knowledge discovery from database is a complex concept integrating several
research fields including machine learning, database systems, statistics, visu6
CHAPTER 2. LITERATURE SURVEY
7
alization etc. Data mining is a core component of the KDD process. The
KDD process assumes that interesting and unexpected patterns in very large
databases are deeply hidden and often difficult or impossible to specify a priori. Consequently, traditional database queries and statistical methods do not
reveal any implicit information from a large database. KDD is a tool for exploring domains that are too difficult to perceive with unaided human abilities
[Miller and Han, 2001].
2.2
Spatial data characteristics
Extracting implicit information from geographical databases appears, in comparison to traditional non-spatial databases, to be more challenging. Together
with non-spatial attributes, spatially referenced objects also carry information concerning their representation in space by geometrical and topological
properties. [Koperski et al., 1996] Topology covers the geographical properties which are not closely connected to the actual position of objects, i.e. it
represents the spatial relationships among objects. [Helokunnas, 1995] According to [Hutchinson, 1991] the topology is a branch of geometry that deals
with those properties of a figure (object) that remain unchanged even when
the figure is transformed. On the other side, geometric characteristics of data
concerns information related to the actual location of the object in space.
[Kraak and Ormeling, 2003] The location is usually described by Euclidian
coordinates or Latitude and Longitude.
Besides the core spatial characteristics dealing with geometry and topology,
geographical data also contains information about the behavior of a phenomenon the data represents. In particular, the notion of spatial autocorrelation is
fundamental to any spatial related operations. [Shekhar et al., 2003] Omitting
the fact that nearby items tend to be more similar than items situated more
apart, causes inconsistent results in the spatial data analysis.
An other important characteristic of geographic data is spatial heterogeneity.
Spatial data is not identically distributed in space, therefore data properties
CHAPTER 2. LITERATURE SURVEY
8
are location dependent. It is possible that local trends can sometimes contradict the global trends. [Shekhar and Chawla, 2003] In other words, global
parameters estimated from a geographic database do not sufficiently describe
the geographic phenomenon at any particular location. [Miller and Han, 2001]
Due to the spatial data diversity, a composition of geographical databases
is crucial. Moreover, the data integration process has to deal with very complicated data transformations, because the collected data are often from different sources. [Bédard et al., 2001] Therefore good database design provides
the possibility of analyzing geographical data with maximum efficiency on data
processing and in the same time considers their spatial characteristics.
2.3
Spatial data mining techniques
There is no unique way of classifying SDM techniques. Various kinds of patterns can be discovered from databases and can be presented in different forms.
The categorization often depends on the background field of a particular researcher. If we assume a person to be interested in data visualization, the
criteria for classification will probably be dependent on various visualization
techniques, whereas a computer science researcher might see the main variance
in the utilization of different algorithms. An illustrative overview about various
possibilities of classifying data mining techniques is given in [Demšar, 2004].
Based on [Han, 1999], general data mining tasks can be classified into two
main categories: descriptive data mining and predictive data mining. The
former concisely describes the behavior of datasets and presents interesting
general properties of the data. Whereas the latter attempts to construct models that tend to help predicting the behavior of the new datasets. Forecasting
an employee’s potential salary based on the salary distribution of similar employees can be seen as an example of a predictive data mining task. While
descriptive methods may be used for comparison of sales between a European
and an Asian branch of a certain company.
CHAPTER 2. LITERATURE SURVEY
9
Ester [Ester et al., 1997] divides spatial data mining techniques into four general groups: spatial association rules, spatial clustering, spatial trend detection
and spatial classification. The categorization is based on the KDD algorithms.
Based on Shekhar and Chawla [Shekhar and Chawla, 2003], the three most
non-controversial techniques would be classification, clustering and association
rules. However, some of those algorithms can be accompanied by supporting
methods. For example for identification of so called Hot Spots which are areas
of a high value of certain activity within a large area of low activity, clustering technique is performed together with outlier detection. Consequently, the
basic idea of co-location is derived from a spatial association technique.
In this thesis, the organization of the particular spatial data mining techniques
as a combination of Ester’s and Shekhar and Chawla’s categorization:
ˆ clustering and outlier detection
ˆ association and co-location
ˆ classification
ˆ trend detection
2.3.1
Clustering and outlier detection
Spatial clustering is a process of grouping a set of spatial objects into groups
called clusters. Objects within a cluster show a high degree of similarity,
whereas the clusters are as much dissimilar as possible. [Ester et al., 2001],
[Shekhar et al., 2003]
Unlike classification, clustering is an unsupervised process. This means that
clustering does not rely on predefined labels of classes or a priori given number of classes. [Han et al., 2001] Clustering is a very well known technique in
statistics and the data mining role is to scale a clustering algorithm to deal
with the large geographical datasets. [Shekhar and Chawla, 2003]
CHAPTER 2. LITERATURE SURVEY
10
Clustering algorithms can be separated into four general categories: partitioning method, hierarchical method, density-based method and grid-based
method. [Han et al., 2001] The categorization is based on different cluster
definition techniques.
Partitioning method
A partitioning algorithm organizes the objects into clusters such, that the total
deviation of each object from its cluster center is minimized. [Han et al., 2001]
At the beginning each object is classified as a single cluster. In the next steps,
all data points are iteratively reallocated to every clusters until a stopping criterion is met. Due to the minimum distance to the center of the cluster, this
method tends to find clusters of spherical shape. [Shekhar and Chawla, 2003]
K-Means and K-Medoids are commonly used fundamental partitioning algorithms. The K-Medoids method uses the most centrally located object in the
cluster to be the cluster center. Some of the recent algorithms that are based on
the K-Medoids method are P artitioning Around M edoids (PAM), C lustering
LARge Applications (CLARA) and C lustering LARge Applications based
upon RAN domized S earch (CLARANS). [Han et al., 2001]
Hierarchical method
These clustering methods hierarchically decompose the dataset by splitting or
merging all clusters until a stopping criterion is met. The result of the decomposition is a dendrogram tree, which can be formed in two ways: ”bottomup” or ”top-down”. The bottom-up approach, also called the agglomerative
approach, starts with each object forming a separate group. In every step,
objects are successively merged until all of the groups congregate into one; the
top most level of the hierarchy. In the top-down approach, also called divisive,
all objects are at the beginning united into one general cluster. In every iteration each cluster is split into several smaller ones, until eventually each object
forms a separate cluster. Some of the recently used hierarchical clustering algorithms are B alanced I terative Reducing and C lustering using H ierarchies
(BIRCH), and C lustering U sing RE presentatives (CURE). [Han et al., 2001]
CHAPTER 2. LITERATURE SURVEY
11
Density-based method
The method regards clusters as dense regions of objects, that are separated by
regions of low density (representing noise). In contrast to partitioning methods, clusters of arbitrary shapes can be discovered. Density-based methods
can be used to filter out noise and outliers. An example of a density basedalgorithm is a Density-B ased cluS tering method based on C onnected regions
with sufficiently high density (DBSCAN). [Han et al., 2001]
Grid-based method
Grid-based clustering algorithms first quantize the clustering space into a finite
number of cells and then perform the required operations on the grid structure.
Cells that contain more than a certain number of points are treated as dense.
The main advantage of the approach is its fast processing time, since the time
is independent on the number of data objects, but dependent on the number of
cells. Some of the grid-based algorithms are a ST atistical IN formation Grid
(STING) and CLustering I n QUE st (CLIQUE). [Shekhar and Chawla, 2003],
[Han et al., 2001]
A clustering method is sometimes accompanied by outlier detection. The
goal of outlier detection is to discover a small subset of data points, which
are often viewed as noise, error, deviations or exceptions. A spatial outlier
is a spatially referenced object whose non-spatial attribute values are significantly different from those of other spatially referenced objects in its spatial
neighborhood [Shekhar and Chawla, 2003]. The identification of global outliers can lead to the discovery of unexpected knowledge and has a number of
practical applications including transportation, public safety climatology etc.
For example, outlier detection and clustering techniques can help to discover
areas named hot spots which may for example represent the areas of high crime
density. [Shekhar and Chawla, 2003] Possible utilization of hot spots analysis
is further explained in Chapter 3.
CHAPTER 2. LITERATURE SURVEY
2.3.2
12
Association and co-location
When performing clustering methods on the data, we can find only characteristic rules, describing spatial objects according to their non-spatial attributes.
In many situations we want to discover spatial rules that associate one or more
spatial objects with others.
A spatial association rule is of the form X ⇒ Y (c % ), where X and Y are
sets of spatial or non-spatial predicates and c % is the confidence of the rule.
[Koperski et al., 1996] An association rule is characterized by two parameters:
support and confidence. The former expresses a ratio of transactions that satisfies both X and Y, to the number of transactions in a dataset. Whereas the
latter one presents a conditional probability that Y is true under the condition
that X is true.
A large number of associations may be extracted from an extended geographical database. However, a majority of those rules are applicable to only a
small number of objects and the extraction of all rules is very computationally
expensive. Often the confidence of rules is low. Therefore, the concepts of
minimum support and minimum confidence are used to guarantee that only
important transactions are discovered. We state that a rule is strong when
the support is large, i.e., no less than the minimum support threshold, and
the confidence is large, i.e., no less than the minimum confidence threshold.
[Koperski et al., 1996] However, one of the biggest research challenges in mining association rules is the development of methods for selecting potentially
interesting rules from among the mass of all discovered rules. [Mannila, 2002]
The following rule can be obtained from a geographic database:
is a(x, school) ∧ close to(x, sport center) ⇒ close to(x, park)(80%)
This rule states that 80% of schools, which are close to sport centers, are
also close to parks. [Koperski et al., 1998] Compared to spatial association
the co-location technique tends to discover only relations considering spatial
CHAPTER 2. LITERATURE SURVEY
13
proximity of objects. Therefore, the number of transactions is reduced only to
spatial transactions concerning object neighborhood. Consequently, attributes
and their values do not influence the result. The example of co-location can
be seen in Figure 2.1. This image represents an analysis of habitats of animals
and plants. Co-location of predator-prey species, symbiotic species and fire
events with ignition sources may be identified. In Figure 2.1, two co-location
patterns can be observed: a fire is often located close to a dry tree and a bird
is often seen in the neighbrouhood of a house. [Shekhar and Chawla, 2003],
[Shekhar et al., 2003]
2.3.3
Classification
Every data object stored in a database is characterized by its attributes. Classification is a technique, which aim is to find rules that describe the partition
of the database into an explicitly given set of classes. Objects with similar
attribute values are integrated into the same class. In spatial classification the
attribute values of neighboring objects may also be relevant for the membership
of objects in a certain group. Therefore, we have to include the neighbourhood
factor in the calculation.
A classification method consists of two parts. First the user defines the number of classes. To test, whether the number of classes was chosen correctly, a
set of training data is selected and the classification is performed on it. Consequently, classification rules are derived from the training dataset. Next, those
rules are applied to the test dataset. Classification is considered as predictive
spatial data mining, because we first create a model according to which the
whole dataset is analyzed. [Ester et al., 1997]
A classification process can be performed in many different ways. A method
offered by Shekhar and Chawla [Shekhar and Chawla, 2003] is based on the
Linear Regression (LR). To guarantee the spatial dependencies of objects, a
Spatial Autoregressive Regression (SAR) technique has been proposed by spatial statisticians.
CHAPTER 2. LITERATURE SURVEY
Figure 2.1:
Example
[Shekhar and Chawla, 2003]
of
co-location
14
spatial
data
mining.
CHAPTER 2. LITERATURE SURVEY
2.3.4
15
Trend detection
A spatial trend is a regular change of one or more non-spatial attributes when
spatially moving away from a start object. Therefore, spatial trend detection
is a technique for finding patterns of the attribute changes with respect to the
neighborhood of some spatial object. [Ester et al., 1997]
Let us consider a statement: ”When moving away from a big city, the Real
Estate property is cheaper.” The trend is characterized by detection of a regular change of the Real Estate property price, dependent on the distance from
a big city. The city in this case represents a start object.
Ester [Ester et al., 1998] has presented an algorithm based on Linear Regression. In each step of the algorithm, local change of the specified attribute and
distance to the neighbors is calculated. In the next step an LR is applied on
the selected pairs of values. When the resulting correlation coefficient is larger
than a specified threshold we can state that a trend is discovered. An illustrative example can be seen in Figure 2.2, originally created by Ester, where an
attribute average rent from the BAVARIA dataset is depicted. A significant
trend can be observed for the city of Munich: the average rent decreases quite
regularly when moving away from Munich. [Ester et al., 1997]
CHAPTER 2. LITERATURE SURVEY
16
Figure 2.2: Average rent for the communities of Bavaria. [Ester et al., 1997]
Chapter 3
Association rules and
geographic data
This chapter gives a general overview on a spatial association rule mining
technique. After a short introduction and definition of the rule, the most
commonly used algorithm for mining association rules the Apriori algorithm,
is briefly described. Since geographical databases deal with spatial data, the
mining process tends to be more difficult. To simplify the process, geographical
data are transformed to the format understandable to classical association rule
mining. The chapter concludes with a case study concerning the application
of the association rule on a geographical database representing geo-referenced
crime data. [Estivill-Castro and Lee, 2001]
3.1
Spatial association rules
For the case study, we decided to use association rule mining technique because this technique enables to detect interesting relationships among objects
representing the study area. The association rule was originally introduced
for the so-called market basket analysis. The basic idea is to find regularities
in the shopping behavior of customers of supermarkets. Typical business decisions, for example about possible sales are usually based on past transaction
data analysis. This analysis tends to improve the quality of such decisions.
Since the progress in bar-code technologies has made it possible to collect
17
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
18
massive amounts of basket data and to store them in a database, the necessary functionality for taking advantage of this process should be provided.
[Agrawal et al., 1993]
Spatial association rule is a rule denoting certain association relationships
among a set of spatial and possibly some non-spatial attributes of geographical
objects, which are for the analysis indicated as predicates. The spatial predicates may represent topological relationships between spatial objects, such
as disjoint, intersects, adjacent to etc., they can also hold information about
spatial orientation or ordering like left, north, east etc., or specify a distance
e.g. close to. [Koperski and Han, 1995] For better understanding of the spatial association mining technique, some preliminary concepts are introduced
in the following sections.
3.1.1
Definition
Let χ = I1 , I2 , · · ·, Im be a set of binary attributes called items. Let T be a
database of transactions. Each transaction t is represented as a binary vector,
with t[k] = 1 if t contains the item Ik , and t[k] = 0 otherwise. There is one
tuple in the database for each transaction. Let X be a set of some items in
χ. We say that a transaction t satisfies X if for all items Ik in X, t[k] = 1.
[Agrawal et al., 1993]
A spatial association rule is an implication of the form X ⇒ Ij (c%), where
X is a set of some items in χ, and Ij is a single item in χ that is not present
in X. [Agrawal et al., 1993] The item set X is called antecedent and the part
behind the implication arrow, is consequent. The most often used measure of
a rule’s strength is confidence (c%), which indicates that c percent of the items
satisfying the antecedent of the rule will also satisfy the consequent of the rule.
Following the definition, a large number of spatial association rules can be
derived from a large spatial database. However, only few rules are indicated
as useful. Therefore, the amount of generated rules is restricted only to those,
which satisfy certain additional constraints, which are of two different forms:
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
19
1. Syntactic Constraints: These constraints involve restrictions on items
that can appear in a rule. For example, we may be interested only in
rules that have a specific item Ix appearing in the consequent.
2. Support Constraints: These constraints concern the number of transactions in T that support a rule. The support for a rule is defined to be
the fraction of transactions in T that satisfy the union of items in the
consequent and antecedent of the rule.
The aim of association rule mining is to generate all combinations of items
that have the support above a certain threshold minsupport. Those combinations of items are called large itemsets. Consequently, all the combinations
of items, that have support below the given minsupport threshold are called
small itemsets. [Agrawal et al., 1993]
For the given large itemset Y = I1 , I2 , · · ·Ik ; k ≥ 2 the association rules are
generated afterwards. The number of rules is at the most k and the rules
only contain items from the set I1 , I2 , · · ·Ik . The antecedent of each of these
rules will be a subset X of Y such that X has k-1 items, and the consequent
will be the item Y-X. To generate an interesting rule X ⇒ Ij (c%), where
X = I1 , I2 · · · Ij−1 , Ij+1 · · · Ik , the confidence of the rule also has to exceed
a certain minconfidence threshold. [Agrawal et al., 1993] The rule is strong
when the support is large, i.e., no less then the minimum support threshold,
and the confidence is large, i.e., no less then the minimum confidence threshold.
[Koperski et al., 1996]
3.2
Apriori algorithm
The main problem of mining association rule is the fact that a large number
of rules can be derived from a large database. However, most people are only
interested in patterns that occur relatively frequently, i.e. strong rules. Since
it is not possible to inspect each rule separately, efficient algorithms are needed
to restrict the search space and check only a subset of important rules. One
of the best known algorithms for mining spatial associations is called Apriori
algorithm and was developed by Agrawal et al. [Agrawal et al., 1993].
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
20
Table 3.1: Example of basket data
basket-id
A
B
C
D
E
t1
1
0
0
0
0
t2
1
1
1
1
0
t3
1
0
1
0
1
t4
0
0
1
0
0
t5
0
1
1
1
0
t6
1
1
1
0
0
t7
1
0
1
1
0
t8
0
1
1
0
1
t9
1
0
0
1
0
t10
0
1
1
0
1
This algorithm works in two steps. In the first step the large itemsets are
determined. The second step represents the actual generation of association
rules from the large itemsets detected in the first step. The first step is the
more important one, because it accounts for the greater part of the processing
time. [Borgelt and Kruse, 2002]
3.2.1
Discovering large itemsets
The large itemsets are very simple patterns telling us that variables in the set
occur reasonably often together. Fortunately, only relatively few large itemsets
may be generated from real databases. To discover large itemsets, we need
to find all itemset patterns that are frequent, i.e. occurrence of the pattern
exceeds the minsupport threshold. Discovering large itemsets is demonstrated
in the following example. Table 3.1, originated in [Mannila, 2002], represents
transactions of several customers in a supermarket. Every line ti indicates a
single customer transaction, which consists of items placed in the customer’s
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
21
shopping basket. Each column represents a single supermarket item. The purchase of a specific item is denoted by a 1 and a 0 means that the item is not
bought. We can easily detect that customer t1 placed in his shopping basket
only item A. Let us set the support, i.e. frequency threshold, to 0.4. From the
example in Table 3.1 all the large itemsets satisfying the given constraint are
{A}, {B }, {C }, {D}, {AC } and {BC }. [Mannila, 2002]
The itemset is considered large if all of its subsets are large. Therefore, we
can find all frequent itemsets by first identifying all large 1-itemsets, i.e. sets
consisting of 1 variable like {A}, {C}. In the next step we build candidate
itemsets of size 2 by connecting two large 1-itemsets {A, C}. This candidate
itemset is tested and later approved as large, if all the test are passed successfully. We can similarly create a candidate itemsets of size 3 and larger.
Figure 3.1 illustrates the steps of the algorithm. Where Lk is a set of large
k -itemsets. Each member of this set has two fields: i) itemset and ii) support
count. Ck represent a set of candidate k -itemsets. Every member of this set is
also characterized by the same two fields as Lk . The first pass of the algorithm
simply counts item occurrences to determine the large 1-itemsets. Every other
pass k consists of two phases. First, the large itemsets Lk−1 detected in the
(k-1)th pass are used to generate the candidate itemsets Ck , by the apriori-gen
function. Next, the database is scanned and the support of candidates in Ck is
counted. To make the counting faster, the candidates in Ck that are contained
in a given transaction t should be efficiently determined. The subset function
is used for this purpose. Further explanation of the apriori-gen and the subset
functions is specified in [Agrawal and Srikant, 1994].
3.2.2
Extraction of association rules
After all the large itemsets are defined, the extraction of association rules is
rather straightforward. If we consider the example in Table 3.1, two association
rules can be discovered A ⇒ C with 67% confidence (c = 46 = 32 ) and the rule
B ⇒ C with 100% confidence (c = 55 = 1). [Mannila, 2002]
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
22
L1 = {large 1- itemsets} ;
for ( k = 2; Lk -1 ≠ ∅; k ++) do begin
Ck = apriori − gen( Lk -1 );
// New candidates
forall transactions t ∈ D do begin
// Candidates contained in t
Ct = subset (Ck , t );
forall candidates c ∈ Ct do
c.count ++;
end
Lk = {c ∈ Ck | c.count ≥ minsup}
end
Answer = U k L k ;
Figure 3.1: Apriori Algorithm
3.3
Evaluation of the rules
In this chapter we have so far focused primarily on association rules formalism. However, an important part of the association rule mining is evaluation
of generated rules. We can obtain hundreds of rules representing a dataset.
Therefore, we need to validate the rules and select only those, which present
only important patterns. It is obvious, that evaluation of all rules one-by-one
is impossible for a human expert. Some automated techniques are needed to
support the interesting rule selection and facilitate the work.
The previous sections discussed methods that are used to find all rules that fulfill simple frequency and accuracy criteria. However, we should not restrict the
selection of interesting rules to only those exceeding the minsupport and minconfidence threshold. Moreover, some rules with low support or low confidence
may still hold very interesting patterns. On the other hand, not all rules with
high confidence and support are interesting. [Borgelt and Kruse, 2002] The
structure of interesting rules can be described by simple rule template which
represents the syntactic constraints. The template gives users the possibility to
specify both interesting and uninteresting rules by describing what attributes
should occur in the antecedent and the consequent. [Klemettinen et al., 1994]
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
23
In the example of the basket data dataset, introduced in Table 3.1, we discovered two rules with support higher than 0.4; A ⇒ C(67%) and B ⇒ C(100%).
Both rules are not necessarily equally interesting. Let us assume, that for our
purpose, rules with B in an antecedent are not important. In this case we
design a template in order to restrict the selection. From the dataset example,
we discover one interesting rule of the form A ⇒ C(67%).
3.4
Mining multivariate associations using clustering
The aim of the research carried out by Estivill-Castro and Lee is to examine
and analyze crime data. [Estivill-Castro and Lee, 2001] Since crime data is a
complex phenomenon, there is a great need for a sophisticated tool to facilitate
the data analyses. One of the popular techniques is a Crime hot spot analysis.
In this article authors identify the hot spots by clustering. After the crime
clusters are identified, the detection of possible cause-effect relations follows.
An association rule mining seems to be a feasible tool. The research proposed
a vertical-view approach for the cluster association rule mining. Since this
research deals with data similar to our case study, it is the main inspiration
for this thesis.
The vertical-view approach detects the relationships among layers by modeling a space into regular cells similarly as a raster. The input for the analysis
consists of several geographical layers of the same area. Each layer represents a different item (different attribute of the geographical location). The
layers include only point or polygon data. Every pinpointed location is assigned a value of each attribute, corresponding to the selected layers. Values
of attributes become true (1) if the location lies within regions (clusters) of
corresponding layers and false (0) otherwise. The vertical-view approach tries
to discover interesting associations from the whole set of attributes.
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
24
Figure 3.2: The first four pictures represent four geographical layers. Picture
a) displays railway stations, picture b) crime incidents. Parks are displayed
on the picture c) and the d) picture depicts urban areas. Picture e) and f)
show the point data after the cluster analysis overlayed by a four-cell grid.
The last two pictures g) h) represent the polygon data, i.e. parks and urban
areas overlayed by the same grid. [Estivill-Castro and Lee, 2001]
The algorithmic procedure of the vertical-view approach is as follows:
1. Find spatial clusters for point-data layers.
2. Segment all the layers with a finite number of regular cells (rectangles).
3. Construct an m x n relational table with the binary {0;1} values.
4. Apply association rule mining to the table.
The first step is to find homogenous groups of spatial concentrations of point
data layers by applying cluster analysis. Noise points are ignored. The space,
in every layer, is then divided into rectangular cells of an arbitrary size. The
size of the cells is identical for each layer. After that the relational table is
computed. The size of the table depends on the number of cells in the grid.
Finally, the association rule is applied to find the correlation between a set
of layers. Since the relational table of the locations is exactly the same as
a table created from transactional databases, except layers replace items and
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
25
Table 3.2: 4 x 4 relational table
layer(a)
layer(b)
layer(c)
layer(d)
loc(1)
1
1
0
0
loc(2)
1
1
1
1
loc(3)
1
0
1
1
loc(4)
1
1
1
1
locations replace transaction, it is now straightforward to discover associations among layers using traditional association rule mining. The algorithm is
demonstrated by an example, illustrated in Figure 3.2 and described in the
following text.
The geographical database consists of four data layers. The first layer (a)
shows railway stations as point data, the second layer displays incidents (b),
the last two layers contain polygons depicting parks (c) and urban areas (d).
The first step is to identify homogeneous groups of spatial concentrations of
point data layers. Two clusters of railway stations (e) and one cluster of crime
incidents (f) have been formed. Subsequently, a 2 x 2 grid is overlayed over all
data layers. A relational table 3.2 is derived from the grid. In this table column layer(j)(0 5 j 5 n) represents j-th geographical layer, loc(i)(1 5 i 5 m)
in rows denotes i-th cell in the grid numbered in Morton order. The transaction t[loc(i), layer(j)] is 1, if event in the j-th layer occurs in loc(i) cell and
t[loc(i), layer(j)] is 0 if otherwise. For instance, t[loc(1), layer(a)] = 1, because the cluster of railway stations lies within the top-left cell as depicted in
Figure 3.2 e). The association rules can be directly mined from the relational
table. One of the association rules is as follows:
layer(a) ∧ layer(b) ⇒ layer(d)(c =
2
. . . c = 66.7%)
3
With 50% support (s = 24 = 0.5), 66.7% of locations, that are situated near-by
railway stations and has quoted crime incidents, fall within urban areas.
CHAPTER 3. ASSOCIATION RULES AND GEOGRAPHIC DATA
26
In the vertical-view approach, the granularity of cells plays a critical role.
However, a big advantage of this approach is its simplicity and the possibility of applying classical association rule mining techniques to a geographical
database.
Chapter 4
Disaster management in Finland
For better understanding of the case study, this chapter gives an overview of
the risk assessment procedure in Finland. In section 4.2, the general model for
fire and rescue operations, proposed by the Ministry of Interior in cooperation
with the Federation for Fire Brigade Chiefs is described. An explanation of the
specific model used in the Helsinki Metropolitan area concludes this chapter.
4.1
Risk assessment procedure
Accident prevention and prevention of disasters have been recognized as a
fundamental topic in the areas of civil protection and emergency services. Risk
assessment plays an essential role for improvement of a risk model, developed
for the fire and rescue services. This model aims to provide crucial information
for planning rescue operations. All Finnish municipalities are responsible for
rescue operations in their respective areas. In 1992 the Finnish Ministry of the
Interior defined the Guidelines on Preparedness of Municipal Fire Brigades.
These Guidelines state, that preparedness in the fire brigades must be based on
municipal risk analysis. To assist the municipal fire brigades in obtaining the
risk assessments, a handbook was published in 1994 in co-operation with the
Ministry of the Interior and the Federation for Fire Brigade Chiefs in Finland
[Alliniemi J., 1994], [Lonka, 1999].
27
CHAPTER 4. DISASTER MANAGEMENT IN FINLAND
4.2
28
General model
According to [Lonka, 1999] the probabilities of different risks are estimated in
the risk assessment procedure. In order to get a numerical risk estimate for
each possible risk a simple calculation method is used:
R = (L + F + P + E) ∗ P b
In this formula R represents a risk, L are the consequences for life and health, F
represents the rapidity of the development of accident, P are the consequences
for property, E are the consequences for the environment and finally Pb is the
probability of the risk occurrence. The consequences can be deaths, injuries,
property losses, interruption losses and environmental damages. These calculations give only very rough estimates of the risk, therefore more sophisticated
evaluation of the risk assessment is needed. The model for risk assessment
should take into consideration the subjective sensitivity of municipalities to
different risk categories. For example, one of the most considerable risks of
the harbour area can be transit transport of liquid hazardous materials, while
this risk has no significance in the inland.
4.3
Risk model in the city of Espoo
In the Helsinki Metropolitan area (Municipalities of Helsinki, Espoo, Vantaa
and Kauniainen), the fire and rescue department focuses on two-level protection plans. The first level is related to the normal situation, whereas the
second level is created for the extreme situations, e.g. wartime. However, the
general interest lays in the enhancement of risk analysis for the normal, i.e.
everyday situations.
To fulfill the risk assessment in a responsible area, a model was developed
by the Rescue Office of Espoo. This model is used for calculation of the risk
zones. The zone identification supports the decision of associating rescue service levels, that are mentioned in the law (Ministry of Interior 2000). The
implementation of this model is based on the expertise of people working in
CHAPTER 4. DISASTER MANAGEMENT IN FINLAND
29
the rescue services and on the national statistics. Simple spatio-statistical
analyses form the core methods of the model.
Those analyses were implemented in a GIS system and provided to the municipalities of the Metropolitan area. This tool depicts the risk zones within the
municipalities responsible areas. The tool was designed for raster operations.
The rescue services identified three factors with the possible strongest affection
to the risk occurrence. Each factor is represented in a separate raster layer of
cell size 250m x 250m. The first raster contains information about a population distribution, the second represents the floor area and the probability of
traffic accidents is displayed on the third raster layer.
The analysis proceeds in two steps. In the first step all three rasters are
combined. Every resulting grid cell is then classified according the values in
all three layers, and assigned a final risk level from 1 to 4. The level defines
the time in which this area has to be reached by the rescue service unit. For
instance, level 1 indicates, that the rescue unit has to be in the place within
6 minutes. In the second step the cells of the resulting raster are joined into
a spatially connected regions. [Krisp et al., 2005] The joining process is based
on a simple rules defined by Ministry of Interior. One of the rules states:
If there are at least ten risk class 1 cells within an area of 10km2 , then the
whole area is classified by risk level 1. Figure 4.1 illustrates the two steps of
classifying the area into the specific risk regions.
CHAPTER 4. DISASTER MANAGEMENT IN FINLAND
30
Figure 4.1: This figure represent the two basic steps in the process of identifying different risk regions. The picture a) displays the result of the raster
overlay of the three data layers. Each pixel is assigned a colour according
one of the four risk classes; red, yellow, blue and white. The second step is
depicted on the picture b). According certain rules, the neighboring pixels are
connected into bigger regions. [Ihamäki, 2000]
Chapter 5
Data
The study of spatial association relationships is confined to the center of the
Finnish capital, Helsinki. The map of the area is presented in Figure 5.1.
The data for the case study are derived from two specific datasets. The first
one is acquired by the Finnish fire and rescue services and the second dataset,
SeutuCD, is provided by the Helsinki Metropolitan Area Council (YTV). Both
of the datasets and the data for the case study are described in three sections.
5.1
Dataset of incidents
The Helsinki Fire and Rescue department maintains an up-to-date register
of all incidents the Fire brigaded has been appealed to. The register data
indicates the point locations of incidents concerning all the fire alarms, rescue
missions and automated fire alarm systems missions within the Helsinki city
area. Every record also includes a more detailed description of the incident.
This information is stored as attribute data. Detailed understanding of the
incident properties plays a crucial role in the decision making. For example,
the knowledge of an occurrence time can help the rescue services with planning
the future distribution of their resources.
31
™
=
.
0
*
;
v
y
ä
à
t
º
B
Æ
¢
o
O
ë
x

G
$
Ê
Á
Þ
Œ
\
f
˜
O
„
Ì
€
>
ù
=
®
î
»
ì
U
…
¦
ã
Ë
8
:
Ì
ç
E
Â
Z
}
‡
ç
X
ö
¼
œ
ž
õ
¹̧
l
o
©
¾
T
V
j
m
3
w
ñ
æ
[
³
{
ƒ
k
Ã
ç
)
â
Á
/
’
±
¹
Ô
œ
ø
Â
z
G
b
$
‚
e
¹̧
þ
T
§
š
Æ
…
i
k
á
3
”
Q
£
±
ð
ô
B
F
S
r
É
ó
ý
&

Ñ
ã

Ž
¤
d
ø
²
z
Ò
g
&
Ž
W
ˆ
¬
µ
«
@
Ø
÷
O
L
U
Û
p
™

+
F
9
#
4
C
›
b
¡
Û
½
Å
{

ü
P
ú
ñ
õ
S
¥
Ü
Y
Ð
à
‘
R
A
Ò
µ
‹
,
*
ý
“
×
Á
Ê
*
‚
Ò
ÿ
ˆ
·
;
2
Î
à
}
M
e
D
I
i
k
¹
Q
ò
¿
Z
¨
¤
°̄
h
ý
l
ï
”
|
ÿ
œ
5
‘

’
Ç
­
À
¦
ó
Ó
•
³́

û
M
È
/
ª
ß
6
?
Ð
Ä
K
ˆ
;
P
Ö
3

Ä
è
x
|
•
¯
á
A
^
î
[\
A
a
s
—
¡
?
Î
Ø
Ý
å
.
L
à
Ê
ð
6
N
\
€
Ÿ
ï
ü
"
K
R
[
u
§
¿
õ
J
~
[
`
Ã
N
l
x
—
ß
1
3
8
‹
é
ƒ
´
Ú
(
c
v
y
°
¼
ì
!

¶
¹̧
Æ
Ý
í
/
Q
s
7
:
Ž
â
¸
À
Ý
t
¦
«
Ì
Õ
ò
9
H
_
Â
ä
'
"
C
u
–
Ç
é
W
¢
,
­
­
Ô
)
]
é
J
<
Í
^
]
ú
ù
d
%
"
f
”
?
ê
›
ñ
Ú
š
ô
–
H
!
r
G
)
þ
Ÿ
½
j
›
É
Ö
Ñ
î
÷
t
˜
¨
#
Ë
Å
F
„
‡
n
Õ
Þ
n
_
È
Õ
¬
×
G
j
Ü
'
Š
R
X
q
û
º
4
°
0
‚
ù
%
µ
+
%
Y

~
“
Š
Œ
y
,
‰
ê
ö
}
Û
ª
á
2
ò
|
š

]
ž
a
m
à
T
w
o
Ù
…
\
í
Ï
i
:
~
þ
ª
¾
Ÿ
˜
1
p
<

“
ë
ß
¯
û
7
ê
h
;
»
n
ƒ
Y
:
„
†
å
o
‰
á
ù
!
U
f
æ
Î
>
l
Ú
š
V
Ø
8

o
q
u
t
„
w
ø
¡
M
ø

c
ä
Œ
q
A
T
S
³
X
©
‡
À
6
ñ
Ù
H
I
î
†
f
~

¥
ð
W
w
¶
µ
ê
7
¢
v
ä
X
Ó
÷
.
ï
{
è
4
U
º
$
>
¦
Œ
<
p
F
j
Í™>é֋XەÑΘ]}Þ!¾V(
ØÙ˗ƒ'öåLJkhH
i
Í
Ã
²
û
$
·
ì
m
^
¾
F
d
x
'
}
9
‹
E
w
Ô
ç
â
ÉD=>
å
z
V
g
p
W
Ô
à
Ó
#
ÿ
+
&
Í
vCí´u±
FNÔTs,*Õ6Ó
’„?5ò©ž0°±
›BHI
®

æ
¼
»
(
E
È
Ü
@
³́
Ì
è
@
µ
&
²
£
¢
{

†
§
ô
º
n
€
Ø
Ò
C
Ó
#
N
‚
Ç
ž
`
¾
É
æ
Ì
Ë
[
J
Ð
Ñ

8
óޓä?%Ïʟ¡
x
•
c
­
­
³́
z
Ž

ƒ
2
Á
3
ô
Ï
ø
v
t
Ý
¥
·
Ò
"
q
²
É
1
¦
¶
½
|
O
‚
m
…
Y
G
‘
2
¥
Ú
J
]
p
I
«
P
œ
à
%
W
Î

LçëK:÷7üoiú‰uA–Î_ÆłLg
CHAPTER 5. DATA
32
Helsinki city center
1:25000
0
0.35
0.7
1.4
2.1
2.8
Kilometers
Figure 5.1: Map of the study area
CHAPTER 5. DATA
5.2
33
SeutuCD
All Municipalities in Finland are obligated to gather register data on their
population, buildings and land use plans. The municipalities can benefit from
the obligation more, if the form of the data is standardized, because the data
analyses can be realized independently on the municipal boundaries.
The Metropolitan municipalities transferred some of the responsibility of urban planning and development of the area to the Helsinki Metropolitan Area
Council (YTV). Since 1997 the YTV is working on the production of a database covering the whole Metropolitan area with data from the municipality
registers. The outcome is a data package called SeutuCD. SeutuCD includes
register data of population, buildings, agencies and enterprizes and data related to Land use planning and Real Estate. [YTV, 2005]
5.3
Data for the case study
The case study area covers the center of Helsinki and adjacent water areas as
depicted in Figure 5.1. The data for the analysis is extracted from both of
the provided datasets. Because the whole SeutuCD database appeared to be
unnecessarily extensive, only a representative sample is utilized. Analysis of
only sample data seems to be sufficient and has no effect on the basic behavior
of association rule mining.
The core of the sample data consists of incident records, obtained from the
Fire and Rescue service Office in Espoo between the years 2002 and 2003. All
incidents are located within the study area. To simplify the analysis, all nonspatial attribute information is omitted. Every incident is characterized only
by its unique ID, geographic coordinates and a definition of its data type, in
the incident case point.
To observe possible interaction of the incident with its neighborhood, information of geographical objects situated within the study area is needed. The
source of the additional information is SeutuCD. For the study area, the follow-
CHAPTER 5. DATA
34
Figure 5.2: Data representing the study area
ing geographical layers were extracted from the database; bars and restaurants,
kindergartens, parks and cemeteries, water areas and road network. The data
is also described by a unique ID, geographical coordinates and data type definition.
This particular selection is made based on the diversity of the data types.
By analyzing points, lines and polygons, the behavior of association rule with
respect to all data types can be studied. The only objects, selected from a
recommendation of the rescue services are bars and restaurants. There exists
a suspicion of their connection to incidents. Because some of the incidents
occurred at sea, the layer containing water areas is included in the data selection. Since the center of Helsinki is situated on a peninsula, the sea plays an
important role in this area.
The parks and cemeteries data is used due to the inspiration of Estivill-Castro
and Lee’s research [Estivill-Castro and Lee, 2001] more closely described in
Chapter 3. The road network is a representative example of the line data.
The roads are widely distributed over the whole city center. To keep the same
character of all data, e.g. every layer representing only one object type, the
road network is divided into several layers according to their category in the
road classification. The last selected layer represents kindergartens. The main
reason is inclusion of other point objects. In Figure 5.2 all the final layers are
named and presented together with their symbol representation on the map.
Chapter 6
Method
This chapter explains an extensive process of mining geographical data. Since
the format of the geographical data is very complex and therefore incomprehensible for applications, extensive pre-processing is required before the association rule mining is applied. This chapter offers a procedure, which leads
to identification of objects influencing the occurrence of incidents. The aim of
this thesis is not to implement any new algorithm, but to apply already existing tool. This chapter concludes with information about the used program
and settings of required parameters.
6.1
Process description
The whole process is outlined on the schema in Figure 6.1. The three core
steps are data pre-processing, transformation to the transaction format and
association rule mining. Those steps are symbolized by red ellipses in the
schema and in the text they represents sections of this chapter. The black
boxes, connected by arrows represent the particular actions, that have to be
performed in each of the three steps. The left side of the schema illustrates the
changes of the format of extracted data needed for the analysis. Additional
operations are depicted on the right side of the schema together with the
program used for the extraction of association rules.
35
CHAPTER 6. METHOD
36
Seutu CD
Fire and Rescue
services database
relevant data extraction
Buffer approach
11 geographical
vector layers
neighbourhood specification
Grid approach
identification of objects situated
within the neighbourhood area
6.1 Data pre-processing
11 text files
export selection to text files
transformation to the
transaction format
6.2 Transformation to the transaction format
transaction file
association rule mining
Genome
template
selection of interesting rules
6.3 Association rule mining
interesting
assocaiton rules
Figure 6.1: Schema of the association rule mining process.
CHAPTER 6. METHOD
6.2
37
Data pre-processing
Association rule mining is a rather straightforward process. However, the
format of the data can generate problems when applying this data mining
technique. This issue becomes a challenge once we concern the detection of
associations among spatial objects. In spatial databases, data are seldom
stored in the form of transactions. To be able to apply the association rule on
the spatial-referenced objects, some changes to the data have to be done.
In this case, the definition of every object from the study area is restricted
to only two basic attributes. The first one relates to the data type and the
second one characterize the location of the object in space. Every object is also
given an ID number, which is unique within the geographical layer it belongs
to. For instance, let’s select an incident with IDinc = 2 and a bar IDb = 2.
Although those two different objects are assigned the same ID number, they
are still distinguishable, because they belong to different geographical layers
(incidents layer and bar and restaurants layer).
Since only the spatial representation of the objects is known, the only reasonable subject to mine is the object’s geographical position. Therefore, we
decided to mine spatial relations among data situated in different geographical
layers. However, spatial relationships are not always easy to discover without
using efficient algorithms like, e.g. Plane sweep [de Berg et al., 2000]. To prevent problems of defining various topological relationships, the only spatial
relation identified among the sample data is the proximate neighbourhood of
the objects. In this thesis, the neighborhood area is defined in two different
ways. The first approach divides the study area space into regular square gird
cells, where the second considers the neighbourhood as a regular buffer around
the objects.
6.2.1
Grid approach
The division of the space into a regular grid was introduced by Estivill-Castro
and Lee’s vertical-view approach [Estivill-Castro and Lee, 2001] explained in
CHAPTER 6. METHOD
38
Figure 6.2: Space filling curve for grid cells numbering.
Chapter 3. The Crime data analyzes face similar problems as risk assessment modelling for the Rescue services. Therefore, the basic idea of this
approach is adopted and adjusted to our case study. However, cluster analysis is not applied on the data before the association rules mining. By generalizing the original data into clusters, some important patterns may be
lost. The objects within the study area are already selected by extracting specific geographical layers from the two available databases. Therefore,
there is no need for a further generalization of the information. However, in
[Estivill-Castro and Lee, 2001] the clustering analysis was a necessary step to
focus the research only to the interesting Hot spot areas.
In our case, the data is organized into 11 geographical layers. A regular square
grid is placed over the whole area layer by layer. The size of one grid cell is chosen 50m x 50m. Every grid cell identifies a neighborhood and is characterized
by a unique ID number. The cell numbering starts at the bottom left corner
of the area. After all cells in a row are assigned a number, the numbering of
the next row continues from the most left cell. The schema of this process is
shown in Figure 6.2. After all cells obtain an ID, the pre-processing method
can start.
CHAPTER 6. METHOD
39
The method is same for all eleven layers, therefore the explanation is illustrated for only one, representing railway. The pre-processing technique is
rather simple and consists of two basic steps:
1. Grid cells selection
2. Extraction of the data
In the first step all cells, which are intersected by a railway are pointed up.
Since the ID number of each cell is known, the selected cells can be easily
identified. In the next step, all the emphasized cells are extracted from the
grid layer and stored in a separate text file. The two steps are illustrated in
Appendices A and B. Appendix A represents all pointed up cells on a map of
the study area. A fragment of the same data, but after the second step is displayed in Appendix B, where every row carries an ID number of a selected cell.
In many cases, one grid cell contains several objects from the same geographical layer. However, we are not interested in the amount of objects of one type
belonging to one grid cell. The cell is selected, when at least one object of a
particular layer is found within the cell area. For instance, two cells (5157,
5252) are highlighted in Figure 6.3. Just by visual comparison we can already
state that cell number 5252 contains more than one segment of railway, while
only one railway segment intersects cell 5157. However, both cells hold the
equal information. For purpose of association rule mining, the count of the
same type of objects within the neighborhood area is not essential. The main
goal of the data pre-processing is to identify diverse object features laying
within one grid cell. After applying this process to all geographical layers,
eleven text files are obtained with names after the explored objects.
6.2.2
Buffer approach
The neighborhood of objects can also be identified by a buffer. The buffer is
placed only around the points representing incidents. We are only interested
in discovering possible relations between incidents and selected geographical
objects. Study of all existing relations among geographical objects from the
SeutuCD and the relations among incidents are out of the scope of this study.
CHAPTER 6. METHOD
40
Figure 6.3: Evaluation of the cell.
The buffer has a circular shape with the incident located in the center. The
radius of the circle has the same length as a size of the grid cell, i.e. 50m.
Everything located within a buffer is considered to be neighboring object.
Figure 6.4 shows an example, where a minor road and a bar or restaurant are
adjacent to an incident, situated in the center of the highlighted buffer zone.
Similar to the grid approach, all buffers are identified with a unique ID number. The numbering is in this case random, since the order of the cells is not
important. The basic idea of the buffer approach resembles the grid approach.
Instead of placing a grid over the whole area, we study an intersection of all
layers with the created buffer zones. All objects, belonging to a particular
layer, and situated within a buffer, are extracted and saved to a text file. As a
result, we obtain ten text files (incident file is excluded), where each file represents a particular geographical layer. The structure of the files is the same
as for the grid approach.
CHAPTER 6. METHOD
41
Figure 6.4: The picture depicts the buffer areas (yellow) around incidents.
The highlighted circle identifies the neighborhood of the incident located in
the center.
CHAPTER 6. METHOD
6.3
42
Transformation to transaction format
In the previous steps all the relevant data is extracted and stored in text format. This pre-processing is necessary, because the complex geographical information need to be simplified. However, before applying the mining process,
still more adjustments have to be done. All the data are now stored in separate files. In the next step, those files have to be integrated transformed
into the suitable format, i.e. transactions containing itemsets. The basic idea
of the integration is the same in both approaches, however, there are slight
differences, which need to be explained.
6.3.1
Grid data integration
As a result from the neighbourhood detection eleven text files are obtained.
Each file stores ID of cells, where objects of a certain type are located. However, each file represents only single object type. In the following step we need
to unite all files according the cell identification. The process is depicted in
Figure 6.5. The top part of the figure shows three separate text files in three
columns. For easier identification, the name of each file is added. Numbers in
the columns represent the cell ID numbers. A number to each file is assigned
according the input order to the integration algorithm. For instance, the file
representing railway is given number 1, because it is detected first. The steps
of the integration algorithm are:
1. Check every cell ID number of the grid.
2. If the ID number occurs in a file, classify the cell according to the file of
origin, in our example (1, 2 or 3).
3. Add the classified cell as an item to the Results file.
4. If the same ID number item exists in a different file, add the file number
item to the already created transaction in the Results file.
5. Save the Results file.
Let’s demonstrate the algorithm on an example highlighted in red in Figure
6.5. After several passages through the files, cell number 50 is detected in the
CHAPTER 6. METHOD
RAILWAY(1)
2
45
48
50
132
145
159
43
INCIDENTS(2)
10
45
46
50
133
BARS(3)
1
8
16
23
50
133
165
181
RESULTS
3
1
3
2
3
3
12
1
2
123
1
23
1
1
3
3
Figure 6.5: Explanation of the integration algorithm for the grid approach.
The top part depicts the three files containing the ID numbers of the selected
cells. The Results file is the output of the algorithm.
CHAPTER 6. METHOD
44
railway file. The cell is classified as number 1, because 1 is the label of a railway
file. Consequently, a new transaction is created in the Results file. The same
number (50) is found in file number 2, i.e. incidents. Therefore, the algorithm
adds the item to the already existing transaction. Now the transaction row
contains two items 1 and 2, railway and incident. Finally, the same cell number
is detected also in the third file representing bars and restaurants. The cell is
again classified by the number of the file and added to the transaction. The
final itemset is of the form 1, 2, 3 and states: In one location within the
study area, railway, incident and a bar or restaurant are identified as adjacent
objects. The entire Results file illustrates all transactions, discovered from the
example input files.
6.3.2
Buffer data integration
The obtaining of the transactional file for the buffer approach is very similar to
the grid approach. Every text file extracted from the database includes buffer
ID numbers. The number represents a buffer, which contains or intersects a
particular object. The buffer zones are created only around incidents, therefore
the incident layer is extracted from the database and it is not anymore used
in the further operations. Similar to the grid approach, all the files need to
be integrated. The integration algorithm is performed the same way as in
the grid approach, however an additional step is added. After the Results
file is filled, one more item is added to every itemset. The neighbourhood is
closely specified only to the proximity of incidents, but the Results file does
not, until now, contain any information about it. Therefore, the additional
item, substituting incidents, respectively the buffers around incidents, makes
the itemsets complete. The whole process is illustrated in Figure 6.6, where
the number 10 represents the additional information about incidents. The
highlighted itemset contains representatives from all three files and expresses:
On one location, situated in the Helsinki center, an incident happened in the
proximity of a railway, bar or restaurant and park or cemetery.
CHAPTER 6. METHOD
RAILWAY(1)
2
45
48
50
132
145
159
45
PARKS(2)
10
45
46
50
133
BARS(3)
1
8
16
23
50
133
165
181
RESULTS
10 3
10 1
10 3
10 2
10 3
10 3
10 1 2
10 1
10 2
10 1 2 3
10 1
10 2 3
10 1
10 1
10 3
10 3
Figure 6.6: Explanation of the integration algorithm for buffer approach. The
top part represents the three input files containing the ID of selected buffers.
The Result file is the output of the algorithm.
CHAPTER 6. METHOD
6.4
46
Mining association rules
Until now, only data pre-processing methods are described. By integrating
all files, the spatial information describing the neighbourhood of selected geographical objects is transformed to a simple set of transactions. Once the
transaction file is obtained, the application of association rule mining is simple and straightforward.
The Apriori algorithm, described in Chapter 3 is the best known among the
algorithms for association rule induction. The aim of this thesis is not to implement the algorithm into a working program. We concentrate more on the possible application of the algorithm. Therefore, we decided to analyze our data
with an already existing program. The program is designed by Borgelt and its
implementation is explained in [Borgelt and Kruse, 2002]. A graphical user
interface for this program was developed by Togaware [Togaware, 2005] as part
of The Gnome Data Mine tool, and can be downloaded from [Gnome, 2005].
The input format of the program is a transaction file. Each record, i.e. one
row, must contain one transaction, i.e. a list of items, which are separated
by a blank. An empty record is interpreted as an empty transaction. Both
our Results files from the grid and buffer approaches, are in the recommended
format. Therefore, the selected program can be applied on our data.
6.4.1
Constraints definition
Both transaction files are in the format suitable for the Genome data mining
tool, selected for extracting the association rules. However, we are aware that
large amounts of rules can be discovered. To obtain only valuable rules, the
extraction has to be restricted. Therefore, three constrains are defined:
ˆ Minsupport
ˆ Minconfidence
ˆ Syntactic constraint (template)
CHAPTER 6. METHOD
47
Because rules with low confidence de facto represent negation, which can hold
an interesting information, the minconfidence threshold is set to zero. With
respect to the minconfidence, the minsupport is also equal to zero. With these
settings, all existing rules are extracted from the data transactions. We are
however not interested in all of them. The other way of solving problems
related to extraction of only important rules it to apply the syntactic constraints. By designing a simple template, where the possible appearance of
certain items is stated, only rules fulfilling the constraint are selected from
the database independently on the value of the confidence and support. This
designed template limits the number of rules to only those, containing incidents. The three constraints are applied to both transactional files with the
same values.
Chapter 7
Results
This chapter evaluates the generated rules. The rules fulfilling the set constraints are described. Since the association rule mining is applied on both
transaction files, the results are listed separately. The general results obtained
by the whole process concludes this chapter.
7.1
Grid approach
Since we divided the study area into regular grid cells of size 50m x 50m, the
number of transactions in the file is equal to the number of cells and is 6510.
Obviously, not every transaction contains an incident. By employing our predefined template, the amount of transactions rapidly decreases.
It is reasonable to calculate the relative frequencies of all possible consequents,
i.e. rules with empty antecedents [Borgelt, 2005], before we start to evaluate
the discovered rules. The knowledge of relative frequencies of every object type
can help to discover whether a strong rule can be evaluated as interesting. The
list of all relative frequencies r can be observed in Table 7.1, where the first
column depicts the object type and second column shows the calculated value
for r. The biggest value for r belongs to the water areas. This validates the
fact, that water covers a large part of the study area.
48
CHAPTER 7. RESULTS
49
Table 7.1: Relative frequencies of all object types for grid approach.
rule
r [%]
⇒ motorway
0.6
⇒ kindergartens
0.8
⇒ bars and restaurants
4.3
⇒ railway
6.0
⇒ incidents
7.2
⇒ waterway
8.9
⇒ paths
9.7
⇒ main roads
14.6
⇒ parks
17.7
⇒ minor roads
29.6
⇒ water
54.1
It is probable, that rules containing water hold high confidence value. If exist
objects, representing other object type, which is also densely distributed over
the study area, the rules containing those objects and water become strong.
But those rules do not have to be necessarily interesting. Since the relative
frequencies of the water and other object type are high, the confidence of the
rules, containing those objects is also high. However, the high value of the
confidence is not in this case based on a detected relation between those two
object types, but on a fact that they are both common within the study area.
In this case study, only water holds high value of relative frequency, therefore
this problem does not have to be considered.
After all the pre-defined constraints are set in the Gnome data mine program,
the association rule mining is applied to the grid transaction file. We discovered about twenty potentially interesting rules. One of the most significant is
rule 7.1.
bars and restaurants ⇒ incidents (1.7%; 40.0%)
(7.1)
CHAPTER 7. RESULTS
50
The first number between brackets represents the support and the second the
confidence of the rule. However, we also extracted rule 7.2 with swapped
objects in antecedent and consequent.
incidents ⇒ bars and restaurants (1.7%; 24.1%)
(7.2)
Those rules carry similar information but only one, representing a more interesting pattern is selected. We can see that the confidence of rule 7.1 is
nearly twice as big as the confidence of rule 7.2. But the confidence value is
not the only factor that influences the rule selection. The Table 7.1 of relative
frequencies shows, that the total number of incidents is larger than the total
number of bars and restaurants. Rule 7.1 demonstrates that, although the bars
and restaurants are not densely sparsed over the area, they are often located
close to the incidents. While rule 7.2 shows, that from all locations of incidents about only a quarter is situated near a bar or restaurant. Consequently,
a significantly larger amount of incidents happens nearby different objects.
Therefore we consider rule 7.1 to be more interesting. This means that there
is a high probability that the presence of bars and restaurants strongly affect
the occurrence of an incident.
The following two rules show an association between incidents and two specific
road classes:
incidents ⇒ main roads (2.2%; 30.4%)
(7.3)
incidents ⇒ minor roads (1.7%; 24.1%)
(7.4)
Those rules detect, that incidents also occur in the neighbourhood of minor
and main roads. We can combine both rules together. In that case, we obtain
the more general rule incidents ⇒ roads. The confidence of the more general
rule is stronger, and the relative frequency of combined road classes is higher.
The decision whether to keep two separate rules or combine them together
depends on the aim of further analysis.
CHAPTER 7. RESULTS
51
Until now, only rules with high confidence were introduced. Since we set
the minsupport and minconfidence thresholds low, we obtained several rules,
that describe negation between incidents and other geographical objects. Some
representatives of those rules are:
motorway ⇒ incidents (0%; 2.9%)
(7.5)
incidents ⇒ kindergarten (0.1%; 1.7%)
(7.6)
incidents ⇒ water (0.4%; 5.7%)
(7.7)
Rule 7.5 shows that accidents on a motorway are not very common. However,
this rule does not have important meaning for this particular area, because
the motorway passes only through a negligible part of the Helsinki center. A
similar pattern can be seen in rule 7.6. The kindergarten distribution in the
center is rather sparse, consequently the rule has no significant importance.
An illustrative example of expressing negation between incidents and spatial
object is rule 7.7. Although the water covers more than 50 % of the study
area, the association between water and incidents is not prominent. Therefore
the water does have not strong impact on incidents.
7.2
Buffer approach
By creating buffers around incidents, 1547 transactions are generated. Since
incidents appear in every transaction, all rules, containing incidents become
very strong with confidence of 100%. But those associations are not representative, because they are heavily influenced. Therefore rules, obtained from the
buffer approach only have an informative purpose. However, the information
is valuable. We can compare those rules with associations discovered by the
grid method. Because both methods are independent on each other, the results
are not correlated. Consequently, once an interesting association is detected
by one method, its existence is approved, when the same rule is discovered by
the second method. The following interesting rules are detected by mining the
CHAPTER 7. RESULTS
52
buffer transaction file.
bars and restaurants ⇒ incidents (35.2%)
(7.8)
mainroads ⇒ incidents (49.8%)
(7.9)
minorroads ⇒ incidents (67.7%)
(7.10)
water ⇒ incidents (15.1%)
(7.11)
The number between brackets represents support of the rule.
7.3
General results
The goal of the case study is to demonstrate the use of the association rule
mining technique on geographical data. Although this technique is originally
designed for relational databases and uncorrelated data samples, it is possible
to extend its applicability to more complicated spatial objects. After extensive
data pre-processing, several interesting association rules are discovered from
the two available databases. By mining the transaction file related to the grid
approach, the discovered rules describe the general behavior of the sample
data. Moreover, the obtained transaction file contains all relations existing
within the studied area. Therefore, those results represent the real relations
among the study objects and can be regarded as confident information about
the studied data.
The circular buffer describes the neighbourhood more exactly than a grid.
However, in this study the buffer is created only around incidents. Therefore
the results are closely related only to the incidents. The transaction file does
not contain all existing itemsets. The amount of transactions is restricted
only to those containing incidents. By mining this file the existing relationships between incidents and other objects are correctly detected, however, the
measures of the extracted rules can not be compared to the results obtained
by the grid approach.
Chapter 8
Discussion
This research demonstrates the possible use of association rule mining on provided databases. Since the goal of this study is to introduce basic concepts
of the mining process, the presented application is kept simple. However, for
obtaining more exact results, some further improvement of the method is advisable. In this chapter, some possible improvements are mentioned. The
chapter concludes with the potential directions for further research.
8.1
Unsolved problems
This thesis covers the whole process of identification of relevant data for improvement of a risk model. Since the process of data identification is very
complex, we proposed a rather simple method. Although this method employs
basic operations, interesting results are obtained. However, some further work
can be done to improve the identification of the object neighbourhood.
The grid approach is dependent on the grid granularity. To ensure better
results, the best fitting grid has to be identified. Several gird sizes are tested
on the study area, however this task is very time consuming. Although this
approach is very simple its application has many disadvantages. The explicitly
given division of the space causes problems especially on the edges of a cell.
The Figure 8.1 shows an illustrative example. An incident is located in grid
cell number 3391. The only identified object in the incident neighbourhood is
53
CHAPTER 8. DISCUSSION
54
3483
3484
3485
3486
3390
3391
3392
3393
3297
3298
3299
3300
Figure 8.1: The neighbourhood problem on the edges of a cell
a main road. Although a bar or restaurant is located closer to the incident and
probably has stronger influence on its occurrence, it is not depicted, because
it is situated in different grid cell.
Compared to the grid approach, the buffer approach is more flexible for further development. Besides changing the size of the neighbourhood, we can
also modify its shape. The currently used buffer has a regular shape. The
circular buffer is very simple to define, but it does not always identify the
neighbourhood objects properly. In Figure 8.2 a) every object, located within
the highlighted buffer is considered as the neighbourhood to the incident situated in the center. Therefore, water and minor road have been identified
as influence factors of the incident. It is obvious, that the incident happened
closer to the minor road than to the sea. There is also a high probability,
that the sea has no impact on the incident. By restricting the identification
of the neighbourhood to the minor road only (Figure 8.2 b)), the resulting
associations become more accurate. This can be done by defining some basic
constraints that allow selection only of certain object types according to the
position of the incident or according specific relevance between objects.
CHAPTER 8. DISCUSSION
a)
55
b)
Figure 8.2: Reduction of selected neighbourhood objects according to the
location of an incident. Image a) represent the neighbourhood detected by the
buffer approach. The possible change of the neighbourhood is illustrated in
image b), where the water area is omitted.
8.2
Further research
Starting from the beginning of the process, more automated tools for data retrieval from geographical databases can be developed. The tools should be able
to acquire the relevant data together with their spatial and non-spatial properties. The implementation of those tools is probably case and data dependent.
Once the data are available, topological relations can be identified directly
from their spatial attributes. The topological relations in this thesis are represented by a simple regular neighbourhood. The neighbourhood relations can
be defined also by neighbourhood graphs and paths introduced by Ester et al.
in [Ester et al., 1998]. This definition speeds up the processing time and can
be created by using relational tables and indexes.
When the neighbourhood does not define spatial relations sufficiently and in
order to obtain more accurate results, more exact topological relations among
CHAPTER 8. DISCUSSION
56
objects can be identified. Koperski and Han [Koperski and Han, 1995] arrange
several topological relations into a hierarchical tree illustrated in Figure 8.3.
The hierarchical structure can be also utilized for geographical objects. In
g_close_to
not_disjoint
intersects
adjacent_to
intersects
inside
covered_by
close_to
equal
contains
inside
covers
contains
Figure 8.3: The hierarchy of topological relations [Koperski and Han, 1995]
this thesis a similar classification was made on only the road network. However, defined road classes are not linked together and each of them carries
its own object characteristics. To gather similar objects in a general group,
classify them hierarchically according to their non-spatial attributes, enhance
not only the processing time but also gives more exact results. Possible set
of object hierarchies can be seen in Figure 8.4, which represents division of
an urban area and road network. Moreover, hierarchical structures of ob-
Figure 8.4: Object hierarchies of urban areas and roads
jects and their spatial relations enable multiple-level association rule mining
[Koperski and Han, 1995] and [Malerba et al., 2001]. By multiple-level association mining, we can discover rules, that express certain association on diverse
levels of detail. For instance, rule 8.1 indicates an association among different
hierarchical levels of topological relations (intersects, close to) from the hierarchical tree in Figure 8.3 and objects (large town, national motorway) from
CHAPTER 8. DISCUSSION
57
the object hierarchies in Figure 8.4.
is a(X, large town) ∧ intersects(X, national motorway) ⇒
close to(X,us boundary) (72%)
(8.1)
In the previous test, possible improvements of the used method and ideas
of further development are outlined. In order to stress their importance, they
are briefly summarized
ˆ Adaptation of the neighbourhood area for obtaining more accurate re-
sults.
ˆ Implementation of automated data retrieval from a geographical data-
base.
ˆ Hierarchical structure of object according to their non-spatial attributes.
ˆ Hierarchical structure of spatial relations among neighbouring objects.
ˆ Multiple-level association mining.
Chapter 9
Conclusion
This research concludes basic concepts of spatial data mining (SDM), which
is a rapidly developing area of spatial data analysis. SDM provides techniques
for discovering unexpected patterns from large geographical databases. Those
techniques derive benefits from e.g. database management, spatial statistics
and artificial intelligence. Although this discipline brings new possibilities,
it also faces many challenging research problems especially related to spatial
data characteristics. To obtain relevant results the spatial autocorrelation and
spatial heterogeneity have to be taken into consideration.
The previous research made in this area identified various SDM techniques.
Although all techniques have the same goal, i.e. discovering information not
explicitly given to the database, the way of obtaining this information differs.
After studying the behavior of the best known SDM techniques, this thesis
concentrates on association rule mining in more detail. Discovery of spatial
associations may detect interesting relationships among spatially distributed
objects. Therefore, application of this technique is convenient for analysis,
dealing with identification of factors, that can have possible impact on the
occurrence of a certain phenomenon.
To test, whether association rule mining can be used in risk management,
this thesis shows how to identify factors, that possibly have influence on the
location of incidents within the Helsinki city center. However, detection of
58
CHAPTER 9. CONCLUSION
59
the relevant factors is very complex and require reasonable knowledge of the
available databases and area. Any new application is not implemented, a
Gnome Data Mine tool, originally designed for classical data mining by Borgelt
[Borgelt and Kruse, 2002] was used for the analysis.
We provide a study, that covers the whole process of association rule mining
from the identification of the data to the evaluation of potentially interesting rules. In order to keep the process simple, the only subjects to mine are
spatial relations of objects represented by their neighbourhood. Two different
approaches are introduced for the neighbourhood definition, the grid and the
buffer approach. The biggest advantage of the first one is its simplicity and
applicability to any kind of data. However, this approach is heavily dependent
on the granularity of the grid. Moreover, the distribution of the data does not
have any affect on the results. This problem is solved by applying the buffer
approach, where the definition of neighbourhood originates from the object
locations. Therefore, the extracted relations are more exact. Comparable to
the grid approach, the buffer approach is more flexible and open for further
improvements.
Even though this research is based on simple and not always flexible operations, it describes the whole process of association rule mining. We are aware,
that the proposed process does not offer a general solution, which can be
applicable to any geographical database. Moreover, the identification of the
object neighbourhood is rather cumbrous and time consuming, therefore the
amount of data in the case study is heavily restricted. However, the process
is open to further improvements. Some of them are proposed in the Chapter 8.
Although spatial data mining does not yet belong to the most commonly used
spatial data analyzes, it was found useful for exploring enormous amounts of
data stored in geographical databases. In this research, possible use of association rule induction, one of the most commonly known technique of spatial data
mining, was demonstrated. Its application was found effective for detecting
strong relationships among geographical objects.
References
[Agrawal and Srikant, 1994] Agrawal R. and Srikant R., Fast Algorithms for
Mining Association Rules, proceedings of 20th International conference
on Very Large Databases, 1994
[Agrawal et al., 1993] Agrawal R., Imielinski T. and Swami A., Mining association rules between sets of items in large databases, proceedings of ACMSIGMOD International Conference Management of Data, pages 207-016,
1993
[Alliniemi J., 1994] Alliniemi J., Threats and possibilities - a way to study
accidents and their effects, 1994, in Finnish
[Bédard et al., 2001] Bédard Y., Merrett T. and Han J., Fundamentals of spatial data warehousing for geographic knowledge discovery, in Geographic
data mining and knowledge discovery, Miller H. J. and Han J., Taylor &
Francis, ISBN 0-415-23369-0, 2001
[de Berg et al., 2000] de Berg M., van Kreveld M., Overmars M. and
Schwarzkopf O., Computational Geometry Algorithms and Applications,
Chapter 2, Springer-Verlag, ISBN 3-540-65620-0, 2000
[Borgelt, 2005] Borgelt Ch., Apriori, Finding Association Rules/Hyperedges
with
the
Apriori
Algorithm,
URL:
http://fuzzy.cs.unimagdeburg.de/ borgelt/doc/apriori/apriori.html (accessed 19.1.2005)
[Borgelt and Kruse, 2002] Borgelt Ch. and Kruse R., Induction of Association Rules: Apriori Implementation, proceedings of 14th Conference on
Computational Statistics, 2002
60
REFERENCES
61
[Demšar, 2004] Demšar U., Exploring geographical metadata by automatic
and visual data mining, Licentiate Thesis, Royal Institute of Technology,
Stockholm, 2004
[Ester et al., 1998] Ester M., Frommelt A., Kriegel H.-P., Sander J., Algorithms for Characterization and Trend Detection in Spatial Databases,
proceedings of 4th International Conference on knowledge Discovery and
Data Mining, pages 44-50, 1998
[Ester et al., 1997] Ester M., Kriegel H.-P. and Sander J., Spatial Data Mining: A Database Approach, proceedings of 5th International Symposium
on Advances in Spatial Databases, pages 47-66, 1997
[Ester et al., 2001] Ester M., Kriegel H.-P. and Sander J., Algorithms and applications for spatial data mining, in Geographic data mining and knowledge discovery, Miller H. J. and Han J., Taylor & Francis, ISBN 0-41523369-0, 2001
[Estivill-Castro and Lee, 2001] Estivill-Castro V. and Lee I., Data Mining
Techniques for Autonomous Exploration of Large Volumes of Georeferenced Crime Data, proceedings 6th International Conference on Geocomputaion, GeoComputation CD-ROM, ISBN 1864995637, 2001
[Gnome, 2005] URL:
http://www.togaware.com/datamining/gdatamine/
gdmapriori.html (accessed 22.3.2005)
[Han, 1999] Han J., Data Mining, in Encyclopedia of Distributed Computing,
Urban J. and Dasgupta P. (eds.), Kluwer Academic Publishers,1999
[Han et al., 2001] Han J., Kamber M. and Tung A. K. H., Spatial clustering
methods in data mining, in Geographic data mining and knowledge discovery, Miller H. J. and Han J., Taylor & Francis, ISBN 0-415-23369-0,
2001
[Helokunnas, 1995] Helokunnas T., Object-Oriented Approaches Applied to
GIS Development, Acta Polytechnica Scandinavica, Mathematics and
computing in engineering series No. 75, 1995
REFERENCES
62
[Hutchinson, 1991] Hutchinson, The Hutchinson Encyclopedic Dictionary, Helicon, ISBN 0091749980, 1991
[Ihamäki, 2000] Ihamäki V.-P., Geographic information in the planning of rescue services, The Emergency Services College, Espoo, 2000
[Klemettinen et al., 1994] Klemettinen M., Mannila H., Ronkainen P., Toivonen H. and Inkeri Verkamo A., Finding Interesting Rules from Large Sets
of Discovered Association Rules, proceedings of 3rd International Conference on Inforamtion and Knowledge Management, pages 401-408, 1994
[Koperski and Han, 1995] Koperski K. and Han J., Discovery of Spatial Association Rules in Geographic Information Databases, proceedings of 4th
International Symposium on Large Spatial Databases, pages 47-66, 1995
[Koperski et al., 1998] Koperski K., Han J. and Adhikary, Mining Knowledge
in Geographical Data, accepted by IEEE Comuter, 1998
[Koperski et al., 1996] Koperski K., Adhikary J. and Han J., Spatial Data
Mining: Progress and Challenges, in SIGMOD9́6 Workshop on Research
Issues on Data mining and Knowledge Discovery, 1996
[Kraak and Ormeling, 2003] Kraak M.-J., Ormeling F., Cartography, Visualization of Geospatial Data, Second edition, Prentice Hall, ISBN 0-13088890-7, 2003
[Krisp et al., 2005] Krisp J.M., Virrantaus K. and Jolma A., Using Explorative Spatial Analysis to Improve Fire and Rescue Services, proceedings
of 1st International Symposium on Geo-information for Disaster Management, 2005
[Lonka, 1999] Lonka H., Risk Assessment Procedures Used in the Field of Civil
Protection and Rescue Services in Different European Union Countries
and Norway, prepared in the framework of EU co-operation in the field
of civil protection, 1999
[Malerba et al., 2001] Malerba D., Esposito F. and Lisi F.A., Mining Spatial
Association Rules in Census Data, proceedings of the Joint Conferences
REFERENCES
63
on New Techniques and Technologies for Statistics and Exchange of Technology and Know-how, 541-550, 2001
[Mannila, 2002] Mannila H., Local and Global Methods in Data Mining: Basic
Techniques and Open Problems, proceedings of 29th International Colloquium on Automata, Languages and Programming, Lecture Notes on
Computer Science, pages 57-68, Springer-Verlag, 2002
[Miller and Han, 2001] Miller H. J. and Han J., Geographic data mining and
knowledge discovery, An overview, in Geographic data mining and knowledge discovery, Miller H. J. and Han J., Taylor & Francis, ISBN 0-41523369-0, 2001
[Shekhar and Chawla, 2003] Shekhar S. and Chawla S., Introduction to Spatial Data Mining, in Spatial Databases: A tour, Prentice Hall, ISBN
013-017480-7, 2003
[Shekhar et al., 2003] Shekhar S., Zhang P., Huang Y., Vatsavai R., Trend
in Spatial Data Mining, to appear in Data Mining: Next Generation
Challenges and Future Directions, Kargupta H., Joshi A., Sivakumar K.,
Yesha Y. (eds.), AAAI/MIT Press, 2003
[Togaware, 2005] Togaware URL: http://www.togaware.com/index.html (accessed 22.3.2005)
[YTV, 2005] Helsinki Metropolitan Area Council, Developing Regional
Data Utility, URL: http://www.ytv.fi/english/data/index.html (accessed
15.3.2005)
Appendix A
Cells containing railway
64
Appendix B
Sample of the railway data in
the text format
"Id"
775
868
869
871
941
942
962
964
965
1035
1036
1037
1055
1056
1058
1130
1131
1132
1142
65