Download SPATIAL DATA MINING TECHNIQUES M.Tech

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
SPATIAL DATA MINING TECHNIQUES
M.Tech Seminar Report
Submitted by:
Subhasmita Mahalik
113050073
CSE,M.Tech1
Under the guidance of:
Prof.N.L. Sarda
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay
Mumbai
ACKNOWLEDGEMENT
I would like to express my sincere gratitude to my guide Prof. N. L. Sarda for his constant
encouragement and guidance. He has been my primary source of motivation and advice during my
entire study of seminar. I would like to thank my friends for their support , suggestions and
feedback they have given me.
Subhasmita Mahalik
113050073
CSE, M.Tech
IIT Bombay
I
Contents
1. Introduction
1
2. Database primitives
2.1 Neighbourhood Relations
2.2 Neighbourhood graphs and their operation
2.3 Filter predicates
4
4
5
6
3. Knowledge Discovery
3.1 Classification
3.2 Clustering Methods
3.2.1 Complete Spatial Randomness, Cluster, and Decluster
3.2.2 DECODE : a new method of discovering cluster
3.3 Association Rules
3.4 Outlier Detection
7
7
9
9
10
12
13
4. Efficient polygon amalgamation method
4.1 Adjacency-based method
4.2 Occupancy-based method
16
16
17
5. Challenges and Applications
5.1 Challenges
5.1 Landslide-monitoring data mining
5.2 In visual data mining
19
19
19
19
6. Conclusion
22
II
ABSTRACT
Spatial data mining is the process of discovering interesting and previously unknown, but
potentially useful patterns from large spatial datasets. Extracting interesting and useful patterns
from spatial datasets is more difficult than extracting the corresponding patterns from traditional
numeric and categorical data due to the complexity of spatial data types, spatial relationships, and
spatial autocorrelation. The requirements of mining spatial databases are different from those of
mining classical relational databases. The spatial data mining techniques are often derived from
spatial statistics, spatial analysis, machine learning, and databases, and are customized to analyze
massive data sets. In this report some of the spatial data mining techniques have discussed along
with some applications in real world.
III
Chapter 1
Introduction
The explosive growth of spatial data and widespread use of spatial databases emphasize the
need for the automated discovery of spatial knowledge. Spatial data mining is the process of
discovering interesting and previously unknown, but potentially useful patterns from spatial
databases. The complexity of spatial data and intrinsic spatial relationships limits the usefulness of
conventional data mining techniques for extracting spatial patterns.Efficient tools for extracting
information from geo-spatial data are crucial to organizations which make decisions based on large
spatial datasets, including NASA, the National Imagery and Mapping Agency (NIMA), the National
Cancer Institute (NCI), and the United States Department of Transportation (USDOT). These
organizations are spread across many application domains including ecology and environmental
management, public safety, transportation, Earth science, epidemiology, and climatology[2].
General purpose data mining tools such as Clementine and Enterprise Miner, are designed to
analyze large commercial databases. Although these tools were primarily designed to identify
customer-buying patterns in market basket data, they have also been used in analyzing scientific and
engineering data, astronomical data, multi-media data, genomic data, and web data. Extracting
interesting and useful patterns from spatial data sets is more difficult than extracting corresponding
patterns from traditional numeric and categorical data due to the complexity of spatial data types,
spatial relationships, and spatial autocorrelation.
Specific features of geographical data that preclude the use of general purpose data mining
algorithms are: i) rich data types(e.g., extended spatial objects) ii) implicit spatial relationships
among the variables, iii) observations that are not independent, and iv) spatial autocorrelation
among the features.
The spatial data mining can be used to understand spatial data, discover the relation
between pace and the non- space data, set up the spatial knowledge base, excel the query, reorganize
spatial database and obtain concise total characteristic etc. The system structure of the spatial data
mining can be divided into three layer structures mostly, such as the Fig 1[1] show .The
customer interface layer is mainly used for input and output, the miner layer is mainly used to
manage data, select algorithm and storage the mined knowledge, the data source layer, which
mainly includes the spatial database (camalig) and other related data and knowledge bases, is
original data of the spatial data mining.
1
The data inputs of spatial data mining are more complex than the inputs of classical data
mining because they include extended objects such as points, lines, and polygons. The data inputs
of spatial data mining have two distinct types of attributes: non-spatial attribute and spatial attribute.
Non-spatial attributes are used to characterize non-spatial features of objects, such as name,
population, and unemployment rate for a city. They are the same as the attributes used in the data
inputs of classical data mining. Spatial attributes are used to define the spatial location and extent of
spatial objects. The spatial attributes of a spatial object most often include information related to
spatial locations, e.g., longitude, latitude and elevation, as well as shape.
Relationships among non-spatial objects are explicit in data inputs, e.g.,arithmetic relation,
ordering, is instance of, subclass of, and membership of. In contrast, relationships among spatial
objects are often implicit, such as overlap, intersect, and behind. One possible way to deal with
implicit spatial relationships is to materialize the relationships into traditional data input columns
and then apply classical data mining techniques. However,the materialization can result in loss of
information. Another way to capture implicit spatial relationships is to develop models or
techniques to incorporate spatial information into the spatial data mining process.
Fig1:Thesystematic structure of spatial data mining.
One of the fundamental assumptions of statistical analysis is that the data samples are
2
independently generated: like successive tosses of coin, or the rolling of a die. However, in the
analysis of spatial data, the assumption about the independence of samples is generally false. In
fact, spatial data tends to be highly self correlated. For example, people with similar characteristics,
occupation and background tend to cluster together in the same neighborhoods. The economies of a
region tend to be similar. Changes in natural resources, wildlife, and temperature vary gradually
over space. The property of like things to cluster in space is so fundamental that geographers have
elevated it to the status of the first law of geography: “Everything is related to everything else but
nearby things are more related than distant things”.In spatial statistics, an area within statistics
devoted to the analysis of spatial data, this property is called spatial autocorrelation.
Knowledge discovery techniques which ignore spatial autocorrelation typically perform
poorly in the presence of spatial data. Often the spatial dependencies arise due to the inherent
characteristics of the phenomena under study, but in particular they arise due to the fact that the
spatial resolution of imaging sensors are finer than the size of the object being observed.For
example, remote sensing satellites have resolutions ranging from 30 meters (e.g., the Enhanced
Thematic Mapper of the Landsat 7 satellite of NASA) to one meter (e.g., the IKONOS satellite
from SpaceImaging), while the objects under study (e.g., Urban, Forest, Water) are often much
larger than 30 meters. As a result, per-pixel-based classifiers, which do not take spatial context into
account, often produce classified images with salt and pepper noise. These classifiers also suffer in
terms of classification accuracy.
The spatial relationship among locations in a spatial framework is often modeled via a
contiguity matrix. A simple contiguity matrix may represent a neighborhood relationship defined
using adjacency, Euclidean distance, etc. Example definitions of neighborhood using adjacency
include a four-neighborhood and an eight-neighborhood. Given a gridded spatial framework, a fourneighborhood assumes that a pair of locations influence each other if they share an edge. An eightneighborhood assumes that a pair of locations influence each other if they share either an edge or a
vertex.
Fig 2:A Spatial Framework and Its Four-neighborhood Contiguity Matrix.
3
In figure2(a) above shows a gridded spatial framework with four locations, A,B, C, and D. A
binary matrix representation of a four-neighborhood relationship is shown in Fig2(b). The row
normalized representation of this matrix is called a contiguity matrix, as shown in Fig2(c).
The prediction of events occurring at particular geographic locations is very important in
several application domains. Examples of problems which require location prediction include crime
analysis, cellular networking, and natural disasters such as fires, floods, droughts, vegetation
diseases, and earthquakes. Two spatial data mining techniques for predicting locations, namely the
Spatial Autoregressive Model (SAR) and Markov Random Fields (MRF).
An Application Domain We begin by introducing an example to illustrate the different
concepts related to location prediction in spatial data mining. We are given data about two wetlands,
named Darr and Stubble, on the shores of Lake Erie in Ohio USA in order to predict the spatial
distribution of a marsh breeding bird, the red-winged blackbird (Agelaius phoeniceus). The data
was collected from April to June in two successive years, 1995 and 1996.
4
Chapter 2
Database primitives
In this section, we introduce a small set of database primitives for spatial data mining . The
major difference between mining in relational databases and mining in spatial databases is that
attributes of the neighbors of some object of interest may have an influence on the object itself.
Therefore, our database primitives are based on the concept of spatial neighborhood relations.
2.1 Neighborhood Relations
The mutual influence between two objects depends on factors such as the topology, the
distance or the direction between the objects. Three basic types of spatial relations: topological,
distance and direction relations which are binary relations, i.e. relations between pairs of objects.
Spatial objects may be either points or spatially extended objects such as lines, polygons or
polyhedrons. In general, the points p = (p1, p2, . . ., pd) are elements of a d-dimensional Euclidean
vector space called Points[1]. Topological relations are those relations which are invariant under
topological transformations, i.e. they are preserved if both objects are rotated, translated or scaled
simultaneously.
Definition 1: (topological relations) The topological relations between two objects A and B are
derived from the nine intersections of the interiors, the boundaries and the complements of A and B
with each other. The relations are: A disjoint B, A meets B, A overlaps B, A equals B, A covers B, A
covered-by B, A contains B, A inside B.
Definition 2: (distance relations) Distance relations are those relations comparing the distance of
two objects with a given constant using one of the arithmetic operators. The distance dist between
two objects, i.e. sets of points, can then simply be defined by the minimum distance between their
points.
Definition 3: (direction relations) To define direction relations O2 R O1, we distinguish between
the source object O1 and the destination object O2 of the direction relation R. There are several
5
possibilities to define direction relations depending on the number of points they consider in the
source and the destination object.
Fig3:Illustrates some of the topological, distance and direction relations using 2D polygons.
We define the direction relation of two spatially extended objects using one representative
point rep(O1) of the source object O1 and all points of the destination object O2. The representative
point of a source object may, e.g., be the center of the object. This representative point is used as the
origin of a virtual coordinate system and its quadrants define the directions.
Definition 4: (complex neighborhood relations) If r1 and r2 are neighborhood relations, then r1 ∧
r2 and r1 ∨ r2 are also neighborhood relations - called complex neighborhood relations.
2.2 Neighborhood Graphs and Their Operations
Based on the neighborhood relations, we introduce the concepts of neighborhood graphs and
neighborhood paths and some basic operations for their manipulation.
Definition 5: (neighborhood graphs and paths) Let neighbor be a neighborhood relation and DB
⊆ 2Points be a database of objects.
a) A neighborhood graph G DBneighbor = ( N, E ) is a graph with the set of nodes N which we
identify with the objects o ∈ DB and the set of edges E ⊆ N × N where two nodes n1 and n2 ∈ N
are connected via some edge of E iff neighbor(n 1,n2) holds. Let n denote the cardinality of N and let
e denote the cardinality of E. Then, f:= e / n denotes the average number of edges of a node,i.e. f is
called the “fan out” of the graph.
b) A neighborhood path is a sequence of nodes [n1, n2, . . ., nk], where neighbor(ni, ni+1) holds
for all n ∈ N, 1 ≤ i < k . The number k of nodes is called the length of the neighborhood path.
6
c) A neighborhood path [n1, n2, . . ., nk] is valid iff ∀ i ≤ k, j < k: i ≠ j ⇔ n i ≠ n j .
2.3 Filter Predicates for Neighborhood Paths
Neighborhood graphs will in general contain many paths which are irrelevant if not “mislead
ing” for spatial data mining algorithms. For finding significant spatial patterns, we have to consider
only certain classes of paths which are “leading away” from the starting object in some
straightforward sense. Such spatial patterns are most often the effect of some kind of influence of an
object on other objects in its neighborhood. Furthermore, this influence typically decreases or
increases continuously with increasing or decreasing distance. The task of spatial trend analysis, i.e.
Finding patterns of systematic change of some non-spatial attributes in the neighborhood of certain
database objects, can be considered as a typical example. Detecting such trends would be
impossible if we do not restrict the pattern space in a way that paths changing direction in arbitrary
ways or containing cycles are eliminated[1].
Fig4:Filter Starlike and filter variable starlike
Definition 6: (filter starlike and filter variable starlike) Let p = [n1,n2,...,nk] be a neighborhood
path and let reli be the exact direction for ni and n i+1, i.e. ni+1 reli ni holds. The predicates starlike and
variable-starlike for paths p are defined as follows:
starlike(p) :⇔ (∃ j < k: ∀ i > j: ni+1 reli ni ⇔ reli ⊆ relj), if k > 1; TRUE, if k=1
variable-starlike(p) :⇔ (∃ j < k: ∀ i > j: ni+1 reli ni ⇔ reli ⊆ rel1), if k > 1; TRUE, if k=1.
7
Chapter 3
Knowledge Discovery
Knowledge discovery in databases (KDD) has been defined as the non-trivial process of
discovering valid, novel, and potentially useful, and ultimately understand able patterns from
data .The process of KDD is interactive and iterative, involving several steps such as the following
ones:
• Selection: selecting a subset of all attributes and a subset of all data from which the
knowledge
should be discovered.
• Data reduction: using dimensionality reduction or transformation techniques to reduce the
effective number of attributes to be considered.
• Data mining: the application of appropriate algorithms that, under acceptable computational
efficiency limitations, produce a particular enumeration of patterns over the data.
• Evaluation: interpreting and evaluating the discovered patterns with respect to their
usefulness
in the given application.
Spatial Database Systems (SDBS) are database systems for the management of spatial data.
To find implicit regularities, rules or patterns hidden in large spatial databases, e.g. for geomarketing, traffic control or environmental studies, spatial data mining algorithms are very
important .Most existing data mining algorithms run on separate and specially prepared files, but
integrating them with a database management system (DBMS) has the following advantages.
Redundant storage and potential inconsistencies can be avoided. Furthermore, commercial database
systems offer various index structures to support different types of database queries. This
functionality can be used without extra implementation effort to speed-up the execution of data
mining algorithms (which, in general, have to perform many database queries)[4]. Similar to the
relational standard language SQL, the use of standard primitives will speed-up the development of
new data mining algorithms and will also make them more portable.
3.1 Spatial Classification
The task of classification is to assign an object to a class from a given set of classes based on
the attribute values of this object. In spatial classification the attribute values of neighboring objects
are also considered.
8
The classification algorithm works as follows:The relevant attributes are extracted by
comparing the attribute values of the target objects with the attribute values of their nearest
neighbors. The determination of relevant attributes is based on the concepts of the nearest hit (the
nearest neighbor belonging to the same class) and the nearest miss (the nearest neighbor belonging
to a different class). In the construction of the decision tree, the neighbors of target objects are not
considered individually. Instead, so-called buffers are created around the target objects and the nonspatial attribute values are aggregated over all objects contained in the buffer. For instance, in the
case of shopping malls a buffer may represent the area where its customers live or work[4]. The size
of the buffer yielding the maximum information gain is chosen and this size is applied to compute
the aggregates for all relevant attributes.
The task of classification is to assign an object to a class from a given set of classes based on
the attribute values of the object. In spatial classification the attribute values of neighbouring
objects may also be relevant for the membership of objects and therefore have to be considered as
well.
The extension to spatial attributes is to consider also the attribute of objects on a
neighbourhood path starting from the current object. Thus, we define generalized attributes for a
neighbourhood path p = [o1, . . ., ok] as tuples (attribute-name, index) where index is a valid position
in p representing the attribute with attribute name of object o index. The generalized attribute
(economic-power,2), e.g., represents the attribute economic-power of some (direct) neighbour of
object o1.
Because it is reasonable to assume that the influence of neighbouring objects and their
attributes decreases with increasing distance, we can limit the length of the relevant neighbourhood
paths by an input parameter max-length. Furthermore, the classification algorithm allows the input
of a predicate to focus the search for classification rules on the objects of the database fulfilling this
predicate. Figure 5 depicts a sample decision tree and two rules derived from it. Economic power
has been chosen as the class attribute and the focus is on all objects of type city.
9
Fig 5:Sample decision tree and rules discovered by the classification algorithm
3.2 Spatial Clustering
Spatial clustering is a process of grouping a set of spatial objects into clusters so that objects
within a cluster have high similarity in comparison to one another, but are dissimilar to objects in
other clusters. For example, clustering is used to determine the “hot spots” in crime analysis and
disease tracking. Hot spot analysis is the process of finding unusually dense event clusters across
time and space. Many criminal justice agencies are exploring the benefits provided by computer
technologies to identify crime hot spots in order to take preventive strategies such as deploying
saturation patrols in hot spot areas[6].
Spatial clustering can be applied to group similar spatial objects together; the implicit
10
assumption is that patterns in space tend to be grouped rather than randomly located. However, the
statistical significance of spatial clusters should be measured by testing the assumption in the data.
The test is critical before proceeding with any serious clustering analyses.
3.2.1 Complete Spatial Randomness, Cluster, and Decluster
In spatial statistics, the standard against which spatial point patterns are often compared is a
completely spatially random point process, and departures indicate that the pattern is not distributed
randomly in space. Complete spatial randomness (CSR) [Cressie1993] is synonymous with a
homogeneous Poisson process. The patterns of the process are independently and uniformly
distributed over space, i.e., the patterns are equally likely to occur anywhere and do not interact
with each other. However, patterns generated by a non-random process can be either cluster
patterns(aggregated patterns) or decluster patterns(uniformly spaced patterns).
Fig 6:Illustration of CSR, Cluster, and Decluster Patterns
Several statistical methods can be applied to quantify deviations of patterns from a complete
spatial randomness point pattern. One type of descriptive statistics is based on quadrats (i.e., well
defined area, often rectangle in shape)[6]. Usually quadrats of random location and orientations in
the quadrats are counted, and statistics derived from the counters are computed.
Another type of statistics is based on distances between patterns; one such type is Ripley’s
K-function . After the verification of the statistical significance of the spatial clustering, classical
clustering algorithms can be used to discover interesting clusters.
3.2.2 DECODE (DiscovEring Clusters Of Different dEnsities)
Discovering clusters in complex spatial data, in which clusters of different densities are
superposed, severely challenges existing data mining methods. For instance, in seismic research,
11
foreshocks (which indicate forthcoming strong earthquakes) or aftershocks (which may help to
elucidate the mechanism of major earthquakes) are often interfered by background earthquakes. In
this context, data are presumed to consist of various spatial point processes in each of which points
are distributed at a constant, but different intensity. DECODE is a new density-based cluster
method(DECODE) to discover clusters of different densities in spatial data[8].
The novelties of DECODE are 2-fold:
(1) It can identify the number of point processes with little prior knowledge.It can
automatically estimate the thresholds for separating point processes and clusters.
Fig 7:Flowchart of the method for discovering clusters of different densities in spatial data
Two strategies have been adopted for finding density homogeneous clusters in density-based
methods:
(1) Grid-based clustering method
Map data into a mesh grid and identify dense regions according to the density in cells. The
main advantage : detection capability for finding arbitrary shaped clusters and their high
efficiency in dealing with complex data sets which are characterized by large amounts of
data, high dimensionality and multiple densities.
(2) Distance-based clustering method
Often based on the mth nearest neighbor distance.
When clusters with different densities and noise coexist in a data set, few existing methods
can determine the number of processes and precisely estimate the parameters. Therefore,
DECODE is a solution for it.
12
DECODE is based upon a reversible jump Markov Chain Monte Carlo( MCMC) strategy and
divided into three steps:
(1) Map each point in the data to its mth nearest distance.
(2)Classification thresholds are determined via a reversible jump MCMC strategy.
(3) Clusters are formed by spatially connecting the points whose mth nearest distances fall into a
particular bin defined by the thresholds.
Some aspects of the model
1) Analysis of sensitivity to prior specification
Two strategies can be applied to the selection of β.
a. to update β constantly during the process.
b. To fix β. Where λmax is the intensity corresponding to min(Xm)
Fig 8: The cumulative occupancy fractions of j at different f b : (a) fb = 2; (b) fb = 10; (c)
fb = 50; (d) fb = 200; (e) fb = 1,000; (f) fb = 5,000; (g) fb = 20,000; (h) fb = 250,000; (i) updating β
(with δ = 1, α = 1, h = 10/(max(Xm ) − min(Xm )), g = 0.2)
13
As shown above, we find that updating β produces the highest posterior probability.
From the results, it is observed that: updating β tends to produce a model with more point processes.
fb is the parameter which significantly influences the results when fixing β.
Total complexity of the algorithm is O(T(Mk+(k-1)k!))where T is the sweep times.
M is the number of points of the data set.
k is the number of point processes that the algorithm determines.
DECODE is a robust and automatic cluster method which needs less prior knowledge of the target
data.
3.3 Association rules
Spatial association rule is a rule indicating certain association relationship among a set of
spatial and possibly some nonspatial predicates. A strong rule indicates that the patterns in the rule
have relatively frequent occurrences in the database and strong implication relationships. raditional
data organization and retrieval tools can only handle the storage and retrieval of explicitly stored
data. The extraction and comprehension of the knowledge implied by the huge amount of spatial
data, though highly desirable, pose great challenges to currently available spatial database
technologies.
A spatial characteristic rule is a general description of a set of spatial-related data. For
example, the description of the general weather patterns in a set of geographic regions is a spatial
characteristic rule. A spatial discriminant rule is the general description of the contrasting or
discriminating features of a class of spatial-related data from other class(es). For example, the
comparison of the weather patterns in two geographic regions is a spatial discriminant rule. A
spatial association rule is a rule which describes the implication of one or a set of features by
another set of features in spatial databases. For example, a rule like most big cities in Canada are
close to the Canada-U.S. border" is a spatial association rule.
A strong rule indicates that the patterns in the rule have relatively frequent occurrences in
the database and strong implication relationships.A spatial association rule is a rule in the form of
P1 ∧ … ∧ Pm → Q1 ∧ … ∧Qn
(c%) ,
where at least one of the predicates P1 ,…, Pm , Q1 ,…, Qn is a spatial predicate, and c% is the
confidence of the rule.
A rule “P  Q/S” is strong if predicate “P ∧ Q” is large in set S and the confidence of
“P → Q/S” is high.
14
Steps for extracting the association rules:
STEP 1: Task_relevant_DB := extract task relevant objects(SDB ,RDB);
(Relevant objects are collected into one database)
STEP 2: Coarse_predicate_DB := coarse spatial computation(Task relevant DB);
(Spatial algorithm is executed at the coarse resolution level)
STEP 3: Large_Coarse_predicate_DB :=
filtering_with_mininmum_support(Coarse_predicate_DB);
(computes the support for each predicate in Coarse_predicate_DB,and filters out those
entries whose support is below the minimum support threshold at the top level.)
STEP 4: Fine_predicate_DB := refined_spatial_computation(Large_Coarse_predicate_DB);
(Algorithm is executed at fine resolution level.)
STEP 5: Find_large_predicates_and_mine_rules(Fine_predicate_DB);
3.4 Outlier Detection
Spatial outlier represent locations which are significantly different from their neighborhoods
even though they may not be significantly different from the entire population. Identification of
spatial outliers can lead to the discovery of unexpected, interesting and implicit knowledge, such as
local instability[3].
A spatial outlier is a spatially referenced object whose non-spatial attribute values are
significantly different from those of other spatially referenced objects in its spatial neighborhood.
Informally, a spatial outlier is a local instability or a spatially referenced object whose non-spatial
attributes are extreme relative to its neighbors,even they may not be significantly different from the
entire population. For example, a new house in an old neighborhood of a growing metropolitan area
is a spatial outlier based on the non-spatial attribute house age. So, the problem is given the
components of the S-outlier definition, the objective is to design a computationally efficient
algorithm to detect the S-outliers.
Techniques of outlier detection:Graphical methods include Variogram clouds and Moran
Scatterplots. The algebraic method includes Scatterplot and Z(S(x)).
A variogram cloud displays data points related by neighborhood relationships. For each pair
of loactaions,the square-root of the absolute difference between attribute values at the locations
versus the Euclidean distance between the loacations are plotted. In data sets exhibiting strong
spatial dependence,the variance in the attribute differences will increase with increasing distance
15
between locations.Locations that are near to one another ,but with large attribute differences,might
indicate spatial outlier, even though the values at both locations may appear to be reasonable when
examining the dataset non-spatially .
Fig 9:A Dataset for Outlier Detection.
In Fig9(a), the X-axis is the location of data points in one-dimensional space; the Y-axis is
the attribute value for each data point. Global outlier detection methods ignore the spatial location
of each data point and fit the distribution model to the values of the non-spatial attribute. The outlier
detected using this approach is the data point G, which has an extremely high attribute value 7.9,
exceeding the threshold of μ + 2σ = 4.49 + 2 ∗ 1.61 = 7.71, as shown in Figure 9(b). This test
assumes a normal distribution for attribute values. On the other hand, S is a spatial outlier whose
observed value is significantly different than its neighbors P and Q[3].
Fig 10:Variogram cloud
16
This plot shows that two pairs (P,S ) and (Q,S ) in the left hand side lie above the main group
of pairs, and are possibly related to outliers. The poin S may be identified as a spatial outlier since it
occurs in both pairs ( Q,S) and ( P, S). This technique requires non-trivial post-processing of
highlighted pairs to separate spatial outliers from their neighbors, particualrly when multiple
outliers are present or density varies greatly.
Fig 11:Scatter plot
A scatterplot shows attribute values on the X-axis and the average of the attribute values in
the neighborhood on the Y-axis. A least square regression line is used to identify spatial outliers. A
scatter sloping upward to the right indicates a positive spatial autocorrelation; a scatter sloping
upward to the left indicates a negative spatial auto -correlation .The residual is defined as the vertical
distance (Y- axis) between a point P with location(Xp,Yp) to the regression line Y = mX +b ,that is,
residual ∈ =Yp -(mXp +b).
A scatterplot shows the attribute values plotted against the average of the attribute values in
neighboring areas for the given dataset.
17
Chapter 4
Efficient polygon amalgamation method
The polygon amalgamation operation computes the boundary of the union of a set polygons.
It’s a fundamental operation for emerging new applications such as spatial OLAP and spatial data
mining.Two efficient methods for identifying internal polygons without retrieving them from
databases[7].
1) Adjacency-based method
2) Occupancy-based method
4.1 Adjacency-based approach
Two polygons are adjacent if they have at least one pair of identical line segments.The
adjacency table of a set of polygons P is defined as a two column table ADJACENCY (p, p’) where
p,p’ are identifiers of polygons in P. A tuple (p,p’) is in the table ADJACENCY if and only if p and
p’ are adjacent. The basic idea is to use adjacency table to identify boundary polygons. Then
remove identical line segments in boundary polygons to get t(S). Hence, most of the internal
polygons will not be retrieved. A polygon is on the boundary of S if it’s adjacent to some polygons
which don’t belong to S[7].
Fig 12:An example of shadow ring
18
The adjacency table for the four polygons in Figure 12(a) where P = {p1 , p2 , p3 , p4 }:
{(p1;p2);p1;p3); (p1;p4); (p2;p3); (p2;p4); (p3;p4); (p2;D); (p3;D);(p4;D)}
where (p; D ) means p has at least one line segment adjacent to no P polygons (in this case we say p
is adjacent to a dummy polygon also labeled as D ). For S ⊆ P , by definition we have
δS = { p|p ∈S, ∃(p,p’) ∈ ADJACENCY, p’ ∉ S }
So, δS :p2, p3, p4 are boundary polygons. P1 will not be involved in the computation. So
those line segments of p2, p3 and p4 which were originally shared with p1 have lost their
counterpart. These line segments couldn’t be removed and form a shadow ring.
Identify δS +: the set of sub-boundary polygons which are internal polygons adjacent to boundary
polygons.
We extract both δS and δS +, remove all identical line segments in them. We get the target
polygon and possibly a shadow ring. But this time we know the shadow ring is formed by line
segments of sub-boundary polygons δS +. So we then remove the line segments belonging to δS +.
The remaining line segments will form the boundary polygon.
4.2 Occupancy-based approach
Z-values: a spatial data access method which establishes certain relationship between the
data space and spatial objects or their approximation bounding rectangles. It approximates a given
object’s shape by recursively decomposing the embedding data space into smaller data space known
as Peano cells.
Z-values is commonly used spatial indexing mechanism[7]. Our occupancy-based approach
is built on top of a simple extension of Z-values.
Fig 13: Z-order and object approximation using z-values
19
One way to assign Z-values.
1. The z-value of the initial space, D, is 1;
2. The z-values of the four quadrants of a Peano cell whose z-value is z = z1 ...zn ,
1 ≤ zi ≤ 4, 1 ≤ i ≤ n, are z1, z2, z3 and z4 respectively following the z-order;
The accuracy of approximation can be improved by assigning multiple Z-values to a polygon.
Occupancy-based approach will not produce shadow rings. Because:
Any line segment which doesn’t overlap with boundary cells is not part of the target polygon, and
thus can be discarded.
Any line segment inside a boundary cell either is part of the target polygon, or its counterpart from
another polygon must also be inside this boundary cell.
Let C be the set of all Peano cells with which S polygons overlap with. If c C is not
completely occupied by S polygons, we call c boundary cell. An S polygon overlapping with a
boundary cell is likely to be a boundary polygon.
The spatial indices using z-values associate objects with Peano cells. That is,each index
entry is of the form (z, p), stating that object p overlaps with Peano cell z. There is no information
about what percentage of the cell is occupied by p, thus it is not sufficient to determine if a Peano
cell is completely occupied by a set of objects. Therefore, we extend the index entry to the form of
(z, p, α) where α is the occupancy ratio. Let p ∩ z be the polygon produced from clipping polygon p
by the Peano cell z, then
α=area(p ∩ z)/area(z)
20
Chapter 5
Challenges and Applications
5.1 Challenges
A noteworthy trend is the increasing size of data sets in common use, such as records of
business transactions, environmental data and census demographics. These data sets often contain
millions of records, or even far more. This situation creates new challenges in coping with scale.
The challenges and impacts can be classified into three main areas, namely, geographic information
in knowledge discovery, geographic knowledge discovery in geographic information science and
geographic knowledge discovery in geographic research[2] .
The huge volume of spatial data . So retrieval and storage is difficult.The spatial data
types/structures are complex .Expensive spatial processing operations .
5.2 Application in land use dynamic monitoring
The mass data storaged in spatial database includes spatial topological, nospatial properties
and objects appearing variety on the time .The main knowledge types that can be discovered in the
spatial database are: general geometric knowledge, spatial distribution rules, spatial association
rules, spatial clustering rules, spatial characteristic rules, spatial discriminate rules, spatial evolution
rules etc.. For land use dynamic monitoring, according to the knowledge mined in the spatial
database, there are following several applications[4]:
a. Make prediction of land variety
According to geographic location, soil characters, geologic circumstance, prevent or control
flood information, transportation circumstance etc. of the land, Making use of the spatial
distribution rules, spatial clustering rules, spatial characteristic rules, spatial discriminate
rules to analyze can get the distributing and the future development of the land .
b.Provide decision support for the city planning
Spatial data mining technique makes use of general geometric knowledge, spatial
distribution rules, spatial association rules, spatial evolution rules to get many factors about
terrain, prevent or control flood, preventive pollution during the city planning for providing
good data environment in city construction.
21
c.Valid management and analysis of remote sensing monitoring result
According to algorithm of spatial data mining, based on knowledge discovery, it can validly
output various statistical charts, images, query result and analysis result for decision.
5.3 Application in Visual Data Mining
For data mining of large data sets to be effective, it is also important to include humans in
the data exploration process and combine their flexibility, creativity, and general knowledge with
the enormous storage capacity and computational power of today’s computers. Visual data mining
applies human visual perception to the exploration of large data sets. Presenting data in an
interactive, graphical form often fosters new insights, encouraging the formation and validation of
new hypotheses to the end of better problem-solving and gaining deeper domain knowledge. Visual
data mining often follows a three step process: Overview first, zoom and filter, and then details-ondemand [5].
Some of they key advantages of visual data exploration over automatic data mining
techniques alone are:
• yields results more quickly, with a higher degree of user satisfaction and confidence in findings .
• are especially useful when little is known about the data and exploration goals are vague, because
the analyst guides the search and can shift or adjust goals on the fly .
• can deal with highly non-homogeneous and noisy data .
• can be intuitive and require less understanding of complex mathematical or statistical algorithms
or parameters .
• can provide a qualitative overview of the data, allowing unexpected phenomena .
22
Chapter 6
Conclusion
Spatial Data Mining extends relational data mining with respect to special features of spatial
data, like mutual influence of neighboring objects by certain factors (topology, distance, direction).
It is based on techniques like generalization, clustering and mining association rules. Some
algorithms require further expert knowledge that can not be mined from the data, like concept
hierarchies. Spatial data mining is a niche area within data mining for the rapid analysis of spatial
data. Spatial data can potentially influence major scientific challenges, including the study of global
climate change and genomics. The distinguishing characteristics of spatial data mining can be
netaly summarized by the first law of geography:All things are related, but nearby things are more
related than distant things. Spatial data mining is being used in various fields like remote sensing
sattelite, Visyal data mining to mine data.
23
Bibliography
[1] Martin Ester, Alexander Frommelt, Hans-Peter Kriegel, Jörg Sander . Spatial Data
Mining:Database Primitives, Algorithms and Efficient DBMS Support . "Integration of Data Mining
with Database Technology", Data Mining and Knowledge Discovery, an International -Journal,
Kluwer Academic Publishers, 1999.
[2] Shashi Shekhar , Pusheng Zhang , Yan Huang , Ranga Raju Vatsavai. Trends in Spatial Data
Mining,Department of Computer Science and Engineering, University of Minnesota 4-192, 200
Union ST SE, Minneapolis, MN 55455
[3] Sashi Sekhar ,Chang-Tien Lu and Pusheng Zhang. A Unified Approach To Detecting Spatial
Outlier. Published in:Journal Geoinformatica Volume 7 Issue 2, June 2003
[4] Zhong yong a, Zhang jixian a, Yan qin. Research on spatial data mining technique applied in
land use dynamic monitoring.Proceedings of International Symposium on Spatio-temporal
Modeling, Spatial Reasoning, Analysis, Data Mining and Data Fusion. Volume XXXVI-2/W25,
2005
[5] Daniel A. Keim Christian Panse Mike Sips . Pixel Based Visual Mining of Geo-Spatial Data .
Published in:Journal IEEE Computer Graphics and Applications Volume 24 Issue 5, September
2004
[6] Krzysztof Koperski and Jiawei Han . Discovery of Spatial Association Rules in Geographic
Information database.Proceeding SSD '95 Proceedings of the 4th International Symposium on
Advances in Spatial Databases Springer-Verlag London, UK ©1995
[7] Xiaofang Zhou1 , David Truffet2 , and Jiawei Han3 . Efficient Polygon Amalgamation Methods
for Spatial OLAP and Spatial Data Mining . Published in:Proceeding SSD '99 Proceedings of the
6th International Symposium on Advances in Spatial Databases Springer-Verlag London, UK
©1999
[8] Pei T, Jasra A, Hand Dj, et al. DECODE: A new method for discovering clusters of different
densities in spatial data.Springer Science+Business Media, LLC 2008
[9] Raymond T. Ng , Jiawei Han . Efficient and Effective Methods for Spatial Data
Mining.Published in:Proceedings VLDB '94 Proceedings of the 20 th international conference on
very large databases, San Franscisco, CA,USA ©1994
24