Download spatial data mining techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
SPATIAL DATA MINING TECHNIQUES
Submitted by:
Akhil Killawala
BE Computer Engineering
KJ Somaiya College
Contents
1. Introduction
1
2. Database primitives
2.1 Neighbourhood Relations
2.2 Neighbourhood graphs and their operation
2.3 Filter predicates
4
4
5
6
3. Knowledge Discovery
3.1 Classification
3.2 Clustering Methods
3.2.1 Complete Spatial Randomness, Cluster, and Decluster
3.2.2 DECODE : a new method of discovering cluster
3.3 Association Rules
3.4 Outlier Detection
7
7
9
9
10
12
13
4. Efficient polygon amalgamation method
4.1 Adjacency-based method
4.2 Occupancy-based method
16
16
17
5. Challenges and Applications
5.1 Challenges
5.1 Landslide-monitoring data mining
5.2 In visual data mining
19
19
19
19
6. Conclusion
22
II
ABSTRACT
Spatial data mining is the process of discovering interesting and previously unknown, but
potentially useful patterns from large spatial datasets. Extracting interesting and useful
patterns from spatial datasets is more difficult than extracting the corresponding patterns from
traditional numeric and categorical data due to the complexity of spatial data types, spatial
relationships, and spatial autocorrelation. The requirements of mining spatial databases are
different from those of mining classical relational databases. The spatial data mining
techniques are often derived from spatial statistics, spatial analysis, machine learning, and
databases, and are customized to analyze massive data sets. In this report some of the spatial
data mining techniques have discussed along with some applications in real world.
III
Chapter 1
Introduction
The explosive growth of spatial data and widespread use of spatial databases emphasize the
need for the automated discovery of spatial knowledge. Spatial data mining is the process of
discovering interesting and previously unknown, but potentially useful patterns from spatial databases.
The complexity of spatial data and intrinsic spatial relationships limits the usefulness of conventional
data mining techniques for extracting spatial patterns.Efficient tools for extracting information from geospatial data are crucial to organizations which make decisions based on large spatial datasets,
including NASA, the National Imagery and Mapping Agency (NIMA), the National Cancer Institute
(NCI), and the United States Department of Transportation (USDOT). These organizations are spread
across many application domains including ecology and environmental management, public safety,
transportation, Earth science, epidemiology, and climatology[2].
General purpose data mining tools such as Clementine and Enterprise Miner, are designed
to analyze large commercial databases. Although these tools were primarily designed to identify
customer-buying patterns in market basket data, they have also been used in analyzing scientific
and engineering data, astronomical data, multi-media data, genomic data, and web data.
Extracting interesting and useful patterns from spatial data sets is more difficult than extracting
corresponding patterns from traditional numeric and categorical data due to the complexity of
spatial data types, spatial relationships, and spatial autocorrelation.
Specific features of geographical data that preclude the use of general purpose
data mining algorithms are: i) rich data types(e.g., extended spatial objects) ii) implicit
spatial relationships among the variables, iii) observations that are not independent, and
iv) spatial autocorrelation among the features.
The spatial data mining can be used to understand spatial data, discover the relation
between pace and the non- space data, set up the spatial knowledge base, excel the query,
reorganize spatial database and obtain concise total characteristic etc. The system structure
of the spatial data mining can be divided into three layer structures mostly, such as the Fig
1[1] show .The customer interface layer is mainly used for input and output, the miner layer is
mainly used to manage data, select algorithm and storage the mined knowledge, the data
source layer, which mainly includes the spatial database (camalig) and other related data and
knowledge bases, is original data of the spatial data mining.
1
The data inputs of spatial data mining are more complex than the inputs of classical data
mining because they include extended objects such as points, lines, and polygons. The data inputs
of spatial data mining have two distinct types of attributes: non-spatial attribute and spatial
attribute. Non-spatial attributes are used to characterize non-spatial features of objects, such as
name, population, and unemployment rate for a city. They are the same as the attributes used in
the data inputs of classical data mining. Spatial attributes are used to define the spatial location
and extent of spatial objects. The spatial attributes of a spatial object most often include
information related to spatial locations, e.g., longitude, latitude and elevation, as well as shape.
Relationships among non-spatial objects are explicit in data inputs, e.g.,arithmetic relation,
ordering, is instance of, subclass of, and membership of. In contrast, relationships among spatial
objects are often implicit, such as overlap, intersect, and behind. One possible way to deal with
implicit spatial relationships is to materialize the relationships into traditional data input columns
and then apply classical data mining techniques. However,the materialization can result in loss of
information. Another way to capture implicit spatial relationships is to develop models or
techniques to incorporate spatial information into the spatial data mining process.
Fig1:Thesystematic structure of spatial data mining.
One of the fundamental assumptions of statistical analysis is that the data
samples are 2
independently generated: like successive tosses of coin, or the rolling of a die. However, in the analysis
of spatial data, the assumption about the independence of samples is generally false. In fact, spatial
data tends to be highly self correlated. For example, people with similar characteristics, occupation and
background tend to cluster together in the same neighborhoods. The economies of a region tend to be
similar. Changes in natural resources, wildlife, and temperature vary gradually over space. The
property of like things to cluster in space is so fundamental that geographers have elevated it to the
status of the first law of geography: “Everything is related to everything else but nearby things are more
related than distant things”.In spatial statistics, an area within statistics devoted to the analysis of
spatial data, this property is called spatial autocorrelation.
Knowledge discovery techniques which ignore spatial autocorrelation typically
perform poorly in the presence of spatial data. Often the spatial dependencies arise due to
the inherent characteristics of the phenomena under study, but in particular they arise due
to the fact that the spatial resolution of imaging sensors are finer than the size of the object
being observed.For example, remote sensing satellites have resolutions ranging from 30
meters (e.g., the Enhanced Thematic Mapper of the Landsat 7 satellite of NASA) to one
meter (e.g., the IKONOS satellite from SpaceImaging), while the objects under study (e.g.,
Urban, Forest, Water) are often much larger than 30 meters. As a result, per-pixel-based
classifiers, which do not take spatial context into account, often produce classified images
with salt and pepper noise. These classifiers also suffer in terms of classification accuracy.
The spatial relationship among locations in a spatial framework is often modeled via
a contiguity matrix. A simple contiguity matrix may represent a neighborhood relationship
defined using adjacency, Euclidean distance, etc. Example definitions of neighborhood
using adjacency include a four-neighborhood and an eight-neighborhood. Given a gridded
spatial framework, a four-neighborhood assumes that a pair of locations influence each
other if they share an edge. An eight-neighborhood assumes that a pair of locations
influence each other if they share either an edge or a vertex.
Fig 2:A Spatial Framework and Its Four-neighborhood Contiguity Matrix.
3
In figure2(a) above shows a gridded spatial framework with four locations, A,B, C, and
D. A binary matrix representation of a four-neighborhood relationship is shown in Fig2(b). The
row normalized representation of this matrix is called a contiguity matrix, as shown in Fig2(c).
The prediction of events occurring at particular geographic locations is very important in
several application domains. Examples of problems which require location prediction include
crime analysis, cellular networking, and natural disasters such as fires, floods, droughts,
vegetation diseases, and earthquakes. Two spatial data mining techniques for predicting
locations, namely the Spatial Autoregressive Model (SAR) and Markov Random Fields (MRF).
An Application Domain We begin by introducing an example to illustrate the different
concepts related to location prediction in spatial data mining. We are given data about two
wetlands, named Darr and Stubble, on the shores of Lake Erie in Ohio USA in order to predict the
spatial distribution of a marsh breeding bird, the red-winged blackbird (Agelaius phoeniceus). The
data was collected from April to June in two successive years, 1995 and 1996.
4
Chapter 2
Database primitives
In this section, we introduce a small set of database primitives for spatial data mining . The
major difference between mining in relational databases and mining in spatial databases is that
attributes of the neighbors of some object of interest may have an influence on the object itself.
Therefore, our database primitives are based on the concept of spatial neighborhood relations.
2.1 Neighborhood Relations
The mutual influence between two objects depends on factors such as the topology,
the distance or the direction between the objects. Three basic types of spatial relations:
topological, distance and direction relations which are binary relations, i.e. relations between
pairs of objects. Spatial objects may be either points or spatially extended objects such as
lines, polygons or polyhedrons. In general, the points p = (p1, p2, . . ., pd) are elements of a ddimensional Euclidean vector space called Points[1]. Topological relations are those relations
which are invariant under topological transformations, i.e. they are preserved if both objects
are rotated, translated or scaled simultaneously.
Definition 1: (topological relations) The topological relations between two objects A and
B are derived from the nine intersections of the interiors, the boundaries and the
complements of A and B with each other. The relations are: A disjoint B, A meets B, A
overlaps B, A equals B, A covers B, A covered-by B, A contains B, A inside B.
Definition 2: (distance relations) Distance relations are those relations comparing the
distance of two objects with a given constant using one of the arithmetic operators. The
distance dist between two objects, i.e. sets of points, can then simply be defined by the
minimum distance between their points.
Definition 3: (direction relations) To define direction relations O2 R O1, we distinguish between the
source object O1 and the destination object O2 of the direction relation R. There are several
5
possibilities to define direction relations depending on the number of points they consider
in the source and the destination object.
Fig3:Illustrates some of the topological, distance and direction relations using 2D polygons.
We define the direction relation of two spatially extended objects using one representative
point rep(O1) of the source object O1 and all points of the destination object O2. The
representative point of a source object may, e.g., be the center of the object. This representative
point is used as the origin of a virtual coordinate system and its quadrants define the directions.
Definition 4: (complex neighborhood relations) If r1 and r2 are neighborhood relations, then r1
∧ r2 and r1 ∨ r2 are also neighborhood relations - called complex neighborhood relations.
2.2 Neighborhood Graphs and Their Operations
Based on the neighborhood relations, we introduce the concepts of neighborhood
graphs and neighborhood paths and some basic operations for their manipulation.
Definition 5: (neighborhood graphs and paths) Let neighbor be a neighborhood relation and DB
⊆ 2Points be a database of objects.
a) A neighborhood graph G
DB
neighbor
= ( N, E ) is a graph with the set of nodes N which we
identify with the objects o ∈ DB and the set of edges E ⊆ N × N where two nodes n1 and n2 ∈ N
are connected via some edge of E iff neighbor(n 1,n2) holds. Let n denote the cardinality of
N and let e denote the cardinality of E. Then, f:= e / n denotes the average number of
edges of a node,i.e. f is called the “fan out” of the graph.
b) A neighborhood path is a sequence of nodes [n1, n2, . . ., nk], where neighbor(ni, ni+1)
holds for all n ∈ N, 1 ≤ i < k . The number k of nodes is called the length of the neighborhood path.
6
c) A neighborhood path [n1, n2, . . ., nk] is valid iff ∀ i ≤ k, j < k: i ≠ j ⇔ n i ≠ n j .
2.3 Filter Predicates for Neighborhood Paths
Neighborhood graphs will in general contain many paths which are irrelevant if not
“mislead ing” for spatial data mining algorithms. For finding significant spatial patterns, we
have to consider only certain classes of paths which are “leading away” from the starting
object in some straightforward sense. Such spatial patterns are most often the effect of some
kind of influence of an object on other objects in its neighborhood. Furthermore, this influence
typically decreases or increases continuously with increasing or decreasing distance. The task
of spatial trend analysis, i.e. Finding patterns of systematic change of some non-spatial
attributes in the neighborhood of certain database objects, can be considered as a typical
example. Detecting such trends would be impossible if we do not restrict the pattern space in
a way that paths changing direction in arbitrary ways or containing cycles are eliminated[1].
Fig4:Filter Starlike and filter variable starlike
Definition 6: (filter starlike and filter variable starlike) Let p = [n1,n2,...,nk] be a
neighborhood path and let reli be the exact direction for ni and ni+1, i.e. ni+1 reli ni holds. The
predicates starlike and variable-starlike for paths p are defined as follows:
starlike(p) :⇔ (∃ j < k: ∀ i > j: ni+1 reli ni ⇔ reli ⊆ relj), if k > 1; TRUE, if k=1 variablestarlike(p) :⇔ (∃ j < k: ∀ i > j: ni+1 reli ni ⇔ reli ⊆ rel1), if k > 1; TRUE, if k=1.
7
Chapter 3
Knowledge Discovery
Knowledge discovery in databases (KDD) has been defined as the non-trivial
process of discovering valid, novel, and potentially useful, and ultimately understand able
patterns from data .The process of KDD is interactive and iterative, involving several steps
such as the following ones:
• Selection: selecting a subset of all attributes and a subset of all data from which the
knowledge should be discovered.
• Data reduction: using dimensionality reduction or transformation techniques to reduce
the effective number of attributes to be considered.
• Data mining: the application of appropriate algorithms that, under acceptable computational
efficiency limitations, produce a particular enumeration of patterns over the data.
• Evaluation: interpreting and evaluating the discovered patterns with respect to their usefulness
in the given application.
Spatial Database Systems (SDBS) are database systems for the management of spatial
data. To find implicit regularities, rules or patterns hidden in large spatial databases, e.g. for geomarketing, traffic control or environmental studies, spatial data mining algorithms are very
important .Most existing data mining algorithms run on separate and specially prepared files, but
integrating them with a database management system (DBMS) has the following advantages.
Redundant storage and potential inconsistencies can be avoided. Furthermore, commercial
database systems offer various index structures to support different types of database queries.
This functionality can be used without extra implementation effort to speed-up the execution of
data mining algorithms (which, in general, have to perform many database queries)[4]. Similar to
the relational standard language SQL, the use of standard primitives will speed-up the
development of new data mining algorithms and will also make them more portable.
3.1 Spatial Classification
The task of classification is to assign an object to a class from a given set of
classes based on the attribute values of this object. In spatial classification the attribute
values of neighboring objects are also considered.
8
The classification algorithm works as follows:The relevant attributes are extracted by
comparing the attribute values of the target objects with the attribute values of their nearest
neighbors. The determination of relevant attributes is based on the concepts of the nearest hit (the
nearest neighbor belonging to the same class) and the nearest miss (the nearest neighbor
belonging to a different class). In the construction of the decision tree, the neighbors of target
objects are not considered individually. Instead, so-called buffers are created around the target
objects and the non-spatial attribute values are aggregated over all objects contained in the buffer.
For instance, in the case of shopping malls a buffer may represent the area where its customers
live or work[4]. The size of the buffer yielding the maximum information gain is chosen and this
size is applied to compute the aggregates for all relevant attributes.
The task of classification is to assign an object to a class from a given set of
classes based on the attribute values of the object. In spatial classification the attribute
values of neighbouring objects may also be relevant for the membership of objects and
therefore have to be considered as well.
The extension to spatial attributes is to consider also the attribute of objects on a
neighbourhood path starting from the current object. Thus, we define generalized
attributes for a neighbourhood path p = [o 1, . . ., ok] as tuples (attribute-name, index) where
index is a valid position in p representing the attribute with attribute name of object o index.
The generalized attribute (economic-power,2), e.g., represents the attribute economicpower of some (direct) neighbour of object o1.
Because it is reasonable to assume that the influence of neighbouring objects and their
attributes decreases with increasing distance, we can limit the length of the relevant neighbourhood
paths by an input parameter max-length. Furthermore, the classification algorithm allows the input of a
predicate to focus the search for classification rules on the objects of the database fulfilling this
predicate. Figure 5 depicts a sample decision tree and two rules derived from it. Economic power has
been chosen as the class attribute and the focus is on all objects of type city.
9
Fig 5:Sample decision tree and rules discovered by the classification algorithm
3.2 Spatial Clustering
Spatial clustering is a process of grouping a set of spatial objects into clusters so that
objects within a cluster have high similarity in comparison to one another, but are dissimilar to
objects in other clusters. For example, clustering is used to determine the “hot spots” in crime
analysis and disease tracking. Hot spot analysis is the process of finding unusually dense
event clusters across time and space. Many criminal justice agencies are exploring the
benefits provided by computer technologies to identify crime hot spots in order to take
preventive strategies such as deploying saturation patrols in hot spot areas[6].
Spatial clustering can be applied to group similar spatial objects together; the
implicit 10
assumption is that patterns in space tend to be grouped rather than randomly located. However,
the statistical significance of spatial clusters should be measured by testing the assumption in the
data. The test is critical before proceeding with any serious clustering analyses.
3.2.1 Complete Spatial Randomness, Cluster, and Decluster
In spatial statistics, the standard against which spatial point patterns are often compared is
a completely spatially random point process, and departures indicate that the pattern is not
distributed randomly in space. Complete spatial randomness (CSR) [Cressie1993] is synonymous
with a homogeneous Poisson process. The patterns of the process are independently and
uniformly distributed over space, i.e., the patterns are equally likely to occur anywhere and do not
interact with each other. However, patterns generated by a non-random process can be either
cluster patterns(aggregated patterns) or decluster patterns(uniformly spaced patterns).
Fig 6:Illustration of CSR, Cluster, and Decluster Patterns
Several statistical methods can be applied to quantify deviations of patterns from a
complete spatial randomness point pattern. One type of descriptive statistics is based on quadrats
(i.e., well defined area, often rectangle in shape)[6]. Usually quadrats of random location and
orientations in the quadrats are counted, and statistics derived from the counters are computed.
Another type of statistics is based on distances between patterns; one such type is
Ripley’s K-function . After the verification of the statistical significance of the spatial
clustering, classical clustering algorithms can be used to discover interesting clusters.
3.2.2 DECODE (DiscovEring Clusters Of Different dEnsities)
Discovering clusters in complex spatial data, in which clusters of different densities are
superposed, severely challenges existing data mining methods. For instance, in seismic research,
11
foreshocks (which indicate forthcoming strong earthquakes) or aftershocks (which may help to
elucidate the mechanism of major earthquakes) are often interfered by background earthquakes.
In this context, data are presumed to consist of various spatial point processes in each of which
points are distributed at a constant, but different intensity. DECODE is a new density-based cluster
method(DECODE) to discover clusters of different densities in spatial data[8].
The novelties of DECODE are 2-fold:
(1) It can identify the number of point processes with little prior knowledge.It can
automatically estimate the thresholds for separating point processes and clusters.
Fig 7:Flowchart of the method for discovering clusters of different densities in spatial data
Two strategies have been adopted for finding density homogeneous clusters in densitybased methods:
(1) Grid-based clustering method
Map data into a mesh grid and identify dense regions according to the density in
cells. The main advantage : detection capability for finding arbitrary shaped clusters
and their high efficiency in dealing with complex data sets which are characterized
by large amounts of data, high dimensionality and multiple densities.
(2) Distance-based clustering method
Often based on the mth nearest neighbor distance.
When clusters with different densities and noise coexist in a data set, few existing
methods can determine the number of processes and precisely estimate the
parameters. Therefore, DECODE is a solution for it.
12
DECODE is based upon a reversible jump Markov Chain Monte Carlo( MCMC) strategy and divided into three steps:
(1) Map each point in the data to its mth nearest distance.
(2)Classification thresholds are determined via a reversible jump MCMC strategy.
(3) Clusters are formed by spatially connecting the points whose mth nearest distances fall into a particular bin defined by the
thresholds.
Some aspects of the model
1)
Analysis of sensitivity to prior specification
Two strategies can be applied to the selection of β.
a. to update β constantly during the process.
b. To fix β. Where λmax is the intensity corresponding to min(Xm)
Fig 8: The cumulative occupancy fractions of j at different f b : (a) fb = 2; (b) fb = 10; (c) fb = 50; (d) fb = 200; (e) fb = 1,000; (f)
fb = 5,000; (g) fb = 20,000; (h) fb = 250,000; (i) updating β (with δ = 1, α = 1, h = 10/(max(Xm ) − min(Xm )), g = 0.2)
13
As shown above, we find that updating β produces the highest posterior probability.
From the results, it is observed that: updating β tends to produce a model with more point
processes. fb is the parameter which significantly influences the results when fixing β.
Total complexity of the algorithm is O(T(Mk+(k-1)k!))where T is the sweep
times. M is the number of points of the data set.
k is the number of point processes that the algorithm determines.
DECODE is a robust and automatic cluster method which needs less prior knowledge of
the target data.
3.3 Association rules
Spatial association rule is a rule indicating certain association relationship among a
set of spatial and possibly some nonspatial predicates. A strong rule indicates that the
patterns in the rule have relatively frequent occurrences in the database and strong
implication relationships. raditional data organization and retrieval tools can only handle
the storage and retrieval of explicitly stored data. The extraction and comprehension of the
knowledge implied by the huge amount of spatial data, though highly desirable, pose great
challenges to currently available spatial database technologies.
A spatial characteristic rule is a general description of a set of spatial-related data. For
example, the description of the general weather patterns in a set of geographic regions is a
spatial characteristic rule. A spatial discriminant rule is the general description of the
contrasting or discriminating features of a class of spatial-related data from other class(es).
For example, the comparison of the weather patterns in two geographic regions is a spatial
discriminant rule. A spatial association rule is a rule which describes the implication of one or a
set of features by another set of features in spatial databases. For example, a rule like most
big cities in Canada are close to the Canada-U.S. border" is a spatial association rule.
A strong rule indicates that the patterns in the rule have relatively frequent occurrences in
the database and strong implication relationships.A spatial association rule is a rule in the form of
P1  …  Pm  Q1  … Qn
(c%) ,
where at least one of the predicates P1 ,…, Pm , Q1 ,…, Qn is a spatial predicate, and c% is
the confidence of the rule.
A rule “P Q/S” is strong if predicate “P  Q” is large in set S and the
confidence of “P  Q/S” is high.
14
Steps for extracting the association rules:
STEP 1: Task_relevant_DB := extract task relevant objects(SDB ,RDB);
(Relevant objects are collected into one database)
STEP 2: Coarse_predicate_DB := coarse spatial computation(Task relevant
DB); (Spatial algorithm is executed at the coarse resolution level)
STEP 3: Large_Coarse_predicate_DB :=
filtering_with_mininmum_support(Coarse_predicate_DB);
(computes the support for each predicate in Coarse_predicate_DB,and filters out
those entries whose support is below the minimum support threshold at the top level.)
STEP 4: Fine_predicate_DB := refined_spatial_computation(Large_Coarse_predicate_DB);
(Algorithm is executed at fine resolution level.)
STEP 5: Find_large_predicates_and_mine_rules(Fine_predicate_DB);
3.4 Outlier Detection
Spatial outlier represent locations which are significantly different from their
neighborhoods even though they may not be significantly different from the entire
population. Identification of spatial outliers can lead to the discovery of unexpected,
interesting and implicit knowledge, such as local instability[3].
A spatial outlier is a spatially referenced object whose non-spatial attribute values
are significantly different from those of other spatially referenced objects in its spatial
neighborhood. Informally, a spatial outlier is a local instability or a spatially referenced
object whose non-spatial attributes are extreme relative to its neighbors,even they may not
be significantly different from the entire population. For example, a new house in an old
neighborhood of a growing metropolitan area is a spatial outlier based on the non-spatial
attribute house age. So, the problem is given the components of the S-outlier definition,
the objective is to design a computationally efficient algorithm to detect the S-outliers.
Techniques of outlier detection:Graphical methods include Variogram clouds and
Moran Scatterplots. The algebraic method includes Scatterplot and Z(S(x)).
A variogram cloud displays data points related by neighborhood relationships. For each pair of
loactaions,the square-root of the absolute difference between attribute values at the locations versus
the Euclidean distance between the loacations are plotted. In data sets exhibiting strong spatial
dependence,the variance in the attribute differences will increase with increasing distance
15
between locations.Locations that are near to one another ,but with large attribute differences,might indicate spatial outlier, even
though the values at both locations may appear to be reasonable when examining the dataset non-spatially .
Fig 9:A Dataset for Outlier Detection.
In Fig9(a), the X-axis is the location of data points in one-dimensional space; the Y-axis is the attribute value for each
data point. Global outlier detection methods ignore the spatial location of each data point and fit the distribution model to the
values of the non-spatial attribute. The outlier detected using this approach is the data point G, which has an extremely high
attribute value 7.9, exceeding the threshold of μ + 2σ = 4.49 + 2 ∗ 1.61 = 7.71, as shown in Figure 9(b). This test assumes a normal
distribution for attribute values. On the other hand, S is a spatial outlier whose observed value is significantly different than its
neighbors P and Q[3].
Fig 10:Variogram cloud
16
This plot shows that two pairs (P,S ) and (Q,S ) in the left hand side lie above the
main group of pairs, and are possibly related to outliers. The poin S may be identified as a
spatial outlier since it occurs in both pairs ( Q,S) and ( P, S). This technique requires nontrivial post-processing of highlighted pairs to separate spatial outliers from their neighbors,
particualrly when multiple outliers are present or density varies greatly.
Fig 11:Scatter plot
A scatterplot shows attribute values on the X-axis and the average of the attribute
values in the neighborhood on the Y-axis. A least square regression line is used to identify
spatial outliers. A scatter sloping upward to the right indicates a positive spatial
autocorrelation; a scatter sloping upward to the left indicates a negative spatial auto correlation .The residual is defined as the vertical distance (Y- axis) between a point P with
location(Xp,Yp) to the regression line Y = mX +b ,that is, residual ∈ =Yp -(mXp +b).
A scatterplot shows the attribute values plotted against the average of the attribute
values in neighboring areas for the given dataset.
17
Chapter 4
Efficient polygon amalgamation method
The polygon amalgamation operation computes the boundary of the union of a set polygons.
It’s a fundamental operation for emerging new applications such as spatial OLAP and
spatial data mining.Two efficient methods for identifying internal polygons without
retrieving them from databases[7].
1) Adjacency-based method
2) Occupancy-based method
4.1 Adjacency-based approach
Two polygons are adjacent if they have at least one pair of identical line
segments.The adjacency table of a set of polygons P is defined as a two column table
ADJACENCY (p, p’) where p,p’ are identifiers of polygons in P. A tuple (p,p’) is in the table
ADJACENCY if and only if p and p’ are adjacent. The basic idea is to use adjacency table
to identify boundary polygons. Then remove identical line segments in boundary polygons
to get t(S). Hence, most of the internal polygons will not be retrieved. A polygon is on the
boundary of S if it’s adjacent to some polygons which don’t belong to S[7].
Fig 12:An example of shadow ring
18
The adjacency table for the four polygons in Figure 12(a) where P = {p1 , p2 , p3 , p4 }: {(p1;p2);p1;p3); (p1;p4); (p2;p3);
(p2;p4); (p3;p4); (p2;D); (p3;D);(p4;D)}
where (p; D ) means p has at least one line segment adjacent to no P polygons (in this case we say p is adjacent to a dummy
polygon also labeled as D ). For S ⊆ P , by definition we have
δS = { p|p ∈S, ∃(p,p’) ∈ ADJACENCY, p’ ∉ S }
So, δS :p2, p3, p4 are boundary polygons. P1 will not be involved in the computation. So those line segments of p2, p3 and
p4 which were originally shared with p1 have lost their counterpart. These line segments couldn’t be removed and form a shadow
ring.
Identify δS +: the set of sub-boundary polygons which are internal polygons adjacent to boundary polygons.
We extract both δS and δS +, remove all identical line segments in them. We get the target polygon and possibly a shadow
ring. But this time we know the shadow ring is formed by line segments of sub-boundary polygons δS +. So we then remove the line
segments belonging to δS +. The remaining line segments will form the boundary polygon.
4.2 Occupancy-based approach
Z-values: a spatial data access method which establishes certain relationship between the data space and spatial objects or
their approximation bounding rectangles. It approximates a given object’s shape by recursively decomposing the embedding data
space into smaller data space known as Peano cells.
Z-values is commonly used spatial indexing mechanism[7]. Our occupancy-based approach is built on top of a simple
extension of Z-values.
Fig 13: Z-order and object approximation using z-values
19
One way to assign Z-values.
1. The z-value of the initial space, D, is 1;
2. The z-values of the four quadrants of a Peano cell whose z-value is z = z1 ...zn , 1
≤ zi ≤ 4, 1 ≤ i ≤ n, are z1, z2, z3 and z4 respectively following the z-order;
The accuracy of approximation can be improved by assigning multiple Z-values to a
polygon. Occupancy-based approach will not produce shadow rings. Because:
Any line segment which doesn’t overlap with boundary cells is not part of the target
polygon, and thus can be discarded.
Any line segment inside a boundary cell either is part of the target polygon, or its
counterpart from another polygon must also be inside this boundary cell.
Let C be the set of all Peano cells with which S polygons overlap with. If c C is not
completely occupied by S polygons, we call c boundary cell. An S polygon overlapping
with a boundary cell is likely to be a boundary polygon.
The spatial indices using z-values associate objects with Peano cells. That is,each
index entry is of the form (z, p), stating that object p overlaps with Peano cell z. There is
no information about what percentage of the cell is occupied by p, thus it is not sufficient to
determine if a Peano cell is completely occupied by a set of objects. Therefore, we extend
the index entry to the form of (z, p, α) where α is the occupancy ratio. Let p ∩ z be the
polygon produced from clipping polygon p by the Peano cell z, then
α=area(p ∩ z)/area(z)
20
Chapter 5
Challenges and Applications
5.1 Challenges
A noteworthy trend is the increasing size of data sets in common use, such as records of
business transactions, environmental data and census demographics. These data sets often
contain millions of records, or even far more. This situation creates new challenges in coping with
scale. The challenges and impacts can be classified into three main areas, namely, geographic
information in knowledge discovery, geographic knowledge discovery in geographic information
science and geographic knowledge discovery in geographic research[2] .
The huge volume of spatial data . So retrieval and storage is difficult.The spatial
data types/structures are complex .Expensive spatial processing operations .
5.2 Application in land use dynamic monitoring
The mass data storaged in spatial database includes spatial topological, nospatial
properties and objects appearing variety on the time .The main knowledge types that can be
discovered in the spatial database are: general geometric knowledge, spatial distribution rules,
spatial association rules, spatial clustering rules, spatial characteristic rules, spatial
discriminate rules, spatial evolution rules etc.. For land use dynamic monitoring, according to
the knowledge mined in the spatial database, there are following several applications[4]:
a. Make prediction of land variety
According to geographic location, soil characters, geologic circumstance, prevent or control
flood information, transportation circumstance etc. of the land, Making use of the spatial
distribution rules, spatial clustering rules, spatial characteristic rules, spatial discriminate rules
to analyze can get the distributing and the future development of the land .
b.Provide decision support for the city planning
Spatial data mining technique makes use of general geometric knowledge, spatial
distribution rules, spatial association rules, spatial evolution rules to get many
factors about terrain, prevent or control flood, preventive pollution during the city
planning for providing good data environment in city construction.
21
c.Valid management and analysis of remote sensing monitoring result
According to algorithm of spatial data mining, based on knowledge discovery, it can validly
output various statistical charts, images, query result and analysis result for decision.
5.3 Application in Visual Data Mining
For data mining of large data sets to be effective, it is also important to include
humans in the data exploration process and combine their flexibility, creativity, and general
knowledge with the enormous storage capacity and computational power of today’s
computers. Visual data mining applies human visual perception to the exploration of large
data sets. Presenting data in an interactive, graphical form often fosters new insights,
encouraging the formation and validation of new hypotheses to the end of better problemsolving and gaining deeper domain knowledge. Visual data mining often follows a three
step process: Overview first, zoom and filter, and then details-on-demand [5].
Some of they key advantages of visual data exploration over automatic data mining
techniques alone are:
• yields results more quickly, with a higher degree of user satisfaction and confidence in findings .
• are especially useful when little is known about the data and exploration goals are vague,
because the analyst guides the search and can shift or adjust goals on the fly .
• can deal with highly non-homogeneous and noisy data .
• can be intuitive and require less understanding of complex mathematical or statistical
algorithms or parameters .
• can provide a qualitative overview of the data, allowing unexpected phenomena .
22
Chapter 6
Conclusion
Spatial Data Mining extends relational data mining with respect to special features of
spatial data, like mutual influence of neighboring objects by certain factors (topology, distance,
direction). It is based on techniques like generalization, clustering and mining association
rules. Some algorithms require further expert knowledge that can not be mined from the data,
like concept hierarchies. Spatial data mining is a niche area within data mining for the rapid
analysis of spatial data. Spatial data can potentially influence major scientific challenges,
including the study of global climate change and genomics. The distinguishing characteristics
of spatial data mining can be netaly summarized by the first law of geography:All things are
related, but nearby things are more related than distant things. Spatial data mining is being
used in various fields like remote sensing sattelite, Visyal data mining to mine data.
23
Bibliography
[1] Martin Ester, Alexander Frommelt, Hans-Peter Kriegel, Jörg Sander . Spatial Data
Mining:Database Primitives, Algorithms and Efficient DBMS Support . "Integration of Data
Mining with Database Technology", Data Mining and Knowledge Discovery, an
International -Journal, Kluwer Academic Publishers, 1999.
[2] Shashi Shekhar , Pusheng Zhang , Yan Huang , Ranga Raju Vatsavai. Trends in
Spatial Data Mining,Department of Computer Science and Engineering, University of
Minnesota 4-192, 200 Union ST SE, Minneapolis, MN 55455
[3] Sashi Sekhar ,Chang-Tien Lu and Pusheng Zhang. A Unified Approach To Detecting
Spatial Outlier. Published in:Journal Geoinformatica Volume 7 Issue 2, June 2003
[4] Zhong yong a, Zhang jixian a, Yan qin. Research on spatial data mining technique
applied in land use dynamic monitoring.Proceedings of International Symposium on
Spatio-temporal Modeling, Spatial Reasoning, Analysis, Data Mining and Data Fusion.
Volume XXXVI-2/W25, 2005
[5] Daniel A. Keim Christian Panse Mike Sips . Pixel Based Visual Mining of Geo-Spatial Data .
Published in:Journal IEEE Computer Graphics and Applications Volume 24 Issue 5,
September 2004
[6] Krzysztof Koperski and Jiawei Han . Discovery of Spatial Association Rules in Geographic
Information database.Proceeding SSD '95 Proceedings of the 4th International Symposium on
Advances in Spatial Databases Springer-Verlag London, UK ©1995
[7] Xiaofang Zhou1 , David Truffet2 , and Jiawei Han3 . Efficient Polygon Amalgamation
Methods for Spatial OLAP and Spatial Data Mining . Published in:Proceeding SSD '99
Proceedings of the 6th International Symposium on Advances in Spatial Databases
Springer-Verlag London, UK ©1999
[8] Pei T, Jasra A, Hand Dj, et al. DECODE: A new method for discovering clusters of
different densities in spatial data.Springer Science+Business Media, LLC 2008
[9] Raymond T. Ng , Jiawei Han . Efficient and Effective Methods for Spatial Data
Mining.Published in:Proceedings VLDB '94 Proceedings of the 20th international
conference on very large databases, San Franscisco, CA,USA ©1994
24