Download Spatial Data Mining - KGISL Institute of Information Management

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
SPATIAL DATA MINING
Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM
INTRODUCTION
The main difference between data mining in relational DBS and in spatial DBS is
that attributes of the neighbors of some object of interest may have an influence on the
object and therefore have to be considered as well. The explicit location and extension of
spatial objects define implicit relations of spatial neighborhood (such as topological,
distance and direction relations), which are used by spatial data mining algorithms.
Therefore, new techniques are required for effective and efficient data mining.
SPATIAL DATA
Spatial data are data that have a spatial or location component. Spatial data can
be viewed as data about objects that they are located in a physical space. This may be
implemented with a specific location attributed such as address or latitude/longitude or
may be more implicitly included such as by a portioning of the database based on
location.
In addition, spatial data me be accessed using queries containing spatial
operators such as near, north and south adjacent and contained in. Spatial data are
stored in spatial databases that contain spatial and non-spatial data about objects,
because of the inherent distance information associated with spatial data, spatial
databases are often stored using special data structures or indices built using distance or
topological information.
Spatial data are required for many current information technology systems,
Geographic Information Systems (GIS) are used to store information related to
geographic locations on the surface of the earth. This includes applications related to
weather, community infrastructure needs, disaster management, and hazardous waste.
Data mining activities include prediction of environmental catastrophes, biomedical
applications including imaging and illness diagnosis, also require spatial systems.
SPATIAL DATA MINING
Spatial data mining i.e., mining knowledge from large amounts of spatial data, is
a highly demanding field because huge amounts of spatial data have been collected in
various applications, ranging from remote sensing to Geographical Information Systems
(GIS), Computer Cartography, Environmental Assessment and planning, etc.
The collected data far exceeded human’s ability to analyze. Recent studies on
data mining have extended the scope of data mining from relational and transactional
databases to spatial databases. This paper summarizes recent works on spatial data
mining from spatial data generalization, to spatial data clustering, mining spatial
association rules etc. It shows that spatial data mining is a promising field, with fruitful
research results and many challenging issues.
Spatial Data Mining Background
Statistical spatial analysis has been the most common approach for analyzing
spatial data. Statistical analysis is a well-studied area and therefore there exist a large
number of algorithms including various optimization techniques. It handles very well
numerical data and usually comes up with realistic models of spatial phenomena. The
major disadvantage of this approach is the assumption of statistical independence
among the spatially distributed data. This causes problems as many spatial data are in
fact interrelated, i.e. spatial objects are influenced by their neighboring objects.
Furthermore, the statistical approach cannot model non-linear rules very well and
symbolic values like names are handled poorly. Statistical methods also do not work
well with incomplete or inconclusive data. Another problem related to statistical spatial
analysis is the expensive computation of the results.
With the advent of data mining researchers proposed various methods for
discovering knowledge from large databases. Most of them concentrate on relational or
transaction databases. These methods strived to combine the already mature areas like
machine learning, databases and statistics. Studies laid a foundation for spatial data
mining.
Machine learning techniques learning from examples and generalization and
specialization are widely used in spatial data mining. It did not take long before the
statistical cluster analysis technique was modified for the use in spatial data mining.
Also other methods were extended toward knowledge discovery in spatial databases. In
the next section we define some commonly used terms in spatial data mining.
PRIMITIVES OF SPATIAL DATA MINING
We have developed a set of database primitives for mining in spatial databases
which are sufficient to express most of the algorithms for spatial data mining and which
can be effectively supported by a Database Management Systems (DBMS). We believe
that the use of these database primitives will enable the integration of spatial data
mining with existing Database Management Systems and will speed-up the
development of new spatial data mining algorithms. The database primitives are based
on the concepts of neighborhood graphs and neighborhood paths.
RULES
Various kinds of rules can be discovered from databases in general. For example,
characteristic rules, discriminant rules, association rules, or deviation and evolution
rules can be mined.
A spatial characteristic rule is a general description of spatial data. For example,
a rule describing the general price range of houses in various geographic regions in a
city is a spatial characteristic rule.
A spatial discriminant rule is a general description of the features discriminating
or contrasting a class of spatial data from other classes like the comparison of price
ranges of houses in different geographical regions.
Finally, a spatial association rule is a rule, which describes the implication of
one or a set of features by another set of features in spatial databases. For example, a rule
associating the price range of the houses with nearby spatial features, like beaches, is a
spatial feature, like beaches is a spatial association rule.
THEMATIC MAPS
Thematic maps present the spatial distribution of a single or a few attributes.
This differs from general or reference maps where the main objective is to present the
position of objects in relation to other spatial objects. Thematic maps may be used for
discovering different rules. For example, we may want to look at temperature thematic
map while analyzing the general weather pattern of a geographic region.
There are two ways to represent thematic maps: raster and vector. In the raster
image from thematic maps have pixels associated with the attribute values. For example,
a map may have the altitude of the spatial objects coded as the intensity of the pixel (or
the color).
In the vector representation a spatial object is represented by its geometry most
commonly being the boundary representation along with thematic attributes.
IMAGE DATABASES
There is special kind of spatial databases where data almost entirely consists of
images or pictures. Image databases are used in remote sensing, medical imaging, etc.
They are usually stored in form of grid arrays representing the image intensity in one or
more spectral ranges.
TOPOLOGICAL RELATIONSHIPS
These relationships are based on the ways in which two objects are placed in a
geographic domain.
•
Disjoint: A is disjoint from B if there are no points in A that are contained
in B.
•
Overlaps or Intersects: A overlaps with B if there is at least one point in
A that is also in B.
•
Equals: A equals B if all points in the two objects are in common.
•
Covered By or Inside or Contained In: A is contained in B if all points in
A are in B. There may be points in B that are not in A.
•
Covers or Contains: A contains B if B is contained in A.
Spatial Data Structures, Computations and Queries
Algorithms for spatial data mining involve the use of spatial operations like
spatial data joins, map overlays, nearest neighbor queries and others. Therefore, efficient
Spatial Access Methods (SAM) and the data structures for such computation is also a
concern in spatial data mining.
SPATIAL DATA STRUCTURES
Spatial data structures consist of points, lines, rectangles etc. In order to build
indices for these data, multidimensional trees have been proposed. These include quad
trees, k-d trees, R-trees, R*-trees etc. one of the prominent SAMs which was much
discussed in the literature recently is R-tree and its modification R*-tree.
R-Tree
One approach to indexing spatial data represented as MBRs is an R-tree. Each
successive layer in the tree identifies smaller rectangles. In an R-tree, cells may actually
overlap. An object is represented by an MBR that is located within one cell. Basically, a
cell is the MBR that contains the related set of objects at a lower level of decomposition.
Objects stored in R-trees are approximated by Minimum Bounding Rectangles
(MBR). R-tree in every node stores as a set of rectangles. At the leaves there are stored
pointers to representation of polygon’s boundaries and polygon’s MBRs. At the internal
nodes each rectangle is associated with a pointer to a child and represents minimum
bounding rectangle of all rectangles stored in the child. The illustration of the use of
MBRs by looking at a lake is shown in figure1.
Figure 1 : Minimum Bounding Rectangles (MBR)
Figure 2 illustrates the use of R-Tree:
Figure 2 : R-Tree
Quad Tree
One of the original data structures proposed for spatial data is that of a quad
tree. A quad tree represents a spatial object by a hierarchical decomposition of the space
into quadrants (cells). The spatial area has been divided into two layers of quadrant
divisions.
The number of layers needed depends on the precision desired. Obviously, the
more layers the more overhead is required for the data structure. Each level in the quad
tree corresponds to one of the hierarchical layers. The quad tree example is shown in
figure 3.
Figure 3 : Quad Tree
Spatial Computations
Spatial Join is one of the most expensive spatial operations. In order to make
spatial queries efficient spatial join has to be efficient as well. Brinkhoff et al proposed an
efficient multilevel processing of spatial joins using R*-Trees and various approximation
of spatial objects. The first step - filter - finds possible pairs of intersecting objects using
their MBRs and later other approximations. In the second step - refinement - detailed
geometric procedure is performed to check for intersection. Another important spatial
operation, map overlay, is especially important in Geographic Information Systems.
Spatial Query Processing
The complexity of spatial operations, much work has been performed to examine
spatial query processing and its optimization. A traditional selection query accessing
non-spatial data uses the standard comparison operations: <, >, <=, >=, !=. A spatial
selection is a selection on spatial data that may use other selection comparison
operations.
The types of spatial comparators that could be used to include near, north, south,
west and east contained in and overlap or intersect. The following are examples of
several spatial selection queries. * Find all houses near to Big Temple of Tanjore *. The
architecture for spatial database called SAND (spatial and non-spatial data) architecture,
which is a model of the extended relational database with spatial operations. This
architecture provides both spatial and non-spatial components of spatial database to
participate in query processing and optimization.
SPATIAL DATA MINING ARCHITECTURE
Various architectures (models) have been proposed for data mining. They
include Han’s architecture for general data mining prototype DBLEARN/DBMINER,
Holsheimer et al’s parallel architecture, and Matheus et al’s
multi component
architecture. Almost all of those architectures have been used or extended to handle
spatial data mining. Matheus et al’s architecture seems to be very general and has been
used by other researchers in spatial data mining, including Ester et al. This architecture
comparable to others is presented in the below figure 4.
In this architecture, the user may control every step of the mining process.
Background knowledge like spatial and non-spatial concept hierarchies or information
about database is stored in a knowledge base. Data is fetched from the storage using the
DB Interface which enables optimization of queries.
User
Controller
DBM
S
DB
Interface
Focus
Domain
Knowledge
Pattern
Extraction
Discoveries
Evaluatio
n
Knowledge base
Figure 4 : Spatial Data Mining Architecture
Spatial data index structures like R-Trees may be used for efficient processing.
The Focusing Component decides which parts of data are useful for pattern recognition.
For example, it may decide that only some attributes are relevant to the knowledge
discovery task, or it may extract objects whose usage promises good results. Rules and
patterns are discovered by the Pattern Extraction module. This module may use
statistical, machine learning, and data mining techniques on conjunction with
computational geometry algorithms to perform the task of finding rules and relations.
The interestingness and significance of these patterns is then processed by the Evaluation
module to possibly eliminate obvious and redundant knowledge. The four last
components may interact between themselves through the Controller part.
ARCHITECTURE: KDD DESIGN
Methods for Knowledge Discovery in Spatial Databases
Geographic data consist of spatial objects and non-spatial description of these
objects. Non-spatial description of spatial objects can be stored in a traditional relational
database where one attribute is a pointer to spatial description of the object. Spatial data
can be described using two different properties, geometric and topological. For example,
geometric properties can be spatial location, area, perimeter etc., whereas topological
properties can be adjacency (object A is neighbor of object B), inclusion (object A is inside
in object B), and others. Thus, the methods for discovering knowledge can be focused on
the non-spatial and/or spatial properties of spatial objects.
The algorithms for spatial data mining include generalization-based methods for
mining special characteristic and discriminate rules, two-step spatial computation
technique for mining spatial association rules, aggregate proximity technique for finding
characteristics of spatial clusters etc., In the following sections, we categorize and
describe a number of these algorithms.
Generalization-Based Knowledge Discovery
One of the widely used techniques in machine learning is learning from examples.
This method is often combined with generalization. This approach cannot be directly
adopted for large spatial databases because:
1. The algorithms are exponential in the number of examples and
2. It does not handle noise and in consistent data very well.
The generalization-based knowledge discovery requires the existence of
background in the form of concept hierarchies. In case of spatial databases, there can be
two kinds of concept hierarchies, non-spatial databases; there can be two kinds of
concept hierarchies, non-spatial and spatial. Concept hierarchies can be explicitly given
by experts, or in some cases they can be generated automatically by data analysis.
It can be done on non-spatial data by (a) Climbing the concept hierarchy when
attribute values in a tuple are changed to generalized values, (b) Removing attributes
when further generalization is impossible and there are too many different values for an
attribute, and (c) Merging identical tuples. Induction is continued until values for every
attribute are generalized to the desired level. The desired level is reached when the
number of different values for the attribute in the generalized table is no greater than the
generalization threshold for this attribute. During the process of merging of identical
tuples the number of merged tuples is stored in additional attribute count to enable
quantitative presentation of acquired knowledge. Other aggregate values for merge
tuples may be stored well. The two generalization based algorithms are spatial-datadominant and non-spatial-data-dominant generalizations. Both algorithms assume that the
rules to be mined are general data characteristics and that the discovery process is
initiated by the user who provides a learning request (query) explicitly, in syntax similar
to SQL.
Spatial-Data-Dominant Generalization
In the first step all data described in the query are collected. Given the spatial
data hierarchy, generalization can be performed first on the spatial data by merging the
spatial regions according to the description stored in the concept hierarchy.
Generalization of the spatial objects continues until the spatial generalization threshold is
reached.
Non-Spatial-Data-Dominant Generalization
This method also starts with collecting all data relevant to the user query. In the
example presented in Figure 4 the DB interface extracts the perception data relevant to
the province and time period specified in the query. In the second step the algorithm
performs attribute-oriented induction on the non-spatial attribute, generalizing them to
a higher (more general) concept level. For example, the precipitation value in the large
(10 in., 15 in.) Can be generalized to the concept wet. The generalization threshold is used
to determine whether to continue or stop the generalization process. In this step the
pointers to spatial object are collected as a set and put with the generalized non-spatial
data.
In the third and last step of the step of the algorithm, neighboring areas with the
same generalized attribute are merged together based on the spatial function adjacent to
For example, if in one area the precipitation value was 17 in., and in neighboring area it
was 18 in. Both precipitation values are generalized to the concept very wet and both
areas are merged. Approximation can used to ignore all the small regions with different
non-spatial description. For example, if the majority of area land can be described as
industrial, but a few gas stations exist in this area the whole area can be described as
industrial one. The query may be presented in the form of a map with a small number of
regions with high level description.
Methods Using Clustering
Cluster analysis is a branch of statistic that has been studied extensively for
many years. The many advantage of using this technique is that has been studied
extensively for many years. The main advantage of using these techniques is that
interesting structures or clusters can be found directly from the data without using any
background knowledge, like concept hierarchies. A similar approach in machine
learning is known as unsupervised learning. We can exploit the results of research on
clustering techniques in the spatial data mining process as proposed in.
Clustering algorithms used in statistic, like PAM or CLARA, are reported to be
inefficient from the computational complexity point of view. As for the efficiency
concern, a new algorithm, called CLARANS (Clustering Large Application based upon
RANdomized Search), was developed for cluster analysis. Experimental evidence
showed that CLARANS out performs the two existing cluster analysis algorithm, PAM
(Partitioning Around Mediods) and CLARA (Clustering Large Application). Ng and
Han used CLARANS in spatial data mining algorithms, SD (CLARANS). First, we will
briefly describe the three cluster analysis algorithms.
The difference between the PAM and CLARA algorithms is that the latter one is
based upon sampling. Only a small portion of the real data is chosen as a representative
of the data and Mediods are chosen from this sample using PAM. The idea is that if the
sample is selected in a fairly random manner, then it correctly represents the whole data
set and therefore, the representative objects (Mediods) chosen, will be similar as if
chosen from the whole data set. CLARA draws multiple samples and outputs the best
clustering out of these samples. As expected, CLARA can deal with larger data sets than
PAM.
Algorithm SD (CLARANS)
In this spatial dominant approach, spatial component(s) of the relevant data item
are collected and collected and clustered using CLARANS. Then, the algorithm
performs an attribute –oriented induction on non-spatial description of the object in each
cluster. The result of the query presents high-level non-spatial description of objects in
every cluster. For example, one can find that in Vancouver expensive housing units are
clustered in 3 clusters. In the downtown cluster there are mainly expensive
condominiums; in the waterfront cluster mansions and single house are located; and the
third cluster consists mainly of single houses.
Algorithm NSD (CLARANS)
This non-spatial dominant approach first applies non-spatial generalizations.
Attribute – oriented generalization is performed on the non spatial generalization is
performed on the non-spatial generalized tuples. For example, the description of
expensive housing units can be generalized to a single houses, mansions and condo
minimums.
For each such generalized tuples, all spatial components are collected and
clustered using CLARANS to find clusters. In the final step, the clusters obtained that
way are checked to see if they overlap with clusters describing other types of objects. If
so, then the clusters are merged, and the corresponding generalized non-spatial
description of tuples is merged as well.
Depending upon the rules or the form of knowledge that user wants to discover,
it may be better to choose one or the other of the other of the above two algorithms.
Usually SD (CLARANS) is more efficient than NSD (CLARANS). But, when the
distribution of points is mainly determined by their non-spatial attributes NSD
(CLARANS) may have an edge.
CLARANS in Large Spatial Databases
Focusing methods pointed some of the drawbacks of the drawbacks of the
CLARANS clustering algorithm. First of all, CLARANS assumes that the objects to be
clustered are all stored in the main memory. This assumption may not be valid for large
database and that is why disk-based methods could be required. Secondly, the efficiency
of the algorithm can be substantially improved by modifying the focusing component of
the algorithm.
The other technique to reduce the computations is to restrict the access to certain
object that does not actually contribute the computation. The two different focusing
techniques which try to exploit this approach is focus on the relevant clusters, and focus on
a cluster.
MINING IN IMAGE AND RASTER DATABASES
Knowledge mining from Image and Raster Databases can be viewed as a case of
spatial data mining. Data mining in image database may be seen as similar to image
processing. However, In the case of data mining studies large data was processed, while
image processing usually concentrate on analysis of single or a few images.
The system is composed of three basic components: Data Focusing, Feature
extraction, classification learning. Like all other data focusing techniques, the first
component increase the overall efficiency of the system by first identifying the portion of
the image being analyzed that is most likely to contain a volcano. This is achieved by
comparing the intensity of the central pixel of a region to the region to the estimated
mean background intensity of its neighborhood pixels.
The second component of the system extracts interesting features from the data.
Standard method used in pattern recognition like edge detection or Hough transform,
deal poorly with the variability and noise presented in the case of natural data. Since it is
difficult to find attribute describing volcanoes exactly, matrices containing volcanoes
images were decomposed into eigenvectors. Eigenvalues were treated as attribute
describing volcanoes. Then the final task, which is performed by the rest of the system,
is to discriminate between volcanoes and other objects looking like volcanoes.
APPLICATIONS
Spatial Trend Detection in GIS
Spatial trend described a regular change of non-spatial attribute when moving
away from certain start object. Global and local trends can be distinguished. To detect
and explain such spatial trends, e.g. with respect to the economic power, is an important
issue in economic geography.
Spatial Characterization of Interesting Regions
Another important task of economic geography is to characterize certain target
region such as areas with a high percentage of retirees. Spatial characterization does
not only consider the attribute of the target regions but also neighboring regions and
their properties.