Download Chapter 7 : Spatial Data Mining:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Chapter 7 : Spatial Data Mining:
Some key notes before reading the Book Chapter Assignment:
1. Currently, only those sections have been addeed for which new literature has to be
incorporated in the existing framework. The new literature sections are marked with a
different color in the index.
2. Sections where major changes have been incorporated include association analysis and
clustering (where hotspot detection is added).
3. Some of the mathematical symbols , verbatim text and figures have been directly adde
from their respective sources in the first draft, which have to be changed and are subject
to revision in subsequent versions.
4. Illustrative examples and use cases of added materials are being searched upon .This will
be incorporated shortly.
5. The section on spatial outlier has not been changed much even in the first organization
proposed. Some of the materials discovered while writing is being looked upon. The
same topic might be extended in the next version.
1.Introduction
1.1 Data Mining Introduction.
1.2 Spatial Data Mining.
1.3 Motivation for Spatial Data Mining.
2.1 Spatial Statistics.
2.1.1 Point process.
2.1.2 Lattice.
2.1.3 Geo Statistics
2.1.4 Spatial Auto Correlation.
3.Spatial Data Mining Tasks
3.1 Spatial Classification and Regression
3.1.1 Linear Regression
3.1.2 Spatial Regression
3.1.3 Model Evaluation
3.1.3 Predicting Location using Map similarity
3.1.4 Markov Random Fields
3.2 Spatial Association Rules
3.2.1 Association Rules in Data Mining.(Includes Aprori).
3.2.2. Spatial Association
3.2.3. Co-location pattern discovery
3.2.3.1 Global Co-location Miner Algorithm
3.2.3.2 Local Co-location Miner Algorithm
3.3 Spatial Clustering
3.3.2 Clustering Algorithms.
3.3.3 Limitations of Clustering Algorithms.
3.3.4 Hotspot Analysis
3.4 Spatial outlier detection
3.5 Spatio-Temporal Data Mining
3.5.1 Recent trends in spatio-temporal Data Mining
1. Introduction
1.1 Data Mining:
The amount of data is growing exponentially every day. With all this data around us we are
starving for information. Data mining is the process of discovery useful information, from large
datasets. Data mining is integral part of Knowledge discovery in databases.
Data Mining tasks:
There are two major tasks for data mining
Predictive tasks: The objective of these tasks is to predict the value of an unknown parameter
attribute using the value of other existing attribute.
Descriptive Tasks: The objective of this pattern is to derive patterns in the form of clusters,
trajectories and anomalies and give understanding about the underlying relationships in the data.
The figure summarizes the four broad data mining tasks that are used.
Fig1: Figure showing various Data Mining Tasks
Predictive modeling refers to task of building a model for the target variable as a function of
explanatory variables. Regression is a technique used for continuous variables where as
classification is used for discrete variables.
Association analysis is used to discover patterns that describe strongly associated features in the
data. The discovered patterns are typically represented in the form of implication rules or
features subsets.
Cluster analysis seeks to group observation into different groups where items of each group are
similar to each other but different to the other groups.
Anomaly detection is the task of identifying observation whose characteristics are significantly
different form the rest of the data. These are often called as outliers.
In this chapter we will try to extend the ideas and techniques presented above to the context of
spatial domain.
1.3. Spatial Data Mining:
Spatial data mining (SDM) consists of extracting knowledge, spatial relationships and any other
properties of spatial data. SDM is used to find implicit regularities, relations between spatial data
and/or non-spatial data. Traditional analysis assumes about the independence of the samples, but
spatial data is highly correlated in nature. For example people with similar characteristics
occupations and backgrounds tend to be similar. A geographical database constitutes a spatiotemporal continuum in which properties concerning a particular place are generally linked and
explained in terms of the properties of its neighborhood. We can thus see the great importance of
spatial relationships in the analysis process.
3.2.3 Colocation Pattern Discovery:
3.2.3.1 Global Colocation patterns:
Co-location patterns represent subsets of boolean spatial features whose instances are often
located in close geographic proximity. Examples include symbiotic species and crime attractors
(e.g., bars, misdemeanors, etc.). Boolean spatial features describe the presence or absence of
geographic object types at different locations in a two-dimensional or three dimensional metric
spaces,.
Spatial co-location: Co-location rules are models to infer the presence of Boolean spatial
features in the neighborhood of instances of other Boolean spatial features. For example, ‘Nile
Crocodiles → Egyptian Plover’ predicts the presence of Egyptian Plover birds in areas with Nile
Crocodiles.
Fig : 2 Illustration of point spatial colocation patterns.
Figure 2 shows a data set consisting of instances of several Boolean spatial features, each
represented by a distinct shape. The shapes ‘+’ and ‘x’,’ o’, ‘∗’ represent different spatial
feature. Spatial features in sets {‘+’, ‘×’} and {‘o’, ‘∗’} tend to be located together. A careful
review reveals two co-location patterns, that is, (‘+’, ‘×’) and (‘o’, ‘∗’). Co-location rule
discovery is the process of identifying co-location patterns from large spatial data sets with a
large number of boolean features. The spatial co-location rule discovery problem looks similar
to, but, in fact, is very different from the association rule mining because of the lack of
transactions. In market basket data sets, transactions represent sets of item types bought together
by customers. The support of an association is defined to be the fraction of transactions
containing the association. Association rules are derived from all the associations with support
values larger than a user-given threshold.
Spatial co-location rule mining approaches can be grouped into two broad categories:
approaches that use spatial statistics and algorithms that use association rule mining kind of
primitives. Spatial statistics based approaches utilize statistical measures such as cross-K
function, mean nearest-neighbor distance, and spatial autocorrelation.However, these approaches
are computationally expensive.
Global colocation pattern mining algorithm:
This section defines the event centric model for finding local co location patterns, our
approach to modeling co-location patterns.Consider Figure 3 as an example spatial dataset to
illustrate the model. In the figure, each instance is uniquely identified by T.i where i is the
spatial feature type and is the unique id inside each spatial feature type. For example, B.2
represents the instance 2 of spatial feature B. Two instances are connected by edges if they have
a spatial neighbor relationship.A co-location is a subset of boolean spatial features. A co-location
rule is of the form: C1=>c2(p,cp) where c1 and c2 are colocations ,and C1 intersection C2 = Null
set where C1 and C2are co-locations, p is a number representing the prevalence measure, and cp
is a number measuring conditional probability.
Terminology
R-proximity Neighborhood:
Given a reflexive and symmetric neighbor relation R over a set ( of instances, a ) R -proximity
neighborhood is a set I of instances that form a clique under the relation . The definition of
neighbor relation R is an input and should be based on the semantics of the application domains.
The neighbor relation R be defined using spatial relationships metric relationships (e.g.
Euclidean distance) or a combination (e.g. shortest-path distance in a graph such as a road-map).
The R -proximity neighborhood concept is different from the neighborhood concept in topology
since some super sets of a R -proximity neighborhood may not qualify to be R-proximity
neighborhoods.
Row Instance:
Two R-proximity neighborhoods I1 and I2 are R -reachable to each other if I1 U I2 is a Rproximity neighborhood. A R -proximity neighborhood is a row instance (denoted by
Row_instance(c)) of a co-location C if contains instances of all the features in C and no proper
subset of I does. For example, {A.3,B.4,C.1} is a row instance of colocation {A,B,C} in the
spatial data set shown in fig3.But {A.2,A.3,B.4,C.1} is not a row instance of colocation {A,B,C}
because it proper subset {A.3,B.4,C.1} is not a row instance of the colocation {A,B,C} in the
spatial data set shown in fig3.But {A.2,A.3,B.4,C.1} is a row instance of the colocation {A,B,C}
because its proper subset {A.3,B.4,C.1} contains instances of all features in {A,B,C}.In another
example {A.2,A.4} is not a row instance of colocation {A} because its proper subset {A.2,A.4}
is not a row instance of colocation {A} because its proper subset {A.2} contain instance of all
the features in {A}.The table instance of a colocation c is the collection of all row instances of c.
Fig:3 Spatial Data set to Illustrate event based model.
Participation Ratio: pr(c,fi) for the feature type fi in a size k-location c={f1,f2,f3,,,fk} is the
fraction of instances of feature fi R-reachable to some instance of colocation c=c-{fi}.The
participation index pi(c) of colocation c= {f1,f2,f3,,,fk} is min(pr(c,fi) for all i.
Conditional Probability: The conditional probability of colocation rule c1=>c2 is the fraction of
row instance s of c1 R-reachable to some row instance of c2.It is computed as
Πc1(table_instances ({c1Uc2}))
(table instances)
where
is relational projection operation with duplication elimination.
The Colocation Mining Algorithm:
Input:
1)K boolean Spatial instance types ad their instances
2)A symmetric and reflexive neighbor relation R
3)A user specified threshold prevalence measure(Min_prevalence)
4)A user specified minimum conditional probability(Min cond
probability)
Output:
Colocation rule set with partition index > min_prevalence and
Conditional probability > min_conditional_prob
Steps:
1)Prevalent size 1 Colocation set along with their table instances = P
2)Generate size 2 colocation rules
3)For size of coloations from (2 to K-1) do
4)Generate candidate prevalent colocations using the generalized
aprori algorithm
5)Generate table instance and prune based on neighborhood
6)Prune based on prevalence of co-locations
7)Generate co-location rules
8) END
Step1 Initializes the prevalent size 1 co-location set with the input P of the algorithm.The
participation indexes of singleton co-locations are 1 and all singleton colocations are prevalent
Step2 Generates Prevalent colocations rules of size 2.Due to the lack of pruning for singleton colocations ,it is more efficient to use spatial join in place of neighbor relationship in place of
generalized aprori algorithm and then neighbor-based pruning like in generation of colocation
rules of size3 or more.The spatial inner join of the instances of all spatial features will produce
pairs of instances with neighbor relation R.
Step 3 to Step 8 Loops through 2 to K-1 to generate prevalent colocations of size 3 or
more,iterating on increasing values of sizes of coloations.It breaks when ever an empty
colocation set of some size is generated.
In Step 4 the function generate candidate colocations uses apriori_gen to generate K+1 candidate
colocations from size k prevalent colocations. In step 5 the function generate candidate key is
simulated using a join query .The join can be computed using geometric approach or
combinatorial approach or a hybrid approach.
In Step 6 the candidate colocations generated are pruned using the threshold on the prevalence
measure.
3.2.3 Local Colocation Algorithm:
Global statistics seldom provide useful insight and that most relationships in spatial datasets are
geographically regional. The need for robust tools capable of extracting local colocations
patterns from large spatial datasets is critical for advancing scientific research.
In the proceeding section we present a algorithm called CLEVER(CLustEringusing
representatiVEs and Randomized hill climbing)that finds the local collocation pattern in the
datasets.
Consider datasets containing objects o1,..,on: O={o1,..,on} subset of F where F is feature space
of the dataset and the objects belonging to O are tuples that are characterized by attributes S U N,
where S={S1,…,Sp } is a set of spatial and temporal attributes.N= {A1,…,Aq} is a set of other,
non-geo-referenced attributes. Dom(S) and Dom(N) describe the possible values the attributes in
S and N can take, that is, each object oϵO is characterized by a single tuple that takes values in
F=Dom(S)xDom(N). Datasets that have the structure as above are called georeferenced , and O
is assumed to be a geo-referenced dataset. The purpose of the framework is to find interesting
places, called regions in the following, geo-referenced datasets. Regions are assumed to be
contiguous areas in the spatial-temporal space Dom(S) which is a subspace of F. A region has an
extension which is the set of objects in O it contains and an intention that describes the area it
occupies.
The region discovery framework employs additive, plug-in fitness functions q that capture what
kind of regions are of interest to the domain expert; moreover, fitness functions are assumed to
have the following structure:
i(c)*|c|^ß
where i(c) denotes the interestingness of region c—a quantity to reflect a degree to which regions
are “newsworthy". It is important to find regions at different levels of granularity. The amount of
premium put on the size of the extension of a region (‘|c|’ denotes the cardinality of c) is
controlled by the value of parameter . A region reward is proportional to its interestingness, but
rewards increase with region size non-linearly (>1) to encourage merging neighboring regions
with similar characteristics.
Given a geo-referenced dataset O, there are many possible algorithms to seek interesting
regions in O with respect to a plugin fitness function q, subject to the following specification:
Given: O, q, and possibly other input parameters
Find: X={r1,...,rk} that maximize q({r1,...,rk}) subject to the following constraints:
(1) r⫃ O (i=1,…,k)
(2) r1 , r2, …, rk are contiguous in Dom(S)
(3) ri∩rj=ϕ
.
Neighboring solutions of the current solution are created using three operators: ‘Insert’ – inserts
a new representative into the current solution, ‘Delete’ – deletes a representative from the current
solution and ‘Replace’ – replaces a representative with a non-representative. Each operator has a
certain selection
probability and representatives to be manipulated are chosen at random. The algorithm also
allows for larger neighborhood sizes; the experiments in this paper were run for neighborhood
size 3: in this case, solutions that are sampled are generated by applying three randomly selected
operators to the current solution. Moreover, to battle premature convergence, CLEVER resamples p’>p solutions before terminating.
Psudeo-Code for CLEVER:
CLEVER
Inputs: k’, neighborhood-size, p, p’
Outputs: regions, region representatives, number of representatives (k), fitness,
interestingness etc.,
Algorithm:
Step1 :Create a current solution by randomly selecting k’ representatives from O.
Step2: Create p neighbors of the current solution randomly using the given
neighborhood definition.
Step:3 If the best neighbor improves the fitness, it becomes the current solution.
Go back to step
Step4: If the fitness does not improve, the solution neighborhood is re-sampled
by generating p’ more neighbors. If re-sampling does not improve the current
solution, terminate.otherwise, go back to step 2 replacing the current solution by
the best solution found by re-sampling.
By adding and deleting representatives and by using neighborhood size of larger than one,
CLEVER samples from much larger neighborhood of the current solution. This characteristic
distinguishes CLEVER from other prototype-based clustering algorithms.
3.Spatial Clustering:
Challenges of clustering algorithms:
Although there are similarities between spatial and non-spatial clustering, large databases, and
spatial databases in particular, have unique requirements that create special needs for clustering
algorithms.
1. The algorithms must be scalable an efficient considering it has to deal with large data sets.
2. Algorithms need to be able to identify irregular shapes, including those with lacunae or
concave sections and nested shapes. (See figure below)
3. The clustering mechanism should be insensitive to large amounts of noise.
4. Algorithms should not be sensitive to the order of input. That is, clustering results should
be independent of data order.
5. No a-priori knowledge of the data or the number of clusters to be created should be
required, and therefore no domain knowledge input should be required from the user.
6. Algorithms should handle data with large numbers of features, that is, higher dimensionality
PAM, (Partitioning around Medoids) uses k-clustering on medoids to identify clusters. It works
efficiently on small data sets, but is extremely costly for larger ones. This led to the development
of CLARA. CLARA (Clustering Large Applications)creates multiple samples of the data set, and
then applies PAM to the sample. CLARA.chooses the best clustering as the output, basing
quality on the similarity and dissimilarity of objects in the entire set, not just the samples. One of
the first clustering algorithms specifically designed for spatial databases was CLARANS which
uses the k-medoid method of clustering. CLARANS was followed by DBSCAN a locality-based
algorithm relying on the density of objects for clustering. is also a locality-based algorithm, but it
allows for random distribution of the points. Other density or locality-based algorithms include
STING an enhancement of DBSCAN, WaveCluster a method based on wavelets, and
DENCLUE which is a generalization of several locality-based algorithms. Three other
algorithms, BIRCH CURE and CLIQUE are hybrid algorithms, making use of both hierarchical
techniques and grouping of related items.
3.5. HotSpot Analysis:
Hotspots are a special kind of clustered pattern. As in clustered patterns, objects in hotspot
regions have high similarity in comparison to one another and are quite dissimilar to all the
objects outside the hotspot. One important feature that distinguishes a hotspot from a general
cluster is that the objects in the hotspot area are more active compared with all others
(density,appearance, etc.). Spatial correlation of the attribute values within a hotspot could be
high and possibly drops dramatically at the boundary, whereas in traditional clustering, the
attribute values within a cluster could be i.i.d. Hotspot discovery/detection in SDM is a process
of identifying spatial regions where more events are likely to happen, or more objects are likely
to appear, in comparison to other areas.
Hotspot detection is mainly used in the analysis of crime and disease data. Crime data
analysis aims at finding areas that have greater than average numbers of criminal or disorderly
events, or areas where people have a higher than average risk of victimization. Figure 4 shows
two types of hotspots, namely, point hotspots and area hotspots. The design of hotspot maps is
primarily oriented toward aiding law enforcement to make appropriate placement of their
resources for crime investigation. For example, Figure 4(b) shows locations of bars with seven
different colors obtained by using LISA(Local indicators of spatial Association),the red squares
in the center, and peripheries of the map show the high crime activity bars. Maps such as the
ones shown in Figure 4(a) show specific bars or hotspots where an increased attention for crime
mitigation is necessary.On the other hand, if an analyst was interested in the geographic
distribution of a particular crime type (e.g., Vandalism) based on an underlying baseline variable,
one can make use of techniques such as kernel density estimation that is a part of tools such as
CrimeStat.For example, Figure 4(b) shows the hotspots of vandalism incidents from the same
city; the red cells indicate areas where there is a significantly high clustering of vandalism
reports and the blue cells indicate cells where there is a significantly low concentration of
vandalism, and grey indicates the area where there is no significant concentration.This map leads
one to understand that, there is a significant clustering of vandalism incidents in the center of the
city around the downtown areas.
Hotspot analysis finds applications in cancer/ disease data analysis, hotspots of locations
where disease are reported intensively are detected, which may indicate a potential breakout of
this disease, or suggest an underlying cause of the disease. Other domains of application include
transportation (to identify unusual rates of accidents along highways) and ecology(to conduct
geoinformatic surveillance for geospatial hot-spot detection).
Many of the standard clustering algorithms have been adapted for spatial hotspot analysis.
These include K-Means, hierarchical clustering, etc. Many other methods such as STAC (spatiotemporal analysis of crime) and LISA (local indicators of spatial association) have been
developed to aid law enforcement agencies for crime mitigation. Spatial hotspot analysis
methods of particular utility in public health applications such as syndromic surveillance and
outbreak detection have been proposed. These methods include various frequencies and Bayesian
statistical measures such as the spatial scan and space-time scan statistic.
Fig:4 Spatial Hotspots from the city of Lincon
Spatial Network analysis :
Spatial Network is a network of spatial elements.Transportation network is a prime example of
spatial network. Finding spatial network hotspot analysis finds various applications , particularly
important for crime analysis (high-crime density street discovery) and law enforcement
(planning effective and efficient patrolling strategies). In urban areas, many human activities are
centered about spatial infrastructure networks, such as roads and highways, oil/gas pipelines,
and utilities (e.g., water, electricity, telephone). Thus, activity reports such as crime logs may
often use network-based location references (e.g., street addresses). In addition, spatial
interaction among activities at nearby locations may be constrained by network connectivity and
network distances (e.g., shortest paths along roads or train networks) rather than the geometric
distances used in traditional spatial analysis. Traditional methods that employ a geometric
summarization scheme to identify concentrations of crime may not account for large crime
concentrations that are normally accounted for by the network-based methods. For example,
Figure 5(a) and (b) show a comparison between an ellipse-based geometric hotspot method and a
network-based hotspot method for a data set from the recent Haiti earthquake. Crime prevention
may focus on identifying subsets of ST networks with high activity levels, understanding
underlying causes in terms of network properties, and designing network control policies.
Identifying and quantifying spatial network hotspot is a challenging task due to the need to
choose the correct statistical model. In addition, the discovery process in large spatial networks is
computationally very expensive
l
Fig : 5 Comparison between geometric and network based hotspot for requests during the Haiti
earthquake
Spatio Temporal Data Mining:
Spatio-temporal data are often modeled using events and processes, both of which generally
represent change of some kind. Processes refer to ongoing phenomena that represent activities of
one or more types without a specified endpoint. Events refer to individual occurrences of a
process with a specified beginning and end. Event-types and event-instances are distinguished.
For example, a hurricane eventtype may occur at many different locations and times, for
example, Katrina (New Orleans, 2005) and Rita(Houston, 2005). Each event-instance is
associated with a particular occurrence time and location. The ordering may be total if eventinstances have disjoint occurrence times. Otherwise, ordering is based on spatio-temporal
semantics such as partial order, and spatio-temporal patterns can be modeled as partially ordered
subsets. These unique characteristics create new and interesting challenges for discovering
spatiotemporal patterns. For example, in contrast to spatial outliers, a spatio-temporal outlier is a
spatio-temporal object whose thematic (non spatial and non temporal) attributes are significantly
different from those of other objects in its spatial and temporal neighborhoods.
A spatio-temporal object is defined as a time evolving spatial object whose evolution or history
is represented by a set of instances (EQ),where the space stamp is the location of the object o id
at timestamp t.
Trends in Spatial Data Mining:
Flow anomalies: Given a percentage threshold and a set of observations across multiple spatial
locations,flow anomaly discovery aims to identify dominant time intervals where the fraction of
time instants.of significantly mismatched sensor readings exceeds a given percentage threshold.
Figure 6 gives a simple example of flow anomalies . In Figure 6(a), the input to the FA problem
consists of two spatial locations [i.e., an upstream (up) and downstream (down) sensor], 10 time
instants, and the notion of travel time or flow between the locations. For simplicity, the travel
time is set to a constant of 1, but it can be a variable. The output contains two flow anomalies
using the time instants at the upstream sensor, periods 1–3 and 6–9, where the majority of time
points show significant differences in between (Figure 6(b)). Discovering flow anomalies is
important for water treatment systems, transportation networks, and video surveillance systems.
However, mining flow anomalies is computationally expensive due to the large (potentially
infinite) number of time instants across a spatial network of locations. Traditional outlier
detection methods (e.g. t-test) are suited for detecting transient flow anomalies (i.e., time instants
of significant mismatches across consecutive sensors) but cannot detect persistent flow
anomalies (i.e., long variable time windows with a high fraction of time instant transient flow
anomalies) due to lack of a predetermined window size. Spatial outlier detection techniques do
not consider the flow (i.e., travel time ) between spatial locations and cannot detect any type of
flow anomalies.
Fig6 : Example of flow anomalies
Teleconnected flow anomalies:.A teleconnection represents a strong interaction between
paired
events that are spatially distant from each other.It utilizes flow anomalies. Identifying
teleconnected flow events is computationally hard due to the large number of time instants of
measurement, sensors, and locations. For example, a well-known teleconnected event pair
involves the warming of the eastern pacific region (i.e., El Nino) and unusual weather patterns
throughout the world.Recently, a RAD (Relationship Analysis of Dynamic-neighborhoods)
technique has been proposed that models flow networks to identify teleconnected events.
Mixed-drove co-occurrence patterns: Mixed-drove spatiotemporal co-occurrence patterns
(MDCOPs) represent subsets of two or more different object-types whose instances are often
located in spatial and temporal proximity. Discovering MDCOPs is potentially useful in
identifying tactics in battlefields and games,understanding predator–prey interactions, and in
transportation (road and network) planning.However, mining MDCOPs is computationally very
expensive because the interest measures are computationally complex, data sets are larger due to
the archival history, and the set of candidate patterns is exponential in the number of objecttypes. Preliminary work has produced a monotonic composite interest measure for discovering
MDCOPs and novel MDCOP mining algorithms.
Cascading spatio-temporal patterns: Casacding spatio-temporal patterns(CSTP) are partially
ordered subsets of event-types whose instances are located together and occur in stage an
example is shown in Figure7 .These are some interesting partially ordered patterns that were
discovered from real spatio-temporal crime data sets from the city of Lincoln, Nebraska. In the
domain of public safety, events such as bar closings and football games are considered
generators of crime. Preliminary analysis revealed that football games and bar closing events do
indeed generate CSTPs. CSTP discovery can play an important role in disaster planning, climate
change science (e.g., understanding the effects of climate change and global warming), and
public health (e.g., tracking the emergence, spread, and reemergence of multiple infectious
diseases).
Fig7 :Example of cascading spatio temporal pattern of public saftey