Download Spatial autocorrelation

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Spatial Databases: Lecture 7
Spatial Statistics
DT249-4 DT228-4 Semester 2
2010
Pat Browne
Outline





Statistical spatial data
Review of standard statistical concepts
Unique features of spatial data Statistics
Spatial Autocorrelation
Spatial regression (SR) and geographical spatial
regression (GWR)
 Data mining
 Association rules
 Co-location
Statistical Spatial Data
 In this lecture we consider spatial data
contains an attribute e.g. house prices,
occurrences of disease, occurrences of
accidents, crop yield, poverty patterns,
crime rates, etc. Earlier parts of the course
covered the representation of physical
objects such as houses, counties, and
roads. These objects were arranged by
theme. Here we consider attributes of
those objects e.g. the population of an ED.
Definitions
 Spatial statistics is the statistical study of spatial
data that varies over discrete space e.g. crime
rates broken down by neighbourhood. Spatial
statistical models can be used for estimation,
description, and prediction based on probability
theory (not covered).
 Geostatistics is the statistical study of spatial
data sets that vary over continuous space e.g.
soil quality. Interpolation and prediction
techniques include Kringing & Veriograms (not
covered on this course).
Standard statistical concepts: Independent
Events
 Two events A and B are statistically independent
if the chance that they both happen
simultaneously is the product of the chances
that each occurs individually. We say that two
events, A and B, are independent if the
probability that they both occur is equal to the
product of the probabilities of the two individual
events, i.e.
 P(AB) = P(A)  P(B)
 This is equivalent to saying that learning that
one event occurs does not give any information
about whether the other event occurred too.
Standard statistical concepts: Identically
Distributed
 Two events A and B are identically
distributed if P(A) =P(B) i.e. they have the
same probability distribution.
Standard statistical concepts: Identically
Distributed variable
Identically Distributed variable Same probability distributions
Standard statistical concepts: i.i.d
 A collection of two or more random
variables {X1, X2, … , } is independent
and identically distributed if the variables
have the same probability distribution, and
are independent.
Standard statistical concepts: Examples
 Example i.i.d: All other things being equal, a
sequence of dice rolls is i.i.d.
 Example of non i.i.d: bird nesting patterns in
wetlands, where the independent variables are
distance from water, length of grass, depth of
water and the dependent variable would be the
presence of a nest site. A uniform distribution of
these variables on a map would indicate an
even distribution, however a more complex
emerges where the variables are spatially
dependent.
Standard statistical concepts: Correlation
 Correlation: A correlation is a single number that
describes the degree of relationship between two
normally distributed variables. The variables are not
designated as dependent or independent. The value of a
correlation coefficient can vary from minus one to plus
one. A minus one indicates a perfect negative
correlation, while a plus one indicates a perfect positive
correlation. A correlation of zero means there is no
relationship between the two variables. When there is a
negative correlation between two variables, as the value
of one variable increases, the value of the other variable
decreases, and vice versa.
Standard statistical concepts: Variance and
covariance
 A measure of variation equal to the mean of the
squared deviations from the mean. The variance
is a measure of the amount of variation within
the values of that variable, taking account of all
possible values and their probabilities or
weightings.
 Covariance is measure of the variation between
variables, say X and Y. The range of covariance
values is unrestricted. However, if the X and Y
variables are first standardized, then covariance
is the same as correlation and the range of
covariance (correlation) values is from –1 to +1.
Standard statistical concepts: Correlation
 Correlation is a measure of the degree of linear
relationship between two variables, say X and Y. While in
regression the emphasis is on predicting one variable
from the other, in correlation the emphasis is on the
degree to which a linear model may describe the
relationship between two variables. In regression the
interest is directional, one variable is predicted and the
other is the predictor; in correlation the interest is nondirectional, the relationship is the critical aspect. The
correlation coefficient may take on any value between
plus and minus one (-1 < r < 1).
Standard statistical concepts: Regression
 Regression: takes a numerical dataset
and develops a mathematical formula that
fits the data. The results can be used to
predict future behaviour. Works well with
continuous quantitative data like weight,
speed or age. Not good for categorical
data where order is not significant, like
colour, name, gender, nest/no nest.
Example: plotting snowfall against height
above sea level.
Standard statistical concepts:
Regression
Y = A + BX; The response variable is y, and x is the
continuous explanatory variable. Parameter A is the
intercept. Parameter B is the slope. The difference
between each data point and the value predicted by
the line (the model) us called a residual
Standard statistical concepts: Null
hypothesis
 The null hypothesis, H0, represents a theory that has
been put forward, either because it is believed to be true,
but has not been proved. For example, in a clinical trial
of a new drug, the null hypothesis might be that the new
drug is no better, on average, than the current drug H0:
there is no difference between the two drugs on average.
 In general, the null hypothesis for spatial data is that
either the features themselves or of the values
associated with those features are randomly distributed
(e.g. no spatial pattern or bias).
Relation of i.i.d., regression, and correlation with
spatial phenomena.
 The first law of geography according to Waldo Tobler is
"Everything is related to everything else, but near things
are more related than distant things." In statistical terms
this is called autocorrelation where the traditional i.i.d.
assumption is not valid for spatially dependent variables
(e.g. temperature or crime rate) we need special
techniques to handle this type of data (e.g. Moran’s I).
These techniques usually involve including a weight
matrix which contains location information. The non-i.i.d.
nature of spatially dependent variables carries over into
regression and correlation which require spatial weights
Relation of i.i.d., regression, and
correlation with spatial database
 Spatial databases are used for spatial data mining,
which includes statistical techniques and more
specialised DM techniques such as association rules.. In
this case the data mining algorithms need to have a
spatial context. We must explicitly include location
information where previously with the i.i.d. assumption it
was not required Typical generic data mining activities
such as clustering, regression, classification, association
rules, all need a spatial context. Spatial DM is used in a
broad range scientific disciplines, such as analysis of
crime, modelling land prices, poverty mapping,
epidemiology, air pollution and health, natural and
environmental sciences, etc. The analyst must be aware
the special techniques required for SDM.
Relation of i.i.d., regression, and
correlation with spatial database
 Spatial databases are also used for pure
statistical research (e.g. environmental
studies). Those variables that are spatially
dependent (e.g. the PH of the soil) need to
be clearly identified and special
techniques applied to take into account
their spatial bias.
Unique features of spatial data Statistics
 General Statistics assumes the samples
are independently generated, which is
may not the case with spatial dependent
data.
 Like things tend to cluster together.
 Change tends to be gradual over space.
Unique features of spatial data Statistics
Spatial dependent values
 The previous maps illustrate two important
features of spatial data:
 Spatial Autocorrelation (not independent)
 The probability that they both occur is equal to the
product of the probabilities of the two individual
events, i.e.
 P(AB) = P(A)  P(B)
 Spatial data is not identically distributed.
 Two events A and B are identically distributed if P(A)
=P(B) i.e. they have the same probability distribution.
Unique features of spatial data Statistics
Autocorrelation & Spatial Heterogeneity.
 Spatial autocorrelation is detected when the value
of a variable in a location is correlated with values of
the same variable in the neighbourhood (can be
measured with Moran I).
 Spatial heterogeneity is characterized by different
values or behaviours through space which can be
measured by Local Indicators of Spatial Association
(LISA). Characterizes the non-stationarity of most
geographic processes, meaning that global
parameters may not accurately reflect the process
occurring at a particular location.
Spatial Autocorrelation1.
 Autocorrelation: degree of correlation between
neighbouring values.
 Spatial dependency: neighbouring values are
similar (i.e. positive spatial autocorrelation).
 Moran’s I enable assessment of the degree to
which values tend to be similar to neighbouring
values. We can observe how autocorrelation
varies with distance.
 The Moran scatter plot relates individual values
to weighted averages of neighbouring values.
The slope of a regression line fitted to the points
in the scatter plot gives the global Moran’s I.
Spatial Autocorrelation: Moran’s I
 Moran’s I measures the average correlation
between the value of a variable at one location
and the value at nearby locations. The essential
idea is to specify pairs of locations that influence
each other along with the relative intensity of
interaction. Moran’s I provides a global view of
spatial autocorrelation correlation. We will look
at details later
 The range of the Moran's I statistic depends on
the spatial weight matrix.
 When Moran's I is scaled by its bounds the
statistic is restricted to the range ±1
Spatial Autocorrelation: Case Study
Nest locations
Distance to open water Vegetation durability
Water depth
Spatial Autocorrelation
Classical Statistical Assumptions
(i.i.d) do not hold for spatially
dependent data
Unique features of spatial data Statistics
First Law of Geography
 First law of geography [Tobler]:
 Everything is related to everything, but nearby
things are more related than distant things.
 People with similar backgrounds tend to live
in the same area
 Economies of nearby regions tend to be
similar
 Changes in temperature occur gradually over
space (and time) (equator V poles).
Spatial Autocorrelation: Moran’s I example
Moran’s I - example
Figure 7.5, pp. 190
•Pixel value set in (b) and (c ) are same but their Moran Is are different.
•Q? Which dataset between (b) and (c ) has higher spatial autocorrelation?
Spatial Autocorrelation : Moran
Scatterplot Map
São Paulo
WZ
Q4 = LH
Q1= HH
a
0
Q2= LL
Q3 = HL
0
z
Old-aged population
Spatial Heterogeneity.
 Spatial heterogeneity; Is there such a thing as an
average place with respect to some property (e.g.
vegetation). is difficult to imagine any subset of the
Earth’s surface being a representative sample of the
whole. GWR (later) addresses the localness of
spatial data.
Neigbourhood relationship
contiguity matrix
Spatial autocorrelation
 Spatial autocorrelation is determined both by
similarities in position, and by similarities in
attributes
 Sampling interval
 Self-similarity
 Auto = self
 Correlation = degree of relatedness
correspondence
Spatial autocorrelation
 In the following slide, each diagram contains 32
white cell and 32 blue cells = 64 cells.
 BB = Blue beside Blue
 BW = Blue beside White
 WW = White beside White.
Spatial autocorrelation
Negative
Dispersed
Spatial
Independence
Spatial Clustering
Positive
Spatial regression (SR)
 Spatial regression (SR) is a global spatial modeling
technique in which spatial autocorrelation among the
regression parameters are taken into account. SR is
usually performed for spatial data obtained from spatial
zones or areas. The basic aim in SR modeling is to
establish the relationship between a dependent variable
measured over a spatial zone and other attributes of the
spatial zone, for a given study area, where the spatial
zones are the subset of the study area. While SR is
known to be a modeling method in spatial data analysis
literature in spatial data-mining literature it is considered
to be a classification technique
Geographically weighted
regression (GWR)
 Geographically weighted regression (GWR) is a powerful
exploratory method in spatial data analysis. It serves for
detecting local variations in spatial behavior and
understanding local details, which may be masked by
global regression models. Unlike SR, where regression
coefficient for each independent variable and the
intercept are obtained for the whole study region, in
GWR, regression coefficients are computed for every
spatial zone. Therefore, the regression coefficients can
be mapped and the appropriateness of stationarity
assumption in the conventional regression analyses can
be checked.
Geographically weighted
regression (GWR)
 GWR is an effective technique for exploring
spatial nonstationarity, which is characterized by
changes in relationships across the study region
leading to varying relations between dependent
and independent variables. Hence there is a
need for better understanding of the spatial
processes has emerged local modeling
techniques. GWR has been implemented in
various disciplines such as the natural,
environmental, social and earth sciences.
Exploring spatial patterning in
spatial data values1.
 Two issues
 1. How do variables change from place to
place? Zone similar to neighbours?
 2. How are variables related. How does the
relationship between rainfall and altitude vary
from place to place.
Local Statistics1 moving window
Geographical Weights
•
Binary: Rook or
queen neighbours
•
Distance based
•
Boundary or
perimeter based.
•
Weights can be rownormalized using the
number of adjacent
cells
Local Univariate measures1 moving window
 Standard univariate can be computed for a
moving window, supplying the degree and
nature of variation in summary statistics
across a region of interest (e.g. we could
compute the standard deviation for several
windows and assess the degree of
variability from place to place.
 Geographical weighting schemes can be
used for the calculation of local statistics.
Local spatial autocorrelation1
 Global statistics such as Moran’s I can mask
local spatial structure. The local Moran can be
used to measure local spatial autocorrelation.
Only if there is little or no variation in the local
observations do the global observations provide
any reliable information on the local areas within
the study area. As the spatial variation of the
local observations increases, the reliability of the
global observation as representative of local
conditions decreases.
Local spatial autocorrelation1
The weights could be based on rook, queen, distance, perimeter and normalized
by number of neighbours ( slide 28)
Local spatial autocorrelation
Spatial autocorrelation
Negative
Dispersed
Spatial
Map A and Map B each represent
a distinct geographic region. The number in the
Independence
regions (cells) represents the number of leukaemia cases in that region. These
two sets of values have the same mean and standard deviation. In contrast,
Moran’s I statistic for the data on Map A is -0.269, and 0.041 for the data on Map
B.
Positive
They
Spatial
differClustering
because values in the regions have a different spatial arrangement.
The contiguity (or weight) matrix used by the Moran I calculation will be different
and hence we get a different result.
A visual inspection of both maps would suggests that A has negative (-Moran) ,
the neighbouring values tend to be dissimilar, thus no clustering of like values is
suggested. B has little autocorrelation because it’s Moran is near zero.
Spatial autocorrelation
Negative
Dispersed
Spatial
The grids A and B represent twoIndependence
different spatial resolutions over the same area.
Grid A contains 16 cells and Grid B contains 64 cells.
The strength of spatial autocorrelation is often a function of scale or spatial
resolution, as illustrated in above using black and white cells. High negative
spatial autocorrelation is exhibited in A since each cell has a different colour from
Positive
its neighbouring
Spatial Clustering
cells. In B each cell can be subdivided into four half-size cells,
assuming the cell’s homogeneity. Then, the strength of spatial autocorrelation
among the black and white cells increases, while maintaining
the same cell arrangement. his illustrates that spatial autocorrelation varies with
the study scale The strength of spatial autocorrelation is a function of scale,
increasing from 4-by-4 case to the 8-by-8 case.
Calculate local Moran I for central
cell (42) where
z i= (xi – x )
Original data
45
44
Values, differences from mean, rook
standardized weight sum = 1
yi
zi
wij
wijzi
45
4.889
0.000
0.000
43
2.889
0.250
0.722
38
-2.111
0.000
0.000
44
3.889
0.2500
0.972
42
1.889
0.000
0.000
32
-8.111
0.250
-2.028
44
3.889
0.000
0.000
39
-1.111
0.25
-0.278
34
-6.111
0.000
0.000
1.00
-0.611
44
43
42
39
38
32
34
Mean 40.111
Variance = 21.861
Ii = (1.889/21.861)*(-0.661)= -0.053
Has low negative value,
neighbouring values tend to be
dissimilar.
sum
Global Moran’s I = 0.665
Local I, large positive values in rural areas, more patchy around Belfast
Spatial Regression1
 The assumption of i.i.d. underlying
ordinary least squares regression rarely
holds for spatial data. There are several
techniques that handle the spatial case;
 Moving window regression
 Geographic Weighted Regression (GWR)
 We will look at GWR
Geographic Weighted Regression (GWR) 1
 The steps are;
1. Go to a location
2. Conduct regression using the raw data and
a geographic weighting scheme.
3. Move to next location go back to stage 2
until all locations have been visited.
 The output is a set of regression
coefficients (e.g. slope and intercept) at
each location
Coords of observations, variables. distance from first
observation, and geographic weights
point
x
y
Var 1 Var 2 dist
Geo w
1
25
45
12
6
0
1
2
25
44
34
52
1
0.995
3
21
48
32
41
5
0.8825
4
27
52
12
25
8
0.7261
5
16
31
11
22
16
0.278
6
42
35
14
9
20
0.0889
7
9
65
56
43
26
0.034
8
29
76
75
67
32
0.006
9
61
66
43
32
42
0.0002
Location of points for previous table
Regression using previous table and locations, the geographic weighting pulls the
line towards the points with larger weights
Summary of spatial stats
 Moran’s I measures the average correlation between
the value of a variable at one location and the value at
nearby locations.
 Local Moran statistic measures spatial dependence on a
local basis, allowing the researcher to see its variation
over space, and by Geographically
 Geographically Weighted Regression allows the
parameters of a regression analysis to vary spatially.
GWR helps in detecting local variations in spatial
behavior and understanding local details, which may be
masked by global regression models. GWR, regression
coefficients are computed for every spatial zone.
© Oxford University Press, 2010. All rights reserved. Lloyd: Spatial Data Analysis
Two scatter plots and fitted lines for different aggregations of same value
© Oxford University Press, 2010. All rights reserved. Lloyd: Spatial Data Analysis
Moran’s I
 A contiguity matrix may represent a
neighborhood relationship defined using
adjacency or Euclidean distance. There are
several definitions adjacency include a fourneighbourhood or an eight-neighborhood. Given
a gridded spatial framework, a fourneighborhood assumes that a pair of locations
influence each other if they share an edge
(rook). An eight-neighborhood assumes that a
pair of locations influence each other if they
share either an edge or a vertex (queen).
Moran’s I
• Using a normalised weight matrix the
values of I range from -1 to 1.
• Value = 1 : Perfect positive correlation
• Value = 0 : No autocorrelation
• Value = -1: Perfect negative correlation
• A Moran’s I may appear low (say 0.17) but
is statistically significant pattern is
clustered since index is above 0.
Moran’s I
• Global Moran’s I
• What is the extent of clustering in the total area?
• Is this clustering significantly different from a
random spatial distribution?
• Local Moran’s I
• Do local clusters (high-high or low-low) or local
spatial outliers (high-low or low-high) exist?
• Are these local clusters and spatial outliers
statistically significant?
Moran’s I: A measure of spatial
autocorrelation
 Given
x  x1 ,...xn 
sampled over n locations.
t
zWz
Moran I is defined as I 
zz t
Where




z   x1  x ,...,xn  x 


and W is a normalized contiguity matrix.
Fig. 7.5, pp. 190
Spatial autocorrelation
Negative
Dispersed
Spatial
The grids A and B represent twoIndependence
different spatial resolutions over the same area.
Grid A contains 16 cells and Grid B contains 64 cells.
The strength of spatial autocorrelation is often a function of scale or spatial
resolution, as illustrated in above using black and white cells. High negative
spatial autocorrelation is exhibited in A since each cell has a different colour from
Positive
its neighbouring
Spatial Clustering
cells. In B each cell can be subdivided into four half-size cells,
assuming the cell’s homogeneity. Then, the strength of spatial autocorrelation
among the black and white cells increases, while maintaining
the same cell arrangement. his illustrates that spatial autocorrelation varies with
the study scale The strength of spatial autocorrelation is a function of scale,
increasing from 4-by-4 case to the 8-by-8 case.
Second Law of Geography1
 Second law of geography: Spatial heterogeneity




[Goodchild]
Spatial heterogeneity describes geographic variation
in the constants or parameters of relationships
When it is present, the outcome of an analysis
depends on the area over which the analysis is made.
Spatial heterogeneity depends on the spatial
resolution.
Global model might be inconsistent with respect to a
regional model(s).
Second Law of Geography
 Spatial heterogeneity definitions:
 quantitative information characterizing the
ground spatial structure
 spatial variance distribution of the variable
considered, within the coarse sample
resolution (e.g. pixel or grid)
 The patterning or patchiness in important
landscape properties such as vegetation
cover.
Second Law of Geography1
 Spatial heterogeneity has been quantified from
remote sensing images by using two basic
approaches:
 (a) the direct image approach, where straight
reflectance or reflectance indices of remote
sensing images are used to quantify spatial
heterogeneity, using the original pixel size of the
image
 (b) the cartographic or patch mosaic approach,
where the image is subdivided into
homogeneous mapping units through
classification.
Second Law of Geography1
 Suppose there is a relationship between number of AIDS
cases and number of people living in an area
 The form of this relationship will vary spatially
 in some areas the number of cases per capita2 will be higher
than in others
 we could map the constant of proportionality3
 Spatial heterogeneity describes this geographic variation
in the constants or parameters of relationships . When it
is present, the outcome of an analysis depends on the
area over which the analysis is made. Often this area is
arbitrarily determined by a map boundary or political
jurisdiction
Second Law of Geography
 Second law of geography [Goodchild]
 Spatial heterogeneity
 Global model often inconsistent with
regional models (e.g. the average does
not hold anywhere).
How to decide the weight wij ?
The weight indicates the spatial interaction between entities.
1) Binary wij, also called absolute adjacency. Covers the general
case answering the question is a value in a region similar or
different to its neighbours.
wij = 1 if two geographic entities are adjacent; otherwise, wij = 0.
Choice of adjacency definition queens(8) or rooks(4).
How to decide the weight wij ?
The weight indicates the spatial interaction between entities.
2) The distance between geographic entities. Often the inverse
distance is used, further objects get less weight, near object get
more weight e.g. centre of epidemic.
wij = f(dist(i,j)), dist(i,j) is the distance between i and j.
3) The length of common boundary for area entities. Policing
borders, smaller borders less weight.
wij = f(leng(i,j)), leng(i,j) is the length of common boundary
between i and j.
How to decide the weight wij ?1
The choice of weights should ultimately be driven by a rationale for including
those areas as neighbors that have a spatial effect on a given location. This
rationale can be derived from theory or be the result of using ESDA to
experiment with different weights and connectivity orders. Since weights
matrices are used to create spatial lags that average neighboring values, the
choice of a weights matrix will determine which neighboring values will be
averaged. For instance, since rook weights will usually have fewer neighbors
than queen weights, on average, each neighboring observation has more
influence.
How to decide the weight wij ?
1
The question of which weights to choose is more pertinent in the context of
modeling than ESDA since modeling is based on substantive notions of
spatial effects while ESDA prioritizes the rejection of spatial randomness.
Therefore, if there are no substantive reasons to guide the choice of weights
in ESDA, using a weights file with as few neighbors as possible (such as
rook) makes sense. Especially with irregular areal units (as opposed to
grids), the difference between rook and queen weights is often minimal.
However, it is advisable to test how sensitive your results are to your weights
specifications by comparing multiple weights matrices.
Spatial Outlier Detection
 Global outliers are observations which
appear inconsistent with the remainder of
that data set.
 Global outliers deviate so much from other
observations that it may be possible that
they were generated by a different
mechanism.
 Spatial outliers are observations that
appear inconsistent with their neighbours.
Spatial Outlier Detection
 Detecting spatial outliers has important
applications in transportation, ecology,
public safety, public health, climatology
and location based services.
 Geographic objects have a spatial
(location, shape, metric & topological
properties) & non-spatial component
(house owner, sensor id., soil type).
Spatial Outlier Detection
 Spatial neighbourhoods may be defined using
spatial attributes & spatial relations.
 Comparisons between spatially referenced
objects can be based on non-spatial attributes.
 A spatial outlier is a spatially referenced object
whose non-spatial attribute values differ from
those of other spatially referenced objects in its
spatial neighbourhood.
Data for Outlier detection
In diagram on left G,P,S,Q show a big change in attribute for a small change in
location. The right hand diagram shows a normal distribution (corresponds to
attribute axis in left diagram)
Spatial Outlier Detection
 The upper left & lower
right quadrants of
figure 7.17 indicate a
spatial association of
dissimilar values; low
values surrounded by
high value neighbours
(P & Q) and high
values surrounded by
low values (S).
Spatial Outlier Detection
 Moranoutlier is a
point located in the
upper left or lower
right quadrant of a
Moran scatter plot.
Spatial Outlier Detection
 Moranoutlier is a
point located in the
upper left or lower
right quadrant of a
Moran scatter plot.
WZ
Q4 = LH
Db
0
Q2= LL
Q1= HH
Cb
a
Q3 = HL
z
0
values in a given location
Model Evaluation
 Consider the two-class classification problem
‘nest’ or ‘no-nest’. The four possible outcomes
(or predictions) are shown on the next slide. The
desired predictions are:
 1) where the model says the should be a nest and
there is an actual nest (True Positive)
 2) where the model says there is no nest and there is
no nest (True Negative)
 The other outcomes are not desirable and point
to a flaw in the model.
Model Evaluation
Spatial Statistical Models
 A Point Process is a model for the spatial
distribution of points in a point pattern.
Examples: the position of trees in a forest,
location of petrol stations in a city.
 Actual real world point patterns can be
compared (using distance) with a
randomly distributed point pattern random.
Calculating the Local Moran I
Where the variance = 667.32 and mean = 55.82 from the entire
population
Calculating the Local Moran I
Calculating the Global Moran I
Statistics versus Data Mining
 Do we know the statistical properties of data? Is data
spatially clustered, dispersed, or random?
 Data mining is strongly related to statistical analysis.
 Data mining can be seen as a filter (exploratory data
analysis) before applying a rigorous statistical tool.
 Data mining generates hypothesis that are then
verified.
 The filtering process does not guarantee
completeness (wrong elimination or missing data).
 "Drowning in Data yet Starving for
Knowledge"
Data Mining: Outline
 Background to data mining & spatial data mining.
 The data mining process
 Spatial autocorrelation i.e. the non independence of
phenomena in a contiguous geographic area.
 Spatial independence
 Classical data mining concepts:
 Classification
 Clustering
 Association rules
 Spatial data mining, e.g. Co-location Rules
 Summary
Data Mining
 Data mining is the process of discovering
interesting and potentially useful patterns of
information embedded in large databases.
 Spatial data mining has the same goals as
conventional data mining but requires additional
techniques that are tailored to the spatial
domain.
 A key goal of spatial data mining is to partially
automate knowledge discovery, i.e., search for
“nuggets” of information embedded in very large
quantities of spatial data.
Data Mining
 Data mining lies at the intersection of database
management, statistics, machine learning and
artificial intelligence. DM provides semiautomatic techniques for discovering
unexpected patterns in very large data sets.
 We must distinguish between operational
systems (e.g. bank account transactions) and
decision support systems (e.g. data mining)
Data Mining
 Spatial DM can be characterised by
Tobler’s first law of geography (near things
tend to be more related than far things).
Which means that the standard DM
assumptions that values are independently
and identically distributed does not hold in
spatially dependent data (SDD). The term
spatial autocorrelation captures this
property and needs to be included in DM
techniques.
Data Mining
 The important techniques in conventional
DM are association rules, clustering,
classification, and regression. These
techniques need to be modified for spatial
DM. Two approaches used when adapting
DM techniques to the spatial domain:
 1)Correct the underlying (iid) statistical model
 2)The objective function1 which drives the
search can be modified to include a spatial
term.
Data Mining
 Size of spatial data sets:
 NASA’s Earth Orbiting Satellites capture about a
terabyte(1012) a day, YouTube 2008 = 6 terabytes.
 Environmental agencies, utilities (e.g. ESB), Central
Statistics Office, government departments such as
health/agriculture, and local authorities all have large
spatial data sets.
 It is very difficult to analyse such large data sets
manually.
 For examples see Chapter 7 from SDT
Data Mining: Sub-processes
 Data mining involves many sub-process:
 Data collection: usually data was collected as
part of the operational activities of an
organization, not for the data mining task. It is
unlikely that the data mining requirements were
considered during data collection.
 Data extraction/cleaning: hence data must be
extracted & cleaned for the specific data mining
task.
Data Mining: Sub-processes
 Feature selection.
 Algorithm design.
 Analysis of output
 Level of aggregation at which the data is
being analysed must be decided. Identical
experiments at different levels of scale can
sometimes lead to contradictory results
(e.g. the choice of basic spatial unit can
influence the results of a social survey).
Geographic Data mining process
Close interaction between Domain Expert & Data-Mining Analyst
The output consists of hypotheses (data patterns) which can be verified with
statistical tools and visualised using a GIS.
The analyst can interpret the patterns recommend appropriate actions
Statistics versus Data Mining
 Do we know the statistical properties of data? Is data
spatially clustered, dispersed, or random?
 Data mining is strongly related to statistical analysis.
 Data mining can be seen as a filter (exploratory data
analysis) before applying a rigorous statistical tool.
 Data mining generates hypothesis that are then
verified.
 The filtering process does not guarantee
completeness (wrong elimination or missing data).
Unique features of spatial data
mining
 The difference between classical & spatial
data mining parallels the difference
between classical & spatial statistics.
 Statistics assumes the samples are
independently generated, which is
generally not the case with SDD.
 Like things tend to cluster together.
 Change tends to be gradual over space.
Non-Spatial Descriptive Data
Mining
 Descriptive analysis is an analysis that results in some description or
summarization of data. It characterizes the properties of the data by
discovering patterns in the data, which would be difficult for the human
analyst to identify by eye or by using standards statistical techniques.
Description involves identifying rules or models that describe data. Both
clustering and association rules are employed by supermarket chains.
 Clustering (unsupervised learning) is a descriptive data mining technique.
Clustering is the task of assigning cases into groups of cases (clusters) so
that the cases within a group are similar to each other and are as different
as possible from the cases in other groups. Clustering can identify groups
of customers with similar buying patterns and this knowledge can be used
to help promote certain products. Clustering can help locate what are the
crime ‘hot spots’ in a city.
 Association Rules. Association rule discovery identifies the relationships
within data. The rule can be expressed as a predicate in the form (IF x
THEN y ). ARD can identify product lines that are bought together in a
single shopping trip by many customers and this knowledge can be used to
by a supermarket chain to help decide on the layout of the product lines.
Non-Spatial Predictive Data Mining
 Predictive DM results in some description or summarization of a
sample of data which predicts the form of unobserved data.
Prediction involves building a set of rules or a model that will enable
unknown or future values of a variable to be predicted from known
values of another variable.
 Classification is a predictive data mining technique. Classification is
the task of finding a model that maps (classifies) each case into one
of several predefined classes. Classification is used in risk
assessment in the insurance industry.
 Regression analysis is a predictive data mining technique that uses
a model to predict a value. Regression can be used to predict sales
of new product lines based on advertising expenditure.
Case Study
 Data from 1995 & 1996 concerning two wetlands




on the shores of Lake Erie, USA.
Using this information we want to predict the
spatial distribution of marsh breeding bird called
the red-winged black bird. Where will they build
nests? What conditions do they favour?
A uniform grid (pixel=5 square metres) was
superimposed on the wetland.
Seven attributes were recorded.
See link1 to Spatial Databases a Tour for details.
Case Study
Case Study
 Significance of three key variables




established with statistical analysis.
Vegetation durability
Distance to open water
Water depth
The spatial distribution is shown in 7.3.
Case Study
Nest locations
Water depth
Distance to open water
Vegetation durability
Example showing different predictions: (a) the actual locations of nests; (b) pixels with actual nests;
(c) locations predicted by one model; and (d) locations predicted by another model. Prediction (d) is
spatially more accurate than (c).
Classical statistical assumptions do
not hold for spatially dependent
data
Case Study
 The previous maps illustrate two important
features of spatial data:
 Spatial Autocorrelation (not independent)
 Spatial data is not identically distributed.
 Two random variables are identically
distributed if and only if they have the
same probability distribution.
Why spatial DBs do not use
classical DM
 Rich data types (e.g., extended spatial
objects)
 Implicit spatial relationships among the
variables,
 Observations that are not independent,
 Spatial autocorrelation exists among the
features.
Classical Data Mining
Association rules: Determination of interaction between attributes. For
example:
 X Y:
 Classification: Estimation of the attribute of an entity in terms of
attribute values of another entity. Some applications are:
 Predicting locations (shopping centers, habitat, crime zones)
 Thematic classification (satellite images)
 Clustering: Unsupervised learning, where classes and the number
of classes are unknown. Uses similarity criterion. Applications:
Clustering pixels from a satellite image on the basis of their spectral
signature, identifying hot spots in crime analysis and disease
tracking.
 Regression: takes a numerical dataset and develops a
mathematical formula that fits the data. The results can be used to
predict future behavior. Works well with continuous quantitative data
like weight, speed or age. Not good for categorical data where order
is not significant, like color, name, gender, nest/no nest.
Determining the Interaction among
Attributes
 We wish to discovery relationships
between attributes of a relation.
is_close(house,beach) -> is_expensive(house)
low(vegetationDurability) ->
high(stem density)
 Associations & association rules are often
used to select subsets of features for more
rigorous statistical correlation analysis.
How does data mining differ from
conventional methods of data analysis?
 Using conventional data analysis the analyst formulates
and refines the hypothesis. This is known as hypothesis
verification, which is an approach to identifying patterns
in data where a human analyst formulates and refines
the hypothesis. For example "Did the sales of cream
increase when strawberries were available?"
 Using data mining the hypothesis is formulated and
refined without human input. This approach is known as
hypothesis generation is an approach to identifying
patterns in that data where the hypotheses are
automatically formulated and refined. Knowledge
discovery is where the data mining tool formulates and
refines the hypothesis by identifying patterns in the data.
For example, "What are the factors that determine the
sales of cream?"
Association rules
 An association rule is a pattern that can
be expressed as a predicate in the form
(IF x THEN y ), where x and y are
conditions (about cases), which state if x
(the antecedent) occurs then, in most
cases, so will y (the consequence). The
antecedent many contain several
conditions but the consequence usually
contains only one term.
Association rules
 Association rules need to be discovered. Rule
discovery is data mining technique that identifies
relationships within data. In the non-spatial case
rule discovery is usually employed to discover
relationships within transactions or between
transactions in operational data. The relative
frequency with which an antecedent appears in
a database is called its support. High support is
the frequency at which the relative frequency is
considered significant and is called the support
threshold (say 70%)
Association rules
 Example: Market basket analysis is form
of association rule discovery that
discovers relationships in the purchases
made by a customer during a single
shopping trip. An itemset in the context of
market basket analysis is the set of items
found in a customer’s shopping basket.
Association rules
Association rules & Spatial
Domain

Differences with respect to spatial domain:
1. The notion of transaction or case does not exist, since data
are immerse in a continuous space.The partition of the
space may introduce errors with respect to overestimation
or sub-estimation confidences. The notion of transaction is
replaced by neighborhood.
2. The size of itemsets is less in the spatial domain. Thus, the
cost of generating candidate is not a dominant factor. The
enumeration of neighbours dominates the final
computational cost.
3. In most cases, the spatial items are discrete version of
continuous variables.
Spatial Association Rules
 Table 7.5 shows examples of association
rules, support, and confidence that were
discovered in Darr 1995 wetland data.
Co-Location rules

Colocation rules attempt to generalise association rules to
point collection data sets that are indexed by space. The
colocation pattern discovery process finds frequently colocated subsets of spatial event types given a map of their
locations, see Figure 7.12.
Co-location Examples
(a) Illustration of Point Spatial Co-location Patterns. Shapes represent different
spatial feature types. Spatial features in sets {`+,x} and {o,*} tend to be
located together.
(b) Illustration of Line String Co-location Patterns. Highways and frontage
roads1 are co-located , e.g., Hwy100 is near frontage road Normandale
Road.
Two co-location patterns
Answers:
and
Spatial Association Rules
 A spatial association rule is a rule indicating certain
association relationship among a set of spatial and possibly
some non-spatial predicates.
 Spatial association rules (SPAR) are defined in terms of
spatial predicates rather than item.
 P1  P2 ..  Pn  Q1 ..  Qm
 Where at least one of the terms (P or Q) is a spatial
predicate.
is(x,country)touches(x,Mediterranean)
is(x,wine-exporter)
Co-location V Association Rules
 Transactions are disjoint while spatial colocation is not. Something must be done.
Three main options
 1. Divide the space into areas and treat them
as transactions
 2. Choose a reference point pattern and treat
the neighbourhood of each of its points as a
transaction
 3. Treat all point patterns as equal
Co-location V Association Rules
 Spatial Association Rules Mining (SARM) is similar to
the raster view in the sense that it tessellates a study
region S into discrete groups based on spatial or aspatial
predicates derived from concept hierarchies. For
instance, a spatial predicate close to(α, β) divides S
into two groups, locations close to β and those not. So,
close to(α, β) can be either true or false depends on α’s
closeness to β. A spatial association rule is a rule that
consists of a set of predicates in which at least a spatial
predicate is involved. For instance, is a(α, house) ∧
close to(α, beach) -> is expensive(α). This approach
efficiently mines large datasets using a progressive
deepening approach.