Download Spatial Statistics and Spatial Knowledge Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Spatial Databases
First law of geography [Tobler]: Everything is related to everything, but
nearby things are more related than distant things.
Lecture 8 : Spatial Statistics
Autocorrelation & Geographically
Weighted Regression
Pat Browne
Correlation
The correlation coefficient is a measure of the degree of linear
relationship between two variables, X and Y. Correlation measures
the strength of a relationship between data. The correlation
coefficient ranges from -1 to +1. In contrast to regression
(discussed later) the correlation relation does not mean that one
thing causes the other (there could be other reasons the data has
high correlation).
Correlation
Correlation
Regression
• Regression: takes a numerical dataset
and develops a mathematical formula that
fits the data. The results can be used to
predict future behaviour. Works well with
continuous quantitative data like weight,
speed or age. Not good for categorical
data where order is not significant, like
colour, name, gender, nest/no nest.
Example: plotting snowfall against height
above sea level.
Standard statistical concepts:
Regression
Y
X
Y = A + BX; The response variable is Y, and X is the
continuous explanatory variable. Parameter A is the
intercept. Parameter B is the slope coefficient. The
difference between each data point and the value
predicted by the line (the model) is called a residual .
Regression
Y
X
Where X , Y are the means of X and Y.
Alternative terminology for linear regression equation:
Y = a + bX where
•Y is the dependent variable
•a is the intercept
•b is the slope or regression coefficient
•X is the independent variable
Regression Model in R (see Lab)
Moving the
line to get a
best fit
Changing
the slope of
the line to
get a best fit
R can calculate the maximum likelihood estimate of
the intercept and slope giving: y = 4.8 + (0.6 * x)
Local Versus Global Statistics.
From “Geographically Weighted Regression” by Fotheringham,Brunsdon,Charlton
Local Versus Global Statistics.
From “Geographically Weighted Regression” by Fotheringham,Brunsdon,Charlton
The ecological fallacy and the modifiable areal
unit problem
From “Spatial data analysis” by Christopher D. Lloyd
We often need to use spatially aggregated
data, for example census zones or cells in
remotely sensed images. Such zones are
unlikely to be internally homogeneous. A cell in
a remotely sensed image has only one value,
but in the real world there may be several
features in the area covered by the cell. The
variation within an area is lost if the area is
larger than the individual features it contains.
Ecological fallacy/Modifi able areal unit
problem(MAUP)
From “Spatial data analysis” by Christopher D. Lloyd
The ecological fallacy refers to the problem of
making inferences about individuals from
aggregate data. For example, not all people in
one census zone are likely to share the same
characteristics. The majority of people in a
census zone may be wealthy, but if there is a
housing estate (high density) just inside one
edge of the zone then clearly generalizations
about the population of the zone may be
unsound.
Modifable areal unit problem(MAUP)
From “Spatial data analysis” by Christopher D. Lloyd
• The MAUP is composed of two parts:
– The scale effect: Statistical analyses based on
data aggregated over areas of different sizes will
produce different results.
– The zoning effect :Two sets of zones can have
the same or similar areas but very different forms
and analyses based on two such sets of zones
may vary.
Modifable areal unit problem(MAUP)
From “Spatial data analysis” by Christopher D. Lloyd
Moving Window
From “Spatial data analysis” by Christopher D. Lloyd
Moving windows (MW) map how values
change from place to place. MW used in
many contexts, including finding the gradient
of the terrain locally
Spatial autocorrelation
• Spatial autocorrelation (SA) is the degree of
correlation between neighbouring values of some
property of a region (e.g. population). SA occurs
when the value of a variable in a location is
correlated with values of the same variable in the
neighbourhood. SA is measured with Moran’s I.
• Moran’s I measures the average correlation
between the value of a variable at one location and
the value at nearby locations. The essential idea is
to specify pairs of locations that influence each other
along with the relative intensity of interaction.
Moran’s I provides a global view of spatial
autocorrelation correlation.
Moran’s I
• The range of the Moran's I statistic
depends on the spatial weight matrix.
• When Moran's I is scaled by its bounds the
statistic is restricted to the range ±1
• Moran’s I can serve as a tool for modeling
spatial dependencies in many data mining
techniques.
Same Mean and SD but different
Moran’s I
Same Mean and SD but different
Moran’s I
Spatial Autocorrelation: Moran’s I example
Moran’s I - example
Figure 7.5, pp. 190
•Pixel value set in (b) and (c ) are same but their Moran Is are different.
•Q? Which dataset between (b) and (c ) has higher spatial autocorrelation?
Neighbours.
Immediate neighbours can be considered using either a rooks or queens case.
The neighbour relation can be weighted with simple ajacency or more complex
calculations, such as boundary length.
Geographical Weights
•
Binary: Rook or
queen neighbours
•
Distance based
•
Boundary or
perimeter based.
•
Weights can be rownormalized using the
number of adjacent
cells
Neigbourhood relationship
contiguity matrix
Spatial Lag Example
1
2
7
4
3
6
5
4
7
6
5
8
5
4
• Spatial lag = sum of spatiallyweighted values of neighboring cells
4
9
6
Lag for cell 2 = 1/3(7) + 1/3(5) +1/3(4)
= 5.3
3
Sample Region Ids top left and Values in centre
Spatial Lag
• Map 1 and Map 2 represent a set of
rainfall readings for regions labelled A to I.
For both maps the mean is 10, and the
standard deviation is 3.8.
• Lag for E in Map1=(6+7+13+14)/4=10
• Lag for E in Map2=(7+8+6+5)/4 =6.5
• In Map 1 the lag=E, in Map2 lag<E, hence
E is more like its neighbours in Map1 than
in Map2 (Rooks case).
Spatial autocorrelation
Negative
Dispersed
Spatial
Independence
Spatial Clustering
Positive
Moran’s I
• Global Moran’s I
• What is the extent of clustering in the total area?
• Is this clustering significantly different from a
random spatial distribution?
• Local Moran’s I
• Do local clusters (high-high or low-low) or local
spatial outliers (high-low or low-high) exist?
• Are these local clusters and spatial outliers
statistically significant?
• Local Moran is a special case of Local indicators
of spatial association (LISA)
Moran Scatter Plot
Scatter Diagram between X and Lag-X, the “spatial lag” of X
formed by averaging all the values of X for the neighboring
polygons
Identifies which type of spatial autocorrelation exists.
Low/High
negative SA
Low/Low
positive SA
High/High
positive SA
High/Low
negative SA
Briggs Henan University 2010
28
Moran’s I index
Spatial Autocorrelation: Case Study
Nest locations
Distance to open water Vegetation durability
Water depth
Spatial Autocorrelation
Classical Statistical Assumptions
(i.i.d) do not hold for spatially
dependent data
Moran’s I - example
• Moran I statistic for map 1 is 0.55316092
• Moran I statistic for map 2 is -0.76724138
Moran’s I - example
Spatial Autocorrelation : Moran
Scatterplot Map
São Paulo
WZ
Q4 = LH
Q1= HH
a
0
Q2= LL
Q3 = HL
0
z
Old-aged population
Moran’s I: A measure of spatial
autocorrelation
• Given x  x1,...xn  sampled over n locations.
t
zWz
Moran I is defined as I 
zz t
Where




z   x1  x ,...,xn  x 


and W is a normalized contiguity matrix.
Fig. 7.5, pp. 190
How to decide the weight wij ?
The weight indicates the spatial interaction between entities.
1) Binary wij, also called absolute adjacency. Covers the general
case answering the question is a value in a region similar or
different to its neighbours.
wij = 1 if two geographic entities are adjacent; otherwise, wij = 0.
Choice of adjacency definition queens(8) or rooks(4).
How to decide the weight wij ?
The weight indicates the spatial interaction between entities.
2) The distance between geographic entities. Often the inverse
distance is used, further objects get less weight, near object get
more weight e.g. centre of epidemic.
wij = f(dist(i,j)), dist(i,j) is the distance between i and j.
3) The length of common boundary for area entities. Policing
borders, smaller borders less weight.
wij = f(leng(i,j)), leng(i,j) is the length of common boundary
between i and j.
How to decide the weight wij ?1
The choice of weights should ultimately be driven by a rationale for including
those areas as neighbors that have a spatial effect on a given location. This
rationale can be derived from theory or be the result of using ESDA to
experiment with different weights and connectivity orders. Since weights
matrices are used to create spatial lags that average neighboring values, the
choice of a weights matrix will determine which neighboring values will be
averaged. For instance, since rook weights will usually have fewer neighbors
than queen weights, on average, each neighboring observation has more
influence.
How to decide the weight wij ?
1
The question of which weights to choose is more pertinent in the context of
modeling than ESDA since modeling is based on substantive notions of
spatial effects while ESDA prioritizes the rejection of spatial randomness.
Therefore, if there are no substantive reasons to guide the choice of weights
in ESDA, using a weights file with as few neighbors as possible (such as
rook) makes sense. Especially with irregular areal units (as opposed to
grids), the difference between rook and queen weights is often minimal.
However, it is advisable to test how sensitive your results are to your weights
specifications by comparing multiple weights matrices.
Spatial Outlier Detection
• Global outliers are observations which
appear inconsistent with the remainder of
that data set.
• Global outliers deviate so much from other
observations that it may be possible that
they were generated by a different
mechanism.
• Spatial outliers are observations that
appear inconsistent with their neighbours.
Spatial Outlier Detection
• Detecting spatial outliers has important
applications in transportation, ecology,
public safety, public health, climatology
and location based services.
• Geographic objects have a spatial
(location, shape, metric & topological
properties) & non-spatial component
(house owner, sensor id., soil type).
Spatial Outlier Detection
• Spatial neighbourhoods may be defined using
spatial attributes & spatial relations.
• Comparisons between spatially referenced
objects can be based on non-spatial attributes.
• A spatial outlier is a spatially referenced object
whose non-spatial attribute values differ from
those of other spatially referenced objects in its
spatial neighbourhood.
Spatial Outlier Detection
• The upper left & lower
right quadrants of
figure 7.17 indicate a
spatial association of
dissimilar values; low
values surrounded by
high value neighbours
(P & Q) and high
values surrounded by
low values (S).
Spatial Outlier Detection
• Moranoutlier is a
point located in the
upper left or lower
right quadrant of a
Moran scatter plot.
Spatial Outlier Detection
• Moranoutlier is a
point located in the
upper left or lower
right quadrant of a
Moran scatter plot.
WZ
Q4 = LH
Db
0
Q2= LL
Q1= HH
Cb
a
Q3 = HL
z
0
values in a given location
Model Evaluation
• Consider the two-class classification problem
‘nest’ or ‘no-nest’. The four possible outcomes
(or predictions) are shown on the next slide. The
desired predictions are:
– 1) where the model says the should be a nest and
there is an actual nest (True Positive)
– 2) where the model says there is no nest and there is
no nest (True Negative)
• The other outcomes are not desirable and point
to a flaw in the model.
Model Evaluation