Download 1 Introduction

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Spatial analysis wikipedia , lookup

Time series wikipedia , lookup

Transcript
c
HYON-JUNG
KIM, 2016
1
Introduction
Spatial analysis is the quantitative study of phenomena that are located in space.
Spatial data analysis usually refers to an analysis of the observations in which the spatial
locations of sites are taken into account, and includes the reduction of spatial patterns to a
few clear and useful summaries.
Spatial statistics goes beyond this in that these summaries are compared with what might
be expected from theories of how the pattern might have originated and developed, i.e.,
inferential statistics. So, Spatial Statistics involves the inferential level of analysis, model
building, testing and interpretation. It is a vast subject in large part because spatial data
are of so many different types.
Spatial data: Data that are location specific and that vary in space.
The observations may be:
- univariate or multivariate
- categorical or continuous
- real-valued (numerical) or not real-valued
- observational or experimental
The data locations may
- be points, regions, line segments, or curves
- be regularly or irregularly spaced
- be regularly or irregularly shaped
- belong to Euclidean or non-Euclidean space
The mechanism that generates the data locations may be:
- known or unknown
- random or non-random
- related or unrelated to the processes that govern the observations
PAGE 1
c
HYON-JUNG
KIM, 2016
Typical data
- sample of observations from the process of interest
- often very noisy, NOT independent
Three prototypes of data:
1. Geostatistical data
The components of geostatistical data are the locations, and the measurements at each
location.
e.g. Rainfall measurements in Tampere, Temperature for weather stations in Finland,
Air pollutants measurements, Soil pH in water, etc.
2. Lattice data (Areal or aggregate data)
Counts or averages of a quantity on subregions that make up a larger region.
e.g. Presence or absence of a plant species in square quadrats over a study area, number
of deaths due to SIDS in the counties of North Carolina, Pixel values from remote sensing
(satellites)
3. Spatial point patterns
e.g. Location of bird nests in a suitable habitat (evidence of territoriality), location of
lunar craters (meteor impacts or volcanism) etc.
Note that the distinction between these types are not always clearcut. Especially, geostatistical data and lattice data have many similarities.
Spatial Structure
• Large-scale structure (Global)
- Mean function of geostatistical process
- Intensity of spatial point process
- Mean vector of lattice data
PAGE 2
c
HYON-JUNG
KIM, 2016
• Small-scale structure (Local)
- Variogram, covariance function of geostatistical process (and lattice process)
- Ripley’s K function, second-order intensity, nearest-neighbor functions for spatial
point process
- Neighbor weights for lattice process
Stationarity implies constant large-scale structure and small-scale structure which depends on the spatial locations only through their relative positions (formal descriptions
will be discussed later.)
Main objectives of Spatial Statistics
- Inference for spatial structure
- Inference for non-spatial structure
- Prediction of unobserved variables
- Design issues, such as where to take observations or how to arrange treatments in a
spatial experiment.
Temporal Statistics, Spatial Statistics, and Spatio-Temporal Statistics
The inherent difference between temporal statistics and spatial statistics is due to the
fact that time flows in one direction only, from past to present to future.
- In spatial statistics, observations are often irregularly spaced and models must be
more flexible.
- In geostatistics and lattice data analysis, observations are usually assumed to be
dependent and non-identically distributed; in particular, models usually include a
trend.
- In space, interaction regarding each observation generally occurs in all directions and
many geostatistical/lattice models incorporate omnidirectional interaction.
- In time series, prediction usually consists of extrapolating to a future time point.
In geostatistics/lattice analysis, interpolation is as important as extrapolation.
- Geostatistics and lattice data analysis are most similar to that subfield of modern
longitudinal data analysis which explicitly models the temporal correlation among
observations.
PAGE 3
c
HYON-JUNG
KIM, 2016
- Spatial point pattern analysis is most similar to failure time data analysis.
Spatiotemporal statistics: data are observations with identifiable and observed spatial
and temporal labels.
e.g. Earthquakes (locations random in time and space), change in locations of trees
over time, environmental monitoring of water quality, etc.
Space-time data can be modeled either as a collection of spatially correlated time series
or a collection of temporally correlated spatial random fields, lattice processes, or spatial point processes. There are many possibilities to combine spatial data types with
temporal data types and the interaction between them. We will focus on “pure” spatial
statistics in this course but occasionally we will discuss spatiotemporal extensions of
certain issues, topics, or methods.
Basic Notation and Statistical Model
- Space S, which for concreteness we assume to be Euclidean:
S = Rd where d = 1, 2, or 3.
- Study region, A ⊂ S
- Spatial data (or point) locations s1 , . . . , sn , si ∈ D, D an index set
- Observations Z(s1 ), Z(s2 ), . . . , Z(sn )
- Covariates X(s1 ), X(s2 ), . . . , X(sn )
- Model: {Z(s), s ∈ D, D ⊆ Rd }
This is a stochastic process, i.e. a collection of random variables, indexed by points or
regions in D.
Either the Z values or s values or both are random. The X values are usually assumed
to be nonrandom.
PAGE 4
1.1 Visualization
1.1
c
HYON-JUNG
KIM, 2016
Visualization
1. Visualizing Geostatistical (point referenced) Data
The best way to visualize these data is to display on a map, and differentiate the values
of the measurements of interest by colour or size.
Example: Field observations of air pollution measurements in the northeast US.
The points are air pollution monitors: the monthly average P M2.5 concentration colour
coded (a gradient from blue (low) to red (high))
Alternatively we can display the points as gradients in size (2nd figure): the monthly
cal Data
average P M2.5 concentration where larger circles represent higher concentrations, smaller
circles are lower concentrations.
Note the choice of colour and size
gradient
of the points (ugm-3)
can lead to different conclusions!
PM2.5
concentrations
55
r
40
30
35
Latitude
45
50
M2.5
ded (a
) to
-90
-85
-80
-75
-70
Longitude
• Goals of spatial statistics applied to geostatistical data
- Explore the spatial pattern in the observations. (Often called spatial “structure”).
23 / 53
- Quantify the spatial pattern with a function.
- Model the spatial correlation/covariance in the observations.
PAGE 5
1.1 Visualization
cal Data
- Make predictions at unobserved locations: interpolation, smoothing.
Additional considerations: Account for spatial structure in regression models and/or
(ugm-3)
Test a null hypothesis of no PM2.5
spatialconcentrations
structure.
55
the
c
HYON-JUNG
KIM, 2016
50
ion
30
35
and
an lead
40
Latitude
circles
s
45
M2.5
ger
-90
2. Areal data
-85
-80
-75
-70
Longitude
Areal units are often referenced as polygons. The centroids of the areal units may be
useful for a spatial reference, in combination with the area of the polygon. The best
24 / 53
way to visualize these data is to display as a map, differentiating the areal units by
colour.
Areal data (lattices) use neighbor relationships.
Examples:
- Median household income in Los Angeles neighborhoods
- State-specific (or county-, census tract-, zip code-specific) election results
- County hospital admission rates for influenza
Information collected in areal units may be census related, health related, environmental (satellite estimates of pollution, land cover).
• Goals of spatial statistics applied to areal data
PAGE 6
1.1 Visualization
c
HYON-JUNG
KIM, 2016
Example- Understand the linkage between areal units.
- We want to determine spatial patterns of areal units within a region.
- If there is a spatial pattern, how strong is it? A pattern through visualization is often
subjective. Independent measurements will usually have no pattern.
in
the
Visualizing Areal Data: Example
30 / 53
32 / 53
PAGE 7
1.1 Visualization
c
HYON-JUNG
KIM, 2016
3. Point Pattern Data
A spatial point process is a stochastic mechanism that generates events in 2D.
Event is an observation (e.g. presence/absence) and the point is the location.
Mapped point pattern: Events in a study area D have been recorded.
Sampled point pattern: Events are recorded after taking samples in an area D.
Examples:
- Locations of homeless in Los Angeles
- Cases of malaria in Nairobi
- Locations of a specific tree species in a forest
If there are different categories of a point pattern, such as with the homeless data, then
Point Pattern Data: An Everyday Example
these categories may be coloured separately. Often conclusions cannot be drawn from
visual inspection alone.
http://graphics.latimes.com/homeless-los-angeles-2015/
39 / 53
• Goals about point pattern data:
Model some spatial pattern and determine if our observed point pattern fits this model.
Measure of intensity: mean number of events per unit area
PAGE 8
1.1 Visualization
c
HYON-JUNG
KIM, 2016
Questions we would like to answer:
- Is there a regular pattern in the points?
- Is there clustering of the points?
- Can we define a point process that our events follow?
- Is there an underlying population distribution from which events arise in a region?
4. Spatio-temporal data
All three types of data we have described may be referenced in space and in time. That
is, data that are location specific can have replicates in time:
- Each observation has a location, time and value
Geostatistical: Relationship between daily air pollution measured at discrete locations
in the US Northeast and hospital admissions
Visualizing
Areal Data
Areal: Examining birth rates from year to year in US states.
Crude birth rates by state based on equal-interval cut points
Point process: Changes in spatial clustering of homeless individuals from 2015 to 2016.
Figure: Monomier, N. Lying with Maps. Statistical Science 2005, 20(3)
215222.
34 / 53
PAGE 9
c
HYON-JUNG
KIM, 2016
2
Geostatistics
The (stochastic) process varies continuously over the space, but data is measured only at
discrete locations.
- Process (Markov Random Field) {Z(s), s ∈ D, D ⊆ Rd }
- Observations: z1 = Z(s1 ), z2 = Z(s2 ), . . . , zn = Z(sn )
First law of geography: Nearby quantities tend to be more alike than those far apart
The usual model for many kinds of data is
Datum = Mean + Residual
In a Geostatistical context, the basic model takes the form
Z(s) = m(s) + (s) ( i.e. large scale variation + small scale)
= m(s) + W (s)(smooth) + δ(s)(white noise)
= signal + noise
where m(s) ≡ E[Z(s)] is the mean function which is usually nonrandom quantity. When
we specify the distribution of (s) sufficiently, the distribution of {Z(s), s ∈ D} will be
specified.
However, the random sampling assumptions generally are not appropriate. Geostatistical data generally represent an incomplete sampling of a single realization. Some further
assumptions about Z(·) must be made for inference to be possible and such an assumption
is stationarity (to be discussed later in detail).
2.1
Exploratory Data Analysis
1. Non-spatial summaries
- Numerical summaries: Mean, median, standard deviation, range, etc.
- Graphic tools: stem-and-leaf, box plots, etc.
2. Descriptive statistics for spatial information
PAGE 10
2.1 Exploratory Data Analysis
c
HYON-JUNG
KIM, 2016
a) Methods mainly to explore large-scale variation:
• Plot of Zi versus each marginal coordinate
• Plot of mean or median of Zi versus row index or column index (data locations on a
regular grid)
• 2-D or 3-D scatterplots: a plot of Zi vs. data location (for d = 3)
• Indicator maps: assign each data point to one of only two classes using two symbols
• contour plots, greyscale maps, proportional symbol maps
• Spatial moving averages: estimation by averaging the values at neighboring sampled
data points
• Nonparametric smoothing : Kernel estimation: (Bailey and Gatrell, section 2.3.2),
LOESS (locally weighted polynomial regression)
• Mean or median polish
- Requires a rectangular grid, say p × q
- Decomposes data: data = overall + row effect + col effect + residuals (i.e. removes
some trend, large scale variation)
- Alternately subtract row means (medians) and column means and accumulate these
in extra cells. Repeat this procedure until another iteration produces virtually no
change.
b) Methods to explore small-scale variation:
• h-scatterplots (or same-lag scatterplots)
- Methods to explore dependence
- Requires regular spacing between data locations
- for a fixed vector e of unit length and a scalar h, plot Z(si + he) vs. Z(si ) for all i
- May reveal direction of dependence, outliers or the existence of nonstationarity in
the mean and/or variance
PAGE 11
2.1 Exploratory Data Analysis
c
HYON-JUNG
KIM, 2016
• 3-D plot of standard deviation versus(vs.) spatial location, computed from a moving
window
• Scatterplot of standard deviation vs. mean, computed from a moving window
• Semivariogram cloud
- Plot (Z(si ) − Z(sj ))2 or |Z(si ) − Z(sj )|1/2 vs. (si − sj )1/2 for all possible pairs of
observations
- Note that this implicitly assumes some kind of stationarity
e.g. Coal Ash Data (Cressie)
The data contains 208 coal ash core samples collected on a grid.
Suppose X=% coal ash, Y1 = % coal ash of neighbor to the East and Y2 = % coal ash
of second nearest neighbor to the East.
Let D12 = (X − Y1 )2 , D12 = (X − Y1 )2 , etc.
Make a boxplot of D12 , D22 , · · · and put them side by side.
D12 small ⇒
D12 large ⇒
• Empirical (or sample, or experimental) semivariogram (Matheron, 1962)
(Assume that large scale variation for Z(·) is removed or ignorable for now.)
γb (h) =
X
1
{Z(si ) − Z(sj )}2
2|N (h)|
N (h)
where
N (h) = {(si , sj ) : si − sj = h : i, j = 1, 2, ..., n}
and |N (h)| is the number of distinct pairs in N (h).
• Sample covariance function
The usual estimator is
b
C(h)
=
X
1
(Z(si ) − Z)(Z(sj ) − Z)
|N (h)|
N (h)
which is the spatial generalization of the sample autocovariance function used in time
series analysis. (This will be discussed more in depth later.)
PAGE 12
2.2 Models
2.2
c
HYON-JUNG
KIM, 2016
Models
Stationarity
a) Strict stationarity
- requires that the joint probability of the data depends only on the relative positions of
the sites at which the data were taken.
b) Second-order stationarity
i) the variate’s mean is constant.
ii) Covariance between variates at two sites depends only on the site’s relative positions.
C(s, t) = C(s + h, t + h), for all h
c) Intrinsic stationarity
i) the mean is constant E[Z(s)] = µ for all s ∈ D
ii) 21 Var[Z(s) − Z(t)] depends only on the lag difference s − t for all s, t ∈ D.
Trend surface (Mean functions)
The first requirement for stationarity (that the spatial variate have constant mean) does
not seem reasonable in many cases. What seems more reasonable is that sites close to one
another should have similar means, but sites far apart need not. This kind of local stationarity
rather than global stationarity leads to the postulation of a continuous, relatively smooth
but nonconstant function for the mean.
- The conventional multiple regression model:
Z(s) = X(s)β + (s)
- A very useful class of mean functions are the polynomials:
e.g. m(x, y) = β0 + β1 x + β2 y
- Another kind of continuous (but less smooth) function is the surface that results from
performing a median polish.
PAGE 13
2.2 Models
c
HYON-JUNG
KIM, 2016
- An alternative to a parametric approach to modeling the mean function is a nonparametric approach using splines or LOESS or a kernel estimator.
Recall that in page 10, if we assume that the distribution of {(s), s ∈ D} is a Gaussian
process, then the distribution of {Z(s), s ∈ D} is completely specified. Now, the convention
in Geostatistics is that the distribution of {Z(s), s ∈ D} is specified through its covariance
function as a function of the coordinates of the two corresponding sites.
Covariance functions
The function needs to satisfy the following properties:
a) Evenness
C(h) = C(−h) for all h
b) Nonnegative definiteness
n X
n
X
ai aj C(si − sj ) ≥ 0
i=1 j=1
for all n, all sequences {ai , i = 1, . . . , n} and all sequences of spatial locations {si , i =
1, . . . , n}.
a) and b) ⇒ C(0) ≥ 0,
|C(h)| ≤ C(0) for all h
Bochner’s theorem: a function is nonnegative definite iff (if and only if) it is the Fourier
transform of a positive Borel measure.
Isotropy and Anisotropy
A stationary covariance function is called isotropic if the covariance between any two
values depends only on the Euclidean distance ks−tk between locations i.e., C(h) = C(khk)
When the covariance depends on the direction, it is called anisotropic.
Isotropic, parametric (valid) covariance function models
Let r = khk for convenience.
PAGE 14
2.2 Models
c
HYON-JUNG
KIM, 2016
• Tent (triangular, piecewise linear) model (valid in R1 only)
C(r; θ) =



θ1 (1 − r/θ2 )
for 0 ≤ r ≤ θ2


0
for θ2 < r
• Spherical model
C(r; θ) =



θ1 1 −


0
3r
2θ2
+
r3
2θ23
for 0 ≤ r ≤ θ2
for θ2 < r
• Exponential model
C(r; θ) = θ1 exp(−θ2 r)
θ1 ≥ 0, θ2 ≥ 0
C(r; θ) = θ1 exp(−θ2 r2 )
θ1 ≥ 0, θ2 ≥ 0
• Gaussian model
• Rational quadratic model
C(r; θ) = θ1
r2
θ2 −
1 + r2 /θ2
!
θ1 ≥ 0, θ2 ≥ 0
• Matern class of model
θ1
C(r; θ) = θ3 −1
2
Γ(θ3 )
2θ3 1/2 r
θ2
!θ3

1/2

2θ r
K θ3  3 
θ2
θ1 ≥ 0, θ2 ≥ 0, , θ3 > 0
where Kθ3 is called the modified Bessel function of the third kind of order θ3 .
• Cosine model
C(r; θ) = θ1 cos(r/θ2 )
θ1 ≥ 0, θ2 ≥ 0
• Wave or hole-effect model
C(r; θ) = θ1 θ2
sin(r/θ2 )
r
θ1 ≥ 0, θ2 ≥ 0
Note that we can constuct more complicated models using the following rules:
- If C1 (·) and C2 (·) are valid covariance functions in Rd , then so is C(·) ≡ C1 (·) + C2 (·)
- If C0 (·) is a valid covariance function in Rd and b > 0, then C(·) ≡ bC0 (·) is a valid
covariance function in Rd
PAGE 15
2.2 Models
c
HYON-JUNG
KIM, 2016
- If C1 (·) and C2 (·) are valid covariance functions in R1d and R2d respectively, then
C(·) ≡ C1 (·)C2 (·) is a valid covariance function in Rd1 +d2
- A valid isotropic covariance function in R1d may not be a valid isotropic covariance
function in R2d where d2 > d1 . However, the converse is true. With the exception of
the tent model, all the models listed above are valid in R2 and R3 .
Semivariogram
Traditionally, geostatistical practitioners have adopted a slightly more general kind of stationarity assumption (intrinsic stationarity) than second-order stationarity, and they modeled the small-scale dependence through a function (semivariogram) somewhat different than
the covariance function.
1
γ(s − t) = Var[Z(s) − Z(t)].
2
The function 2γ(·) is called the variogram. When the process is intrinsically stationary, it
can be also expressed as
1
γ(h) = E[Z(s) − Z(t)]2 where h = s − t.
2
A second-order stationary random process with covariance function C(·) is intrinsically stationary, with semivariogram
γ(h) = C(0) − C(h)
but the converse is not true in general. That is, there exist processes that are intrinsically
stationary but not second-order stationary.
The semivariogram must satisfy the following properties:
a) It vanishes at 0, i.e. γ(0) = 0
b) Evenness
c) It needs to be conditionally negative-definite; that is, it must satisfy
n X
n
X
λi λj γ(si − sj ) ≤ 0
i=1 j=1
for each set of locations s1 , ..., sn and all λ1 , . . . , λn such that
d) limkhk→∞ {γ(h)/khk2 } = 0
PAGE 16
Pn
i=1
λi = 0.
2.2 Models
c
HYON-JUNG
KIM, 2016
Attributes of the semivariogram
• Nugget effect
microscale variability
• Sill ( = partial sill+ nugget effect )
• Range or effective range
The range of an isotropic semivariogram (or covariance function) is defined as the
distance beyond which correlation is equal to 0. Of the models listed, only the tent
and spherical models have a range (which is equal to θ2 ). For isotropic models that do
not have a range, effective range, if one exists, is defined as the distance beyond which
correlation does not exceed 0.95× variance (or C(0), partial sill). The exponential,
Gaussian, rational quadratic, and Matern models all have effective ranges; but the
cosine model does not.
• Slope
Examples of valid isotropic semivariogram models
• Tent (valid in R1 only)



θ1 r/θ2
for 0 ≤ r ≤ θ2

θ1
for θ2 < r
γ(r; θ) = 
• Linear
γ(r; θ1 ) = θ1 r
θ1 ≥ 0
• Power
γ(r; θ) = θ1 rθ2
θ1 ≥ 0, 0 ≤ θ2 < 2
• Spherical
γ(r; θ) =



θ1


θ1
3r
2θ2
−
r3
2θ23
for 0 ≤ r ≤ θ2
for θ2 < r
• Exponential
γ(r; θ) = θ1 {1 − exp(−θ2 r)}
PAGE 17
θ1 ≥ 0, θ2 ≥ 0
2.2 Models
c
HYON-JUNG
KIM, 2016
• Gaussian model
γ(r; θ) = θ1 {1 − exp(−θ2 r2 )}
θ1 ≥ 0, θ2 ≥ 0
• Rational quadratic model
γ(r; θ) = θ1
r2
1 + r2 /θ2
θ1 ≥ 0, θ2 ≥ 0
• Cosine model
γ(r; θ) = θ1 {1 − cos(r/θ2 )}
θ1 ≥ 0, θ2 ≥ 0
• Wave or hole-effect model
γ(r; θ) = θ1 {1 − θ2
sin(r/θ2 )
}
r
θ1 ≥ 0, θ2 ≥ 0
• Matern class of model

1
γ(r; θ) = θ1 1 − θ3 −1
2
Γ(θ3 )
2θ3 1/2 r
θ2
!θ3

1/2

2θ r
Kθ3  3 
θ2
θ1 ≥ 0, θ2 ≥ 0, , θ3 > 0
Geostatistical
Data: Semivariogram Interpretation
- The exponential model is a special case of the Matern model with θ = ; the Gaussian
3
model is the limiting case of the Matern model as θ3 → ∞.
.
PAGE 18
1
2
2.2 Models
c
HYON-JUNG
KIM, 2016
0
400
0 2 4 6 8
Gaussian semivariogram
Semivariogram
0 5 10
Semivariogram
20
Exponential semivariogram
800
0
100
250
Power semivariogram
20
ω =ω
1.5
=1
1.0
ω = 0.3
0.0
Semivariogram
5
0
0
2.0
Spherical semivariogram
10 15
h
Semivariogram
h
40
0.0
h
1.0
2.0
h
Modeling anisotropy
a) Range anisotropy
- Most often seen in practice (sill and nugget are the same).
- Geometric anisotropy is easiest to model. Any valid isotropic model can be generalized
to make it geometrically anisotropic.
e.g.
C(h : θ) = θ1 exp[−θ2 (h21 + 2θ3 h1 h2 + θ4 h22 )1/2 ]
PAGE 19
2.2 Models
c
HYON-JUNG
KIM, 2016
b) Sill anisotropy
- Either the assumption of second-order stationarity is violated or there are measurement
errors which are correlated or do not have mean zero.
c) Nugget anisotropy
- Can be caused by correlated measurement errors.
- Typically occurs in one direction only.
d) Slope anisotropy
- Can be dealt with a similar fashion as geometric anisotropy.
Other types of anisotropy
i) Geometric anisotropy: A covariance function is geometrically anisotropic if a positive
definite matrix A exists such that
C(h) = C([h0 Ah]1/2 ) for all h
ii) Zonal anisotropy
Estimation of C(·) and γ(·) (revisited)
• Empirical (or sample, or experimental) semivariogram
For a sample of given realizations from Z(·) where the mean function is taken to
be constant, the empirical semivariogram is the unbiased estimator of an isotropic
semivariogram given by
γb (h) =
1
2|N (h)|
{Z(si ) − Z(sj )}2
si −sj =h
X
When non-constant trend is assumed, the sample semivariogrm is computed based on
the residuals
γb (h) =
1
2|N (h)|
{ˆ(si ) − ˆ(sj )}2
si −sj =h
X
where
ˆ(si ) = Z(si ) − m(si ; β̂),
PAGE 20
i = 1, . . . , n
2.2 Models
c
HYON-JUNG
KIM, 2016
This estimator is unbiased for the semivariogram (assuming the correct mean function
has been adopted): method of moments type estimator.
When data locations are irregularly spaced, we partition the lag space H = {(s−t) :
s, t ∈ D} into lag classes or windows H1 , . . . , Hk , say, and assign each lag in the data
set to one of these classes. For non-regularly spaced data, this estimator is approximately unbiased because the grouping (binning) of lags into classes cause a blurring
effect.
Need to replace ‘si − sj = h’ with ‘si − sj ∈ T (h)’ where T (h) is a tolerance region
about h.
⇒ γ + (h) = 21 AVG{[Z(si ) − Z(sj )]2 : si − sj ∈ T (hl )}
Two main types of partitions:
1. Polar partitioning, i.e. angle and distance classes
2. Rectangular partitioning
Rules of thumb to be considered (Journel and Huijbregts, 1978):
i) Empirical semivariogram should be considered only for distances for which the number
of pairs is greater than (about) 30.
ii) The distance of reliability is half the maximum distance over the field of data.
• Robust semivariogram estimators
- Cressie and Hawkins 1984
γ(h) =
{ 2|N1(h)|
P
N (h)
|ˆ(si ) − ˆ(sj )|1/2 }4
.457 + [.494/N (h)]
- Genton 1998
• Sample covariance function
Recall that the estimator is given by
b
C(h)
=
X
1
(Z(si ) − Z)(Z(sj ) − Z)
|N (h)|
N (h)
PAGE 21
2.2 Models
c
HYON-JUNG
KIM, 2016
istical Data: Anisotropy
This estimator is biased even for regularly-spaced data and is meaningful only if the
process is second-order stationary.
b
b
NOTE: γb (h) 6= C(0)
− C(h)
py means that the semivariance depends only on the d
en points, not direction.
tropy means the semivariance also depends on direction
s distance.
an examine anisotropy with a directional semivariogra
• Correlation function (Correlogram)
ρ(h) = C(h)/C(0)
• Checking for isotropy
- Superimposition of directional sample semivariogram
- Rose diagram: consists of smoothing the directional sample semivariograms, then
in the lag space connecting with a smooth curve, those lag vectors h for which these
smoothed semivariograms are roughly equal. In effect, this plots estimated isocorrela-
1.5
tion contours (in case of a second-order stationary process).
0.5
0.0
semivariance
1.0
0°
45°
90°
135°
0
5
10
15
Distance (h), degrees
PAGE 22
20
25
2.3 Estimation for geostatistical models
2.3
c
HYON-JUNG
KIM, 2016
Estimation for geostatistical models
In summary, the general (or classical) model we use for our analysis of geostatistical data is
Z(s) = m(s; β) + (s)
where m(·; β) is a specified family of continuous functions, β is a vector of unknown parameters, {(s) : s ∈ D} is a intrinsically (or second-order) stationary process with mean zero
and semivariogram γ(·; θ) (or covariance function C(·; θ)), and θ is a vector of unknown
parameters.
Overview of the geostatistical method:
i) Using exploratory techniques, prior knowledge, and etc., set up an appropriate model
(e.g. model given above) with assumptions on the mean function and stationarity of the
process that generated the data.
ii) Estimate β for the mean function (if it is not assumed to be constant): β̂ (e.g. by
ordinary least squares or median polish).
iii) Obtain the fitted residuals: ˆ(si ) = Z(si ) − m(si ; β̂). Compute the empirical semivariogram of the residuals.
iv) Select a valid semivariogram model that is compatible with the plot from the previous
step. Fit the chosen model to empirical semivariogram to estimate the model’s parameters.
v) Using the fitted semivariogram model, re-estimate β by generalized least squares (or
some other method which accounts for correlation among observations).
vi) Repeat steps iii) - v) if needed.
vii) Predict (‘krige’) unobserved values at sites (or over regions) of interest and estimate the corresponding variances of prediction error. Determine optimal locations to take
additional observations , and repeat the above steps if needed.
Semivariogram Model Fitting
Although the empirical semivariogram is unbiased for the semivariogram, it may not be
negative-definite. Neither the sample semivariogram nor the sample covariance function can
be used directly used for statistical inference ,e.g., spatial prediction (kriging).
PAGE 23
2.3 Estimation for geostatistical models
c
HYON-JUNG
KIM, 2016
⇒ Fit a valid semivariogram model to the sample semivariogram
Methods of fitting
i) By inspection (by eye)
ii) Ordinary nonlinear least squares (OLS)
M in
X
[γ̂(h) − γ(h; θ)]2 with respect to θ
h
Semivariogram estimates are correlated!
iii) Weighted nonlinear least squares (WNLS) Cressie, 1985
A weighted nonlinear estimator of γ(h; θ) is defined as a value θ̂ that minimizes the
weighted residual sum of squares function
M in
X
|N (h)|
[γ̂(h) − γ(h; θ)]2
[γ(h; θ)]2
Note that the nonparametric estimates at large lags tend to receive relatively less weight.
iv) Generalized nonlinear least squares (GLS)
−1
ˆ
M in[γ̂ − γ(θ)]0 [Var(γ̂)]
[γ̂ − γ(θ)]
ˆ
- Derivation and calculation of Var(γ̂)]?
v) Maximum likelihood (ML) / Restricted maximum likelihood (REML)
- Assuming normality for a model Z = Xβ + ,
1
1
L(β, θ; Z) = − log|V | − (Z − Xβ)0 V −1 (Z − Xβ)
2
2
where V = V (θ) denotes the covariance matrix of Z = (Z1 , . . . , Zn )0 and X is the model
matrix for covariates.
- Estimates θ and β simultaneously by finding values that maximizes L(β, θ).
- Applicable to processes with second-order stationary errors only.
- The restricted MLE (REML estimator) maximizes the log likelihood function associated
with n−rank(X) linearly independent error contrasts. It is known to be less biased than
MLE’s and thus, often more preferred especially when rank(X) is appreciable relative to n.
PAGE 24
2.3 Estimation for geostatistical models
c
HYON-JUNG
KIM, 2016
Model Selection Procedures
- Visual inspection of semivariogram plot
- Minimized weighted (or generalized) residual sum of squares function
- Maximized log-likelihood (restricted log-likelihood) function
- Penalized likelihood criteria
e.g., Akaike’s criterion AIC = L(β̂, θ̂)− no. of estimated parameters
Estimating the large-scale variation
If the mean function m(s; β) is linear or nonlinear function of the elements of β, then
linear or nonlinear least squares can be used to fit the model to the data. This is called trend
surface analysis.
This approach is quite easy to implement due to wide availability of computing software
(e.g. PROC REG in SAS or lm in Splus, etc).
Other approaches:
• Median Polish
The mean function is taken to be m(xl , yk ; β) = a + rk + cl
• Locally weighted least squares (LOESS)
- Only assumes that the mean function is smooth.
- Estimates the smooth trend in a moving fashion by fitting a site-specific first-order
or second-order polynomial to only the most proximate data to a site.
- Fits using weighted least squares with weights inversely related to distance from the
site.
• Kernel estimator
It is a type of local smoother which calculates a weighted average of observations near
a target point(s):
n
X
s − si
1
k
zi
2
b
i=1 b
PAGE 25
2.3 Estimation for geostatistical models
c
HYON-JUNG
KIM, 2016
where k(·) is called a kernel function or simply a kernel satisfying some moment conditions (e.g. a quadratic, or uniform kernel).
• Smoothing splines
It is an estimator which minimizes a functional criterion (penalized residual sum of
squares) to fit the data well and at the same time has some degree of smoothness.
Spatial Regression
i) Generalized least squares (GLS) with known covariance matrix
Model:
Z = Xβ + ,
E() = 0, Var() = V (θ)
where V = V (θ) is a completely specified positive definite matrix.
- GLS estimator of β : β̂ GLS = (X 0 V −1 X)−1 X 0 V −1 Z
ii) Estimated generalized least squares (EGLS)
In practice, the true value of θ and consequently V is hardly known and completely
specified. A natural way to deal with this problem is to replace θ in the evaluation of V by
an estimator θ̂, thereby obtaining V̂ .
- EGLS estimator of β : β̂ EGLS = (X 0 V̂ −1 X)−1 X 0 V̂ −1 Z
Example:
*** Mean structure or Covariance structure?
The issue that was mentioned previously is that in practice, a decomposition of the data
into large-scale and small-scale variation is not so clearcut. This problem is often addressed
as follows (Statistics for Spatial Data, Cressie): “One man’s mean structure is another man’s
covariance structure.”
If replications of a spatial process are available, statistical procedures exist for distinguishing between two structures. In practice, however, geostatistical data are not usually
replicated so we must settle for plausibility, rather than a high degree of certainty.
PAGE 26
2.4 Spatial Prediction (Kriging)
2.4
c
HYON-JUNG
KIM, 2016
Spatial Prediction (Kriging)
Goal: Predict a value of Z(s) at s0 (an arbitrary location in D)
Spatial prediction usually refers to ‘interpolating’ a value rather than extrapolation for
a random spatial process. The main idea relies on a form of weighted averaging in which
the weights are chosen such that the error associated with the predictor is less than for any
other linear sum.
The terminology ’kriging’ is from D.G. Krige, a South African mining engineer who in
the 1950’s developed empirical methods for predicting ore grades at unsampled locations
using the known grades of ore sampled at nearby sites.
For kriging,
i) First choose a parametric model for the semivariogram or covariance function.
ii) Estimate the semivariogram (covariance) parameters.
iii) Make predictions and uncertainty estimates given the parameter estimates.
The types of Kriging:
a. Simple Kriging: assumes a constant known mean, but is not often used because for
unbiasedness constraint to be applicable in kriging equations, we must estimate the expected
value.
b. Ordinary Kriging: assumes a constant unknown mean (mean needs to be estimated).
c. Universal Kriging: assumes a trend in x and y, and may include other spatially varying
covariates.
1. Ordinary Kriging (O.K.) by D.G. Krige
Basic assumptions:
i) The mean function is assumed to be constant.
ii) The semivariogram is assumed to be known.
Restrictions to obtain an ordinary kriging predictor:
PAGE 27
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
i) It is a linear combination of the data values.
ii) It is unbiased.
iii) It minimizes the variance of prediction error among all functions satisfying the above
2 properties.
⇒
min Var[
n
X
λi Z(si ) − Z(s0 )]
i=1
subject to
n
X
λi = 1
i=1
Then,
Ẑ(s0 ) =
n
X
λi Z(si )
with E[Ẑ(s0 )] = µ.
i=1
Kriging gives us the best linear unbiased predictor (BLUP) at any new location s0 .
With the method of Lagrange multiplier (from Calculus), it is shown that the optimal
coefficients λ1 , . . . , λn are the first n elements of the vector λo that satisfies the following
system of linear equations, known as the ordinary kriging equations:
Γo λo = γ o
where
λo = (λ1 , . . . , λn , m)0
γo = [γ(s1 − s0 ), . . . , γ(sn − s0 ), 1]0







γ(si − sj )
1
for i = n + 1; j = 1, . . . , n




0
for i = n + 1; j = n + 1
Γo = 

for i = 1, . . . , n; j = 1, . . . , n
and m is a Lagrange multiplier and Γo is symmetric.
The minimized variance called the kriging variance is given by
2
σOK
(s0 )
=
n
X
λi γ(si − s0 ) + m = λ0o γo .
i=1
PAGE 28
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
3.0
Example:
1.5
●
1.0
y
2.0
2.5
●
●
●
0.0
0.5
●
●
●
0
1
2
3
4
x
Take γ(||h||) = 1 − exp(−||h||/2).
Γo =

γo =
√

1 − exp(− 5/2)


 1 − exp(−1/2) 






 1 − exp(−1) 




√


 1 − exp(− 2/2) 




 1 − exp(−1) 




√


 1 − exp(− 2/2) 



λo = Γo −1 γo =
1
0.017


 0.422 






 0.065 






 0.218 




 0.031 






 0.246 


0.004
2
σOK
(s0 ) = λo 0 γo = 0.478.
PAGE 29

2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
Alternative expressions for ordinary kriging predictor and prediction variance which do
not involve the unknown Lagrange multiplier are given below.
Define
λ = (λ1 , . . . , λn )0
γ = (γ(s1 − s0 ), . . . , γ(sn − s0 ))0
Γ = {γ(si − sj )}
Then it can be shown that
1 − 10 Γ−1 γ
m = −
0 −1
1
"Γ 1
#
1 − 10 Γ−1 γ
−1
γ = Γ
γ+
1
10 Γ−1 1
So the OK predictor can be obtained as
#0
1 − 10 Γ−1 γ
1 Γ−1 Z
Ẑ(s0 ) = γ +
0 −1
1Γ 1
"
and the kriging variance is
2
σOK
(s0 ) = γ 0 Γ−1 γ −
(1 − 10 Γ−1 γ)2
10 Γ−1 1
100(1 − α)% prediction interval for Z(s0 ), assuming the random field is Gaussian:
Ẑ(s0 ) ± zα/2 σOK (s0 )
where zα/2 is the upper α/2 percentage point of the standard normal distribution.
Remarks: Ordinary kriging is derived under the assumption of constant mean. This assumption will be relaxed later in discussion of Universal kriging.
It is also derived under an assumption that the semivariogram is known. In practice, it is
hardly known and must be estimated, and the estimator γ̂(·) replace γ(·) in kriging equations
and kriging variance. However, it should be noted that the estimated kriging variance tends
to underestimate the prediction error variance of the estimated OK predictor because it does
not account for the estimation error incurred in estimating θ.
Example continued from p 20:
PAGE 30
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
Suppose we wish to minimize the kriging variance at s0 (a new site inside the sampling
configuration), and we have sufficient resources to take an observation at any one of the
remaining unsampled sites (excluding s0 ).
Kriging variances at s0 corresponding to the addition of each of the sites, A,B,C and D are
Cross Validation
It is a method of evaluating the aptness of a spatial correlation model using only data
from the sample. It can be used for evaluating choices of search radius, lag tolerance, etc.
Procedure:
i) For location si , omit zi from the data set temporarily.
ii) Estimate Z(si ) = zi from the remaining points and call it ẑ−i .
iii) Compare the estimate ẑ−i to zi .
iv) Repeat the above steps for all points i = 1, . . . , n in the sample.
v) Compute the summary statistics and graphs of the cross-validation error distribution.
Summary statistics:
1. Average of prediction sum of squares (PRESS):
1
n
Pn
i=1 (zi
− ẑ−i )2
where ẑ−i indicates the prediction of zi from the rest of the data.
2. Mean of standardized PRESS residuals:
n
1X
(zi − ẑ−i )/σ̂−i
n i=1
2
where σ̂−i
is the mean squared prediction error for predicting zi from the rest.
3. Root mean squared prediction (standardized) residuals:
v
u
n
u1 X
t
n i=1
zi − ẑ−i
σ̂−i
!2
4. Histogram, scatterplots, of maps of PRESS residuals or standardized PRESS residuals
Cautions: The model that appears best may depend on which summary statistics you used.
PAGE 31
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
2. Universal Kriging
A constant-mean assumption in ordinary kriging may not be reasonable in many practical
situations. Two extensions which allow for nonconstant mean are universal kriging and
median polish kriging.
Assume
Z(s) = β0 + β1 f1 (s) + . . . + βp fp (s) + (s)
where fj (·)’s are functions of spatial location (which can be any covariates measured at each
location) and (·) is assumed to be intrinsically stationary.
Again, we seek to find a linear unbiased estimator which minimizes the variance of prediction error:
min Var[
n
X
λi Z(si ) − Z(s0 )]
i=1
subject to
n
X
E[
λi Z(si )] = β0 + β1 f1 (s) + . . . + βp fp (s)
i=1
(This yields a set of p + 1 constraints.)
Then there are p + 1 Lagrange multiplier to be found, and the algebra is messier than
the case of ordinary kriging. The optimal coefficients λ1 , . . . , λn are the first n elements of
the vector λU that satisfies the following system of linear equations (UK equations):
ΓU λU = γ U
where
λU = (λ1 , . . . , λn , m0 , m1 , . . . , mp )0
γ U = [γ(s1 − s0 ), . . . , γ(sn − s0 ), 1, f1 (s0 ), . . . , fp (s0 )]0
ΓU =













γ(si − sj )
for i = 1, . . . , n; j = 1, . . . , n
fj−1−n (si )
for i = 1, . . . , n; j = n + 1, . . . , n + p + 1
0
for i = n + 1, . . . , n + p + 1; j = n + 1, . . . , n + p + 1
and ΓU is a symmetric (n + p + 1) × (n + p + 1) matrix.
We should try to understand why the trend exists based on the nature of our data and
use a simple form of the trend if possible. Then, we subtract this trend from the observed
PAGE 32
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
data to obtain the residuals. We then use the residuals to compute the sample variogram,
fit a model variogram to it, predict the values at the unsampled locations (“kriged” the
residuals), and finally add the kriged residuals back to the trend.
OTHER EXTENSIONS OF ORDINARY KRIGING:
We have considered point kriging, i.e. prediction at a single site so far. Sometimes it is
desirable to predict the average value over a region. This can be done by a straight forward
extension of OK called ordinary block kriging.
In some cases, quantity as P (Z(s0 ) ≥ z0 |Z) (e.g. “ozone levels in air cannot exceed 2
ppm” in environmental monitoring) is of more importance and a method for predicting such
a quantity is called indicator kriging, which utilizes 0-1 data (exceeds standard or not).
In other situations, there are measurements for more than one variable at each data location. An extended method which utilizes dependence between variables as well as dependence
within variables to predict values at unsampled locations, is called cokriging.
Block Kriging
Suppose that we want to predict the average value of Z over a region B, i.e.,
R
B
Z(B) ≡
Z(s)ds
|B|
where |B| is the area of the block.
The theoretical development is similar as in ordinary kriging and yields ordinary block
kriging equations,
ΓOB λOB = γ OB
where
γ OB = [γ(B, s1 ), . . . , γ(B, sn ), 1]0
−1
γ(B, si ) = |B|
Z
γ(u − si )du
B
The ordinary block kriging predictor of is given by
Ẑ(s0 ) =
n
X
λOB,i Z(si )
i=1
where λB,1 , . . . , λB,n are the first n elements of λOB . The kriging variance is given by
λ0OB γ OB − |B|−2
Z Z
B
γ(u − v)dudv
B
PAGE 33
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
In practice, it will generally be necessary to evaluate the integrals by a numerical integration
procedure.
-“Change of support” problem
Median Polish Kriging
First do a median polish fit of overall, row and column effects and compute the residuals
from this fit.
Then, perform ordinary kriging to get, say, ˆ(s0 ) using those residuals.
To get the median polish kriging predictor of Z(s0 ), add just the planar interpolated
median polish fit at s0 to the kriged residual:
Ẑ(s0 ) = m(s0 ; â, {r̂k }{ĉl }) + ˆ(s0 )
The kriging variance of the median-polish kriging predictor is taken (with little modification) to be the ordinary kriging variance based on the median polish residuals.
Indicator Kriging
Define the indicator random field
I(s, z) =



1
if Z(s) ≤ z


0
otherwise
The indicator random field is intrinsically stationary if the following conditions hold:
i) E[I(s, z)] ≡ F (z) for all s and all z ∈ R.
ii) Var[ I(s, z) − I(s + h, z)] ≡ 2γI,z (h, z) for all h and all z ∈ R.
Indicator kriging proceeds as does ordinary kriging, but with I(si , z) in place of zi and
γI,z (·) instead of γ(·). Prediction is often carried out at K levels z1 , . . . , zk , which requires
the K corresponding semivariograms to be estimated and modeled.
- Other simple methods of spatial prediction:
1. Method of polygons.
2. Weighted average based on triangulation.
3. Inverse distance (k-nn) method.
PAGE 34
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
Characterization of spatial cross-dependence
When sampling over a spatial domain, measurements are often collected on more than one
variables, say m variables and we may also be interested in the correlations between them.
Consider for now the simplest case where we confine our development to the case of two
variables, i.e. m = 2. As before, there are several functions that can be used to characterize
the dependence of two variables.
• Cross-covariance function
Cij (s, t) = Cov(Zi (s), Zj (t)) i, j = 1, 2
Note that Cij (s, t) 6= Cij (t, s) for i 6= j,
Cij (s, t) 6= Cji (s, t) for i 6= j, in general.
• “Traditional” cross-variogram
2νij (s, t) = Cov(Zi (s) − Zi (t), Zj (s) − Zj (t)) i, j = 1, 2
• “Pseudo” cross-variogram
2γij (s, t) = Var(Zi (s) − Zj (t)) i, j = 1, 2
Note that νij requires that data on both variables must be measured at the same
locations or at least at many of the same locations, whereas γij requires that the two
variables be measured in the same units in order to be meaningful. It is recommended
to standardize the variables before estimating this quantity.
Estimation: Sample cross-covariance function (h = s − t)
2Ĉij (s, t) =
1 X
(i)
(j)
Zi (s)Zj (t) − Z −h Z +h
|N (h)|
Sample cross-variograms
2ν̂ij (s, t) =
2γ̂ij (s, t) =
1 X
(Zi (s) − Zi (t))(Zj (s) − Zj (t))
|N (h)|
1 X
(i)
(j)
(Zi (s) − Zj (t))2 − (Z −h − Z +h )2
|N (h)|
Example:
PAGE 35
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
Cokriging
Suppose that the data are now m×1 vectors Z(s1 ), . . . , Z(sn ) and we may want to predict
one or more values of the variables at an unsampled location. Denote the jth element of the
ith of these vectors by Zj (si ). Let s0 denote the unsampled site.
First consider that we wish to predict, say Z1 (s0 ). We can merely do the ordinary kriging
to get a predicted value. However, if the other variables are correlated with the first variable,
then a better predictor can be obtained from basing the prediction on all of the elements of
Z(s1 ), . . . , Z(sn ). The best linear unbiased predictor of Z1 (s0 ) based on all of these others
is called the (ordinary) cokriging predictor.
When we wish to predict the entire vector of variables at an unsampled site, i.e. Z(s0 ),
then it can be accomplished using similar ideas and is called multivariate spatial prediction.
For m = 2, define Z1 = [Z1 (s1 ), . . . , Z1 (sn )]0 and Z2 = [Z2 (s1 ), . . . , Z2 (sn )]0 .
Then the cokriging predictor of Z1 (s0 ) is given by
λ1 0 Z1 + λ2 0 Z2
whereas the multivariate spatial predictor of Z(s0 ) is given by
Λ1 Z1 + Λ2 Z2
where Λ1 and Λ2 are matrices.
COKRIGING EQUATIONS:
Assume that m = 2 and that the two variables are jointly second-order stationary for
simplicity.
The model for the process is
Z(s) = β + (s).
Or


Z1
Z2



=
1 0
0 1


β1
β2


+
1
2

.
Also,

Z(s0 ) = 
1 0
0 1


β1
β2


+
PAGE 36
1 (s0 )
2 (s0 )

.
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
Define

Σ=
where
Σ11
Σ21
Σ12
Σ22


Cij (s1 , s1 )

..

.
Σij = Cov(i , j ) = 

···
Cij (s1 , sn )

..


.



Cij (sn , s1 ) · · · Cij (sn , sn )
and


C11 (s1 , s0 )


..




.
c1 = Cov(, 1 (s0 )) =




 C (s , s ) 
 11 n 0 




 C21 (s1 , s0 ) 




..


.




C21 (sn , sn )
The cokriging equations to predict Z1 (s0 ) are

Σ11


 Σ21


 10


00
Σ12
1 0

λ1







1   λ2 





0   m1 



Σ22
0
00
0
10
0 0

c1






=
 1 
0
m2
The ordinary cokriging predictor of Z1 (s0 ) is then
λ1 0 Z1 + λ2 0 Z2
and the associated cokriging variance is given by
(λ1 0 , λ2 0 )c1 + m1 .
Note that the symmetry condition Cij (s, t) = Cij (t, s) should be satisfied in order for
cokriging based on 2νij to give the optimal predictor. This condition is not required for 2γij
which always gives the same predictor as cokriging based on the cross-covariance function.
We can get the same results using the variance-based cross-variogram. The cokriging
PAGE 37
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
equations in these terms are

Γ11


 Γ21


 10


00
Γ12
1 0

λ1





 λ2 
1






0   m1 



Γ22
0
00
0
10
0 0

γ1







= 1 
0
m2
The ordinary cokriging predictor of Z1 (s0 ) is then
λ1 0 Z1 + λ2 0 Z2
and the associated cokriging variance is given by
(λ1 0 , λ2 0 )γ1 + m1 .
In order to implement cokriging, we need to estimate the cross-covariance functions or
cross variograms, choose valid parametric models for these functions, and fit the model to
the estimates. Much research is still needed on these topics especially because of the scarcity
of known valid models.
EXAMPLE: m = 2
γij (k, l) =



1 − exp(−|l − k|)
for i = j


1 − 0.5exp(−|l − k + 1|)
for i 6= j
PAGE 38
2.4 Spatial Prediction (Kriging)
c
HYON-JUNG
KIM, 2016
Space-time Geostatistics
Suppose that we have observed spatial data at each of m time points, i.e.,
{Z(s1i , ti ), . . . , Z(sni , ti ) : i = 1, . . . , m}
where s1i , . . . , sni are the ni data locations at time i, and t1 < t2 < . . . < tm are the times
of observation. When ni ≡ n for all i, the data are said to be rectangular.
The data are usually assumed to be an incomplete sampling of one realization of the
stochastic process
{Z(s, t), s ∈ D(t), t ∈ T }.
If D(t) ≡ D and T = {1, 2, . . . , }, then we can view this as a time series of spatial
processes. If the temporal correlation is non-negligible, then we generally need to assume
spatial and temporal stationary of some kind.
The generic space-time problem is to use the data to predict Z(s, t), where s ∈ D and
t0 ∈ T . Typically, t0 ≥ tm . In principle, we can use ideas from spatial kriging to perform
space-time kriging, but some differences arise:
- Data in time often reveal a cyclical or periodic component but data in space usually
do not. e.g. This can be dealt with by using a mean function model that contains some
periodic components.
- We must use a valid space-time covariance function or semivariogram.
i) Include an extra parameter to scale properly for time.
ii) Assume space-time additivity.
iii) Assume space-time separability.
PAGE 39
c
HYON-JUNG
KIM, 2016
3
Lattice Data
Recall that the definition of the lattice data: nontrivial observations are taken at a finite
number of sites whose whole constitutes the entire study region. For this type of data, there
is no possibility of a response between data locations. When the data locations are points,
geostatistical methods can be used to handle the data. So we shall focus on the cases where
data locations are regions. For areal (lattice) data, we use neighbour information to define
spatial relationships.
Examples:
- Cancer rate in each city district
- Census data with zipcode division for a metropolitan area
- Remotely sensed data
Exploratory Data Analysis
Many of the EDA tools previously introduced for geostatistical data can also be applied
to lattice data. For data on a regular grid: median polish, plots of row or column mean
versus row or column index, same-lag scatterplots
Irregularly spaced regions: 3-D scatterplots, semivariogram cloud, plots of each datum
against the average of its nearest neighbors, gray-scale maps, plot of response versus area of
region, etc.
The data analysis involves: representation of spatial proximity, testing for spatial pattern
using Moran’s I or Geary’s c statistic, modeling with autoregressive models (SAR, CAR).
Measures of spatial autocorrelation
The study objective is mainly to measure how strong the tendency is for observations
from nearby regions to more (or less) alike than observations from regions far apart, and
then judge whether any apparent tendency is sufficiently strong that it is unlikely to be due
to chance alone.
- The data locations may be points or regions and response variables can be either discrete
or continuous.
PAGE 40
c
HYON-JUNG
KIM, 2016
Examples of spatial autocorrelation for binary (0-1) data:
1
1 0 0
1 0 1 0
1
1 0 1
1
1 0 0
0 1 0 1
0
0 1 0
0
1 1 0
1 0 0 1
0
1 0 0
1
1 0 0
0 1 1 0
0
0 1 1
1. The general cross-product statistic
Notation:
- Let Zi denote the response at the ith location, i = 1, . . . , n.
- Let Yij be a measure of how similar or dissimilar the responses are at locations i and j.
- Let Wij be a measure of the spatial proximity of locations i and j.
- Define matrices (for future reference) Y = (Yij ) and W = (Wij ).
W is called a proximity matrix.
The general cross-product statistic is given by
C=
XX
i
Wij Yij .
j
If C too small ⇒
If C too large ⇒
Example (hypothetical): Let Yij = (Zi − Zj )2 for binary Zi ’s.
Wij =



1
if locations i and j are adjacent


0
otherwise
1
1 0
0
1 0
1
1 0
1
0 1
0
0 0
0
1 0
PAGE 41
c
HYON-JUNG
KIM, 2016
Testing the statistical significance of C:
H0 : no correlation
- Normal approximation
- Comparison to randomization distribution
- Monte Carlo approach
i) Normal approximation of C:
Let
S0 =
XX
Wij ,
S1 =
i6=j
1 XX
(Wij + Wji )2 ,
2
i6=j
S2 =
(Wi. + W.i )2
X
i
Let T0 , T1 , and T2 similarly but for the Yi j’s.
Then,
E(C) =
S0 T0
n(n − 1)
and
Var(C) =
(S2 − 2S1 )(T2 − 2T1 ) (S02 + S1 − S2 )(T02 + T1 − T2 )
S1 T1
+
+
− [E(C)]2
2n(n − 1)
4n(n − 1)(n − 2)
n(n − 1)(n − 2)(n − 3)
C ∼ approx. N (E(C), Var(C)).
Compute
z=
|C − E(C)| − 1
q
Var(C)
Example continued:
1
1 0
1
1 0
0
0 0
PAGE 42
c
HYON-JUNG
KIM, 2016
ii) Randomization distribution
- List all possible arrangements of the observed responses over the locations obtained by
permutation of responses.
- Compute C for each arrangement, and rank these.
- Determine where the data’s C values fits in; P -value for the test is the number of C
values in the randomization distribution as extreme or more extreme than the observed
C.
Example continued:
iii) Monte Carlo approach
- Observe that complete enumeration of the possible arrangements may be computationally prohibitive even for moderately-sized data sets.
- So instead, obtain a random sample from the randomization distribution and follow the
same type of procedure.
- In order to implement this random sampling, generate n random numbers (one for
each data location), rank these random numbers from smallest to largest, then rearrange
the observations in accordance with the ranking of random numbers. C is computed for this
arrangement, and repeat the whole process m times.
- The P -values estimates the proportion of C values as extreme or more extreme than
the observed C, and is given by
P =
1 + number of C values ≥ observed C
1+m
Example:
PAGE 43
c
HYON-JUNG
KIM, 2016
2. Join-Count Statistics
A subclass of general cross-product statistics which are for use with binary data.
- Code the data as either 1 (black) or 0 (white). The black-white classification is for the
purpose of making a map.
- Question of interest: Are neighboring locations more likely to display the same color
(or opposite colors) than what we would expect in the absence of spatial correlation?
Procedures:
• Classify the “joins” between contiguous regions as BB, BW , or W W .
• Define Wij = 1 if regions i and j share an edge, and 0 otherwise (using rook’s definition
of neighborhood). Other ways of defining neighborhoods: bishop’s, queen’s etc.
• Count the number of joins of a specified type, e.g. the # of BW joins= BW .
Note that if we define Yij = (Zi − Zj )2 , then
C=
XX
i
Wij Yij = 2BW
j
i.e. BW = C/2.
Likewise, BB = C ∗ /2 where C ∗ is the value of C obtained by defining Yij = Zi Zj .
If the total # of joins in the system is J, then W W = J − BB − BW .
BW statistic:
(There is some evidence that this statistic is slightly better than the other two.)
Let b = # of black regions and w = # of white regions; (b + w = n).
Note E(BW ) = 21 E(C) and Var(BW ) = 41 Var(C).
It can be also shown that
T0 = 2bw,
T1 = 2T0 ,
T2 = 4nbw
If the regions form a rectangular r × c lattice, and the rook’s contiguity definition is
used, then
S0 = 2(2rc − r − c),
S1 = 2S0 ,
PAGE 44
S2 = 8(8rc − 7r − 7c + 4)
c
HYON-JUNG
KIM, 2016
• Commonly used definitions of neighborhood
Areal Data
- Rook’s: spatial correlation down rows and across rows
- Bishop’s: spatial correlation in diagonal direction
Border/Edge Connectivity
- Queen’s: omni-directional correlation
Queen
a single
point
meansspaced
they are
The same
approach
canshared
be usedboundary
for data at
irregularly
andneighbours.
shaped locations, but
Rook requires more than a single shared point to constitute
neighbours.
only formulas given for T0 , T1 , and T2 can apply but S0 , S1 , and S2 cannot.
-BW statistic:
T0 = b(b − 1),
T1 = 2T0 ,
T2 = 4b(b − 1)2
20 / 46
Extensions to polytomous categorical data (i.e. a multi-colored map) are possible.
3. Moran’s and Geary’s statistic (for continuous data)
Moran’s I (1950, Biometrika):
n
I=
S0
where Z =
P P
i
j
Wij (Zi − Z)(Zj − Z)
P
2
i (Zi − Z)
P Zi
.
i n
1
E(I) = − n−1
under independence.
1
I > − n−1
⇒
1
I < − n−1
⇒
- Normal approximation to distribution of I under independence (n > 25):
E (I) as before.
PAGE 45
c
HYON-JUNG
KIM, 2016
Var(I) =
n[(n2 − 3n + 3)S1 − nS2 + 3S02 ] − k[n(n − 1)S1 − 2nS2 + 6S02 ]
1
−
n(n − 1)(n − 2)
(n − 1)2
P
4
n i (Zi −Z)
where k = (P (Z
.
−Z)2 )2
i
i
- For small sample sizes, can use randomization distribution or Monte Carlo approach to
evaluate significance.
Geary (1954, The Incorporated Statistician):
n−1
c=
S0
P P
i
j
P
Wij (Zi − Zj )2
2
i (Zi − Z)
Example:
6
9 6
5
7 4
4
2 2
4. Generalized Proximity Values
The joint count statistic and Moran’s and Geary’s statistics have assumed that the Wij ’s
are binary (0 or 1). This is rather crude. In many situations we may be able to measure
spatial proximity on a more refined scale (as we do the Yij ’s in going from BW to I or c).
Possible refinements:
- Use lengths of common boundary.
- Use actual distance between locations or centroids of locations, e.g. the inverse of
Euclidean or city-block distance between locations.
- Incorporate directionality by allowing Wij 6= Wij .
A side benefit of using non-binary Wij is that the distribution of the test statistic under
independence is better approximated by the normal distribution.
Example:
PAGE 46
c
HYON-JUNG
KIM, 2016
5. Spatial Autocorrelation Functions
The statistics considered so far attempt to express information about spatial autocorrelation in a single number. Alternatively, we could consider regarding spatial autocorrelation
as a function of distance.
- Divide the range of distances into q classes.
- Compute a previously considered spatial autocorrelation measure, e.g. I, once for each
of the q distance classes; in other words, we use only those pairs of locations that are within
the same distance class.
- Plot the statistic, e.g. Id vs. d. Such a plot is called the correlogram corresponding to
that statistic (just as done for geostatistical data).
MODELS
The most popular models for lattice data are similar to commonly used models for discrete
time series. In time series analysis, one of the most well-known models is the autoregressive
model of order one, AR(1):
Xt = ρXt−1 + t ,
t ∼ iid N (0, σ 2 ),
t = 0, ±1, ±2, . . .
where ρ ∈ (−1, 1) is called the autoregressive coefficient. It can be shown that corr(Xt , Xt−1 ) =
ρ, corr(Xt , Xt−2 ) = ρ2 , . . . , and in general, corr(Xt , Xt−k ) = ρk .
Instead of assuming zero mean as in the AR(1) model above, we can allow for a trend by
supposing that deviations of responses from time-specific means, rather than the response
themselves, follow an AR(1) model:
Xt − µt = ρ(Xt−1 − µt−1 ) + t ,
t ∼ iid N (0, σ 2 ),
t = 0, ±1, ±2, . . .
where µt ≡ E(Xt ).
There are two ways to specify a first-order autoregressive model:
i) A simultaneous AR(1) as above
ii) A conditional AR(1):
Xt |Xt−1 ∼ independent N (µt + ρ(Xt−1 − µt−1 ), σ 2 ).
PAGE 47
c
HYON-JUNG
KIM, 2016
It turns out that these two specifications are equivalent here, i.e. they produce responses
X1 , . . . , Xn that have the same joint distribution. Moreover, they are both equivalent to a
“two-sided” specification,
ρ
ρ
Xt − µt = (Xt−1 − µt−1 ) + (Xt+1 − µt+1 ) + t ,
2
2
t ∼ iid N (0, σ 2 ) t = 0, ±1, ±2, . . .
We can write this using matrix notation as follows:
X − µ = ρW (X − µ) + ,
∼ N (0, σ 2 I)
where W is an n × n matrix whose nonzero elements specify the neighboring times of each
time point. Specifically, after accounting for “edge effects”, W is given by
We generalize these ideas to spatial models for lattice data. Consider a spatial model in
which each response is a first-order autoregression on the average of its neighbors’ responses,
i.e.,
P
j∈Ni
Zj
+ i , i ∼ iid N (0, σ 2 ) i = 1, . . . , n
|Ni |
where Ni is the set of neighbors of location i and |Ni | is the number of those neighbors.
Zi = ρ
Example: data on a 3 × 3 regular rectangular lattice, with neighbors defined as adjacent
sites.
Z1
Z2
Z3
Z4
Z5
Z6
Z7
Z8
Z9
This kind of model described is called a spatial autoregression model.
As in the time series case, we can specify two kinds of spatial autoregressive models. In
their most general forms they are:
1. Simultaneous autoregression (SAR) model
Zi − µi =
X
Sij (Zj − µj ) + i ,
i ∼ iid N (0, σ 2 ) i = 1, . . . , n
j
where S ≡ {Sij } is such that I − S is nonsingular. In matrix form,
Z − µ = S(Z − µ) + ,
∼ iid N (0, σ 2 I)
PAGE 48
c
HYON-JUNG
KIM, 2016
2. Conditional autoregression (CAR) model
(Zi |Zj , j 6= i) ∼ N (µi +
X
Cij (Zj − µj ), σ 2 ),
j
where C ≡ {Cij } is such that I − C is symmetric and positive definite.
However, in contrast to the time series case, the two specifications yield different models,
i.e. if we take Cij = Sij the CAR yields responses whose joint distribution is different than
for the SAR.
Examples:
INFERENCE
For the SAR, the distribution of the data is as follows:
Z − µ ∼ N (0, σ 2 (I − S)−1 (I − S 0 )−1 )
For the CAR,
Z − µ ∼ N (0, σ 2 (I − C)−1 )
The log-likelihood function associated with a Z from either process is
−
n
1
1
log (2πσ 2 ) + |B| − 2 (Z − µ)0 B(Z − µ)
2
2
2σ
where
B=



(I − S 0 )(I − S)
for a SAR


I −C
for a CAR
Usually the mean is parametrized by a linear model µ = Xβ, and the MLE’s of β, σ 2
and the parameters in B are given by
β̂ = (X 0 B̂X)−1 X 0 B̂Z
1
σˆ2 =
(Z − X β̂)0 B(Z − X β̂)
n
and B̂ minimizes the “profile log-likelihood”
L(B) = n log (σ̂ 2 ) − log |B|.
PAGE 49
c
HYON-JUNG
KIM, 2016
4
Spatial Point Patterns
Terminology:
• Meaning of “pattern”: pattern is the characteristic of a set of points (events) which
describes the location of these points in terms of the relative distance and orientations of
one point or one group of points to another points at one or more scales of observations.
• Spatial point pattern: locations of a finite number of events in a bounded region
A ⊆ Rd .
• Spatial point process (SPP): a set of random events in Rd - a random mechanism for
generating a countable set of points in A.
• Key point: locations are modeled as random variables.
Subsequently, we call the points in a spatial point pattern or process as events to
distinguish them from arbitrary points in A.
If one or more additional variables other than location labels are measured at each
point, such a SPP is referred to as a marked point pattern.
• Objectives of statistical analysis of SPPs:
- What is the nature of the spatial pattern?
- What is the intensity of the underlying process?
- Can we model the process that we envisage has generated the data? Can we do
statistical inference on the parameters of the model?
• Appropriate statistical methods for addressing these questions depend on
- the extent of sampling (completely mapped or sparsely mapped)
- the type of sampling (quadrat or distance sampling)
• Four-way classification of SPPs (qualitative, single scale, simplistic classifications)
i) Completely random (complete spatial randomness:CSR )
No obvious structure, a random sample from the uniform distribution on A.
PAGE 50
c
HYON-JUNG
KIM, 2016
ii) Aggregated (clustered, clumped)
iii) Regular (overdispersed, inhibitory, superuniform)
iv) Heterogeneous
Exploratory Analysis
1. Quadrat methods
Partition A into subregions of equal size (quadrats) and summarize the spatial pattern
in each quadrat.
Quadrats are usually rectangular, but may or may not constitute an exhaustive partition of the study area.
2. Distance methods
Based on a reduction of the SPP to distance to events. These methods may utilize
interevent distance such as distance of an event to its nearest neighbor or point-to-event
distances or both.
Note:
- Size and shape of quadrats are arbitrary, and different choices can give you different
results.
- Two main problems with distance methods are edge effects and overlap effects.
• Edge effects
Distance measurements taken near the boundary of A will tend to be larger than those
taken in the interior, since points or events near the boundary are denied the possibility
of neighbors outside the boundary.
Possible remedies:
- Restrict attention to points or events in interior, surrounded by “buffer zone”
- If A is rectangular, connect opposite edges (toroidal edge correction)
- Censor the search distance and incorporate this into the distribution of distance
measurements.
- Base inferences on actual areas searched instead of distances
PAGE 51
4.1 Models
c
HYON-JUNG
KIM, 2016
• Overlap effects
“Search areas” for the nearest event can overlap, resulting in dependent measurements.
Possible remedies:
- Use sparse sampling (undesirable however, for completely mapped patterns)
- Censor to prevent overlap
4.1
Models
• Notation and definitions:
- Process: {N (x) : x ∈ A}
- N (B): number of events in an arbitrary region B ⊆ A
- |B|: area of B
- dx: infinitesimal region containing x ∈ A
- Intentsity function (first-order):
λ(x) = lim|dx|→0
E[N (dx)]
|dx|
!
- Second-order intensity function:
λ2 (x, y) = lim|dx|,|dy |→0
E[N (dx)N (dy)]
|dx||dy|
!
- Stationarity:
A process is stationary if all probability statements about it in any region B ⊆ A are
invariant under arbitrary translations of B.
- Isotropy:
A process is isotropic if the same invariance holds under rotation as well as translation.
- Orderliness:
A process is first-degree orderly if
lim|dx|→0
P (N (dx) > 1)
=0
|dx|
PAGE 52
4.1 Models
c
HYON-JUNG
KIM, 2016
A process is second-degree orderly if
lim|dx|,|dy |→0
P (N (dx) > 1, N (dy) > 1)
=0
|dx||dy|
Note that orderliness of the first degree implies respectively, that the probability of a
single event in an increasingly small area is constant, independent of other events.
1. Homogenous Poisson Process (HPP, CSR)
Two equivalent characterizations:
i) For every B ⊆ A, N (B) has a Poisson distribution with mean λ|B| for some λ > 0.
ii) Conditional on N (A), the events of the process are a random sample from a uniform
distribution on A.
Note that i) ⇔ ii)
- Stationary, isotropic, with the intensitiy = λ
- λ2 (t) = λ2
2. Poisson Cluster Process (PCP)
Three postulates:
i) Cluster centers form a HPP with intensity ρ
ii) The number of events in each cluster are iid variates with mean µ
iii) Postisions of events within a cluster, relative to its center are iid ∼ pdf h(·)
- Stationary with intensitiy λ = ρµ
- isotropic ⇔ h(·) is radially symmetric.
3. Simple Inhibition Process (IPP)
Two types
i) Static: Modify an HPP of intensity ρ by deleting all pairs of events less than δ unit
apart.
ii) Sequential (dynamic): First event is uniformly distributed in A. The distribution
of each subsequent event, conditional on all previously realized events, is uniform on
that portion of A that lies no closer than δ to any previously realized event.
- Stationary and isotropic
PAGE 53
4.1 Models
c
HYON-JUNG
KIM, 2016
- Static SIP has λ = ρexp(−πρδ 2 ) and
λ2 =



0
if 0 < t < δ


ρ2 exp(−ρUδ (t))
if t ≥ δ
where Uδ (t) is the area of the union of 2 circles of radius δ and centers a distance t
apart.
- For the sequential SIP, for any fixed number of events desired, δ cannot be too
large or else it becomes impossible to add further events (related to maximum packing
intensity). The maximum permissible value of δ is usually given by
s
√
2|A| 3
3N
4. Inhomogeneous Poisson Process (IPP)
This is a nonstationary process with variable intensity function λ(x).
i) For every B ⊂ A, N (B) ∼ Poisson with mean
R
B
λ(x)dx
ii) Conditional on N (A), the events are a random sample from a continuous distribution with pdf ∝ λ(x)
Simulation:
- Generate an event from the uniform distribution on A. Call its coordinate vector x.
- Retain the event at x with probability λ(x)/λ0 where λ0 ≡ max(x)∈A λ(x).
- Repeat the above steps until N events have been obtained.
5. Cox Process (Doubly stochastic process)
{Λ(x) : x ∈ R2 } is a nonnegative-valued stochastic process.
Conditional on {Λ(x) = λ(x) : x ∈ R2 }, the events form an IPP with intensity function
λ(x).
6. Markov Process
A process is Markov of range δ if the conditional intensity at the point x, λ(x) configuration of events in A − x depends only on the configuration in a well-defined
neighborhood of x.
PAGE 54
4.2 Tests for CSR: Completely Mapped Patterns
c
HYON-JUNG
KIM, 2016
7. Strauss Process
This process belongs to the class of pairwise interaction processes.
The joint density function for n point locations (s1 , . . . , sn ) which contains m distinct
pairs of neighbors is specified as
f (s1 , . . . , sn ) = αβ n γ m ,
β > 0, 0 ≤ γ ≤ 1
where α is the normalizing constant, β reflects the intensity, and γ describes the interaction between neighbors.
8. Thinned process
Start with any “primary” process (an HPP, for example).
Bring in a stochastic process {Z(x)} such that 0 ≤ Z(x) ≤ 1 ∀x. This is called the
“thinning field”. Let {z(xi )} denote its realization.
Retain each event xi in the realized primary process with probability z(xi ).
4.2
Tests for CSR: Completely Mapped Patterns
1. Quadrat Methods
Let n1 , . . . , nm denote the counts from a partitioning of A into m equally-sized quadrats.
Write n =
P
ni /m for the sample mean of the ni ’s. Then compute the “index of
dispersion”,
X2 =
m
X
(ni − n)2 /n.
i=1
If the pattern is completely random, then the distribution of X 2 is, to a good approximation, χ2m−1 (provided that n is not too small, say n ≥ 5.
The test is two-sided:
X 2 too large ⇒
X 2 too small ⇒
Example: Analysis of Japanese black pines.
PAGE 55
examples of mapped patterns:
4.2 Tests for CSR: Completely Mapped Patterns
•• • • • • •
••
• • ••• •
•
•
•• •• • •• • • • • •
•• • •
•
•
• •
•
•
•
•• • ••• ••
•
•
• • • •
c
HYON-JUNG
KIM, 2016
•
• •
•• •• • ••• •
•
••
• ••
•
• ••
•
•• •
••• ••
••
•••
•
••
••• •
••••
••• ••••
••
•
•
• •
•
• ••
•
•
•
••
•
Pines
•
•
• •
•
• • •
•
•
•
•
•
•
• • •
• •
•
• •
•
• • •
• •
•
•
Redwoods
•
•
•
•
• •
• • •
• •
• •
•
•
•
•
•
• •
•••
•
•
•
•• •• •
• • •••
• •
Cells
•
••
•
•
•
•
•
•
•
•
•
Rushes
- Quadrat methods are insensitive to regular departure from CSR.
- Conclusion can depend on quadrat size and shape and the choice of which is quite
arbitrary.
- Too much information is lost by reducing the pattern to quadrat counts.
However, an analysis based on combining contiguous quadrats can be very useful for
characterizing pattern at different scales. For example,
i) Successively combine quadrats into 2 × 2, 4 × 4, . . . , blocks
ii) Plot X 2 for each block size vs. block size
iii) Peaks or troughs in the plot can be interpreted as evidence of scales of pattern.
iv) A problem with this approach is that the values of X 2 at different scales are not
independent. Mead (Biometrics, 1974) suggests a modification that yields a sequence
of independent tests.
PAGE 56
4.2 Tests for CSR: Completely Mapped Patterns
c
HYON-JUNG
KIM, 2016
2. Distance Methods
a) Clark-Evans test (Ecology, 1954)
- Based on the mean nearest-neighbor (NN) distance, Y
Y too small ⇒
Y too large ⇒
Test statistic is given by
CE =
1
√
2 λ
q
4−π
4λπN
Y −
where λ = N/|A|.
- Test tends to be powerful for detecting aggregation and regularity, weak at detecting
heterogeneity.
- Under CSR, and if edge and overlap effects are ignored, the distribution of CE is to
a fairly good approximation, N (0, 1).
Note that the above statistic ignores edge and overlap effects. There are various modifications for these, and one way given by Donnelly, is as follows:
s
E(Y ) = 0.5
|A|
l(A)
l(A)
+ 0.0514
+ 0.041 3/2
N
N
N
s
|A|
|A|
V ar(Y ) = 0.0703 2 + 0.037
l(A)
N
N5
where l(A) is the length of the study region’s perimeter.
Example:
PAGE 57
4.2 Tests for CSR: Completely Mapped Patterns
c
HYON-JUNG
KIM, 2016
b) Diggle’s Refined NN analysis (Biometrics, 1979)
A test based on the entire empirical distribution function (EDF) of the NN distances
may be more sensitive than Clark-Evans test when there are few intermediate distances
expected under CSR.
Let
Ĝ(y) =
1
#(Yi ≤ y).
N
If CSR holds, Ĝ(y) should be close to G(y) = 1 − exp(−λπy 2 ) for all y > 0, and a plot
of Ĝ(y) vs. G(y) should be nearly a straight line.
Ĝ(y) > G(y) for small y ⇒
Ĝ(y) < G(y) for small y ⇒
• Measures of discrepancy between Ĝ(·) and G(·):
- ∆G = maxy |Ĝ(y) − G(y)| (Kolmogorov-Smirnov type)
- {Ĝ(y) − G(y)}2 dy (Cramer-von Mises type)
R
• For significance of tests,
usually Monte Carlo testing is used due to difficult distribution theory. That is, we
compare the measure’s value for our data to the measure’s values for s simulations
(typically take 99 or 999 times) of an HPP.
- Edge and overlap effect modification:
Ḡi (y) =
1 X
Ĝj (y)
s − 1 i6=i
in place of G(y) is recommended. Koen (1990, Biometrical Journal) has tabulated the
distribution of ∆G using simulation.
Rather than reducing the EDF to a single summary statistic, it may be more informative to look at a plot of EDF. If the SPP is consistent with CSR, then a plot of Ĝ(y)
vs. G(y) should be nearly a straight line through the origin. Departures from CSR
can be detected by means of simulation envelopes, whose upper and lower endpoints
are defined as
U (y) = max
i=1,...,s {Ĝi (y)},
PAGE 58
4.2 Tests for CSR: Completely Mapped Patterns
c
HYON-JUNG
KIM, 2016
L(y) = mini=1,...,s {Ĝi (y)}
where s is the number of simulated HPP patterns having the same number of events
(s is usually taken to be 99), and Ĝi (·) is the NN-distance EDF for the ith simulation.
For each y > 0,
P [Ĝ(y) > U (y)] = P [Ĝ(y) < L(y)] =
1
.
s+1
Simulation envelopes also indicate the distance at which a deviation from CSR occurs,
if there is any.
We can do precisely the same kinds of tests using the EDF of the point-to-nearest
event distances X1 , . . . , Xm from m random or systematically placed sample points.
Let
F̂ (x) =
1
#(Xi ≤ x)
m
If CSR holds, F̂ (x) should be close to F (x) = 1 − exp(−λπx2 ) for all x > 0, and a plot
of F̂ (x) vs. F (x) should be nearly a straight line.
F̂ (x) > F (x) for small x ⇒
F̂ (x) < F (x) for small x ⇒
The use of both Ĝ(y) and F̂ (x) is what Diggle calls refined NN analysis.
c) Ripley’s K-function approach (JRSS-B, 1979)
The “K-function” (second-moment cumulative function) is defined as
K(t) =
1
E(# of additional events within t of a randomly chosen event)
λ
- It combines distance measurements with quadrat counting.
For a HPP, K(t) = πt2 . Equivalently,
K(t)
L(t) ≡ t −
π
L(t) < 0 (K(t) > πt2 ) for small t ⇒
L(t) > 0 (K(t) < πt2 ) for small t ⇒
PAGE 59
!1/2
=0
4.2 Tests for CSR: Completely Mapped Patterns
c
HYON-JUNG
KIM, 2016
- Ripley proposes a nonparametric estimator K̂(t) of K(t) (whose exact form we will
not go into). He suggests looking at the plot of L̂(t) ≡ t − {K̂(t)/π}1/2 vs. t and
computing a test statistic
Lmax = max
t<t0 |L̂(t)|
The upper bound t0 is used to account for the scarcity of information about K(t) at
“large” distances.
• A Monte Carlo approach can be used preferably to assess significance.
Example:
Modeling Completely Mapped Patterns
Methods used to fit models to patterns may be different for different models. For
example, for some models, maximum likelihood estimation is possible but for others
the likelihood function is intractable or computationally burdensome to evaluate.
1. Stationary Processes
If CSR is rejected, we may want to fit an alternative model to the data such as PCP,
or SIP.
Let K̂(t) be a nonparametric estimator of K(t), and suppose that we wish to fit a family
of stationary models whose k-function is a known function of a parameter vector θ.
A modified least squares estimator for θ is obtained by minimizing
Q(θ) =
Z t0
{[K̂(t)]c − [K(t; θ)]c }2 dt
0
where c and t0 are “tuning constants”.
• Some computable K-functions
- PCP with Poisson number of offspring per parent:
K(t) = πt2 + H(t)/ρ
where H(t) is a nonnegative valued function.
- SIP:
2
K(t) = 2πexp(2πρδ )
Z t
δ
PAGE 60
exp {−ρUδ (x)}xdx
4.2 Tests for CSR: Completely Mapped Patterns
c
HYON-JUNG
KIM, 2016
• c is used to control for heterogeneity of variance of K̂(t); c =
aggregated patterns, and c =
1
2
1
4
is suggested for
for regular pattern.
• t0 is used as an upper limit since the pattern supplies increasingly limited information
as t increases.
2. Inhomogeneous Poisson Process (IPP)
a) Maximum likelihood estimation
Consider a parametric family of intensity functions {λθ (x, y) : θ ∈ Θ}. For this family,
the likelihood function is proportional to
l(θ; A) =
N (A)
{Πi=1 λθ (xi , yi )}
exp {−
Z
A
λθ (u, v)dudv}
An MLE of θ is a value of θ̂ that maximizes l(θ, A).
A particularly useful family of intensity function is
λ(x, y; θ) = exp {θ0 z(x, y)}
where z(x, y) is a vector whose components may be values of concomitant environmental variables, known functions of the coordinates themselves, or distances to known
environmental features.
- Special case of HPP:
b) Nonparametric estimation
As an alternative to parametric estimation, nonparametric methods for multivariate
density estimation can be used to the problem of estimating λ(·).
An edge-corrected kernel estimator for λ(x, y) is given by
q

N (A)
X 1
(x − xi )2 + (y − yi )2
1


λ̂h (x, y) =
κ
ph (x, y) i=1 h2
h
where κ(·) is a probability density (kernel) function symmetric about the origin, h > 0
is a bandwidth that determines the amount of smoothing, and ph (x, y) is an edge
correction.
PAGE 61
4.3 Testing for CSR: Sparsely Sampled Data
4.3
c
HYON-JUNG
KIM, 2016
Testing for CSR: Sparsely Sampled Data
Now suppose that there were not sufficient resources to completely map the events and
that the pattern was sampled in some manner. The most common sampling methods
that are used are completely random sampling and systematic sampling. Note that
the sampling must be taken completely independently of the observed events.
- Advantages and disadvantages of systematic sampling:
1. Quadrat methods
Here n1 , . . . , nm are the counts from m randomly placed, non-overlapping, equally-sized
quadrats in A with relatively sparse coverage of A, rather than a complete partition of
A.
The same test statistic and its limiting distribution under CSR can be used as in the
completely mapped case:
2
X =
Pm
i=1 (ni
− n)2
n
∼ χ2m−1 under CSR.
This approach is more competitive in this context and generally quite powerful against
aggregation and heterogeneity, but weak against regularity.
- Example: Lansing woods data
2. Distance methods
Data generally cannot be NN distances if it is regarded as a random sample. So consider
methods based on m sample point-to-nearest-event distances, X1 , . . . , Xm .
- Then, each Xi has cdf F (x) = 1 − exp (−λπx2 ) under CSR (ignoring edge effects).
- So if X1 , . . . , Xm are independent (as is the case if the overlap effects are ignored),
then
2λ
m
X
πXi2 ∼ χ22m .
i=1
However, an exact test for CSR cannot be based on this since λ is unknown.
• Hopkins’ test
Suppose that we could measure NN distances Y1 , . . . , Ym from a randomly selected
subset of m events for the sake of argument. Then by the same arguments,
PAGE 62
4.3 Testing for CSR: Sparsely Sampled Data
2λ
m
X
c
HYON-JUNG
KIM, 2016
πYi2 ∼ χ22m .
i=1
which are independent of 2λ
Pm
i=1
πXi2 under CSR ignoring overlap effects. Then,
2
2λ m
i=1 πXi /2m
H≡
∼ F2m,2m .
Pm
2λ i=1 πYi2 /2m
P
H small ⇒
H large ⇒
But as noted above, Hopkins’ test is not quite sound as we cannot get a random sample
of Yi ’s.
• Alternatively, consider T-square sampling:
Let Xi be the distance from sample point to nearest event and Zi be the distance from
the nearest event to its NN within the half plane “perpendicular” to the chord from
the point to the nearest event. Thus, the search area associated with Zi is a semicircle,
not a circle.
By an argument similar to one given before,
λ
m
X
πZi2 ∼ χ22m .
i=1
Then, we can test for CSR using
2λ m πX 2
t ≡ Pmi=1 2i ∼ F2m,2m .
λ i=1 πZi
P
There are several distance-based tests for CSR that have been proposed and you can
refer to, for example, the book by Cressie for more of those.
PAGE 63