Download Applied Statistics : Practical 11

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Theoretical computer science wikipedia , lookup

General circulation model wikipedia , lookup

Data analysis wikipedia , lookup

History of numerical weather prediction wikipedia , lookup

Pattern recognition wikipedia , lookup

Operational transformation wikipedia , lookup

Inverse problem wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Corecursion wikipedia , lookup

Generalized linear model wikipedia , lookup

Computer simulation wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Applied Statistics : Practical 11
This practical will introduce basic tools for geostatistics in R. You may need first to install and load a
few packages. The packages sp and lattice contain useful function and structures for the management
of spatially distributed data. The package gstat provides tools for the analysis of geostatistical data.
1. Exploratory data analysis of geostatistical data
In this item we consider a classical dataset in geostatistics which is avaliable in the sp package.
The data set consists of 155 samples of top soil heavy metal concentrations (ppm), along with a
number of soil and landscape variables. The samples were collected in a flood plain of the river
Meuse, near the village Stein (The Netherlands).
> library(sp)
> data(meuse)
> head(meuse)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
1
2
3
4
5
6
x
y cadmium copper lead zinc elev
dist
om ffreq soil
181072 333611
11.7
85 299 1022 7.909 0.00135803 13.6
1
1
181025 333558
8.6
81 277 1141 6.983 0.01222430 14.0
1
1
181165 333537
6.5
68 199 640 7.800 0.10302900 13.0
1
1
181298 333484
2.6
81 116 257 7.655 0.19009400 8.0
1
2
181307 333330
2.8
48 117 269 7.480 0.27709000 8.7
1
2
181390 333260
3.0
61 137 281 7.791 0.36406700 7.8
1
2
lime landuse dist.m
1
Ah
50
1
Ah
30
1
Ah
150
0
Ga
270
0
Ah
380
0
Ga
470
You can see that the dataset reports the geographical coordinates (x and y) as well as the
measurements associated to each data point. We need first to instruct R about which column
correspond to the coordinates:
> coordinates(meuse) <- c('x','y')
This changes the nature of the dataset, which is now treated as a SpatialPointsDataFrame,
i.e. a data frame which associated spatial location. This is one of the data structure that are
available in R for spatial data and it is provided by the package sp. If we now try to plot the
data, R gives as the locations of the data point:
> plot(meuse)
Other types of plots are avalable to explore the other variables in the data set, for example the
quantity of zinc in the soil:
> spplot(meuse,'zinc',do.log=TRUE)
> bubble(meuse,'zinc',do.log=TRUE)
Can you interpret the output of this plot? (Check the help if needed) In these plots, each data
point is associated to an interval of zinc quantity and the coding is based either on color (spplot)
1
or the dimension of the dot (bubble). A logarithmic scale is used to determine the intervals in
which to partition the range of quantity of zinc. While the SpatialPointsDataFrame structure is
usually preferred for geostatistical data, other types of structure are available in the sp package.
For example, the river borders can be described (and plotted) using SpatialPolygons
>
>
>
>
>
data(meuse.riv)
meuse.lst <- list(Polygons(list(Polygon(meuse.riv)), "meuse.riv"))
meuse.sr <- SpatialPolygons(meuse.lst)
plot(meuse.sr, col = "grey")
plot(meuse, add = TRUE)
Looking at the geography, can you suggest any interpretation for the variation in the quantity
of zinc? The quantity of zinc appears to be larger the closer the data points are to the river.
To explore the spatial dependence, we can first plot the semivariogram cloud, i.e. the empirical
semivarogram for all the distances observed in the dataset (we are transforming the quantity of
zinc on the log scale first):
> library(gstat)
> cld <- variogram(log(zinc) ~ 1, data=meuse, cloud = TRUE)
> plot(cld, main = 'Semivariogram cloud')
2
Semivariogram cloud
semivariance
3
2
1
500
1000
1500
distance
or better the binned semivariogram
> svgm <- variogram(log(zinc) ~ 1, width=100, data=meuse)
> plot(svgm, main = 'Binned Semivariogram',pch=19)
3
Binned Semivariogram
semivariance
0.6
0.4
0.2
500
1000
1500
distance
The parameter width controls the bandwidth, try to change the value and see what happens.
It is also possible to include covariates in the formula (in place of 1), to account for a drift
term in the model. Does the plot of the binned semivariogram suggests the presence of spatial
dependence? If yes, does the process appear to be (second order) stationary?
The binned semivariogram suggests the presence of spatial dependence beacuse its values increase
with the spatial distance. The process appears to be stationary beacuse the binned semivariogram
does not diverge to plus infinity for large distances.
We can then fit a model semivariogram to the binned semivariogram, via weighted least squares,
using the vgm and fit.variogram function. vgm provides the expressions for a galley of model
semivariograms, you can type vgm() for a complete list. Let us now fit a spherical semivariogram:
> sph.model<-fit.variogram(svgm, vgm(psill=0.6, "Sph", range=800, nugget=0.2))
> plot(svgm,sph.model)
> sph.model
We needed to specify a starting point for the model parameters in the optimization algorithm:
the partial sill (0.6), the range (800) and the nugget (0.2). Reasonable choices for these starting
values can be obtained looking at the binned variogram plot. In particular, the algorithm
may fail if you select completely unreasonable choices for the (effective) range. Try now to fit
an exponential semivariogram model, what are the estimated nugget, sill and effective range?
Which one best fits the binned semivariogram?
4
> exp.model<-fit.variogram(svgm, vgm(0.6, "Exp", 800, 0.2))
> plot(svgm,exp.model)
semivariance
0.6
0.4
0.2
500
1000
1500
distance
> exp.model
e = 3 ∗ 477.1374 = 1431.412. The spherical
τb2 = 0.0143, σ
b2 = 0.0143 + 0.7148 = 0.729 and R
semivariogram is a best fit for the data, because the fitted exponential model has a range larger
than the one observed in the data.
2. Simulated examples
In this item we consider a few simulated examples to explore the empirical and fitted variogram
in known situations. To simulate a spatially dependent random field, we can use the gstat
package.
>
>
>
>
>
>
>
+
>
library(gstat)
# define first a grid for simulations
xy <- expand.grid(1:100, 1:100)
names(xy)<-c("x","y")
# spherical example
sph.field<-gstat(formula=z~ 1,locations=~x+y, dummy=T, beta = 1,
model=vgm(psill=0.025, range=10, model='Sph'), nmax=20)
sph.sim<-predict(sph.field,newdata=xy,nsim=1)
## [using unconditional Gaussian simulation]
5
>
>
>
>
# to plot the simulated field
library(sp)
gridded(sph.sim) = ~x+y
spplot(sph.sim[1])
Try changing the sill and the range to appreciate how they influence the realization of the random
field.
Estimate the binned semivariogram and fit a spherical semivariogram model.
> sph.vario <- variogram( sim1~ 1, width=5, data=sph.sim)
> sph.model<-fit.variogram(sph.vario, vgm(0.02, "Sph", 10, 0.005))
> plot(sph.vario,sph.model)
semivariance
0.020
0.015
0.010
0.005
10
20
30
40
distance
Simulate then a new random field with a pure nugget (i.e. spatial white noise), by chosing the
Nug model. What happens if you try to fit a spherical semivariogram in this case?
> # nugget simulation
> noise.field<-gstat(formula=z~ 1,locations=~x+y, dummy=T, beta = 1, model=vgm(psill=0,ran
> noise.sim<-predict(noise.field,newdata=xy,nsim=1)
## [using unconditional Gaussian simulation]
> # to plot the simulated field
> gridded(noise.sim) = ~x+y
6
> spplot(obj=noise.sim[1])
> svgm <- variogram(sim1 ~ 1, width=5,data=noise.sim)
> sph.model<-fit.variogram(svgm, vgm(0.02, "Sph", 10, 0.005))
## Warning in fit.variogram(svgm, vgm(0.02, "Sph", 10, 0.005)):
variogram fit
singular model in
> plot(svgm,sph.model)
R gives a warning because the model is degenerate. However, the fit may be not too bad, because
if the sill and the nugget almost coincide whatever the range the semivariogram is essentially
constant. This estimates is also very dependent on the width of the bins in the binned semivariogram.
3. Kriging prediction
Let us consider again the Meuse dataset, now with the aim of reconstructing a smooth surface
of the (logarithm of) lead concentration.
>
>
>
>
>
>
>
>
>
>
>
>
>
library(gstat)
library(sp)
data(meuse)
coordinates(meuse) <- c('x','y')
# river meuse
data(meuse.riv)
meuse.lst <- list(Polygons(list(Polygon(meuse.riv)), "meuse.riv"))
meuse.sr <- SpatialPolygons(meuse.lst)
# grid for prediction
data(meuse.grid)
coordinates(meuse.grid) <- c('x','y')
meuse.grid <- as(meuse.grid, 'SpatialPixelsDataFrame')
Let us start by assuming the process as stationary and thus estimating the semivariogram first:
>
>
>
>
>
>
>
>
>
>
>
v <- variogram(log(lead) ~ 1, meuse)
plot(v,pch=19)
#spherical
v.fit <- fit.variogram(v, vgm(1, "Sph", 800, 1))
plot(v, v.fit, pch = 19)
#exponential
v.fit <- fit.variogram(v, vgm(1, "Exp", 800, 1))
plot(v, v.fit, pch = 19)
# gaussian
v.fit <- fit.variogram(v, vgm(1, "Gau", 800, 1))
plot(v, v.fit, pch = 19)
Which model provides the best fit? The spherical model is the one that fits best the binned
semivariogram.
> v.fit <- fit.variogram(v, vgm(1, "Sph", 800, 1))
7
There are two alternative ways to obtain the kriging prediction, by using the krige function or
by fitting a model with gstat and then use the predict option for that model. Both are doing
the same mathematical operations (actually, krige is just a wrapper for gstat and predict ).
Let us consider the simple kriging prediction first (we to pretend to know the true mean, even
if we estimate it from the data).
>
>
>
>
#
beta.hat <- mean(log(meuse$lead)) # assumed known
#simple kriging prediction
lz.sk <- krige(log(lead)~1, meuse, meuse.grid, v.fit, beta = beta.hat)
## [using simple kriging]
>
>
>
>
plot(lz.sk)
# or
lz.mod<-gstat(formula=log(lead)~1, data=meuse, model=v.fit, beta = beta.hat)
lz.sk2<-predict(lz.mod,meuse.grid)
## [using simple kriging]
>
>
>
>
>
+
plot(lz.sk2)
# the spplot gives you prediction
# variance as well (although the colorscale is not great)
spplot(lz.sk, col.regions=bpy.colors(n = 100, cutoff.tails = 0.2),
main = 'Simple Kriging')
The ordinary kriging prediction can be easily obtained by not providing a mean value:
> ## ordinary kriging
> lz.ok <- krige(log(lead)~1, meuse, meuse.grid, v.fit)
## [using ordinary kriging]
Looking at the concentration of lead and the position of the observation with respect to the
river,
>
>
>
>
>
>
data(meuse.riv)
meuse.lst <- list(Polygons(list(Polygon(meuse.riv)), "meuse.riv"))
meuse.sr <- SpatialPolygons(meuse.lst)
plot(meuse.sr, col = "grey")
plot(meuse, add = TRUE)
spplot(meuse,'lead',do.log=TRUE)
we can see that the lead concentration appears to change with the distance from the river (which
is one of the parameter in the dataset, dist). Let us consider first the scatterplot of log(lead)
and dist.
> plot(meuse$dist,log(meuse$lead))
8
6.5
6.0
5.5
5.0
4.5
3.5
4.0
log(meuse$lead)
0.0
0.2
0.4
0.6
0.8
meuse$dist
This suggests to fit a universal kriging model where the mean is a function of the distance from
the river (possibly a square root, looking at the scatterplot).
> lead.gstat <- gstat(id = 'lead', formula = log(lead) ~ sqrt(dist),
+
data = meuse, model=v.fit)
> lead.gstat
##
##
##
##
##
##
data:
lead : formula = log(lead)`~`sqrt(dist) ; data dim = 155 x 12
variograms:
model
psill
range
lead[1]
Nug 0.05156237
0.0000
lead[2]
Sph 0.51530657 965.1491
> lead.uk<-predict(lead.gstat,newdata = meuse.grid)
## [using universal kriging]
> plot(lead.uk,main="Universal kriging")
The gstat function automatically (and iteratively) estimates the drift, the residuals and the
residual variogram, then the predict function compute the universal kriging predictor for the
new locations. It is also possible to get an evaluation of the drift using the option BLUE=TRUE:
9
> lead.trend<-predict(lead.gstat,newdata = meuse.grid,BLUE=TRUE)
## [generalized least squares trend estimation]
> plot(lead.trend,main="Drift")
You can also get the predicted value of the field or the estimated trend in a specific new observation (but note that you need to provide also the correspondent value for the predictors in the
non-stationary mean model, in this case the distance from the river)
> new_obs<-data.frame(x=179660,y=331860,dist=0.124805)
> coordinates(new_obs)<-c('x','y')
> predict(lead.gstat,newdata=new_obs)
## [using universal kriging]
##
coordinates lead.pred lead.var
## 1 (179660, 331860)
4.59488 0.1460198
> predict(lead.gstat,newdata=new_obs,BLUE=TRUE)
## [generalized least squares trend estimation]
##
coordinates lead.pred
lead.var
## 1 (179660, 331860)
4.92977 0.03846024
10