Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Theoretical computer science wikipedia , lookup
General circulation model wikipedia , lookup
Data analysis wikipedia , lookup
History of numerical weather prediction wikipedia , lookup
Pattern recognition wikipedia , lookup
Operational transformation wikipedia , lookup
Inverse problem wikipedia , lookup
Multidimensional empirical mode decomposition wikipedia , lookup
Corecursion wikipedia , lookup
Generalized linear model wikipedia , lookup
Applied Statistics : Practical 11 This practical will introduce basic tools for geostatistics in R. You may need first to install and load a few packages. The packages sp and lattice contain useful function and structures for the management of spatially distributed data. The package gstat provides tools for the analysis of geostatistical data. 1. Exploratory data analysis of geostatistical data In this item we consider a classical dataset in geostatistics which is avaliable in the sp package. The data set consists of 155 samples of top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables. The samples were collected in a flood plain of the river Meuse, near the village Stein (The Netherlands). > library(sp) > data(meuse) > head(meuse) ## ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 1 2 3 4 5 6 x y cadmium copper lead zinc elev dist om ffreq soil 181072 333611 11.7 85 299 1022 7.909 0.00135803 13.6 1 1 181025 333558 8.6 81 277 1141 6.983 0.01222430 14.0 1 1 181165 333537 6.5 68 199 640 7.800 0.10302900 13.0 1 1 181298 333484 2.6 81 116 257 7.655 0.19009400 8.0 1 2 181307 333330 2.8 48 117 269 7.480 0.27709000 8.7 1 2 181390 333260 3.0 61 137 281 7.791 0.36406700 7.8 1 2 lime landuse dist.m 1 Ah 50 1 Ah 30 1 Ah 150 0 Ga 270 0 Ah 380 0 Ga 470 You can see that the dataset reports the geographical coordinates (x and y) as well as the measurements associated to each data point. We need first to instruct R about which column correspond to the coordinates: > coordinates(meuse) <- c('x','y') This changes the nature of the dataset, which is now treated as a SpatialPointsDataFrame, i.e. a data frame which associated spatial location. This is one of the data structure that are available in R for spatial data and it is provided by the package sp. If we now try to plot the data, R gives as the locations of the data point: > plot(meuse) Other types of plots are avalable to explore the other variables in the data set, for example the quantity of zinc in the soil: > spplot(meuse,'zinc',do.log=TRUE) > bubble(meuse,'zinc',do.log=TRUE) Can you interpret the output of this plot? (Check the help if needed) In these plots, each data point is associated to an interval of zinc quantity and the coding is based either on color (spplot) 1 or the dimension of the dot (bubble). A logarithmic scale is used to determine the intervals in which to partition the range of quantity of zinc. While the SpatialPointsDataFrame structure is usually preferred for geostatistical data, other types of structure are available in the sp package. For example, the river borders can be described (and plotted) using SpatialPolygons > > > > > data(meuse.riv) meuse.lst <- list(Polygons(list(Polygon(meuse.riv)), "meuse.riv")) meuse.sr <- SpatialPolygons(meuse.lst) plot(meuse.sr, col = "grey") plot(meuse, add = TRUE) Looking at the geography, can you suggest any interpretation for the variation in the quantity of zinc? The quantity of zinc appears to be larger the closer the data points are to the river. To explore the spatial dependence, we can first plot the semivariogram cloud, i.e. the empirical semivarogram for all the distances observed in the dataset (we are transforming the quantity of zinc on the log scale first): > library(gstat) > cld <- variogram(log(zinc) ~ 1, data=meuse, cloud = TRUE) > plot(cld, main = 'Semivariogram cloud') 2 Semivariogram cloud semivariance 3 2 1 500 1000 1500 distance or better the binned semivariogram > svgm <- variogram(log(zinc) ~ 1, width=100, data=meuse) > plot(svgm, main = 'Binned Semivariogram',pch=19) 3 Binned Semivariogram semivariance 0.6 0.4 0.2 500 1000 1500 distance The parameter width controls the bandwidth, try to change the value and see what happens. It is also possible to include covariates in the formula (in place of 1), to account for a drift term in the model. Does the plot of the binned semivariogram suggests the presence of spatial dependence? If yes, does the process appear to be (second order) stationary? The binned semivariogram suggests the presence of spatial dependence beacuse its values increase with the spatial distance. The process appears to be stationary beacuse the binned semivariogram does not diverge to plus infinity for large distances. We can then fit a model semivariogram to the binned semivariogram, via weighted least squares, using the vgm and fit.variogram function. vgm provides the expressions for a galley of model semivariograms, you can type vgm() for a complete list. Let us now fit a spherical semivariogram: > sph.model<-fit.variogram(svgm, vgm(psill=0.6, "Sph", range=800, nugget=0.2)) > plot(svgm,sph.model) > sph.model We needed to specify a starting point for the model parameters in the optimization algorithm: the partial sill (0.6), the range (800) and the nugget (0.2). Reasonable choices for these starting values can be obtained looking at the binned variogram plot. In particular, the algorithm may fail if you select completely unreasonable choices for the (effective) range. Try now to fit an exponential semivariogram model, what are the estimated nugget, sill and effective range? Which one best fits the binned semivariogram? 4 > exp.model<-fit.variogram(svgm, vgm(0.6, "Exp", 800, 0.2)) > plot(svgm,exp.model) semivariance 0.6 0.4 0.2 500 1000 1500 distance > exp.model e = 3 ∗ 477.1374 = 1431.412. The spherical τb2 = 0.0143, σ b2 = 0.0143 + 0.7148 = 0.729 and R semivariogram is a best fit for the data, because the fitted exponential model has a range larger than the one observed in the data. 2. Simulated examples In this item we consider a few simulated examples to explore the empirical and fitted variogram in known situations. To simulate a spatially dependent random field, we can use the gstat package. > > > > > > > + > library(gstat) # define first a grid for simulations xy <- expand.grid(1:100, 1:100) names(xy)<-c("x","y") # spherical example sph.field<-gstat(formula=z~ 1,locations=~x+y, dummy=T, beta = 1, model=vgm(psill=0.025, range=10, model='Sph'), nmax=20) sph.sim<-predict(sph.field,newdata=xy,nsim=1) ## [using unconditional Gaussian simulation] 5 > > > > # to plot the simulated field library(sp) gridded(sph.sim) = ~x+y spplot(sph.sim[1]) Try changing the sill and the range to appreciate how they influence the realization of the random field. Estimate the binned semivariogram and fit a spherical semivariogram model. > sph.vario <- variogram( sim1~ 1, width=5, data=sph.sim) > sph.model<-fit.variogram(sph.vario, vgm(0.02, "Sph", 10, 0.005)) > plot(sph.vario,sph.model) semivariance 0.020 0.015 0.010 0.005 10 20 30 40 distance Simulate then a new random field with a pure nugget (i.e. spatial white noise), by chosing the Nug model. What happens if you try to fit a spherical semivariogram in this case? > # nugget simulation > noise.field<-gstat(formula=z~ 1,locations=~x+y, dummy=T, beta = 1, model=vgm(psill=0,ran > noise.sim<-predict(noise.field,newdata=xy,nsim=1) ## [using unconditional Gaussian simulation] > # to plot the simulated field > gridded(noise.sim) = ~x+y 6 > spplot(obj=noise.sim[1]) > svgm <- variogram(sim1 ~ 1, width=5,data=noise.sim) > sph.model<-fit.variogram(svgm, vgm(0.02, "Sph", 10, 0.005)) ## Warning in fit.variogram(svgm, vgm(0.02, "Sph", 10, 0.005)): variogram fit singular model in > plot(svgm,sph.model) R gives a warning because the model is degenerate. However, the fit may be not too bad, because if the sill and the nugget almost coincide whatever the range the semivariogram is essentially constant. This estimates is also very dependent on the width of the bins in the binned semivariogram. 3. Kriging prediction Let us consider again the Meuse dataset, now with the aim of reconstructing a smooth surface of the (logarithm of) lead concentration. > > > > > > > > > > > > > library(gstat) library(sp) data(meuse) coordinates(meuse) <- c('x','y') # river meuse data(meuse.riv) meuse.lst <- list(Polygons(list(Polygon(meuse.riv)), "meuse.riv")) meuse.sr <- SpatialPolygons(meuse.lst) # grid for prediction data(meuse.grid) coordinates(meuse.grid) <- c('x','y') meuse.grid <- as(meuse.grid, 'SpatialPixelsDataFrame') Let us start by assuming the process as stationary and thus estimating the semivariogram first: > > > > > > > > > > > v <- variogram(log(lead) ~ 1, meuse) plot(v,pch=19) #spherical v.fit <- fit.variogram(v, vgm(1, "Sph", 800, 1)) plot(v, v.fit, pch = 19) #exponential v.fit <- fit.variogram(v, vgm(1, "Exp", 800, 1)) plot(v, v.fit, pch = 19) # gaussian v.fit <- fit.variogram(v, vgm(1, "Gau", 800, 1)) plot(v, v.fit, pch = 19) Which model provides the best fit? The spherical model is the one that fits best the binned semivariogram. > v.fit <- fit.variogram(v, vgm(1, "Sph", 800, 1)) 7 There are two alternative ways to obtain the kriging prediction, by using the krige function or by fitting a model with gstat and then use the predict option for that model. Both are doing the same mathematical operations (actually, krige is just a wrapper for gstat and predict ). Let us consider the simple kriging prediction first (we to pretend to know the true mean, even if we estimate it from the data). > > > > # beta.hat <- mean(log(meuse$lead)) # assumed known #simple kriging prediction lz.sk <- krige(log(lead)~1, meuse, meuse.grid, v.fit, beta = beta.hat) ## [using simple kriging] > > > > plot(lz.sk) # or lz.mod<-gstat(formula=log(lead)~1, data=meuse, model=v.fit, beta = beta.hat) lz.sk2<-predict(lz.mod,meuse.grid) ## [using simple kriging] > > > > > + plot(lz.sk2) # the spplot gives you prediction # variance as well (although the colorscale is not great) spplot(lz.sk, col.regions=bpy.colors(n = 100, cutoff.tails = 0.2), main = 'Simple Kriging') The ordinary kriging prediction can be easily obtained by not providing a mean value: > ## ordinary kriging > lz.ok <- krige(log(lead)~1, meuse, meuse.grid, v.fit) ## [using ordinary kriging] Looking at the concentration of lead and the position of the observation with respect to the river, > > > > > > data(meuse.riv) meuse.lst <- list(Polygons(list(Polygon(meuse.riv)), "meuse.riv")) meuse.sr <- SpatialPolygons(meuse.lst) plot(meuse.sr, col = "grey") plot(meuse, add = TRUE) spplot(meuse,'lead',do.log=TRUE) we can see that the lead concentration appears to change with the distance from the river (which is one of the parameter in the dataset, dist). Let us consider first the scatterplot of log(lead) and dist. > plot(meuse$dist,log(meuse$lead)) 8 6.5 6.0 5.5 5.0 4.5 3.5 4.0 log(meuse$lead) 0.0 0.2 0.4 0.6 0.8 meuse$dist This suggests to fit a universal kriging model where the mean is a function of the distance from the river (possibly a square root, looking at the scatterplot). > lead.gstat <- gstat(id = 'lead', formula = log(lead) ~ sqrt(dist), + data = meuse, model=v.fit) > lead.gstat ## ## ## ## ## ## data: lead : formula = log(lead)`~`sqrt(dist) ; data dim = 155 x 12 variograms: model psill range lead[1] Nug 0.05156237 0.0000 lead[2] Sph 0.51530657 965.1491 > lead.uk<-predict(lead.gstat,newdata = meuse.grid) ## [using universal kriging] > plot(lead.uk,main="Universal kriging") The gstat function automatically (and iteratively) estimates the drift, the residuals and the residual variogram, then the predict function compute the universal kriging predictor for the new locations. It is also possible to get an evaluation of the drift using the option BLUE=TRUE: 9 > lead.trend<-predict(lead.gstat,newdata = meuse.grid,BLUE=TRUE) ## [generalized least squares trend estimation] > plot(lead.trend,main="Drift") You can also get the predicted value of the field or the estimated trend in a specific new observation (but note that you need to provide also the correspondent value for the predictors in the non-stationary mean model, in this case the distance from the river) > new_obs<-data.frame(x=179660,y=331860,dist=0.124805) > coordinates(new_obs)<-c('x','y') > predict(lead.gstat,newdata=new_obs) ## [using universal kriging] ## coordinates lead.pred lead.var ## 1 (179660, 331860) 4.59488 0.1460198 > predict(lead.gstat,newdata=new_obs,BLUE=TRUE) ## [generalized least squares trend estimation] ## coordinates lead.pred lead.var ## 1 (179660, 331860) 4.92977 0.03846024 10