Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Combining Judgment and Probability Sample Data Across Space and Time Stephen L. Rathbun Department of Statistics University of Georgia Athens, Georgia 30602 USA [email protected] Programs that monitor aquatic resources typically employ judgment sampling designs (Olsen et al. 1998), in which sample sites are selected by experts according to a number of often vaguely dened criteria (Gilbert 1987). However, expert choice often results in selection bias (e.g., Smith 1983). Only through the implementation of a probability sampling design can design-unbiased estimates be guaranteed (Hansen et al. 1983). Managers of water quality monitoring programs are reluctant to implement probability sampling designs. Much of this reluctance stems from the fear of losing information from the historical data base. Therefore, probability sampling designs are not likely to be widely accepted unless statistical approaches for combining data from judgment and probability sampling designs are available. Unfortunately, methods for combining such data have received very little attention in the statistical literature. Overton, Young, and Overton (1993) use sampling frame attributes to assign judgement sites to clusters of similar probability sites. Judgment sites assigned to a given cluster are assumed to be representative of that cluster, and are treated as though they were obtained from a probability sampling design. However, the representativeness of the judgment sites with respect to their assigned clusters is dicult to diagnose, and if false, the combined data may yield biased estimates (Cox and Piegorsch 1996). The following proposes an approach to combining data from historical judgment sites with new data from probability sites. This approach requires an interval of overlap in which both historical judgment sites and new probability sites are sampled. The spatio-temporal correlation between the two collections of sites is then exploited to predict what data would have been attained had a probability sampling design been implemented from the very beginning. Spatio-Temporal Prediction The following assumes that the data from both sampling designs are a partial realization of a spatio-temporal random eld. In particular, assume that the data Z (s; t) at the site s at time t are realized from the model Z (s; t) = 0 + 1x1 (s;t) + + x (s;t) + "(s;t); (1) where 0; 1; ; are model parameters, and the explanatory variables x1 (s;t); ; x (s;t) may be functions of the spatial coordinates, time, distances to known geographic features, or environmental variables. The errors "(s;t) are assumed to be realized from a zero-mean random eld with variogram 2 (h; r) = varfZ (s; t) , Z (s + h;t + r)g: (2) Typically, the variogram is assumed to take a parametric form 2 (h; r; ): Suppose that the judgment sample data are fZ (s ; t) : i = 1; ; n; t = 1; ; T g and the probability sample data are fZ (u ; t) : i = 1; ; m; t = M; ; T g: Thus, data are collected p p p p i i from both sets of sites in the interval M; ; T: Universal krging (Cressie 1991) is used to back predict what data would have been obtained had a probability sampling design been implemented from the very beginning of the monitoring program; that is, predict fZ (ui; t) : i = 1; ; m; t = 1; ; M , 1g: To reduce the problem to manageable dimensions, only data from the judgement sample at time t and the probability sample at time M shall be used to predict the unobserved values of the data from the probability sample at time t: Thus, our predictor is X X Zb (uk ; t) = 1iZ (si ; t) + 2iZ (ui ; M ); n m i=1 i=1 (3) where the 's are chosen to minimize the mean squared prediction error subject to the constraint that the resulting predictor is unbiased. SIMULATION STUDY RESULTS Data were simulated from the spatio-temporal model Z (s;t) = a0 f (t) + as(s) + at(t) + ast (s;t) + a"(s;t): (4) The function 8 >< 0:05t ; t 10 ; 10 t 30 f (t) = > 0:5 (5) : 0:5 + 0:025t ; t > 30 models the background temporal trend. The zero mean, unit variance spatial random eld (s) has an exponential spatial correlation function s(h) = expf,3 khk =sg with a long range of spatial dependence of s = 200 km; for the current application, it can be considered to model the spatial trend in the data. Likewise, the zero mean, unit variance temporal process (t) has an exponential correlation function t (r) = expf,3r=tg with a long range of temporal correlation of t = 3000 years. The spatio-temporal process (s; t) allows the temporal trend to depend on location; it has zero mean, unit variance, and exponential correlation function st(h;r) = expf,3 khk =s , 3r=tg with relatively short ranges of spatial and temporal correlated set at s = 20 km and t = 10 years. All three of these processes were simulated using the spectral method (Shinozuka 1971). The error "(s;t) is Gaussian white noise with unit variance, and models the eects of measurement error. 0 0 0 0 The relative inuence of the four component processes on the resulting data can be explored by varying the levels of the coecients a0;as; at ; ast; and a": The following illustrates the proposed method of back prediction taking a0 = 5; as = 10; at = 3; ast = 3; and a" = 0:5: Two samples of data were generated. A total of 100 probability sites were obtained from a simple random sample over a 100 100 km region. For the judgment sample, an additional 100 sites were independently selected from the density proportional to f,1 + 4(s)g : p(s) = 1 +expexp f,1 + 4(s)g The latter design yields a sample that is biased in favor of high values. Data for both designs was generated for the years t = 1; ; 50; but it is assumed that the probability-based points were only observed for years t = 41; ; 50: The sampling bias inherent to the judgment sampling design results in biased estimates of spatial trend parameters, and spatial and temporal variogram parameters, as well as biased back predictions. Ordinary least squares estimation yields the tted models b( ) = 15 7 + 0 0701 , 0 0155 and b ( Z x; y; t : : x : y; ) = 18 2 + 0 0867 , 0 0269 Z x; y; t : : x : y for the judgment and probability-based designs, respecitively. Notice that estimates of partial slopes were shrunk towards zero under the judgment sampling design. Weighted least squares was used to t an exponential variogram model to method of moments estimates of the spatial variogram (Cressie 1991) obtained by pooling the available data over time. The two sampling designs yielded similar estimated ranges of spatial correlation (31.7 km for the probability sites, and 30.6 km for the judgment sites). However, the judgement sites showed higher variance (12.02) than the probability sites (10.45). Probability sites were not sampled over a sucient length of time to obtain reasonable estimates of temporal variogram parameters. Weighted least squares was used to t an exponential variogram model to method of moments estimates of the temporal variogram obtained by pooling data over the judgment sample sites. Here, the estimated of the range of temporal correlation was 12.2 years. The universal kriging predictor (3) was computed for the unobserved data at the probability sites between years 1 and 40. A comparison with the unobserved sample mean values of this data shows that the predictor consistently over estimates the mean level of the simulated eld over those years. This over estimation can be directly attributed to the positive bias in the judgment sampling design. The magnitude of this bias was estimated using data from the interval of overlap during which observations are obtained from both sampling designs. The results yield the following estimate of the bias attributed to back predicting years into the past: b = 0 87989 + 0 014164 (6) t Thus, the magnitude of the bias increases slightly with increasing number of years we are attempting to back predict the data. t b : : t: The bias corrected predicted mean values (x's) track the unobserved sample mean values (open circles) of the probability sites reasonably well, except in the rst nine years, when the judgment sample sites (triangles) showed little sampling bias (Figure 1). The poor performance of the bias correction in these early years is a result of the violation of the assumed model (6) for the temporal trend in the bias. Nevertheless, the bias corrected predictor tracks the temporal trend function (5) very well. In general, the proposed bias corrected back predictor performs well when the coecient st = 0 n the simulation model (4). However, as st departs from zero, the performance of the proposed predictor degrades. a a REFERENCES Cox, L.H., and Piegorsh, W.W. (1996) Combining environmental information I: environmental monitoring, measurement and assessment. Environmetrics 7, 299-308. Cressie, N. (1991) Statistics for Spatial Data, Wiley, New York. Gilbert, R.O. (1987) Statistical Methods for Environmental Pollution Monitoring, Von Nostrand Reinhold, New York. mean 18 16 14 12 10 8 6 0 10 20 30 40 50 time Figure 1. Comparison of bias corrected predicted (x's) and unobserved (open circles) sample means for probability sites in years 1 to 40. In addition, observed annual means for judgment (triangles) and probability (closed circles) sample sites are given. Hansen, M.H., Madow, W.G., and Tepping, B.J. (1983) An evaluation of model-dependent and probability-sampling inferences in sample surveys. Journal of the American Statistical Association 78, 776-793. Olsen, A.R., Sedransk, J., Edwards, D., Gotway, C.A., Liggett, W., Rathbun, S., Reckhow, K.H., and Young, L.J. (1999). Statistical issues for monitoring ecological and natural resources in the United States. Environmental Monitoring and Assessment 54, 1-45. Overton, J., Young, T., and Overton, W. 1993. Using found data to augment a probability sample: procedure and a case study. Environmental Monitoring and Assessment 26, 65-83. Shinozuka, J.R. (1971) Simulation of multivariate and multidimensional random processes. Journal of the Acoustical Society of America 49, 357-367. Smith, T.M.F. (1983) On the validity of inferences from non-random samples. Journal of the Royal Statistical Society A 146, 394-403. RESUM E Considerons le remplacement des sites historiques de contr^ole environmental, selectiones a partir de plans d'echantillonage bases sur le jugement, par de nouveaux sites selectionnes a partir de plans d'echantillonage basees sur la probabilite. Un predicteur krigeage est forme pour predire les donnees qui auraient ete obtenues si un plan dechantillonage base sur la probabilite avait ete applique des le debut du programme de contr^ole. Un intervalle de chevauchement, ou des donnees sont obtenues des deux series de sites, est utilise pour corriger le biais du plan de jugement.