Download Combining Judgment and Probability Sample Data Across Space

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Statistical inference wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Combining Judgment and Probability Sample Data Across
Space and Time
Stephen L. Rathbun
Department of Statistics
University of Georgia
Athens, Georgia 30602 USA
[email protected]
Programs that monitor aquatic resources typically employ judgment sampling designs
(Olsen et al. 1998), in which sample sites are selected by experts according to a number of
often vaguely dened criteria (Gilbert 1987). However, expert choice often results in selection
bias (e.g., Smith 1983). Only through the implementation of a probability sampling design can
design-unbiased estimates be guaranteed (Hansen et al. 1983).
Managers of water quality monitoring programs are reluctant to implement probability
sampling designs. Much of this reluctance stems from the fear of losing information from the
historical data base. Therefore, probability sampling designs are not likely to be widely accepted unless statistical approaches for combining data from judgment and probability sampling
designs are available. Unfortunately, methods for combining such data have received very little
attention in the statistical literature. Overton, Young, and Overton (1993) use sampling frame
attributes to assign judgement sites to clusters of similar probability sites. Judgment sites
assigned to a given cluster are assumed to be representative of that cluster, and are treated as
though they were obtained from a probability sampling design. However, the representativeness
of the judgment sites with respect to their assigned clusters is dicult to diagnose, and if false,
the combined data may yield biased estimates (Cox and Piegorsch 1996).
The following proposes an approach to combining data from historical judgment sites
with new data from probability sites. This approach requires an interval of overlap in which
both historical judgment sites and new probability sites are sampled. The spatio-temporal
correlation between the two collections of sites is then exploited to predict what data would have
been attained had a probability sampling design been implemented from the very beginning.
Spatio-Temporal Prediction
The following assumes that the data from both sampling designs are a partial realization
of a spatio-temporal random eld. In particular, assume that the data Z (s; t) at the site s at
time t are realized from the model
Z (s; t) = 0 + 1x1 (s;t) + + x (s;t) + "(s;t);
(1)
where 0; 1; ; are model parameters, and the explanatory variables x1 (s;t); ; x (s;t)
may be functions of the spatial coordinates, time, distances to known geographic features, or
environmental variables. The errors "(s;t) are assumed to be realized from a zero-mean random
eld with variogram
2 (h; r) = varfZ (s; t) , Z (s + h;t + r)g:
(2)
Typically, the variogram is assumed to take a parametric form 2 (h; r; ):
Suppose that the judgment sample data are fZ (s ; t) : i = 1; ; n; t = 1; ; T g and the
probability sample data are fZ (u ; t) : i = 1; ; m; t = M; ; T g: Thus, data are collected
p
p
p
p
i
i
from both sets of sites in the interval M; ; T:
Universal krging (Cressie 1991) is used to back predict what data would have been obtained had a probability sampling design been implemented from the very beginning of the
monitoring program; that is, predict fZ (ui; t) : i = 1; ; m; t = 1; ; M , 1g: To reduce the
problem to manageable dimensions, only data from the judgement sample at time t and the
probability sample at time M shall be used to predict the unobserved values of the data from
the probability sample at time t: Thus, our predictor is
X
X
Zb (uk ; t) = 1iZ (si ; t) + 2iZ (ui ; M );
n
m
i=1
i=1
(3)
where the 's are chosen to minimize the mean squared prediction error subject to the constraint
that the resulting predictor is unbiased.
SIMULATION STUDY RESULTS
Data were simulated from the spatio-temporal model
Z (s;t) = a0 f (t) + as(s) + at(t) + ast (s;t) + a"(s;t):
(4)
The function
8
>< 0:05t
; t 10
; 10 t 30
f (t) = > 0:5
(5)
: 0:5 + 0:025t ; t > 30
models the background temporal trend. The zero mean, unit variance spatial random eld (s)
has an exponential spatial correlation function s(h) = expf,3 khk =sg with a long range of
spatial dependence of s = 200 km; for the current application, it can be considered to model the
spatial trend in the data. Likewise, the zero mean, unit variance temporal process (t) has an
exponential correlation function t (r) = expf,3r=tg with a long range of temporal correlation
of t = 3000 years. The spatio-temporal process (s; t) allows the temporal trend to depend
on location; it has zero mean, unit variance, and exponential correlation function st(h;r) =
expf,3 khk =s , 3r=tg with relatively short ranges of spatial and temporal correlated set at
s = 20 km and t = 10 years. All three of these processes were simulated using the spectral
method (Shinozuka 1971). The error "(s;t) is Gaussian white noise with unit variance, and
models the eects of measurement error.
0
0
0
0
The relative inuence of the four component processes on the resulting data can be explored by
varying the levels of the coecients a0;as; at ; ast; and a": The following illustrates the proposed
method of back prediction taking a0 = 5; as = 10; at = 3; ast = 3; and a" = 0:5:
Two samples of data were generated. A total of 100 probability sites were obtained from a
simple random sample over a 100 100 km region. For the judgment sample, an additional
100 sites were independently selected from the density proportional to
f,1 + 4(s)g :
p(s) = 1 +expexp
f,1 + 4(s)g
The latter design yields a sample that is biased in favor of high values. Data for both designs
was generated for the years t = 1; ; 50; but it is assumed that the probability-based points
were only observed for years t = 41; ; 50:
The sampling bias inherent to the judgment sampling design results in biased estimates of
spatial trend parameters, and spatial and temporal variogram parameters, as well as biased
back predictions. Ordinary least squares estimation yields the tted models
b(
) = 15 7 + 0 0701 , 0 0155 and b (
Z x; y; t
:
:
x
:
y;
) = 18 2 + 0 0867 , 0 0269
Z x; y; t
:
:
x
:
y
for the judgment and probability-based designs, respecitively. Notice that estimates of partial
slopes were shrunk towards zero under the judgment sampling design.
Weighted least squares was used to t an exponential variogram model to method of moments
estimates of the spatial variogram (Cressie 1991) obtained by pooling the available data over
time. The two sampling designs yielded similar estimated ranges of spatial correlation (31.7
km for the probability sites, and 30.6 km for the judgment sites). However, the judgement sites
showed higher variance (12.02) than the probability sites (10.45). Probability sites were not
sampled over a sucient length of time to obtain reasonable estimates of temporal variogram
parameters. Weighted least squares was used to t an exponential variogram model to method
of moments estimates of the temporal variogram obtained by pooling data over the judgment
sample sites. Here, the estimated of the range of temporal correlation was 12.2 years.
The universal kriging predictor (3) was computed for the unobserved data at the probability
sites between years 1 and 40. A comparison with the unobserved sample mean values of this
data shows that the predictor consistently over estimates the mean level of the simulated eld
over those years. This over estimation can be directly attributed to the positive bias in the
judgment sampling design. The magnitude of this bias was estimated using data from the
interval of overlap during which observations are obtained from both sampling designs. The
results yield the following estimate of the bias attributed to back predicting years into the
past:
b = 0 87989 + 0 014164
(6)
t
Thus, the magnitude of the bias increases slightly with increasing number of years we are
attempting to back predict the data.
t
b
:
:
t:
The bias corrected predicted mean values (x's) track the unobserved sample mean values (open
circles) of the probability sites reasonably well, except in the rst nine years, when the judgment
sample sites (triangles) showed little sampling bias (Figure 1). The poor performance of the
bias correction in these early years is a result of the violation of the assumed model (6) for
the temporal trend in the bias. Nevertheless, the bias corrected predictor tracks the temporal
trend function (5) very well. In general, the proposed bias corrected back predictor performs
well when the coecient st = 0 n the simulation model (4). However, as st departs from zero,
the performance of the proposed predictor degrades.
a
a
REFERENCES
Cox, L.H., and Piegorsh, W.W. (1996) Combining environmental information I: environmental
monitoring, measurement and assessment. Environmetrics 7, 299-308.
Cressie, N. (1991) Statistics for Spatial Data, Wiley, New York.
Gilbert, R.O. (1987) Statistical Methods for Environmental Pollution Monitoring, Von Nostrand
Reinhold, New York.
mean
18
16
14
12
10
8
6
0
10
20
30
40
50
time
Figure 1. Comparison of bias corrected predicted (x's) and unobserved (open
circles) sample means for probability sites in years 1 to 40. In addition, observed
annual means for judgment (triangles) and probability (closed circles) sample
sites are given.
Hansen, M.H., Madow, W.G., and Tepping, B.J. (1983) An evaluation of model-dependent
and probability-sampling inferences in sample surveys. Journal of the American Statistical
Association 78, 776-793.
Olsen, A.R., Sedransk, J., Edwards, D., Gotway, C.A., Liggett, W., Rathbun, S., Reckhow,
K.H., and Young, L.J. (1999). Statistical issues for monitoring ecological and natural resources
in the United States. Environmental Monitoring and Assessment 54, 1-45.
Overton, J., Young, T., and Overton, W. 1993. Using found data to augment a probability
sample: procedure and a case study. Environmental Monitoring and Assessment 26, 65-83.
Shinozuka, J.R. (1971) Simulation of multivariate and multidimensional random processes.
Journal of the Acoustical Society of America 49, 357-367.
Smith, T.M.F. (1983) On the validity of inferences from non-random samples. Journal of the
Royal Statistical Society A 146, 394-403.
RESUM
E
Considerons le remplacement des sites historiques de contr^ole environmental, selectiones
a partir de plans d'echantillonage bases sur le jugement, par de nouveaux sites selectionnes a
partir de plans d'echantillonage basees sur la probabilite. Un predicteur krigeage est forme pour
predire les donnees qui auraient ete obtenues si un plan dechantillonage base sur la probabilite
avait ete applique des le debut du programme de contr^ole. Un intervalle de chevauchement, ou
des donnees sont obtenues des deux series de sites, est utilise pour corriger le biais du plan de
jugement.