Download Bayesian Method for Groundwater Quality Monitoring Network

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Network tap wikipedia , lookup

Transcript
Bayesian Method for Groundwater Quality Monitoring
Network Analysis
Khalil Ammar1; Mac McKee2; and Jagath Kaluarachchi3
Abstract: A new methodology is developed to analyze existing monitoring networks. This methodology incorporates different aspects of
monitoring, including vulnerability/probability assessment, environmental health risk, the value of information, and redundancy reduction.
A conceptual framework for groundwater quality monitoring is formulated to represent the methodology’s context. Relevance vector
machine 共RVM兲 plays a basic role in this conceptual framework, and is employed to reduce redundancy and to create probability map of
contaminant distribution, and accordingly to estimate the expected value of sample information. Disability adjusted life years approach of
the global burden of disease is used for quantifying the health risk consequences. This is demonstrated through a case study application
to nitrate contamination monitoring in the West Bank, Palestine. The results obtained from the RVM analysis showed that an overlap error
of less than 30% were obtained based on using around 30% of the monitoring sites 共170 relevance vectors兲. This reflects the importance
of the RVM as a useful model for improving the efficiency of monitoring systems, both in terms of reducing redundancy and increasing
the information content of the collected data. However, in this application, the results of health risk assessment and the evaluation of
monitoring investments were less encouraging due to the minimal elasticity of the nitrate health effect with respect to monitoring
information and uncertainty.
DOI: 10.1061/共ASCE兲WR.1943-5452.0000043
CE Database subject headings: Groundwater management; Monitoring; Groundwater pollution; Bayesian analysis.
Author keywords: Groundwater; Monitoring; Pollution; Bayesian analysis; Relevance vector machine; Value of information.
Introduction
Efficient water management relies upon efficient monitoring systems that have the capability to provide information that are decision relevant. Unfortunately, existing monitoring systems do not
always fulfill this objective, where many monitoring systems are
designed to gather data that are redundant and do not add
management- or decision-relevant information of value. Therefore, the needs to acquire data that are decision-relevant, and
efficient, establish a need for the development of cost-effective
and flexible analytical methodology for water quality monitoring
networks.
Recently, increase in using supervised learning machines in
water resources management applications has drawn more attention in the research literature. Particularly, using the relevance
vector machine 共RVM兲 model in hydrological applications and
groundwater quality modeling, Bayesian deduction for redundancy detection in groundwater quality monitoring networks
共Ammar et al. 2008兲, and on selection of kernel parameters in
1
Hydrogeological Scientist, International Center for Biosaline Agriculture, Dubai, United Arab Emirates 共corresponding author兲. E-mail:
[email protected]
2
Director, Utah Water Research Laboratory, Utah State Univ., Logan,
UT.
3
Professor, Dept. of Civil and Environmental Engineering, Utah State
Univ., Logan, UT.
Note. This manuscript was submitted on September 5, 2008; approved
on July 31, 2009; published online on August 3, 2009. Discussion period
open until June 1, 2011; separate discussions must be submitted for individual papers. This paper is part of the Journal of Water Resources
Planning and Management, Vol. 137, No. 1, January 1, 2011. ©ASCE,
ISSN 0733-9496/2011/1-51–61/$25.00.
RVM for hydrologic applications 共Tripathi and Govindaraju
2006兲. The work on applicability of statistical learning algorithms
in groundwater quality modeling 共Khalil et al. 2005a兲, real-time
management of reservoir releases 共Khalil et al. 2005b兲, and modeling of chaotic hydrologic time series 共Khalil et al. 2006兲. Continuous advances in RVM performance have solved some of the
limitations of earlier formulations of these models, such as extending the binary classifier to a multiclass classifier with associated probability estimates as shown in the application of
multiclass RVMs for selecting shape features 共Zhang and Malik
2005兲. The extension of RVM to include multivariate outputs
共Thayananthan et al. 2006兲. The rationale for using the RVM
model here, is due to its strengths as a forecasting tool: 共1兲 its
design yields a sparse model 共redundancy reduction兲, which allows better generalization performance and avoids overfitting; 共2兲
it has the capability to infer information contained in the data due
to its Bayesian framework; and 共3兲 it derives accurate probabilistic prediction models and allows the computation of posterior
probabilities for a prediction.
A Bayesian decision analysis approach is employed here. This
approach incorporates prior knowledge about the possible state of
a system, and adds new data in a preposterior analysis to produce
posterior knowledge of full information about possible system
states. In the case of groundwater quality monitoring, this is done
to account for uncertainty in the predictions and to estimate the
value of information 共VOI兲 in terms of quantifying health risk
consequences. In this study, health risk assessment is quantified
using the disability adjusted life years 共DALY兲 metric. The main
characteristics of DALY approach over other methods are: 共1兲 it
enables a comprehensive evaluation of health gains and losses in
terms of established public health concepts 共quality and quantity
of life and social magnitude兲, using time as a unit of measure-
JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 51
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
ment; 共2兲 it combines annual mortality rates and nonlethal end
points and explicitly addresses life and health expectancy; 共3兲
DALY measures loss of health, an inverse form of the more general concept of quality adjusted life years, which measures
equivalent healthy years lived at the individual level instead of
the health of a population; 共4兲 it has been adopted in many countries and by health development agencies as the standard health
accounting; and 共5兲 DALY measures the future stream of healthy
years of life lost due to each incident case of disease and for each
death.
Recently, the involvement of the VOI in decision-making process in many disciplines has increased dramatically. The VOI by
definition is the amount the decision maker would be willing to
pay for information prior to making a decision. In monetary
terms, the expected monetary value of the decision with information minus the expected monetary value without the information.
The concept of involving VOI in health risk assessment has
helped reduce uncertainty in the decision-making process through
the collection of new data that have value in terms of providing a
greater understanding of health consequences 共Yokota and Thompson 2004兲. They have conducted a review of VOI analysis in
environmental health risk decisions. They showed that the complexity of solving VOI problems with continuous probability distributions as inputs to models emerges as the main barrier to
greater use of the VOI concept. They suggested the need to standardize approaches and develop some prescriptive guidance for
VOI analysts. Kim and Bridges 共2006兲 proposed a structured approach to decision making that explicitly considers the risks and
uncertainties within an overall decision analysis framework to
provide a systematic process for evaluating how the predicted
performance of a management action and the uncertainty associated with that prediction affects objectives that stakeholders care
about most. Khadam and Kaluarachchi 共2003兲 proposed an integrated approach for the management of contaminated groundwater resources using health risk assessment and economic analysis
through a multicriteria decision analysis framework. The proposed decision analysis framework integrates probabilistic health
risk assessment into a cost-based multicriteria decision analysis
framework. The results showed the importance of using an integrated approach for decision making considering both costs and
risks. Bates et al. 共2003兲 used a special case of Bayesian modeling
to make inferences from deterministic models about environmental risk assessment. The results obtained were posterior distributions of contaminant concentrations in soil and vegetables taking
into account uncertainty.
The objective of this paper is to develop an improved methodology and a systematic approach for groundwater quality monitoring network analysis, using state-of-the-art supervised RVM,
and incorporating wide aspects of vulnerability assessment, health
risk, and VOI reflected in a multiobjective perspective. This multiobjective dimension includes cost of monitoring, the information content or level of uncertainty, and the risk to public and/or
environmental health. This will help evaluate the worth of collecting additional data in selecting the optimal monitoring network
configuration. In addition, this will assist in deciding whether
available data are sufficient to support decisions on monitoring
site selection 共monitoring strategy兲, or insufficient in terms of the
added VOI they would supply, and the amount of uncertainty
reduction inherent in the information obtained from the monitoring network.
The main research contribution is in the introduction of new
techniques as the RVM and incorporating integrated methodologies of vulnerability assessment, health risk, and economic VOI
for groundwater quality monitoring analysis, assessed with respect to their efficiency and applicability to a real groundwater
problem. This is an important contribution to the decision-making
process because of incorporating the multiobjective dimension
and quantification of possible tradeoffs between different objectives in groundwater monitoring.
Methodology
The methodology is described systematically as illustrated in the
conceptual framework for groundwater quality monitoring network analysis figure. The main steps include selection of constituent of concern 共COC兲, selection of a sparse model for creation of
vulnerability/probability maps, health risk assessment, and estimation of the value of information. The process of selecting the
COC is done by following the approach proposed by Harmancioglu et al. 共1999兲. Probability map is created using the RVM model
through integrating data on the natural system 共hydrogeological
information兲 and human activities 共land use and contaminant concentration兲 within a probabilistic framework 共Ammar 2007兲.
A Bayesian decision analysis is adopted here to estimate the
value of information, through calculating the expected value of
sample information 共EVSI兲 by incorporating current knowledge
共prior probability distribution兲 and adding newly collected data to
produce posterior probability distributions of the state of a system. The value of additional information is calculated as the difference between the expected loss 共or expected utility兲 that would
be achieved under posterior knowledge and the expected loss 共expected utility or payoff兲 under current 共prior兲 knowledge. This
should help in deciding whether available data are sufficient to
support decisions on monitoring site selection 共monitoring strategy兲 or insufficient in terms of the added VOI and uncertainty
reduction as will be described later.
The organization of the methodology is as follows: first, we
will describe the RVM method. Second, we will describe the
Bayesian decision analysis, and most importantly the preposterior
components: risk assessment and EVSI. For risk assessment we
will elaborate on the assumptions used in quantifying the risk,
starting with defining the dose-response relationship as reflected
in the hazard quotient, and accordingly, the hazard index to delineate the areas that has adverse effect or acceptable risk, and to
define the health loss based on the DALYs approach. Finally, this
will help define the EVSI.
Relevance Vector Machine
RVM plays a basic role in the conceptual framework as illustrated
in Fig. 1. The main features of RVM model are sparsity 共reducing
redundancy兲, good generalization performance, avoiding overfitting, and producing posterior probability as an output 共Tipping
2001兲. Sparsity within the Bayesian framework of the RVM is
achieved as follows: the weights of each input are governed by a
set of hyperparameters. These hyperparameters describe the posterior distribution of the weights and are estimated iteratively during training. This continues until convergence occurs by judging
if we have attained a local maximum of the marginal likelihood.
This implies that most hyperparameters approach infinity, and accordingly setting the posterior distributions of the weights to zero.
The remaining vectors with nonzero weights are called relevance
vectors. Details of the description are given in the following paragraphs. Note that we used here multivariate output extension algorithm which is an extension to the fast marginal likelihood
52 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
marginal likelihood over the hyperparameters. Then optimal
weight matrix is obtained using these optimal set of hyperparameters. Details are described in the following. The mathematical
representation of the likelihood is given below as
再
p共y兩w,␴2兲 = 共2␲␴2兲−N/2exp −
1
储y − ⌽w储2
2␴2
冎
共1兲
where y = 共y 1 . . . . y N兲T = targets; w = 共wo . . . . wN兲T = weights or relevance vector coefficients; ⌽ = design matrix that contains the
response of all basis functions to the inputs, with ⌽
= 关␾共x1兲 , ␾共x2兲 , . . . , ␾共xN兲兴T,
wherein
␾共xn兲 = 关1 , K共xn , x1兲 ,
K共xn , x2兲 , ¯ , K共xn , xN兲兴T. The kernel function 关K共xn , xN兲兴 is used
to introduce nonlinearity into the mapping function. The kernel
functions define a set of nonlinear fixed basis functions, ␺N共xn兲
= K共xn , xN兲, where the kernel function is centered on each of the N
training data points xn. An important feature of the kernel is the
kernel width, which is considered for development of RVM models as the key for forecast accuracy and consequently sparsity. For
example, the Gaussian kernel 共radial basis function kernel兲 width
is represented here by the standard deviation, ␴, or simply the
variance over the noise ␴2, as shown in the Gaussian kernel equation
冉
K共x,xi兲 = exp −
Fig. 1. Conceptual framework for groundwater quality monitoring
network analysis
maximization algorithm 共Tipping and Faul 2003兲. For simplicity,
the RVM methodology is demonstrated according to Tipping formulation 共Tipping 2001兲. Interested readers can read a multivariate paper by Thayananthan 共Thayananthan et al. 2006兲 regarding
the details of extending the univariate output to multivariate output. In his extension, he followed similar formulation of Tipping
and let the algorithm iterate in two steps: in the first step, to
calculate the probability 共likelihood兲 of each example belonging
to each class, and in the second step, to estimate the parameters
using the probability calculated in the first step. He used a matrix
target with number of columns equal to number of variates instead of a vector target, also extended the weight matrix, and used
a diagonal covariance matrix.
Bayesian learning in RVMs begins by observing a data set
consisting of N pairs, 兵xn , y n , n = 1 , 2 , 3 , . . . , N其, where the xn
= inputs and the y n = targets. The targets form an N by P matrix,
with P the number of regulatory classes, y nxP
= 关y共n , 1兲 , . . . , y共N , P兲兴. The likelihood or probability of membership in one of the regulatory classes given information about the
input x, written as P 共y 兩 x兲, can be written in terms of the mean or
weight w and the variance ␴2 assuming the error follows independent zero-mean Gausian 共normal兲 distribution 共Tipping 2001兲.
This conditional target or data likelihood incorporates the uncertainty of prediction and is obtained as a function of weight variables and hyperparameters. These hyperparameters are the
parameters that determine the relevance of the associated basis
function. The weight variables are then analytically integrated out
to obtain marginal likelihood as function of the hyperparameters.
An optimal set of hyperparameters is obtained by maximizing the
储x − xi储 2
␴2
冊
共2兲
where K共x , xi兲, i = 1 , . . . , n = kernel function for n by m matrix, and
the norm reflects the nearness of one point 共x兲 to other point 共xi兲,
while the variance controls the width of the bell shape of the
Gaussian distribution, i.e., smoothing parameter.
In the training process, the RVM considers the training data as
prior data and infers the information about class membership
from this data. Accordingly, it recognizes a pattern in the data.
The addition of new data updates the likelihood of class membership, which also updates the error between the observed data and
predicted data 共or regulatory classes兲. This Gaussian prior distribution is introduced to complement the likelihood function and is
given in the form as
M
p共w兩␣兲 = 共2␲兲
−M/2
兿 ␣m1/2 exp
m=1
冉
−
␣mwm2
2
冊
共3兲
where M = number of the independent hyperparameters, ␣
= 共␣1 , . . . , ␣ M 兲T, each hyperparameter is individually controlling
the strength of the prior over its associated weight. This form of
the prior is responsible for the sparsity properties of the RVM
model. Given the previously defined prior and likelihood distributions, the posterior parameter distribution conditioned on the
data over all unknowns is defined using Bayes’ rule
P共w,␣,␴2兩y兲 =
P共y兩w,␣,␴2兲P共w,␣,␴2兲
P共y兲
共4兲
The posterior in Eq. 共4兲 is intractable; an approximation is obtained by decomposing the posterior to P共w , ␣ , ␴2 兩 y兲
⬅ P共w 兩 y , ␣ , ␴2兲P共␣ , ␴2 兩 y兲. As a consequence, the posterior distribution over the weight becomes analytically solvable. Details
of analytically solving these equations are given in Tipping
共2001兲. The analytical posterior distribution over weights is
JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 53
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
P共w兩y,␣,␴2兲 =
P共y兩w,␴2兲P共w兩␣兲
= 共2␲兲−共N+1兲/2兩⌺兩−1/2
P共y兩␣,␴2兲
再
1
exp − 共w − ␮兲T⌺−1共w − ␮兲
2
冎
HQ =
共5兲
with posterior covariance ⌺ = 共␴−2⌽T⌽ + A兲−1 and mean ␮
= ␴−2⌺⌽Ty, and A = diag共␣1 , ␣2 , . . . , ␣ M 兲. The estimated value of
the model weights is given by the mean of the posterior distribution Eq. 共5兲, which is also the maximum a posterior 共MP兲 estimate
of the weights using type-II maximum likelihood procedure 共i.e.,
maximizing the marginal likelihood兲. The optimal values of many
hyperparameters ␣MP are infinite. This leads to a parameter posterior distribution infinitely peaked at zero for many weights with
the consequence that ␮MP correspondingly comprises very few
nonzero elements. That is, sparse Bayesian learning is formulated
as the local maximization of the marginal likelihood with respect
to ␣.
A Bayesian decision analysis is conducted as mentioned before,
to evaluate the expected health loss function for each decision or
monitoring strategy from the range of all possible outcomes produced from running the RVM model. In addition, to select the
decision that satisfies the objectives of monitoring design 共multiobjective dimension兲, i.e., selecting the design that minimizes the
expected health loss or maximizing EVSI, minimizes cost of sampling 共roughly measured here in terms of the number of monitoring sites兲, and simultaneously improves uncertainty.
The prior state is considered here as the state of having the
minimum number of monitoring sites, and the associated probability as the prior probability produced by running the RVM
model. This prior probability is actually the posterior probability
of the training data used in the RVM model formulation.
Preposterior analysis is the core of the approach for calculating
the expected value of information. In this step, an assessment of
the risk based upon updated information is conducted and, accordingly, an estimation of the expected value of the sample information for each update is done as described below.
Risk Assessment
Risk is the probability of an adverse outcome and risk assessment
is the process of quantifying risk 共probability兲 and associated consequences. The main consequence that will be taken into account
in this study is the health loss consequence; other consequences,
such as economic or environmental, will not be covered due to
lack of data concerning these issues. Health risk assessment generally consists of hazard identification, exposure assessment,
dose-response relationship, and risk characterization.
The hazard quotient 共HQ兲 is used to define the dose-response
relationship of contaminant 共in this study is nitrate兲 exposure. The
HQ is the noncancer risk assessment of this exposure, defined as
the ratio of the exposure for some generalized or typical hypothetical member of the receptor population at a site, compared to
an appropriate toxicity reference value that equates to an acceptable level of risk for that receptor and chemical 共U.S. EPA 2001兲.
The main equation to calculate the HQ as defined by U.S. EPA
共2001兲 is
CDI =
C ⫻ IR ⫻ EF
BW ⫻ 365
共6兲
共7兲
where CDI= chronic daily intake of contaminant 共mg/kg-day兲;
C = contaminant concentration 共mg/L兲; IR= ingestion rate 共L/day兲;
EF= exposure frequency 共days/year兲; BW= bodyweight 共kg兲; and
RfD= reference dose for the contaminant.
The hazard index 共HI兲 which is the sum of HQs 共as defined
before兲 and is used to evaluate the fraction of the population with
a HQ above and below 1, where HQ values above 1 are interpreted as indicating the potential for adverse effects and HQ values below 1 are interpreted as indicating acceptable risk. This HI
will be used to determine the potential exposure in the population,
mainly the infants. The health loss 共HL兲 is the loss in health that
the individual will get due to exposure to hazard material. The HL
function is defined as
HL = DALYs
HL = 0
Bayesian Decision Analysis
CDI
RfD
for HI ⱖ 1
Otherwise
共8a兲
共8b兲
where DALY= disability adjusted life years as described next. In
order to characterize the risk, the concept of DALYs is adopted in
this study to quantify the HL, with time as the unit of measurement. This health impact measure combines years-of-life-lost
with years-lived-with-disability that is standardized by means of
severity weights using time as the metric. The importance of
DALY appears in combining different aspects of public health:
quantity of life 共measured by life expectancy and duration of
disease兲, quality of lie 共expressed through a severity weight for
the adverse health outcome兲, and social magnitude 共or number of
people affected兲.
HL is defined according to the Global Burden of Disease
共GBD兲 共Lopez et al. 2006兲 as time spent with reduced quality of
life aggregated over the population involved, combining years of
life lost 共combining mortality and age of death data兲 and years
lived with disability, standardized through application of severity
weights. Generally, HL using the DALY method is assessed by
estimating the number of people affected, estimating the average
duration of the adverse health response, including loss of life
expectancy as a consequence of premature mortality, assigning
weights for severity in unfavorable health conditions, and calculating the HL in DALYs. Therefore, DALYs for a certain cause is
calculated as the sum of the years of life lost due to premature
mortality 共YLL兲 from that cause and the years of healthy life lost
as a result of the disability 共YLD兲 for incident cases of the health
condition
DALY = YLL + YLD
共9兲
where YLL= number of life years lost due to mortality and
YLD= number of years lived with the disability, weighed with a
factor between 0 共perfect health兲 and 1 共dead兲 for the severity of
the disability for all relevant diseases
n
DALY =
共D ⫻ L + I ⫻ DW ⫻ T兲
兺
i=1
共10兲
where n = number of all relevant diseases, the first part of the
summation represents the mortality and YLL= DⴱL, with D being
the number of deaths, and L the standard life expectancy at the
age of death due to this cause. The second part of the summation
represents the morbidity, YLD= I ⫻ DW⫻ T, where I = number of
incident cases for both causes of disability; DW= disability
weight for both causes; and T = duration in years of the case until
54 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
remission or death. Due to the unavailability of data on mortality,
only morbidity will be taken into account in this study.
Risk assessment is expressed as the expected HL, which is
equal to the posterior probability of membership of contaminant
共here is nitrate兲 in one of the regulatory classes for each monitoring site times the HL consequence. The expected HL for each
monitoring strategy was calculated from
N
E共HL兲 =
posti · DALYs
兺
i=1
共11兲
where E共HL兲 = expected HL; N = number of monitoring sites; and
posti = posterior probability of the contaminant being in one of the
regulatory classes at each monitoring site; and DALYs
= quantification of the health consequences at each monitoring
site as described before.
EVSI
The expected value of sample information 共EVSI兲 is a measure of
the value of the reduction in uncertainty that may result from the
collection of new sample information. The EVSI is defined as the
difference between the expected HL given prior information and
the expected HL when provided updated information 共U.S. EPA
2001; Tappenden et al. 2004兲. The expected value of perfect information 共EVPI兲 is the difference between the expected HL of
the optimal management decision based on the prior analysis and
the expected HL of the optimal management decision if all uncertainty were to be eliminated; this provides the upper bound for the
EVSI. A detail of obtaining uncertainty is described in the following.
Local Uncertainty
The probabilistic output of the RVM can obtained as an error bar
or predictive variance ␴ˆ 2. This predictive variance 共error bars兲
consists of two variance components: the estimated noise vari2
共maximum a posterior兲 in the data and variance due to
ance ␴MP
uncertainty in the prediction of weights 关uncertainty about the
optimal value of the weights reflected by the posterior distribution
共4兲兴 represented in the second part of Eq. 共12兲 共⌽T⌺⌽兲, with F
and ⌺ the basis function and posterior covariance, respectively, as
defined in Eq. 共5兲 共Tipping 2001兲
2
␴ˆ 2 = ␴MP
+ ⌽T⌺⌽
The methodology is applied on the West Bank/Palestine case
study in the Middle East. The study area covers the whole West
Bank area which is approximately 5 , 660 km2. The total Palestinian population living in the West Bank is about 2.5 million according to projections made by the Palestinian Central Bureau of
Statistics 共PCBS兲 for the year 2005 关Palestinian Center Bureau of
Statistics 共PCBS兲 2002兴. The West Bank has a semiarid climate
with average annual rainfall of 600 mm/year 共24 in./year兲. Rainfall varies from north to south and from west to east, with the
least in the Jordan Valley close to the Dead Sea region 共less than
100 mm/year兲. Rainfall is the main natural source for groundwater recharge. Groundwater is the main natural source for all uses
in the West Bank.
The West Bank groundwater system consists of three main
basins: the eastern, northeastern, and western basins as shown in
Fig. 2. Generally, the aquifer system in these groundwater basins
is a Quaternary-Cretaceous geologic system of Holocene-Albian
age; it is composed of karstic and permeable limestone and dolomite. The aquifer systems consists mainly of three main aquifers:
共1兲 a shallow aquifer occurs in the Quaternay and Teriary systems
of Holocene-Paleocene age; 共2兲 an upper aquifer occurs in the
upper Cenomanian-Turonian age; and 共3兲 a lower aquifer occurs
in the lower Cenomanian age. For more details on the West Bank
groundwater system, the reader is advised to read CH2M HILL
report 共CH2M HILL 2001兲.
Land Use Pattern
Several local practices affect nitrate loading to West Bank aquifers. These include application of fertilizers and pesticides in irrigated areas, leakage from septic tanks and sewage systems in
urban areas, and industrial activities such as food processing and
textiles. For the purpose of this study, five land use categories
were defined in terms of their nitrate contribution: 共1兲 urban areas
representing the main cites of West Bank, including both domestic and industrial activities; 共2兲 rural areas covering all village
built-up areas, where most are not connected to sewage networks,
and septic tanks are the major source of nitrate loading; 共3兲 irrigated areas in Jenin, Tulkarm, Qalqilya, Nablus, Tubas, and Jericho; 共4兲 forests and natural reserves representing natural areas;
and 共5兲 rainfed and range land areas covering most of the area of
the West Bank as shown in Fig. 3.
共12兲
The local uncertainty of predicted maximum nitrate concentration is then quantified here based on the error bars as the width of
the confidence interval of a specified probability for the predicted
maximum nitrate concentration 共i.e., the response or output of the
RVM兲 at each individual location of the existing monitoring sites
共where, again, the location of a monitoring site is the input to the
RVM兲
CI = ⌽␻ ⫾ t␯,␣/2关␴ˆ 2/n兴1/2
Case Study
共13兲
where ⌽␻ = predicted maximum nitrate concentration 共mg/L兲,
i.e., the model output is calculated based on the relevance vectors
共RVs兲; t␯,␣c/2 = t-test statistic at ␯ = n − p degrees of freedom and
confidence level ␣c 共for 95% confidence level= 0.05兲; n
= number of monitoring sites in the existing network configuration; and p = number of model parameters 共p = 2, for weight and
bias兲.
Results and Discussion
To run the RVM model, the first step was to select the kernel
parameters, i.e., kernel type and kernel width. A kernel type that
best suits the structural properties of the data was selected by
judging the RVM performance against selected criteria 共e.g., accuracy兲 of different kernel types 共Gaussian, Laplace, spline,
Cauchy, cubic, thinplate spline, and bubble兲. Fig. 4 presents the
performance of the RVM model for different kernel types. As
shown in the figure, the best performance in terms of accuracy of
prediction was given by the Gaussian kernel, followed by the
Laplace kernel. A systematic procedure was followed to set the
kernel width or the variance to get an optimal monitoring network
configuration.
Several runs 共20 runs兲 of the RVM model produced a range of
possible sampling plans or monitoring network configurations in
JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 55
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
Fig. 2. 共a兲 Location map and groundwater quality monitoring site locations in the West Bank; 共b兲 West Bank geological map; and 共c兲 hydrogeological cross section A-A⬘
terms of the number and location of monitoring sites. The RVM
model was run with inputs of well locations and aquifer characteristics 共monitoring well locations, and aquifer characteristic parameters including: depth to water level, recharge to aquifer,
aquifer media, soil characteristics, topography of the area, impact
of vadose zone, and hydraulic conductivity using DRASTIC
method, and regulatory nitrate classes as target. The DRASTIC
method uses index values derived from key aquifer properties to
generate an overall vulnerability distribution for the aquifer system. This is done following the rating and weighting procedure
described by Aller et al. 共1987兲 for defining the DRASTIC index.
The weights are used to determine the relative importance of each
parameter 共factor兲 with respect to the other. The regulatory classes
represent the corresponding log maximum concentrations of nitrate 共as NO−3 兲 classified into three regulatory classes and are assumed to be independent. The predicted regulatory nitrate classes
were used to estimate the probability maps. This produced the
predicted nitrate regulatory classes and the associated predicted
probability map for each class using multinomial logistic regression method. These maps illustrate the spatial distribution of nitrate occurrence in each specified class and reflect the possibility
of nitrate occurrence 共probability兲 represented in ranges for each
of the three classes. These three class membership probability
maps were specified as low probability, or Class 1 关Fig. 5共a兲兴,
moderate probability, or Class 2 关Fig. 5共b兲兴, and high probability,
or Class 3 关Fig. 5共c兲兴. Each of these maps was also classified into
low 共0–0.29兲, moderate 共0.3–0.39兲, and high 共0.4–0.575兲 vulnerability ranges 共classes兲. As is clear from these maps, low probability spots in one class are associated with moderate or high
probability spots in other classes, where the sum of probability
over the three classes at the same grid point is equal to 1. As will
be shown in the following discussion, the more accurate the class
membership prediction, the more reliable class membership probability will be.
The spatial distribution of predicted classes, misclassified
classes 共total incorrect class prediction percent or classification
error兲, and location of relevance vectors with associated prior
probability in each class based on prior information of 95 RVs are
presented in Fig. 5. From this figure, the locations of the RVs are
close to “hot-spot” areas. This reflects the capability of the RVM
model to understand the information contained in the data. The
methodology used for interpolating these results is the inverse
distance weight method of ArcGIS of ESRI.
Low nitrate concentration probability is noticed in the Senonian 共aquitard兲 outcrops in the southeastern part of the eastern
basin which confines the underlying upper aquifer. Few wells and
56 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
from the Eocene aquifer of the northeastern basin, and the communities of Tulkarm, Ramallah and Bethlehem, which obtain
water from the upper aquifer of the western basin.
Sparsity is a key feature of RVM model. This sparsity is illustrated in the ability of the model to achieve good generalization
performance and avoid overfitting while using small number of
basis functions or RVs. These important features of RVM models
共sparsity and generalization兲 reflect the potential of the RVM
model use in water resources decision making and research. In the
context of monitoring, the selection of the kernel width as described in the methodology section, prescribes the desired quality
of the resulting mapping function as well as the number of RVs
共i.e., the size of the monitoring network兲 used in the prediction
and characterization of contaminant distributions in the aquifer.
For example, specifying a desired sparsity level 共in terms of a
maximum allowable number of relevance vectors兲 of 95 RVs generated acceptable class prediction accuracy with associated error
or class misclassification of around 40%. Decreasing the sparsity
of the model by decreasing the kernel width and accordingly increasing the number of RVs to 170 resulted in significant improvements in prediction accuracy, with error reduction of 10%
共the error associated with 170 RVs is 30%兲. Bayesian decision
analysis was used to select an appropriate or optimal monitoring
strategy as explained in the following steps.
Expected Health Loss
Fig. 3. Land use map for the West Bank case study
springs tap the underlying aquifer in this particular area 关Fig.
5共a兲兴. Moderate nitrate contamination probability covers 60% of
the West Bank study area, mainly in the shallow aquifer of the
eastern basin and the upper aquifer of the eastern and western
basins, as shown in Fig. 5共b兲. High nitrate concentration probability covers 13% of the West Bank area and is observed mainly in
the vicinity of Hebron in the southern part of the west bank for
wells and springs that tap water from the upper aquifer in both the
eastern and western basins 关Fig. 5共c兲兴. Other high vulnerability
spots occur near the cities of Jenin and Nablus, which draw water
High nitrate concentrations in drinking water cause different
health outcomes, such as methemoglobinemia, cancer 共due to the
bacterial production of N-nitroso compounds兲, hypertension, increased infant mortality, and diabetes 共Fewtrell 2004兲. Methemoglobin is formed when nitrite 共due to bacterial conversion from
nitrate兲 oxidizes the ferrous iron in hemoglobin to the ferric form.
In this study we will consider only the methemoglobinemia health
effects due to high nitrate concentrations in drinking water. Debate about the carcigonic effect of nitrate is still ongoing and no
decision on the issue of possible carcigonic effects related to nitrates has been published in the U.S. EPA or World Health Organization 共WHO兲 1996 guidelines.
Due to the limited availability of records from the Palestinian
Ministry of Health 共MOH兲 regarding the number of methemoglobinemia cases in the West Bank, an approximate exposure assess-
Fig. 4. Kernel type selection based on RVM model performance
JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 57
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
Fig. 5. Predicted probability map of nitrate NO3 based on RVM method 共95 RVs兲 according to regulatory classes: 共a兲 Class 1; 共b兲 Class 2; and
共c兲 Class 3
58 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
Table 1. Uncertainty Level or Misclassification Error, Health Loss,
Expected Health Loss, EVSI, and EVPI for Some Selected Monitoring
Strategies
Expected
Number of
Health
health
monitoring Misclassification
loss
loss
EVSI
EVPI
sites
error
共DALYs兲 共DALYs兲 共DALYs兲 共DALYs兲
95
170
200
250
300
350
413
540
Fig. 6. HI map for West Bank
ment was based on the nitrate concentration in drinking water
supplies using the HQ method of U.S. EPA. The populations that
could be considered at risk and exposed to water of high nitrate
concentration are those people drinking from a water source that
exceeds the HI regulatory limits of U.S. EPA. Since the infant
population is the most sensitive age category affected by methemoglobinemia caused by high nitrate concentrations, the infant
population was estimated equal to the growth rate 共2.6%兲 in communities drinking from polluted sources, and assuming only 30%
of infants are fed by formula. The number of incident cases from
methemoglobinemia caused by high nitrate concentrations is not
registered with the Palestinian MOH. Therefore, to estimate the
number of incident cases from methemoglobinemia, the number
of incident cases of infants from methemoglobinemia were taken
equal to the growth rate times the infected population associated
with HI greater than 1. As clear from Fig. 6, there is no adverse
effect due to nitrate exposure in most of the West Bank area
except in southern part near Hebron, and the main urban areas
共Jenin, Nablus, Tulkarm, Qalqilya, Ramallah, and Bethlehem兲.
The disability weight was estimated to be 0.02 according to
the GBD study 共Lopez et al. 2006兲. The duration of recovery from
methemoglobenimea was estimated as 6 days. HL for the morbidity part 共YLD兲 was quantified with the DALY unit of measurement for each sampling site using the estimated values of number
of incidences, disability weight, and duration to remission or
death, based on Eq. 共11兲.
216
162
144
106
76
44
36
10
38
35.2
34.2
33.5
33.5
33.5
33.5
33.5
20.3
19
18.3
18
18
18
18
18
1.3
2.0
2.3
2.3
2.3
2.3
2.3
tion conditions. The prior information is represented by the sparsest level and obtained by removing most of the redundancy from
the existing monitoring network. This results in the highest uncertainty or misclassification error 共a monitoring strategy having
95 monitoring sites兲. The updated state or information is the state
that has additional sampling sites 共i.e., 170 monitoring sites, more
than the number of monitoring sites of the prior case兲.
Estimation of the consequent improvement or gain in the EVSI
due to adding or updating new data 共by adding more monitoring
sites兲 is continued until an upper bound is reached. This upper
bound is the maximum of the EVSI, i.e., the EVPI, which was
estimated as the difference between the case of a high uncertainty
level 共the prior case兲 and the case of a minimum uncertainty level
共or misclassification兲. Table 1 presents the results of the EVSI and
EVPI according to different monitoring strategies.
The results obtained show the minimal health effect for the
cause of methemoglobenimia due to high nitrate concentration
and affecting only the most sensitive group, the infants. This ignores the issue of potential carcinogenicity and other causes of
disease associated with consumption of food irrigated with water
having high nitrate concentrations. This reduced the sensitivity of
methemoglobenimia for any change in the predicted nitrate concentration. In addition, using the HI method for quantifying the
dose-response relationship, which is more conservative than relying upon calculating the change in prediction in terms of predicted error, also affected this sensitivity to change.
Fig. 7 presents the quantified HL in DALY units for selected
monitoring strategy. As is clear from the figure, even with the
Expected Value of Sample Information
The EVSI was calculated as the difference between the expected
HL under prior information conditions and the updated informa-
Fig. 7. Expected health loss 共DALYs兲 for different monitoring strategies
JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 59
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
minimal HL effect, we can still observe tradeoffs between different objectives: minimizing the cost versus improving accuracy
and minimizing the expected HL i.e., maximizing the EVSI. This
result hints at directions for applying the proposed methodology
and framework on disease causes such as carcinogenic diseases
that have greater public attention and more detailed data. This
Bayesian analysis could be extended to a simple multiobjective
problem by examining the possible tradeoffs among sampling
cost 共as represented by the number of sampling wells兲, the value
of added information 共quantified in DALYs兲, and accuracy 共in
terms of misclassification error兲.
Conclusions and Recommendations
The conceptual framework proposed in this paper is an important
contribution for efficient management of water information systems in terms of saving time and effort needed to collect, prepare,
and analyze data, and in understanding the physical structure of
the groundwater system. The importance of this conceptual
framework appears in its flexibility to suit a wide range of possible applications for similar monitoring networks using different
water quality monitoring parameters and hydrometric data collection networks for both groundwater and surface water monitoring
systems. In addition, the aspects that have been incorporated into
the conceptual framework reflect the multiobjective dimension of
monitoring network problems through integration of socioeconomic concerns in terms of health risk using the well-known
DALY approach, value of information, and cost of sampling. Possible tradeoffs among different objectives increase the potential
usefulness of the conceptual framework in the sense that the resulting information is more understandable in terms of the quantified health outcome and reflects a measurable unit that has
meaning for any decision taken.
The Bayesian framework of the RVM model features sparsity,
accuracy, and incorporation of both subjective judgment of prior
knowledge and statistical judgment of the posterior probability; it
uses these to produce probabilistic output. The results of the use
of the RVM model in this research illustrate a significant potential
for future application in the decision-making process. The Bayesian framework of the RVM model shows its capability to infer
information contained in the data and understand the physical
system of the case study, taking into account the uncertainty in
the forecast it produced.
The modeling results provided here show that RVMs can provide an attractive, simple to use, and straightforward analytical
forecasting model for application in monitoring network design.
The RVM model identified a reliable network configuration that is
pertinent to the information contained in the available monitoring
data, with the minimal number of sampling sites, and with the
most significant sampling locations identified in terms of their
information content. Such information could be used to reduce
unnecessary monitoring costs. Such savings could potentially be
used to develop new monitoring locations in areas not previously
monitored, which would be a research topic for further investigation. The proposed tool could potentially assist decision makers
and water resources planners for planning, developing, and protecting groundwater resources, and adopting best management
practices.
The work presented here concentrated on removing redundancy from an existing groundwater quality monitoring network
and selecting sampling sites based on spatial data. Therefore,
there is a need to explore the use of temporal data, or both spatial
and temporal data, to test if temporal data alone or a combination
of both spatial and temporal data produces different sampling
strategies.
More work is needed to improve the capability of RVMs to
illustrate the prior, likelihood, and posterior probability as an outcome of the model, and allow for the gradual addition of new data
to test for incremental improvements in prediction. This would
make the model easier to use and understand.
References
Aller, L., Bennett, T., Lehr, J. H., and Petty, R. J. 共1987兲. “DRASTIC: A
standardized system for evaluating groundwater pollution potential
using hydrogeologic settings.” Rep. No. EPA/600/2-85/0108, U.S.
EPA, Robert S. Kerr Environmental Research Laboratory, Ada, Okla.
Ammar, A. K. 共2007兲. “A Bayesian method for groundwater quality
monitoring network analysis.” Dissertation, Utah State Univ., Salt
Lake City, Utah.
Ammar, K., Khalil, A., McKee, M., and Kaluarachchi, J. 共2008兲. “Bayesian deduction for redundancy detection in groundwater quality monitoring networks.” Water Resour. Res., 44, W08412.
Bates, S., Cullen, A., and Raftery, A. 共2003兲. “Bayesian uncertainty assessment in multicompartment deterministic simulation models for
environmental risk assessment.” Environmetrics, 14, 355–371.
CH2M HILL. 共2001兲. “Aquifer modeling.” Water resources program
phase III, USAID, Washington, D.C.
Fewtrell, L. 共2004兲. “Drinking-water nitrate, methemoglobinemia, and
global burden of disease: A discussion.” Environ. Health Perspect.,
112共14兲, 1371–1374.
Harmancioglu, N. B., Fistikoglu, O., Ozkul, S. D., Singh, V. P., and
Alpaslan, M. N. 共1999兲. Water science and technology library: Water
quality monitoring network design, Kluwer Academic, Dordrecht, The
Netherlands.
Khadam, I., and Kaluarachchi, J. 共2003兲. “Multi-criteria decision analysis
with probabilistic risk assessment for the management of contaminated groundwater.” Environ. Impact. Asses. Rev., 23, 683–721.
Khalil, A., Almasri, M., McKee, M., and Kaluarachchi, J. 共2005a兲. “Applicability of statistical learning algorithms in groundwater quality
modeling.” Water Resour. Res., 41, W05010.
Khalil, A., McKee, M., Kemblowski, M., and Asefa, T. 共2005b兲. “Sparse
Bayesian learning machine for real-time management of reservoir releases.” Water Resour. Res., 41共11兲, W11401.
Khalil, A., McKee, M., Kemblowski, M., Asefa, T., and Bastidas, L.
共2006兲. “Multiobjective analysis of chaotic dynamic systems with
sparse learning machines.” Adv. Water Resour., 29共1兲, 72–88.
Kim, J., and Bridges, T. 共2006兲. “Risk, uncertainty, and decision analysis
applied to the management of aquatic nuisance species.” ERDC/TN
ANSRP-06-1, Aquatic Nuisance Species Research Program, Washington, D.C.
Lopez, A., Mathers, C., Ezzati, M., Jamison, D., and Murray, C. 共2006兲.
Global burden of disease and risk factors, The International Bank for
Reconstruction and Development/World Bank, Washington, D.C.
Palestinian Center Bureau of Statistics 共PCBS兲. 共2002兲. Results of population census of December 1997 in the West Bank and Gaza Strip,
1998, Ramallah, Palestine.
Tappenden, P., Chilcott, J., Eggington, S., Oakley, J., and MaCabe, C.
共2004兲. “Methods for expected value of information analysis in complex health economic models: Developments on the health economics
of interferon-␤ and glatiramer acetate for multiple sclerosis.” Health
Technol. Assess, 8共27兲, 1–92.
Thayananthan, A., Navaratnam, R., Stenger, B. Torr, P., and R. Cilpolla.
共2006兲. “Multivariate relevance vector machines for tracking.” Proc.,
European Conf. on Computer Vision, Dublin, Ireland.
Tipping, M. E. 共2001兲. “Sparse Bayesian learning and the relevance vector machine.” J. Mach. Learn. Res., 1, 211–244.
Tipping M.E., and Faul A.C. 共2003兲. “Fast marginal likelihood maximi-
60 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org
sation for sparse Bayesian models.” Proc., 9th Int. Workshop on Artificial Intelligence and Statistic, C. M. Bishop and B. J. Frey, eds., Key
West, Fla.
Tripathi, S., and Govindaraju, R. S. 共2006兲. “On selection of kernel parameters in relevance vector machines for hydrologic applications.”
Stochastic Environ. Res. Risk Assess., 21共6兲, 747–764.
U.S. EPA. 共2001兲. “Risk assessment guidance for superfund: Volume
III—Part A, Process for conducting probabilistic risk assessment.”
EPA 540-R-02-002 OSWER 9285.7-45 PB2002 963302, Washington,
D.C., 具http://www.epa.gov/superfund/RAGS3A/index.htm典.
World Health Organization 共WHO兲 共1996兲. “Health criteria and other
supporting information.” Guidelines for drinking water quality, 2nd.
Ed. Vol. 2, Geneva, Switzerland.
Yokota, F., and Thompson, K. 共2004兲. “Value of information analysis in
environmental health risk management decisions: Past, present, and
future.” Risk Anal., 24共3兲, 635–650.
Zhang, H., and Malik, J. 共2005兲. “Selecting shape features using multiclass relevance vector machines.” Technical Rep. No. UCB/EECS2005-6, EECS, University of California, Berkeley, Berkeley, Calif.
JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 61
Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org