Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Method for Groundwater Quality Monitoring Network Analysis Khalil Ammar1; Mac McKee2; and Jagath Kaluarachchi3 Abstract: A new methodology is developed to analyze existing monitoring networks. This methodology incorporates different aspects of monitoring, including vulnerability/probability assessment, environmental health risk, the value of information, and redundancy reduction. A conceptual framework for groundwater quality monitoring is formulated to represent the methodology’s context. Relevance vector machine 共RVM兲 plays a basic role in this conceptual framework, and is employed to reduce redundancy and to create probability map of contaminant distribution, and accordingly to estimate the expected value of sample information. Disability adjusted life years approach of the global burden of disease is used for quantifying the health risk consequences. This is demonstrated through a case study application to nitrate contamination monitoring in the West Bank, Palestine. The results obtained from the RVM analysis showed that an overlap error of less than 30% were obtained based on using around 30% of the monitoring sites 共170 relevance vectors兲. This reflects the importance of the RVM as a useful model for improving the efficiency of monitoring systems, both in terms of reducing redundancy and increasing the information content of the collected data. However, in this application, the results of health risk assessment and the evaluation of monitoring investments were less encouraging due to the minimal elasticity of the nitrate health effect with respect to monitoring information and uncertainty. DOI: 10.1061/共ASCE兲WR.1943-5452.0000043 CE Database subject headings: Groundwater management; Monitoring; Groundwater pollution; Bayesian analysis. Author keywords: Groundwater; Monitoring; Pollution; Bayesian analysis; Relevance vector machine; Value of information. Introduction Efficient water management relies upon efficient monitoring systems that have the capability to provide information that are decision relevant. Unfortunately, existing monitoring systems do not always fulfill this objective, where many monitoring systems are designed to gather data that are redundant and do not add management- or decision-relevant information of value. Therefore, the needs to acquire data that are decision-relevant, and efficient, establish a need for the development of cost-effective and flexible analytical methodology for water quality monitoring networks. Recently, increase in using supervised learning machines in water resources management applications has drawn more attention in the research literature. Particularly, using the relevance vector machine 共RVM兲 model in hydrological applications and groundwater quality modeling, Bayesian deduction for redundancy detection in groundwater quality monitoring networks 共Ammar et al. 2008兲, and on selection of kernel parameters in 1 Hydrogeological Scientist, International Center for Biosaline Agriculture, Dubai, United Arab Emirates 共corresponding author兲. E-mail: [email protected] 2 Director, Utah Water Research Laboratory, Utah State Univ., Logan, UT. 3 Professor, Dept. of Civil and Environmental Engineering, Utah State Univ., Logan, UT. Note. This manuscript was submitted on September 5, 2008; approved on July 31, 2009; published online on August 3, 2009. Discussion period open until June 1, 2011; separate discussions must be submitted for individual papers. This paper is part of the Journal of Water Resources Planning and Management, Vol. 137, No. 1, January 1, 2011. ©ASCE, ISSN 0733-9496/2011/1-51–61/$25.00. RVM for hydrologic applications 共Tripathi and Govindaraju 2006兲. The work on applicability of statistical learning algorithms in groundwater quality modeling 共Khalil et al. 2005a兲, real-time management of reservoir releases 共Khalil et al. 2005b兲, and modeling of chaotic hydrologic time series 共Khalil et al. 2006兲. Continuous advances in RVM performance have solved some of the limitations of earlier formulations of these models, such as extending the binary classifier to a multiclass classifier with associated probability estimates as shown in the application of multiclass RVMs for selecting shape features 共Zhang and Malik 2005兲. The extension of RVM to include multivariate outputs 共Thayananthan et al. 2006兲. The rationale for using the RVM model here, is due to its strengths as a forecasting tool: 共1兲 its design yields a sparse model 共redundancy reduction兲, which allows better generalization performance and avoids overfitting; 共2兲 it has the capability to infer information contained in the data due to its Bayesian framework; and 共3兲 it derives accurate probabilistic prediction models and allows the computation of posterior probabilities for a prediction. A Bayesian decision analysis approach is employed here. This approach incorporates prior knowledge about the possible state of a system, and adds new data in a preposterior analysis to produce posterior knowledge of full information about possible system states. In the case of groundwater quality monitoring, this is done to account for uncertainty in the predictions and to estimate the value of information 共VOI兲 in terms of quantifying health risk consequences. In this study, health risk assessment is quantified using the disability adjusted life years 共DALY兲 metric. The main characteristics of DALY approach over other methods are: 共1兲 it enables a comprehensive evaluation of health gains and losses in terms of established public health concepts 共quality and quantity of life and social magnitude兲, using time as a unit of measure- JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 51 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org ment; 共2兲 it combines annual mortality rates and nonlethal end points and explicitly addresses life and health expectancy; 共3兲 DALY measures loss of health, an inverse form of the more general concept of quality adjusted life years, which measures equivalent healthy years lived at the individual level instead of the health of a population; 共4兲 it has been adopted in many countries and by health development agencies as the standard health accounting; and 共5兲 DALY measures the future stream of healthy years of life lost due to each incident case of disease and for each death. Recently, the involvement of the VOI in decision-making process in many disciplines has increased dramatically. The VOI by definition is the amount the decision maker would be willing to pay for information prior to making a decision. In monetary terms, the expected monetary value of the decision with information minus the expected monetary value without the information. The concept of involving VOI in health risk assessment has helped reduce uncertainty in the decision-making process through the collection of new data that have value in terms of providing a greater understanding of health consequences 共Yokota and Thompson 2004兲. They have conducted a review of VOI analysis in environmental health risk decisions. They showed that the complexity of solving VOI problems with continuous probability distributions as inputs to models emerges as the main barrier to greater use of the VOI concept. They suggested the need to standardize approaches and develop some prescriptive guidance for VOI analysts. Kim and Bridges 共2006兲 proposed a structured approach to decision making that explicitly considers the risks and uncertainties within an overall decision analysis framework to provide a systematic process for evaluating how the predicted performance of a management action and the uncertainty associated with that prediction affects objectives that stakeholders care about most. Khadam and Kaluarachchi 共2003兲 proposed an integrated approach for the management of contaminated groundwater resources using health risk assessment and economic analysis through a multicriteria decision analysis framework. The proposed decision analysis framework integrates probabilistic health risk assessment into a cost-based multicriteria decision analysis framework. The results showed the importance of using an integrated approach for decision making considering both costs and risks. Bates et al. 共2003兲 used a special case of Bayesian modeling to make inferences from deterministic models about environmental risk assessment. The results obtained were posterior distributions of contaminant concentrations in soil and vegetables taking into account uncertainty. The objective of this paper is to develop an improved methodology and a systematic approach for groundwater quality monitoring network analysis, using state-of-the-art supervised RVM, and incorporating wide aspects of vulnerability assessment, health risk, and VOI reflected in a multiobjective perspective. This multiobjective dimension includes cost of monitoring, the information content or level of uncertainty, and the risk to public and/or environmental health. This will help evaluate the worth of collecting additional data in selecting the optimal monitoring network configuration. In addition, this will assist in deciding whether available data are sufficient to support decisions on monitoring site selection 共monitoring strategy兲, or insufficient in terms of the added VOI they would supply, and the amount of uncertainty reduction inherent in the information obtained from the monitoring network. The main research contribution is in the introduction of new techniques as the RVM and incorporating integrated methodologies of vulnerability assessment, health risk, and economic VOI for groundwater quality monitoring analysis, assessed with respect to their efficiency and applicability to a real groundwater problem. This is an important contribution to the decision-making process because of incorporating the multiobjective dimension and quantification of possible tradeoffs between different objectives in groundwater monitoring. Methodology The methodology is described systematically as illustrated in the conceptual framework for groundwater quality monitoring network analysis figure. The main steps include selection of constituent of concern 共COC兲, selection of a sparse model for creation of vulnerability/probability maps, health risk assessment, and estimation of the value of information. The process of selecting the COC is done by following the approach proposed by Harmancioglu et al. 共1999兲. Probability map is created using the RVM model through integrating data on the natural system 共hydrogeological information兲 and human activities 共land use and contaminant concentration兲 within a probabilistic framework 共Ammar 2007兲. A Bayesian decision analysis is adopted here to estimate the value of information, through calculating the expected value of sample information 共EVSI兲 by incorporating current knowledge 共prior probability distribution兲 and adding newly collected data to produce posterior probability distributions of the state of a system. The value of additional information is calculated as the difference between the expected loss 共or expected utility兲 that would be achieved under posterior knowledge and the expected loss 共expected utility or payoff兲 under current 共prior兲 knowledge. This should help in deciding whether available data are sufficient to support decisions on monitoring site selection 共monitoring strategy兲 or insufficient in terms of the added VOI and uncertainty reduction as will be described later. The organization of the methodology is as follows: first, we will describe the RVM method. Second, we will describe the Bayesian decision analysis, and most importantly the preposterior components: risk assessment and EVSI. For risk assessment we will elaborate on the assumptions used in quantifying the risk, starting with defining the dose-response relationship as reflected in the hazard quotient, and accordingly, the hazard index to delineate the areas that has adverse effect or acceptable risk, and to define the health loss based on the DALYs approach. Finally, this will help define the EVSI. Relevance Vector Machine RVM plays a basic role in the conceptual framework as illustrated in Fig. 1. The main features of RVM model are sparsity 共reducing redundancy兲, good generalization performance, avoiding overfitting, and producing posterior probability as an output 共Tipping 2001兲. Sparsity within the Bayesian framework of the RVM is achieved as follows: the weights of each input are governed by a set of hyperparameters. These hyperparameters describe the posterior distribution of the weights and are estimated iteratively during training. This continues until convergence occurs by judging if we have attained a local maximum of the marginal likelihood. This implies that most hyperparameters approach infinity, and accordingly setting the posterior distributions of the weights to zero. The remaining vectors with nonzero weights are called relevance vectors. Details of the description are given in the following paragraphs. Note that we used here multivariate output extension algorithm which is an extension to the fast marginal likelihood 52 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org marginal likelihood over the hyperparameters. Then optimal weight matrix is obtained using these optimal set of hyperparameters. Details are described in the following. The mathematical representation of the likelihood is given below as 再 p共y兩w,2兲 = 共22兲−N/2exp − 1 储y − ⌽w储2 22 冎 共1兲 where y = 共y 1 . . . . y N兲T = targets; w = 共wo . . . . wN兲T = weights or relevance vector coefficients; ⌽ = design matrix that contains the response of all basis functions to the inputs, with ⌽ = 关共x1兲 , 共x2兲 , . . . , 共xN兲兴T, wherein 共xn兲 = 关1 , K共xn , x1兲 , K共xn , x2兲 , ¯ , K共xn , xN兲兴T. The kernel function 关K共xn , xN兲兴 is used to introduce nonlinearity into the mapping function. The kernel functions define a set of nonlinear fixed basis functions, N共xn兲 = K共xn , xN兲, where the kernel function is centered on each of the N training data points xn. An important feature of the kernel is the kernel width, which is considered for development of RVM models as the key for forecast accuracy and consequently sparsity. For example, the Gaussian kernel 共radial basis function kernel兲 width is represented here by the standard deviation, , or simply the variance over the noise 2, as shown in the Gaussian kernel equation 冉 K共x,xi兲 = exp − Fig. 1. Conceptual framework for groundwater quality monitoring network analysis maximization algorithm 共Tipping and Faul 2003兲. For simplicity, the RVM methodology is demonstrated according to Tipping formulation 共Tipping 2001兲. Interested readers can read a multivariate paper by Thayananthan 共Thayananthan et al. 2006兲 regarding the details of extending the univariate output to multivariate output. In his extension, he followed similar formulation of Tipping and let the algorithm iterate in two steps: in the first step, to calculate the probability 共likelihood兲 of each example belonging to each class, and in the second step, to estimate the parameters using the probability calculated in the first step. He used a matrix target with number of columns equal to number of variates instead of a vector target, also extended the weight matrix, and used a diagonal covariance matrix. Bayesian learning in RVMs begins by observing a data set consisting of N pairs, 兵xn , y n , n = 1 , 2 , 3 , . . . , N其, where the xn = inputs and the y n = targets. The targets form an N by P matrix, with P the number of regulatory classes, y nxP = 关y共n , 1兲 , . . . , y共N , P兲兴. The likelihood or probability of membership in one of the regulatory classes given information about the input x, written as P 共y 兩 x兲, can be written in terms of the mean or weight w and the variance 2 assuming the error follows independent zero-mean Gausian 共normal兲 distribution 共Tipping 2001兲. This conditional target or data likelihood incorporates the uncertainty of prediction and is obtained as a function of weight variables and hyperparameters. These hyperparameters are the parameters that determine the relevance of the associated basis function. The weight variables are then analytically integrated out to obtain marginal likelihood as function of the hyperparameters. An optimal set of hyperparameters is obtained by maximizing the 储x − xi储 2 2 冊 共2兲 where K共x , xi兲, i = 1 , . . . , n = kernel function for n by m matrix, and the norm reflects the nearness of one point 共x兲 to other point 共xi兲, while the variance controls the width of the bell shape of the Gaussian distribution, i.e., smoothing parameter. In the training process, the RVM considers the training data as prior data and infers the information about class membership from this data. Accordingly, it recognizes a pattern in the data. The addition of new data updates the likelihood of class membership, which also updates the error between the observed data and predicted data 共or regulatory classes兲. This Gaussian prior distribution is introduced to complement the likelihood function and is given in the form as M p共w兩␣兲 = 共2兲 −M/2 兿 ␣m1/2 exp m=1 冉 − ␣mwm2 2 冊 共3兲 where M = number of the independent hyperparameters, ␣ = 共␣1 , . . . , ␣ M 兲T, each hyperparameter is individually controlling the strength of the prior over its associated weight. This form of the prior is responsible for the sparsity properties of the RVM model. Given the previously defined prior and likelihood distributions, the posterior parameter distribution conditioned on the data over all unknowns is defined using Bayes’ rule P共w,␣,2兩y兲 = P共y兩w,␣,2兲P共w,␣,2兲 P共y兲 共4兲 The posterior in Eq. 共4兲 is intractable; an approximation is obtained by decomposing the posterior to P共w , ␣ , 2 兩 y兲 ⬅ P共w 兩 y , ␣ , 2兲P共␣ , 2 兩 y兲. As a consequence, the posterior distribution over the weight becomes analytically solvable. Details of analytically solving these equations are given in Tipping 共2001兲. The analytical posterior distribution over weights is JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 53 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org P共w兩y,␣,2兲 = P共y兩w,2兲P共w兩␣兲 = 共2兲−共N+1兲/2兩⌺兩−1/2 P共y兩␣,2兲 再 1 exp − 共w − 兲T⌺−1共w − 兲 2 冎 HQ = 共5兲 with posterior covariance ⌺ = 共−2⌽T⌽ + A兲−1 and mean = −2⌺⌽Ty, and A = diag共␣1 , ␣2 , . . . , ␣ M 兲. The estimated value of the model weights is given by the mean of the posterior distribution Eq. 共5兲, which is also the maximum a posterior 共MP兲 estimate of the weights using type-II maximum likelihood procedure 共i.e., maximizing the marginal likelihood兲. The optimal values of many hyperparameters ␣MP are infinite. This leads to a parameter posterior distribution infinitely peaked at zero for many weights with the consequence that MP correspondingly comprises very few nonzero elements. That is, sparse Bayesian learning is formulated as the local maximization of the marginal likelihood with respect to ␣. A Bayesian decision analysis is conducted as mentioned before, to evaluate the expected health loss function for each decision or monitoring strategy from the range of all possible outcomes produced from running the RVM model. In addition, to select the decision that satisfies the objectives of monitoring design 共multiobjective dimension兲, i.e., selecting the design that minimizes the expected health loss or maximizing EVSI, minimizes cost of sampling 共roughly measured here in terms of the number of monitoring sites兲, and simultaneously improves uncertainty. The prior state is considered here as the state of having the minimum number of monitoring sites, and the associated probability as the prior probability produced by running the RVM model. This prior probability is actually the posterior probability of the training data used in the RVM model formulation. Preposterior analysis is the core of the approach for calculating the expected value of information. In this step, an assessment of the risk based upon updated information is conducted and, accordingly, an estimation of the expected value of the sample information for each update is done as described below. Risk Assessment Risk is the probability of an adverse outcome and risk assessment is the process of quantifying risk 共probability兲 and associated consequences. The main consequence that will be taken into account in this study is the health loss consequence; other consequences, such as economic or environmental, will not be covered due to lack of data concerning these issues. Health risk assessment generally consists of hazard identification, exposure assessment, dose-response relationship, and risk characterization. The hazard quotient 共HQ兲 is used to define the dose-response relationship of contaminant 共in this study is nitrate兲 exposure. The HQ is the noncancer risk assessment of this exposure, defined as the ratio of the exposure for some generalized or typical hypothetical member of the receptor population at a site, compared to an appropriate toxicity reference value that equates to an acceptable level of risk for that receptor and chemical 共U.S. EPA 2001兲. The main equation to calculate the HQ as defined by U.S. EPA 共2001兲 is CDI = C ⫻ IR ⫻ EF BW ⫻ 365 共6兲 共7兲 where CDI= chronic daily intake of contaminant 共mg/kg-day兲; C = contaminant concentration 共mg/L兲; IR= ingestion rate 共L/day兲; EF= exposure frequency 共days/year兲; BW= bodyweight 共kg兲; and RfD= reference dose for the contaminant. The hazard index 共HI兲 which is the sum of HQs 共as defined before兲 and is used to evaluate the fraction of the population with a HQ above and below 1, where HQ values above 1 are interpreted as indicating the potential for adverse effects and HQ values below 1 are interpreted as indicating acceptable risk. This HI will be used to determine the potential exposure in the population, mainly the infants. The health loss 共HL兲 is the loss in health that the individual will get due to exposure to hazard material. The HL function is defined as HL = DALYs HL = 0 Bayesian Decision Analysis CDI RfD for HI ⱖ 1 Otherwise 共8a兲 共8b兲 where DALY= disability adjusted life years as described next. In order to characterize the risk, the concept of DALYs is adopted in this study to quantify the HL, with time as the unit of measurement. This health impact measure combines years-of-life-lost with years-lived-with-disability that is standardized by means of severity weights using time as the metric. The importance of DALY appears in combining different aspects of public health: quantity of life 共measured by life expectancy and duration of disease兲, quality of lie 共expressed through a severity weight for the adverse health outcome兲, and social magnitude 共or number of people affected兲. HL is defined according to the Global Burden of Disease 共GBD兲 共Lopez et al. 2006兲 as time spent with reduced quality of life aggregated over the population involved, combining years of life lost 共combining mortality and age of death data兲 and years lived with disability, standardized through application of severity weights. Generally, HL using the DALY method is assessed by estimating the number of people affected, estimating the average duration of the adverse health response, including loss of life expectancy as a consequence of premature mortality, assigning weights for severity in unfavorable health conditions, and calculating the HL in DALYs. Therefore, DALYs for a certain cause is calculated as the sum of the years of life lost due to premature mortality 共YLL兲 from that cause and the years of healthy life lost as a result of the disability 共YLD兲 for incident cases of the health condition DALY = YLL + YLD 共9兲 where YLL= number of life years lost due to mortality and YLD= number of years lived with the disability, weighed with a factor between 0 共perfect health兲 and 1 共dead兲 for the severity of the disability for all relevant diseases n DALY = 共D ⫻ L + I ⫻ DW ⫻ T兲 兺 i=1 共10兲 where n = number of all relevant diseases, the first part of the summation represents the mortality and YLL= DⴱL, with D being the number of deaths, and L the standard life expectancy at the age of death due to this cause. The second part of the summation represents the morbidity, YLD= I ⫻ DW⫻ T, where I = number of incident cases for both causes of disability; DW= disability weight for both causes; and T = duration in years of the case until 54 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org remission or death. Due to the unavailability of data on mortality, only morbidity will be taken into account in this study. Risk assessment is expressed as the expected HL, which is equal to the posterior probability of membership of contaminant 共here is nitrate兲 in one of the regulatory classes for each monitoring site times the HL consequence. The expected HL for each monitoring strategy was calculated from N E共HL兲 = posti · DALYs 兺 i=1 共11兲 where E共HL兲 = expected HL; N = number of monitoring sites; and posti = posterior probability of the contaminant being in one of the regulatory classes at each monitoring site; and DALYs = quantification of the health consequences at each monitoring site as described before. EVSI The expected value of sample information 共EVSI兲 is a measure of the value of the reduction in uncertainty that may result from the collection of new sample information. The EVSI is defined as the difference between the expected HL given prior information and the expected HL when provided updated information 共U.S. EPA 2001; Tappenden et al. 2004兲. The expected value of perfect information 共EVPI兲 is the difference between the expected HL of the optimal management decision based on the prior analysis and the expected HL of the optimal management decision if all uncertainty were to be eliminated; this provides the upper bound for the EVSI. A detail of obtaining uncertainty is described in the following. Local Uncertainty The probabilistic output of the RVM can obtained as an error bar or predictive variance ˆ 2. This predictive variance 共error bars兲 consists of two variance components: the estimated noise vari2 共maximum a posterior兲 in the data and variance due to ance MP uncertainty in the prediction of weights 关uncertainty about the optimal value of the weights reflected by the posterior distribution 共4兲兴 represented in the second part of Eq. 共12兲 共⌽T⌺⌽兲, with F and ⌺ the basis function and posterior covariance, respectively, as defined in Eq. 共5兲 共Tipping 2001兲 2 ˆ 2 = MP + ⌽T⌺⌽ The methodology is applied on the West Bank/Palestine case study in the Middle East. The study area covers the whole West Bank area which is approximately 5 , 660 km2. The total Palestinian population living in the West Bank is about 2.5 million according to projections made by the Palestinian Central Bureau of Statistics 共PCBS兲 for the year 2005 关Palestinian Center Bureau of Statistics 共PCBS兲 2002兴. The West Bank has a semiarid climate with average annual rainfall of 600 mm/year 共24 in./year兲. Rainfall varies from north to south and from west to east, with the least in the Jordan Valley close to the Dead Sea region 共less than 100 mm/year兲. Rainfall is the main natural source for groundwater recharge. Groundwater is the main natural source for all uses in the West Bank. The West Bank groundwater system consists of three main basins: the eastern, northeastern, and western basins as shown in Fig. 2. Generally, the aquifer system in these groundwater basins is a Quaternary-Cretaceous geologic system of Holocene-Albian age; it is composed of karstic and permeable limestone and dolomite. The aquifer systems consists mainly of three main aquifers: 共1兲 a shallow aquifer occurs in the Quaternay and Teriary systems of Holocene-Paleocene age; 共2兲 an upper aquifer occurs in the upper Cenomanian-Turonian age; and 共3兲 a lower aquifer occurs in the lower Cenomanian age. For more details on the West Bank groundwater system, the reader is advised to read CH2M HILL report 共CH2M HILL 2001兲. Land Use Pattern Several local practices affect nitrate loading to West Bank aquifers. These include application of fertilizers and pesticides in irrigated areas, leakage from septic tanks and sewage systems in urban areas, and industrial activities such as food processing and textiles. For the purpose of this study, five land use categories were defined in terms of their nitrate contribution: 共1兲 urban areas representing the main cites of West Bank, including both domestic and industrial activities; 共2兲 rural areas covering all village built-up areas, where most are not connected to sewage networks, and septic tanks are the major source of nitrate loading; 共3兲 irrigated areas in Jenin, Tulkarm, Qalqilya, Nablus, Tubas, and Jericho; 共4兲 forests and natural reserves representing natural areas; and 共5兲 rainfed and range land areas covering most of the area of the West Bank as shown in Fig. 3. 共12兲 The local uncertainty of predicted maximum nitrate concentration is then quantified here based on the error bars as the width of the confidence interval of a specified probability for the predicted maximum nitrate concentration 共i.e., the response or output of the RVM兲 at each individual location of the existing monitoring sites 共where, again, the location of a monitoring site is the input to the RVM兲 CI = ⌽ ⫾ t,␣/2关ˆ 2/n兴1/2 Case Study 共13兲 where ⌽ = predicted maximum nitrate concentration 共mg/L兲, i.e., the model output is calculated based on the relevance vectors 共RVs兲; t,␣c/2 = t-test statistic at = n − p degrees of freedom and confidence level ␣c 共for 95% confidence level= 0.05兲; n = number of monitoring sites in the existing network configuration; and p = number of model parameters 共p = 2, for weight and bias兲. Results and Discussion To run the RVM model, the first step was to select the kernel parameters, i.e., kernel type and kernel width. A kernel type that best suits the structural properties of the data was selected by judging the RVM performance against selected criteria 共e.g., accuracy兲 of different kernel types 共Gaussian, Laplace, spline, Cauchy, cubic, thinplate spline, and bubble兲. Fig. 4 presents the performance of the RVM model for different kernel types. As shown in the figure, the best performance in terms of accuracy of prediction was given by the Gaussian kernel, followed by the Laplace kernel. A systematic procedure was followed to set the kernel width or the variance to get an optimal monitoring network configuration. Several runs 共20 runs兲 of the RVM model produced a range of possible sampling plans or monitoring network configurations in JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 55 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org Fig. 2. 共a兲 Location map and groundwater quality monitoring site locations in the West Bank; 共b兲 West Bank geological map; and 共c兲 hydrogeological cross section A-A⬘ terms of the number and location of monitoring sites. The RVM model was run with inputs of well locations and aquifer characteristics 共monitoring well locations, and aquifer characteristic parameters including: depth to water level, recharge to aquifer, aquifer media, soil characteristics, topography of the area, impact of vadose zone, and hydraulic conductivity using DRASTIC method, and regulatory nitrate classes as target. The DRASTIC method uses index values derived from key aquifer properties to generate an overall vulnerability distribution for the aquifer system. This is done following the rating and weighting procedure described by Aller et al. 共1987兲 for defining the DRASTIC index. The weights are used to determine the relative importance of each parameter 共factor兲 with respect to the other. The regulatory classes represent the corresponding log maximum concentrations of nitrate 共as NO−3 兲 classified into three regulatory classes and are assumed to be independent. The predicted regulatory nitrate classes were used to estimate the probability maps. This produced the predicted nitrate regulatory classes and the associated predicted probability map for each class using multinomial logistic regression method. These maps illustrate the spatial distribution of nitrate occurrence in each specified class and reflect the possibility of nitrate occurrence 共probability兲 represented in ranges for each of the three classes. These three class membership probability maps were specified as low probability, or Class 1 关Fig. 5共a兲兴, moderate probability, or Class 2 关Fig. 5共b兲兴, and high probability, or Class 3 关Fig. 5共c兲兴. Each of these maps was also classified into low 共0–0.29兲, moderate 共0.3–0.39兲, and high 共0.4–0.575兲 vulnerability ranges 共classes兲. As is clear from these maps, low probability spots in one class are associated with moderate or high probability spots in other classes, where the sum of probability over the three classes at the same grid point is equal to 1. As will be shown in the following discussion, the more accurate the class membership prediction, the more reliable class membership probability will be. The spatial distribution of predicted classes, misclassified classes 共total incorrect class prediction percent or classification error兲, and location of relevance vectors with associated prior probability in each class based on prior information of 95 RVs are presented in Fig. 5. From this figure, the locations of the RVs are close to “hot-spot” areas. This reflects the capability of the RVM model to understand the information contained in the data. The methodology used for interpolating these results is the inverse distance weight method of ArcGIS of ESRI. Low nitrate concentration probability is noticed in the Senonian 共aquitard兲 outcrops in the southeastern part of the eastern basin which confines the underlying upper aquifer. Few wells and 56 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org from the Eocene aquifer of the northeastern basin, and the communities of Tulkarm, Ramallah and Bethlehem, which obtain water from the upper aquifer of the western basin. Sparsity is a key feature of RVM model. This sparsity is illustrated in the ability of the model to achieve good generalization performance and avoid overfitting while using small number of basis functions or RVs. These important features of RVM models 共sparsity and generalization兲 reflect the potential of the RVM model use in water resources decision making and research. In the context of monitoring, the selection of the kernel width as described in the methodology section, prescribes the desired quality of the resulting mapping function as well as the number of RVs 共i.e., the size of the monitoring network兲 used in the prediction and characterization of contaminant distributions in the aquifer. For example, specifying a desired sparsity level 共in terms of a maximum allowable number of relevance vectors兲 of 95 RVs generated acceptable class prediction accuracy with associated error or class misclassification of around 40%. Decreasing the sparsity of the model by decreasing the kernel width and accordingly increasing the number of RVs to 170 resulted in significant improvements in prediction accuracy, with error reduction of 10% 共the error associated with 170 RVs is 30%兲. Bayesian decision analysis was used to select an appropriate or optimal monitoring strategy as explained in the following steps. Expected Health Loss Fig. 3. Land use map for the West Bank case study springs tap the underlying aquifer in this particular area 关Fig. 5共a兲兴. Moderate nitrate contamination probability covers 60% of the West Bank study area, mainly in the shallow aquifer of the eastern basin and the upper aquifer of the eastern and western basins, as shown in Fig. 5共b兲. High nitrate concentration probability covers 13% of the West Bank area and is observed mainly in the vicinity of Hebron in the southern part of the west bank for wells and springs that tap water from the upper aquifer in both the eastern and western basins 关Fig. 5共c兲兴. Other high vulnerability spots occur near the cities of Jenin and Nablus, which draw water High nitrate concentrations in drinking water cause different health outcomes, such as methemoglobinemia, cancer 共due to the bacterial production of N-nitroso compounds兲, hypertension, increased infant mortality, and diabetes 共Fewtrell 2004兲. Methemoglobin is formed when nitrite 共due to bacterial conversion from nitrate兲 oxidizes the ferrous iron in hemoglobin to the ferric form. In this study we will consider only the methemoglobinemia health effects due to high nitrate concentrations in drinking water. Debate about the carcigonic effect of nitrate is still ongoing and no decision on the issue of possible carcigonic effects related to nitrates has been published in the U.S. EPA or World Health Organization 共WHO兲 1996 guidelines. Due to the limited availability of records from the Palestinian Ministry of Health 共MOH兲 regarding the number of methemoglobinemia cases in the West Bank, an approximate exposure assess- Fig. 4. Kernel type selection based on RVM model performance JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 57 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org Fig. 5. Predicted probability map of nitrate NO3 based on RVM method 共95 RVs兲 according to regulatory classes: 共a兲 Class 1; 共b兲 Class 2; and 共c兲 Class 3 58 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org Table 1. Uncertainty Level or Misclassification Error, Health Loss, Expected Health Loss, EVSI, and EVPI for Some Selected Monitoring Strategies Expected Number of Health health monitoring Misclassification loss loss EVSI EVPI sites error 共DALYs兲 共DALYs兲 共DALYs兲 共DALYs兲 95 170 200 250 300 350 413 540 Fig. 6. HI map for West Bank ment was based on the nitrate concentration in drinking water supplies using the HQ method of U.S. EPA. The populations that could be considered at risk and exposed to water of high nitrate concentration are those people drinking from a water source that exceeds the HI regulatory limits of U.S. EPA. Since the infant population is the most sensitive age category affected by methemoglobinemia caused by high nitrate concentrations, the infant population was estimated equal to the growth rate 共2.6%兲 in communities drinking from polluted sources, and assuming only 30% of infants are fed by formula. The number of incident cases from methemoglobinemia caused by high nitrate concentrations is not registered with the Palestinian MOH. Therefore, to estimate the number of incident cases from methemoglobinemia, the number of incident cases of infants from methemoglobinemia were taken equal to the growth rate times the infected population associated with HI greater than 1. As clear from Fig. 6, there is no adverse effect due to nitrate exposure in most of the West Bank area except in southern part near Hebron, and the main urban areas 共Jenin, Nablus, Tulkarm, Qalqilya, Ramallah, and Bethlehem兲. The disability weight was estimated to be 0.02 according to the GBD study 共Lopez et al. 2006兲. The duration of recovery from methemoglobenimea was estimated as 6 days. HL for the morbidity part 共YLD兲 was quantified with the DALY unit of measurement for each sampling site using the estimated values of number of incidences, disability weight, and duration to remission or death, based on Eq. 共11兲. 216 162 144 106 76 44 36 10 38 35.2 34.2 33.5 33.5 33.5 33.5 33.5 20.3 19 18.3 18 18 18 18 18 1.3 2.0 2.3 2.3 2.3 2.3 2.3 tion conditions. The prior information is represented by the sparsest level and obtained by removing most of the redundancy from the existing monitoring network. This results in the highest uncertainty or misclassification error 共a monitoring strategy having 95 monitoring sites兲. The updated state or information is the state that has additional sampling sites 共i.e., 170 monitoring sites, more than the number of monitoring sites of the prior case兲. Estimation of the consequent improvement or gain in the EVSI due to adding or updating new data 共by adding more monitoring sites兲 is continued until an upper bound is reached. This upper bound is the maximum of the EVSI, i.e., the EVPI, which was estimated as the difference between the case of a high uncertainty level 共the prior case兲 and the case of a minimum uncertainty level 共or misclassification兲. Table 1 presents the results of the EVSI and EVPI according to different monitoring strategies. The results obtained show the minimal health effect for the cause of methemoglobenimia due to high nitrate concentration and affecting only the most sensitive group, the infants. This ignores the issue of potential carcinogenicity and other causes of disease associated with consumption of food irrigated with water having high nitrate concentrations. This reduced the sensitivity of methemoglobenimia for any change in the predicted nitrate concentration. In addition, using the HI method for quantifying the dose-response relationship, which is more conservative than relying upon calculating the change in prediction in terms of predicted error, also affected this sensitivity to change. Fig. 7 presents the quantified HL in DALY units for selected monitoring strategy. As is clear from the figure, even with the Expected Value of Sample Information The EVSI was calculated as the difference between the expected HL under prior information conditions and the updated informa- Fig. 7. Expected health loss 共DALYs兲 for different monitoring strategies JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 59 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org minimal HL effect, we can still observe tradeoffs between different objectives: minimizing the cost versus improving accuracy and minimizing the expected HL i.e., maximizing the EVSI. This result hints at directions for applying the proposed methodology and framework on disease causes such as carcinogenic diseases that have greater public attention and more detailed data. This Bayesian analysis could be extended to a simple multiobjective problem by examining the possible tradeoffs among sampling cost 共as represented by the number of sampling wells兲, the value of added information 共quantified in DALYs兲, and accuracy 共in terms of misclassification error兲. Conclusions and Recommendations The conceptual framework proposed in this paper is an important contribution for efficient management of water information systems in terms of saving time and effort needed to collect, prepare, and analyze data, and in understanding the physical structure of the groundwater system. The importance of this conceptual framework appears in its flexibility to suit a wide range of possible applications for similar monitoring networks using different water quality monitoring parameters and hydrometric data collection networks for both groundwater and surface water monitoring systems. In addition, the aspects that have been incorporated into the conceptual framework reflect the multiobjective dimension of monitoring network problems through integration of socioeconomic concerns in terms of health risk using the well-known DALY approach, value of information, and cost of sampling. Possible tradeoffs among different objectives increase the potential usefulness of the conceptual framework in the sense that the resulting information is more understandable in terms of the quantified health outcome and reflects a measurable unit that has meaning for any decision taken. The Bayesian framework of the RVM model features sparsity, accuracy, and incorporation of both subjective judgment of prior knowledge and statistical judgment of the posterior probability; it uses these to produce probabilistic output. The results of the use of the RVM model in this research illustrate a significant potential for future application in the decision-making process. The Bayesian framework of the RVM model shows its capability to infer information contained in the data and understand the physical system of the case study, taking into account the uncertainty in the forecast it produced. The modeling results provided here show that RVMs can provide an attractive, simple to use, and straightforward analytical forecasting model for application in monitoring network design. The RVM model identified a reliable network configuration that is pertinent to the information contained in the available monitoring data, with the minimal number of sampling sites, and with the most significant sampling locations identified in terms of their information content. Such information could be used to reduce unnecessary monitoring costs. Such savings could potentially be used to develop new monitoring locations in areas not previously monitored, which would be a research topic for further investigation. The proposed tool could potentially assist decision makers and water resources planners for planning, developing, and protecting groundwater resources, and adopting best management practices. The work presented here concentrated on removing redundancy from an existing groundwater quality monitoring network and selecting sampling sites based on spatial data. Therefore, there is a need to explore the use of temporal data, or both spatial and temporal data, to test if temporal data alone or a combination of both spatial and temporal data produces different sampling strategies. More work is needed to improve the capability of RVMs to illustrate the prior, likelihood, and posterior probability as an outcome of the model, and allow for the gradual addition of new data to test for incremental improvements in prediction. This would make the model easier to use and understand. References Aller, L., Bennett, T., Lehr, J. H., and Petty, R. J. 共1987兲. “DRASTIC: A standardized system for evaluating groundwater pollution potential using hydrogeologic settings.” Rep. No. EPA/600/2-85/0108, U.S. EPA, Robert S. Kerr Environmental Research Laboratory, Ada, Okla. Ammar, A. K. 共2007兲. “A Bayesian method for groundwater quality monitoring network analysis.” Dissertation, Utah State Univ., Salt Lake City, Utah. Ammar, K., Khalil, A., McKee, M., and Kaluarachchi, J. 共2008兲. “Bayesian deduction for redundancy detection in groundwater quality monitoring networks.” Water Resour. Res., 44, W08412. Bates, S., Cullen, A., and Raftery, A. 共2003兲. “Bayesian uncertainty assessment in multicompartment deterministic simulation models for environmental risk assessment.” Environmetrics, 14, 355–371. CH2M HILL. 共2001兲. “Aquifer modeling.” Water resources program phase III, USAID, Washington, D.C. Fewtrell, L. 共2004兲. “Drinking-water nitrate, methemoglobinemia, and global burden of disease: A discussion.” Environ. Health Perspect., 112共14兲, 1371–1374. Harmancioglu, N. B., Fistikoglu, O., Ozkul, S. D., Singh, V. P., and Alpaslan, M. N. 共1999兲. Water science and technology library: Water quality monitoring network design, Kluwer Academic, Dordrecht, The Netherlands. Khadam, I., and Kaluarachchi, J. 共2003兲. “Multi-criteria decision analysis with probabilistic risk assessment for the management of contaminated groundwater.” Environ. Impact. Asses. Rev., 23, 683–721. Khalil, A., Almasri, M., McKee, M., and Kaluarachchi, J. 共2005a兲. “Applicability of statistical learning algorithms in groundwater quality modeling.” Water Resour. Res., 41, W05010. Khalil, A., McKee, M., Kemblowski, M., and Asefa, T. 共2005b兲. “Sparse Bayesian learning machine for real-time management of reservoir releases.” Water Resour. Res., 41共11兲, W11401. Khalil, A., McKee, M., Kemblowski, M., Asefa, T., and Bastidas, L. 共2006兲. “Multiobjective analysis of chaotic dynamic systems with sparse learning machines.” Adv. Water Resour., 29共1兲, 72–88. Kim, J., and Bridges, T. 共2006兲. “Risk, uncertainty, and decision analysis applied to the management of aquatic nuisance species.” ERDC/TN ANSRP-06-1, Aquatic Nuisance Species Research Program, Washington, D.C. Lopez, A., Mathers, C., Ezzati, M., Jamison, D., and Murray, C. 共2006兲. Global burden of disease and risk factors, The International Bank for Reconstruction and Development/World Bank, Washington, D.C. Palestinian Center Bureau of Statistics 共PCBS兲. 共2002兲. Results of population census of December 1997 in the West Bank and Gaza Strip, 1998, Ramallah, Palestine. Tappenden, P., Chilcott, J., Eggington, S., Oakley, J., and MaCabe, C. 共2004兲. “Methods for expected value of information analysis in complex health economic models: Developments on the health economics of interferon- and glatiramer acetate for multiple sclerosis.” Health Technol. Assess, 8共27兲, 1–92. Thayananthan, A., Navaratnam, R., Stenger, B. Torr, P., and R. Cilpolla. 共2006兲. “Multivariate relevance vector machines for tracking.” Proc., European Conf. on Computer Vision, Dublin, Ireland. Tipping, M. E. 共2001兲. “Sparse Bayesian learning and the relevance vector machine.” J. Mach. Learn. Res., 1, 211–244. Tipping M.E., and Faul A.C. 共2003兲. “Fast marginal likelihood maximi- 60 / JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org sation for sparse Bayesian models.” Proc., 9th Int. Workshop on Artificial Intelligence and Statistic, C. M. Bishop and B. J. Frey, eds., Key West, Fla. Tripathi, S., and Govindaraju, R. S. 共2006兲. “On selection of kernel parameters in relevance vector machines for hydrologic applications.” Stochastic Environ. Res. Risk Assess., 21共6兲, 747–764. U.S. EPA. 共2001兲. “Risk assessment guidance for superfund: Volume III—Part A, Process for conducting probabilistic risk assessment.” EPA 540-R-02-002 OSWER 9285.7-45 PB2002 963302, Washington, D.C., 具http://www.epa.gov/superfund/RAGS3A/index.htm典. World Health Organization 共WHO兲 共1996兲. “Health criteria and other supporting information.” Guidelines for drinking water quality, 2nd. Ed. Vol. 2, Geneva, Switzerland. Yokota, F., and Thompson, K. 共2004兲. “Value of information analysis in environmental health risk management decisions: Past, present, and future.” Risk Anal., 24共3兲, 635–650. Zhang, H., and Malik, J. 共2005兲. “Selecting shape features using multiclass relevance vector machines.” Technical Rep. No. UCB/EECS2005-6, EECS, University of California, Berkeley, Berkeley, Calif. JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT © ASCE / JANUARY/FEBRUARY 2011 / 61 Downloaded 14 May 2011 to 35.8.11.2. Redistribution subject to ASCE license or copyright. Visit http://www.ascelibrary.org