Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SUPPLEMENTARY INFORMATION Estimating Perchlorate and Iodine Mean Concentrations Based on the observed pattern of analyte concentrations in food and related research efforts1, we assume that the analyte concentrations X ij of a given food i, i=1,…,I ,j =1, …ni, follow a (potentially zero-inflated) log-normal distribution. Let (1-pi) correspond to the probability that a sample from a food i is truly zero, and contains no perchlorate or iodine. Samples with analyte concentrations <limit of detection (LOD) correspond either to samples in food i that truly have no analyte or samples in food i that truly have analyte concentrations that are below the LOD. Let mi and si be the parameters of the log normal distribution for food i such that Xij~LN(mi, si) if and only if ln(Xij) ~ N(mi, si). The density of Xij is given by ln LOD mi f X ij | pi , mi , si 1 pi pi s i wij p LN ( X i ; mi , si ) 1 wij ij where wij is an indicator variable with value 1 when Xij<LOD and Φ(·) is the cumulative distribution function of the normal distribution. Hence, if X={Xij}, p={pi}, m={mi}, and s={si}, the likelihood function is given by ln LOD mi L X | p, m, s 1 pi pi si i 1 j 1 I ni wij p LN ( X i ; mi , si ) 1 wij ij First step We proceed by using Bayesian estimation to find clusters of the parameters p={pi}, m={mi}, and s={si} through the use of “Single-p” Dependent Dirichlet process (DDP) priors (see MacEachern2 and De Iorio et al.3), i.e., pi , mi , si | H ~ H , H ~ DDP (M , G0 ) where M is the concentration parameter and G0 is the base measure composed of three distributions, 1 logit pi ~ DP M , G0 1 2 mi ~ DP (M , G0 ) 3 si ~ DP(M , G0 ) where G0 N ( p , p2 ) , G02 N ( m , m2 ) , and G03 is half-normal, i.e., G03 N (0, s2 ) I (0, ) and H is 1 constructed over the collection of random distributions {H t , t 1,2,3} . G0t , t=1,2, or 3, serves as the best guess for the underlying model G0 and the concentration parameter M determines the a priori confidence in G0t . Larger values of M correspond to greater degrees of confidence in G0t . The Dirichlet Process (DP) is implemented through a stick-breaking process4. That is, a set of independent and identically distributed (iid) atoms― logit pi ~ DP M , G1 , mi ~ G0 , and si ~ G0 and a set of weights― p i i 2 3 (1 j ) where the i are j i iid with i ~ Beta(1, M ) for i 1,..., are generated. Then H t pi i (t ) , t 1,2,3 i 1 where mi is a point mass at mi and i 1 logit pi , i (2) mi , and i (3) si . It is important to note that this representation only uses one p for the three base distributions to simplify computations and to induce dependency among the parameters. Furthermore, the sum can be truncated to obtain a reasonable approximation to G. The effect of this truncation on the distribution of functionals of a DP has been studied in Ohlssen et al.5 and Ishwaran and Zarepour6. The prior is completed by specifying that p ~ N 0,20 , m ~ N 0.5, 20 s ~ N 0.5, 20 , and , p2 , m2 , and s2 all have N(0,100)I(0,∞). The model is fitted via Markov Chain Monte Carlo (MCMC) methods7, where, at 2 each iteration of the MCMC sampler, food profiles (pi, mi, si) are assigned to clusters. Posterior inferences of the parameters were conducted using JAGS (Just Another Gibs Sampling8) (through rJags9) to generate 10,000 MCMC iterations. Second Step Since the clusters of food have been established within the MCMC run, the likelihood functions of all the sample analyte concentrations within the cluster are formed at each iteration, such that ln LOD mc L X c | pc , mc , sc , C c 1 pc pi s i 1 j 1 c I ni wij p LN ( X c ; mc , s c ) 1 wij ij ,i c Then, classical maximum likelihood estimates or Bayesian estimates for pc, mc, sc can be used. In our case, we used the ML estimate on 1,001 random iterations of the MCMC chain. This is implemented through classical general purpose optimization methods in R10. Development of the Figures Figures were developed to illustrate the clustering of the foods depending on the pattern of perchlorate (Figure S1) or iodine (Figure S-2) concentrations in different foods. These figures were developed by finding the partition that best represents the final average probability matrix. The posterior similarity matrix is constructed where at each iteration of the MCMC sampler, a score matrix with each element of the matrix set equal to 1 if food i and j belong to the same cluster, and zero otherwise. At the end of the estimation process, a probability matrix, S, is formed by averaging the score matrices obtained at each iteration, so element Sij denotes the probability that foods i and j are assigned to the same cluster. However, “label switching” prohibits making inference on the class specific parameters, because draws of class specific parameters may be associated with different class labels during the course of the MCMC run. Consequently, class-specific posterior summaries that average across the draws will be invalid. S. Dahl11 suggests an approach to 3 identify the best partition by choosing among all the partitions generated by the sampler―that is, the partition that minimizes the least-squared distance to the matrix S. This is accomplished by maximizing the Posterior Expected Rand Adjusted index (PEAR) using the mcclust R library12. The adjusted Rand index measures similarity between estimated and posterior expected clusters but is corrected for chance. Figure S-1 shows a violin plot of perchlorate concentration on the y-axis (µg/kg) for each of the identified clusters, with the number of foods per cluster on the x-axis. Each dot represents a perchlorate concentration, with the dashed horizontal line indicating the perchlorate LOD of 1 µg/kg. For some clusters (for example where the number of foods=3), all values are above the LOD, while for other clusters (for example, where the number of foods=31), the majority of perchlorate values are at the LOD. Figure S-2 shows a similar violin plot for iodine, where the y-axis is iodine concentration in mg/kg and the x-axis is number of foods per cluster. The dashed horizontal line represents the LOD of 0.03 mg/kg (although iodine had a range of LODs from 0.03-0.06 mg/kg). The code is available on request. 4 References for Supplemental Information 1. European Food Safety Authority. Management of left-censored data in dietary exposure assessment of chemical substances. EFSA Journal 2010; 8. 2. MacEachern SN. Dependent dirichlet processes. In. Columbus, OH: The Ohio State University, Department of Statistics, 2000. 3. De Iorio M, Muller P, Rosner GL, MacEachern SN. An ANOVA model for dependent random measures. J Am Stat Assoc 2004; 99: 205-215. 4. Sethuraman J. A constructive definition of Dirichlet priors. Stat Sin 1994; 4: 639-650. 5. Ohlssen DI, Sharples LD, Spiegelhalter DJ. Flexible random-effects models using applications to institutional comparisons. Stat Med 2007; 26: 2088-2112. 6. Ishwaran H, Zarepour M. Exact and approximate sum representations for the Dirichlet process. Can J Stat 2002; 30: 269-283. 7. Gilks WR, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in Practice. Chapman & Hall/CRC: London, 1996. 8. Plummer M. JAGS Version 3.4.0 User Manual, 2013. 9. Plummer M. rjags: Bayesian Graphical Models using MCMC. R Package Version 4-4. 2015, Available at http://CRAN.R-project.org/package=rjags (accessed March 4, 2016). 10. R Core Team. A language and environment for statistical computing. . 2015: Vienna, Austria, Available at http://www.R-project.org/ (accessed March 4, 2016). 11. Dahl DB. Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model. In: KA D, P M, M V (eds). Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press, 2006, pp 201-218. 12. Fritsch A, Ickstadt K. Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal 2009; 4: 367-391. 5 Supplementary information Table S1. Total Diet Study sample collection dates and locations for perchlorate and iodine data. Market basket 2008-1 2008-2 2008-3 2008-4 2009-1 2009-2 2009-3 2009-4 2010-1 2010-2 2010-3 2010-4 2011-1 2011-2 2011-3 2011-4 2012-1 2012-2 2012-3 2012-4 Sample collection dates October-November 2007 January-February 2008 March-May 2008 July-August 2008 October-November 2008 January-February 2009 April-May 2009 July-August 2009 October-November 2009 January-February 2010 April-May 2010 July-August 2010 October-November 2010 January-February 2011 April-May 2011 July-August 2011 October-November 2011 January-February 2012 April-May 2012 July-August 2012 Collection region and locations Central (Toledo, OH; Detroit, MI; Minneapolis-St. Paul, MN) West (Albuquerque, NM; Phoenix-Mesa, AZ; Reno, NV) South (Baltimore, MD; Houston, TX; Tampa, FL) North (Buffalo, NY; Voorhees, NJ; Philadelphia, PA) Central (Chicago, IL; Columbus, OH; Springfield, MO) West (Colorado Springs, CO; Oakland, CA; Spokane, WA) South (Greenville, NC; Austin, TX; Montgomery, AL) North (New York, NY; Newark, NJ; Concord, NH) Central (Lansing, MI; Des Moines, IA; Madison, WI) West (Riverside-San Bernardino, CA; San Francisco, CA; Yakama, WA) South (Charleston, WV; Tampa-St. Petersburg-Clearwater, FL; New Orleans, LA) North (Boston, MA; Syracuse, NY; Pittsburg, PA) Central (Chicago, IL; Youngstown-Warren, OH; Kalamazoo-Battle Creek, MI) West (Salt Lake City-Ogden, UT; Los Angeles-Long Beach, CA; Boise, ID) South (Atlanta, GA; Roanoke, VA; San Antonia, TX) North (Hartford, CT; Morris-Passaic, NJ; Scranton-Wilkes-Barre, PA) Central (Peoria, IL; Wichita, KS; St. Cloud, MN) West (Boulder, CO; Las Vegas, NV; Seattle, WA) South (Raleigh, NC; West Palm Beach-Boca Raton, FL; Nashville, TN) North (Monmouth-Ocean, NJ; Albany, NY; Chester County, PA) 6