Download Supplementary Information (doc 124K)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Freeganism wikipedia , lookup

Food studies wikipedia , lookup

Food politics wikipedia , lookup

Food coloring wikipedia , lookup

Food choice wikipedia , lookup

Transcript
SUPPLEMENTARY INFORMATION
Estimating Perchlorate and Iodine Mean Concentrations
Based on the observed pattern of analyte concentrations in food and related research efforts1, we assume that the
analyte concentrations X ij of a given food i, i=1,…,I ,j =1, …ni, follow a (potentially zero-inflated) log-normal
distribution. Let (1-pi) correspond to the probability that a sample from a food i is truly zero, and contains no
perchlorate or iodine. Samples with analyte concentrations <limit of detection (LOD) correspond either to
samples in food i that truly have no analyte or samples in food i that truly have analyte concentrations that are
below the LOD. Let mi and si be the parameters of the log normal distribution for food i such that Xij~LN(mi, si) if
and only if ln(Xij) ~ N(mi, si). The density of Xij is given by

 ln LOD   mi  
 
f X ij | pi , mi , si    1  pi   pi 

s
i



wij
 p LN ( X
i
; mi , si ) 
1 wij
ij
where wij is an indicator variable with value 1 when Xij<LOD and Φ(·) is the cumulative distribution function of
the normal distribution. Hence, if X={Xij}, p={pi}, m={mi}, and s={si}, the likelihood function is given by

 ln LOD   mi  
 
L X | p, m, s     1  pi   pi 

si
i 1 j 1 


I
ni
wij
 p LN ( X
i
; mi , si ) 
1 wij
ij
First step
We proceed by using Bayesian estimation to find clusters of the parameters p={pi}, m={mi}, and s={si} through
the use of “Single-p” Dependent Dirichlet process (DDP) priors (see MacEachern2 and De Iorio et al.3), i.e.,
pi , mi , si | H ~ H ,
H ~ DDP (M , G0 )
where M is the concentration parameter and G0 is the base measure composed of three distributions,
1

logit  pi  ~ DP M , G0
1

2
mi ~ DP (M , G0 )
3
si ~ DP(M , G0 )
where G0  N (  p , p2 ) , G02  N (  m , m2 ) , and G03 is half-normal, i.e., G03  N (0, s2 ) I (0, ) and H is
1
constructed over the collection of random distributions   {H t , t  1,2,3} . G0t , t=1,2, or 3, serves as the best
guess for the underlying model G0 and the concentration parameter M determines the a priori confidence in G0t .
Larger values of M correspond to greater degrees of confidence in G0t . The Dirichlet Process (DP) is
implemented through a stick-breaking process4. That is, a set of independent and identically distributed (iid)


atoms― logit  pi  ~ DP M , G1 , mi ~ G0 , and si ~ G0 and a set of weights― p i   i
2
3
 (1 
j
) where the  i are
j i
iid with  i ~ Beta(1, M ) for i  1,...,  are generated. Then

H t   pi i (t ) , t  1,2,3
i 1
where  mi is a point mass at mi and  i 1  logit  pi  ,  i (2)  mi , and i (3)  si . It is important to note that this
representation only uses one p for the three base distributions to simplify computations and to induce dependency
among the parameters. Furthermore, the sum can be truncated to obtain a reasonable approximation to G. The
effect of this truncation on the distribution of functionals of a DP has been studied in Ohlssen et al.5 and Ishwaran
and Zarepour6. The prior is completed by specifying that  p ~ N 0,20  ,  m ~ N 0.5, 20   s ~ N 0.5, 20  , and
,
 p2 , m2 , and  s2 all have N(0,100)I(0,∞).
The model is fitted via Markov Chain Monte Carlo (MCMC) methods7, where, at
2
each iteration of the MCMC sampler, food profiles (pi, mi, si) are assigned to clusters. Posterior inferences of the
parameters were conducted using JAGS (Just Another Gibs Sampling8) (through rJags9) to generate 10,000
MCMC iterations.
Second Step
Since the clusters of food have been established within the MCMC run, the likelihood functions of all the sample
analyte concentrations within the cluster are formed at each iteration, such that

 ln LOD   mc  
 
L X c | pc , mc , sc , C  c     1  pc   pi 

s
i 1 j 1 
c


I
ni
wij
 p LN ( X
c
; mc , s c ) 
1 wij
ij
,i  c
Then, classical maximum likelihood estimates or Bayesian estimates for pc, mc, sc can be used. In our case, we
used the ML estimate on 1,001 random iterations of the MCMC chain. This is implemented through classical
general purpose optimization methods in R10.
Development of the Figures
Figures were developed to illustrate the clustering of the foods depending on the pattern of perchlorate (Figure S1) or iodine (Figure S-2) concentrations in different foods.
These figures were developed by finding the partition that best represents the final average probability matrix.
The posterior similarity matrix is constructed where at each iteration of the MCMC sampler, a score matrix with
each element of the matrix set equal to 1 if food i and j belong to the same cluster, and zero otherwise. At the end
of the estimation process, a probability matrix, S, is formed by averaging the score matrices obtained at each
iteration, so element Sij denotes the probability that foods i and j are assigned to the same cluster. However, “label
switching” prohibits making inference on the class specific parameters, because draws of class specific
parameters may be associated with different class labels during the course of the MCMC run. Consequently,
class-specific posterior summaries that average across the draws will be invalid. S. Dahl11 suggests an approach to
3
identify the best partition by choosing among all the partitions generated by the sampler―that is, the partition that
minimizes the least-squared distance to the matrix S. This is accomplished by maximizing the Posterior Expected
Rand Adjusted index (PEAR) using the mcclust R library12. The adjusted Rand index measures similarity between
estimated and posterior expected clusters but is corrected for chance.
Figure S-1 shows a violin plot of perchlorate concentration on the y-axis (µg/kg) for each of the identified
clusters, with the number of foods per cluster on the x-axis. Each dot represents a perchlorate concentration, with
the dashed horizontal line indicating the perchlorate LOD of 1 µg/kg. For some clusters (for example where the
number of foods=3), all values are above the LOD, while for other clusters (for example, where the number of
foods=31), the majority of perchlorate values are at the LOD. Figure S-2 shows a similar violin plot for iodine,
where the y-axis is iodine concentration in mg/kg and the x-axis is number of foods per cluster. The dashed
horizontal line represents the LOD of 0.03 mg/kg (although iodine had a range of LODs from 0.03-0.06 mg/kg).
The code is available on request.
4
References for Supplemental Information
1.
European Food Safety Authority. Management of left-censored data in dietary exposure assessment of
chemical substances. EFSA Journal 2010; 8.
2.
MacEachern SN. Dependent dirichlet processes. In. Columbus, OH: The Ohio State University,
Department of Statistics, 2000.
3.
De Iorio M, Muller P, Rosner GL, MacEachern SN. An ANOVA model for dependent random measures.
J Am Stat Assoc 2004; 99: 205-215.
4.
Sethuraman J. A constructive definition of Dirichlet priors. Stat Sin 1994; 4: 639-650.
5.
Ohlssen DI, Sharples LD, Spiegelhalter DJ. Flexible random-effects models using applications to
institutional comparisons. Stat Med 2007; 26: 2088-2112.
6.
Ishwaran H, Zarepour M. Exact and approximate sum representations for the Dirichlet process. Can J Stat
2002; 30: 269-283.
7.
Gilks WR, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in Practice. Chapman &
Hall/CRC: London, 1996.
8.
Plummer M. JAGS Version 3.4.0 User Manual, 2013.
9.
Plummer M. rjags: Bayesian Graphical Models using MCMC. R Package Version 4-4. 2015, Available at
http://CRAN.R-project.org/package=rjags (accessed March 4, 2016).
10.
R Core Team. A language and environment for statistical computing. . 2015: Vienna, Austria, Available
at http://www.R-project.org/ (accessed March 4, 2016).
11.
Dahl DB. Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model. In: KA D,
P M, M V (eds). Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press,
2006, pp 201-218.
12.
Fritsch A, Ickstadt K. Improved criteria for clustering based on the posterior similarity matrix. Bayesian
Anal 2009; 4: 367-391.
5
Supplementary information
Table S1. Total Diet Study sample collection dates and locations for perchlorate and iodine data.
Market basket
2008-1
2008-2
2008-3
2008-4
2009-1
2009-2
2009-3
2009-4
2010-1
2010-2
2010-3
2010-4
2011-1
2011-2
2011-3
2011-4
2012-1
2012-2
2012-3
2012-4
Sample collection dates
October-November 2007
January-February 2008
March-May 2008
July-August 2008
October-November 2008
January-February 2009
April-May 2009
July-August 2009
October-November 2009
January-February 2010
April-May 2010
July-August 2010
October-November 2010
January-February 2011
April-May 2011
July-August 2011
October-November 2011
January-February 2012
April-May 2012
July-August 2012
Collection region and locations
Central (Toledo, OH; Detroit, MI; Minneapolis-St. Paul, MN)
West (Albuquerque, NM; Phoenix-Mesa, AZ; Reno, NV)
South (Baltimore, MD; Houston, TX; Tampa, FL)
North (Buffalo, NY; Voorhees, NJ; Philadelphia, PA)
Central (Chicago, IL; Columbus, OH; Springfield, MO)
West (Colorado Springs, CO; Oakland, CA; Spokane, WA)
South (Greenville, NC; Austin, TX; Montgomery, AL)
North (New York, NY; Newark, NJ; Concord, NH)
Central (Lansing, MI; Des Moines, IA; Madison, WI)
West (Riverside-San Bernardino, CA; San Francisco, CA; Yakama, WA)
South (Charleston, WV; Tampa-St. Petersburg-Clearwater, FL; New Orleans, LA)
North (Boston, MA; Syracuse, NY; Pittsburg, PA)
Central (Chicago, IL; Youngstown-Warren, OH; Kalamazoo-Battle Creek, MI)
West (Salt Lake City-Ogden, UT; Los Angeles-Long Beach, CA; Boise, ID)
South (Atlanta, GA; Roanoke, VA; San Antonia, TX)
North (Hartford, CT; Morris-Passaic, NJ; Scranton-Wilkes-Barre, PA)
Central (Peoria, IL; Wichita, KS; St. Cloud, MN)
West (Boulder, CO; Las Vegas, NV; Seattle, WA)
South (Raleigh, NC; West Palm Beach-Boca Raton, FL; Nashville, TN)
North (Monmouth-Ocean, NJ; Albany, NY; Chester County, PA)
6