Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Supplementary Information Calculation of Parameters for Simulations To simulate pooled sample data that might result from individual samples, I considered a group of compounds that typically have a large percentage of non-detectable values when measured during the National Health and Nutrition Examination Survey (NHANES) of the National Centers for Disease Control and Prevention (CDC), in conjunction with CDC’s ongoing biomonitoring of the U.S. population’s exposure to environmental chemicals. But in order to ensure that I could obtain reliable estimates of the mean and variance of the measurements, I considered only chemicals in this group with the highest percentage of detectable results. Because I expected my pooled-sample estimation method (which relies on variance estimates to correct biases) to be affected by the among-subject variance structure, I chose two chemicals, 2,2’,4,4’,5,5’Hexachlorobiphenyl (or PCB 153) and 1,1’-(2,2-dichloroethenylidene)-bis[4chlorobenzene] (or p,p’-DDE) from among this group, which exhibited a different among-subject variance structure. The among-subject variance for PCB153 measurements is uniform across all concentrations, whereas the among-subject variance for p,p’-DDE increases with increasing concentration. To make the simulation study as realistic as possible and to evaluate whether pooled sampling might serve as an alternative to CDC’s current individual-sample approach to biomonitoring, I assumed that the pooled sampling design would correspond to the NHANES 2005–2006 stratified multistage probability survey design for polychlorinated and polybrominated compounds. In this design, 2,193 samples were collected and could 2 be divided into subpopulations based on gender, race-ethnicity, and age. To assure adequate sample sizes for subpopulation estimation and to correspond to the demographic domains of the NHANES complex multistage area probability design and the raceethnicity categories used in the Third National Report on Human Exposure to Environmental Chemicals (NCEH, 2005), I limited race-ethnicity to Mexican-Americans, Non-Hispanic Blacks, and Non-Hispanic Whites in four age groups: (12–19 years; 20–39 years; 40–59 years; and 60+ years). As a result, the number of samples available for pooling from 24 subpopulations based on 2 levels of gender, 3 levels of race-ethnicity, and 4 levels of age group was 2,034. As shown in Table 1 (main text), the number of samples available varied across the 24 subpopulations and ranged from 36 to 145. For this evaluation, I performed simulations to determine the optimum number of samples to combine in each pool by evaluating the bias of geometric mean and 95th percentile estimates for all possible designs. If I require that there be a minimum of 2 pools per subpopulation, then the possible numbers of samples per pool and the range in the number of pools per subpopulation would be as follows: samples per pool [minimum, maximum number of pools per subpopulation] = 2 [18, 72], 3 [12, 48], 4 [9, 36], 5 [7, 29], 6 [6, 24], 7 [5, 20], 8 [4, 18], 9 [4, 16], 10 [3, 14], 11 [3, 13], 12 [3, 12], 13 [2, 11], 14 [2, 10], 16 [2, 9], 18 [2, 8]. To simulate both individual-sample data and pooled-sample data, it was necessary to obtain separate estimates of the analytical standard deviation ( Analytic:( x ) ) and the amongsubject standard deviation ( Among subject:ln( X ) ). To estimate Analytic:( x ) I used repeat measurements of 10 PCB 153 calibration standards and of 7 p,p’-DDE calibration standards. For both chemicals, there was an approximate linear relationship between the 3 natural logarithm of ˆ Analytic:( x ) and the natural logarithm of the corresponding calibration standard mean. So I used least squares regression to model this relationship. The total standard deviation ( Total:ln(x) ) was estimated by use of actual log-transformed data from individual subjects measured during NHANES 2003–2004. This estimate of Total:ln(x) was computed separately for each of the 24 demographic groups described in Table 1 (main text). An estimate of the among-subject standard deviation ( Among subject:ln( X ) ) was calculated as follows: 2 1/ 2 ˆ2 . ˆ Amongsubject:ln(X ) (ˆ Total :ln( X ) Analytic:ln( X ) ) (SI12) 2 The value of ˆ Analytic :ln(x ) used in Eq. SI12 was obtained from the following relationship 2 ˆ2 between ˆ ln( X ) and X , which can be derived from the equations given on page 58 of Helsel (2005): 2 ˆ2 ˆ2 ˆ2 ˆ ln( X ) log{( X X ) / X } , where ˆ X2 and ˆ X2 are estimates of the variance and the square of the mean, respectively, of 2 non-transformed data and ˆ ln( X ) is an estimate of the variance of log-transformed data. The simulation of individual- and pooled-sample data was accomplished by starting with means of log-transformed individual sample values from NHANES 2003–2004 for each of the 24 subpopulations described in Table 1 (main text). These means were treated as the “true” subpopulation means of log-transformed individual samples. Then, separately for each of the 15 possible scenarios (i.e., combinations of numbers of samples per pool and pools per subpopulation), simulated individual values with the among-subject variance only and with both the among-subject and analytic variances described in the 4 previous paragraphs were generated by adding Gaussian random errors to each of the 24 “true” means of log-transformed results. Each result in these two sets of simulated values was then exponentiated to represent individual-sample values with among-subject variability only or to represent individual-sample measurements with both among-subject and analytic variability. This process was repeated 5,000 times for each simulated individual-sample value or measurement within each of the variable numbers of pools within each of the 15 scenarios. For example, the scenario with 7 samples per pool and a range from 5 to 20 pools per subpopulation resulted in a pooled sample design with 280 pools each with 7 samples per pool, for a total of 1,960 samples. Thus, 9,800,000 (= 1,960 x 5,000) individual sample values with only among-subject error and 9,800,000 individual sample measurements with both among-subject and analytic error were generated for this scenario. The simulated individual-sample values with only the among-subject variances were summed within each pool in each subpopulation to generate arithmetic means of individual samples to represent simulated pooled-sample values. Then Gaussian random errors based on the appropriate analytical variance were added to these arithmetic means to simulate pooled-sample measurements. Thus, for the scenario with 7 samples per pool and a range from 5 to 20 pools per subpopulation, the original 9,800,000 values were reduced to 1,400,000 (= 280 x 5,000) pooled-sample measurements. So at this point in the simulation for this scenario, there were 5,000 replicates of 280 simulated pooledsample measurements and 5,000 replicates of 1,960 individual- sample measurements from the same 24 subpopulations. 5 Results and Discussion In Figures SI1 and SI2, I demonstrate how well bias-corrected pooled-sample geometric mean estimates compare with corresponding parametric and non-parametric individualsample estimates under a pooled-sample design with 7 samples per pool and from 5 to 20 pools per demographic group. Also displayed by these figures are the pooled sample results that would be obtained if no bias correction had been made. To facilitate the comparisons, I present the expected 5th, 50th, and 95th percentiles of percent biases for simulated PCB153 (Figure SI1) and p,p’-DDE (Figure SI2) data for each of the 24 subpopulations described in Table 1 (main text). I calculate each percent bias by multiplying 100 times the difference (geometric mean estimate - true geometric mean) divided by the true geometric mean. The top frame in each figure displays this range in percent biases (i.e., 5th, 50th, and 95th percentiles of geometric mean biases) for parametric estimates using individual samples. The second-from-top frame in each figure displays this range for non-parametric estimates, using individual samples. Parametric estimates were computed on the basis of estimates of the mean of the natural logarithm of the individual sample results. Non-parametric estimates were computed by using the empirical 50th percentiles of the individual sample results. The second-from-bottom frame in each figure displays the range in percent biases for my bias-corrected estimates, using pooled samples. The bottom frame in each figure displays the range in percent biases for estimates using pooled samples without a bias correction. Even though the percentage bias exceeded 210% for the uncorrected p,p’-DDE estimates (Figure SI2 bottom frame) I limited the vertical axis of each frame to an upper bias of 100% so that 6 the range in biases of the corrected pooled-sample estimates could be more readily compared with those of the individual-sample estimates. It is clear from the bottom frame in both figures that uncorrected pooled sample estimates from log-normally distributed results are extremely biased and should not be used. From the second frame from the bottom in both figures, on the other hand, it can be seen that estimates of geometric means for compounds with uniform among-subject variance across all concentrations are expected to have a slight negative bias (Figure SI1), and estimates of geometric means for compounds with increasing among-subject variance as concentration increases are expected to have biases that range from slightly negative to slightly positive as the true concentration increases from low to high (Figure SI2). It should be noted that my simulations do not deal with the potential error associated with ignoring a complex survey design, such as the one employed in NHANES. Thus, even though it is possible to obtain unbiased estimates with pooled samples by using volumetric weighting to incorporate the NHANES sampling weights, it is not possible to incorporate the stratification and clustering aspects of the survey design. Thus, accuracy of standard errors and confidence limits may be compromised by the use of pooled samples. So it may be advisable to inflate standard errors based on an estimate of the average design effect of the survey. One advantage of pooled samples over individual samples, however, is that geometric means and percentiles can be estimated within demographic sub-groups (e.g., a percentile estimate for non-Hispanic Black males over 60 years of age), whereas individual-sample measurements usually require estimation across demographic sub-groups to attain sufficient sample sizes. References 7 Helsel DR. Nondetects and Data Analysis (John Wiley & Sons, Inc., New Jersey, 2005), p 58. NCEH. Third National Report on Human Exposure to Environmental Chemicals (NCEH Publication 05-0570, 2005). 8 Figure SI1. Simulation results for PCB153 using an unbalanced pooled-sample design consisting of 7 samples in each of 5 to 20 replicate pools for each of 24 demographic groups described in Table 1 (main text). The following are plotted versus the true geometric mean: percent bias of parametric geometric mean estimates from individual samples constituting pools (top frame), percent bias of non-parametric geometric mean estimates from individual samples constituting pools (second frame from top), percent bias of geometric mean estimates from bias-corrected pooled samples (second frame from bottom), and percent bias of geometric mean estimates from uncorrected pooled samples (bottom frame). Horizontal cross marks represent median percent biases from 5,000 simulations. Vertical lines extending from the 5th to the 95th percentile are used to show the observed range of percent biases from all simulations. Figure SI2. Simulation results for p,p’-DDE using an unbalanced pooled-sample design consisting of 7 samples in each of 5 to 20 replicate pools for each of 24 demographic groups described in Table 1 (main text). The following are plotted versus the true geometric mean: percent bias of parametric geometric mean estimates from individual samples constituting pools (top frame), percent bias of non-parametric geometric mean estimates from individual samples constituting pools (second frame from top), percent bias of geometric mean estimates from bias-corrected pooled samples (second frame from bottom), and percent bias of geometric mean estimates from uncorrected pooled samples (bottom frame). Horizontal cross marks represent median percent biases from 5,000 simulations. Vertical lines extending from the 5th to the 95th percentile are used to show the observed range of percent biases from all simulations.