Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bias Adjustment in Data Mining: Local Control Analysis of Radon and Ozone S. Stanley Young Robert Obenchain Goran Krstic Abstract Local control analysis of radon and ozone S. Stanley Young, CGStat LLC Robert L. Obenchain, Risk Benefit Statistics LLC Goran Krstic, Fraser Health Authority Large (observational) data sets typically present research opportunities, but also problems that can lead to false claims. In Big Data, the standard error of an effect estimate goes to zero as sample size increases, so even small biases can lead to declared (but false) claims. In addition, the average of treatment can be almost meaningless when there are interactions with confounders that create local variation in effect-sizes. Data miners need statistical methods that can deal simply and efficiently with these sources of bias. Here, we demonstrate use of a JMP add-in, Moving Median, and a new JMP platform, Local Control, for the analysis of two data sets. Our first case study illustrate reduction of bias in an environmental epidemiology data set. Our second study uses Local Control on a time series air quality example. By detecting interactions, data miners can produce more realistic and more relevant analyses that reduce the bias typically implied by the variety and heterogeneity of Big Data. 2 Plan for Radon Data Set 1. Radon background 2. Local Control analysis strategy 3. Analysis of 2881US counties 4. Review of Local Control Results 5. Summary 3 Figure 1. Spatial distribution of obesity, lung cancer, radon and smoking. Obesity Lung Cancer Source: http://www.maxmasnick.com/2011/11/15/obesity_by_county/ Ever Smoking 4 EPA cited meta analysis (1) 2005 5 EPA cited meta analysis (2) 2004 6 Local Control Analysis Process Large observational data set A vs B comparison The steps 1. 2. 3. 4. Aggregate – cluster, LTDs Confirm – randomization test Explore – sensitivity analysis Reveal – modeling, MLR, RP 7 Step 0: Select clustering variables 8 Local Control Analysis Radon “Most Typical” micro-aggregation of 2,881 US Counties on 3 primary X-confounders 1. Age Over 65 % 2. Obesity % 3. Currently Smoke % Y-outcome = Lung Cancer Mortality. Binary Treatment Indicator: Radon High ( > 2.1 pCi/L ) vs. Low 9 Variables selected for clustering. NB: Regression coefficient for radon is NEGATIVE. 10 Local Control Add-In Russ Wolfinger/Bob Obenchain 11 Step 1: Clustering 12 Within cluster statistics 1. Local Treatment Difference, LTD 2. Local Linear Regression (slope and intercept) 3. Local Survival Analysis (Failure times) 4. Etc. 13 Local Treatment Difference at the centroid of an informative cluster E[ (Y|t=1) - (Y|t=0)|X ] Single df comparison Given X, local effect. “Fair Treatment Comparison” 14 Aggregate Cycle Observed LTD Distribution (49 Informative Clusters) 15 Step 2: Confirm clustering matters Random Distribution Observed Distribution Observed LTD Distribution 16 Step 2: Confirm Cycle Observed LTD empirical Cumulative Distribution Function (CDF) “LTD-like” Random Permutation CDF 17 Step 3: Explore Cycles Tried using “Complete Linkage” as well as “Fast Ward” Tried using of 3 out of 5 potential X-confounders for clustering: Age Over 65 % Obesity % Currently Smoke % Ever Smoke % Median Household Income ($1,000s) Tried using between 50 and 100 clusters. 18 Reveal Cycle Fitted “Supervised Learning” Models for predicting observed LTDs: JMP 11 “Modeling” Platform -> “Partition” option single Tree (7 terminal nodes) Bootstrap Forest – Model Average of 100 Trees JMP “Fit Model” Platform – Multi-Variable Regression (Degree at most 2) Tried using 6 potential X-confounders for predicting observed LTDs: Age Over 65 % Obesity % Currently Smoke % Ever Smoke % Median Household Income ($1,000s) Radon ( or Ln[Rn] ) Level (as either ordinal or continuous measures) 19 Tree 20 Method Two (Bootstrap Forest), R^2 =0.80 21 Linear Regression 22 Regression Results 23 Partial Correlations 24 London Smog, 1952 25 EPA and ozone Bad 0.075 ppm Good ???? 0.20 ppm Ozone Generators that are Sold as Air Cleaners Reviewed by EPA 26 LC London Ozone 0. Variable selection 1. Cluster to 34/35 2. LR within cluster 3. Append to intercept and slope to data set 4. P-value plot 5. Histograms 6. RP on intercepts and slopes London time Series (outliers removed) Daily Deaths Smoothed Subtract Moving Median 21-5 28 Step 0: Variable selection 29 Time Series Smoother Add-In Paul Fogel Paris 30 Possible Ozone Effect 31 Step 1: Cluster 32 LR within clusters, Intercept 33 LR Within clusters Slope No evidence that daily deaths are affected by current ozone. 34 Predict Intercept 35 Predict Slope 36 P-value plot for intercepts and slopes ozone - 1 37 Summary of ozone Local Control Analysis 1. NB: Outlier time period was removed. 2. Ozone on day of death had no effect, LR. 3. Ozone on day -1, Intercepts and slopes within clusters were consistent with random, with one exception of a strong negative effect, pvalue plots. 4. Deaths on day -1, and temperature were the most important predictors for death deviation from time series. 38 Contact Information S. Stanley Young [email protected] Robert Obenchain [email protected] Goran Krstic [email protected] 39 Outlier 40 What is going on? 41 Plan for Radon Data Set 1. Radon background 2. Local Control analysis strategy 3. Analysis of ~3000 US counties 4. Review of Local Control Results 5. Summary 42 Step Zero: Selecting Variables 43 Puzzle, parts and fit 44