Download Local Control Analysis of Radon

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Bias Adjustment in Data Mining:
Local Control Analysis
of Radon and Ozone
S. Stanley Young
Robert Obenchain
Goran Krstic
Abstract
Local control analysis of radon and ozone
S. Stanley Young, CGStat LLC
Robert L. Obenchain, Risk Benefit Statistics LLC
Goran Krstic, Fraser Health Authority
Large (observational) data sets typically present research opportunities, but also problems
that can lead to false claims. In Big Data, the standard error of an effect estimate goes to
zero as sample size increases, so even small biases can lead to declared (but false) claims.
In addition, the average of treatment can be almost meaningless when there are
interactions with confounders that create local variation in effect-sizes. Data miners need
statistical methods that can deal simply and efficiently with these sources of bias. Here, we
demonstrate use of a JMP add-in, Moving Median, and a new JMP platform, Local Control,
for the analysis of two data sets. Our first case study illustrate reduction of bias in an
environmental epidemiology data set. Our second study uses Local Control on a time series
air quality example. By detecting interactions, data miners can produce more realistic and
more relevant analyses that reduce the bias typically implied by the variety and
heterogeneity of Big Data.
2
Plan for Radon Data Set
1. Radon background
2. Local Control analysis strategy
3. Analysis of 2881US counties
4. Review of Local Control Results
5. Summary
3
Figure 1. Spatial distribution of obesity, lung cancer, radon and smoking.
Obesity
Lung Cancer
Source: http://www.maxmasnick.com/2011/11/15/obesity_by_county/
Ever Smoking
4
EPA cited meta analysis (1)
2005
5
EPA cited meta analysis (2)
2004
6
Local Control Analysis Process
Large observational data set
A vs B comparison
The steps
1.
2.
3.
4.
Aggregate – cluster, LTDs
Confirm – randomization test
Explore – sensitivity analysis
Reveal – modeling, MLR, RP
7
Step 0: Select clustering variables
8
Local Control Analysis Radon
“Most Typical” micro-aggregation of 2,881 US Counties
on 3 primary X-confounders
1. Age Over 65 %
2. Obesity %
3. Currently Smoke %
Y-outcome = Lung Cancer Mortality.
Binary Treatment Indicator:
Radon High ( > 2.1 pCi/L ) vs. Low
9
Variables selected for clustering.
NB: Regression coefficient
for radon is NEGATIVE.
10
Local Control Add-In
Russ Wolfinger/Bob Obenchain
11
Step 1: Clustering
12
Within cluster statistics
1. Local Treatment Difference, LTD
2. Local Linear Regression (slope and intercept)
3. Local Survival Analysis (Failure times)
4. Etc.
13
Local Treatment Difference
at the centroid of an informative cluster
E[ (Y|t=1) - (Y|t=0)|X ]
Single df comparison
Given X, local effect.
“Fair Treatment Comparison”
14
Aggregate Cycle
Observed LTD Distribution
(49 Informative Clusters)
15
Step 2: Confirm clustering matters
Random
Distribution
Observed
Distribution
Observed LTD Distribution
16
Step 2: Confirm Cycle
Observed LTD empirical Cumulative Distribution Function (CDF)
“LTD-like” Random Permutation CDF
17
Step 3: Explore Cycles
Tried using “Complete Linkage” as well as “Fast Ward”
Tried using of 3 out of 5 potential X-confounders for clustering:
Age Over 65 %
Obesity %
Currently Smoke %
Ever Smoke %
Median Household Income ($1,000s)
Tried using between 50 and 100 clusters.
18
Reveal Cycle
Fitted “Supervised Learning” Models for predicting observed LTDs:
JMP 11 “Modeling” Platform -> “Partition” option
single Tree (7 terminal nodes)
Bootstrap Forest – Model Average of 100 Trees
JMP “Fit Model” Platform – Multi-Variable Regression (Degree at most 2)
Tried using 6 potential X-confounders for predicting observed LTDs:
Age Over 65 %
Obesity %
Currently Smoke %
Ever Smoke %
Median Household Income ($1,000s)
Radon ( or Ln[Rn] ) Level (as either ordinal or continuous measures)
19
Tree
20
Method Two (Bootstrap Forest), R^2 =0.80
21
Linear Regression
22
Regression Results
23
Partial Correlations
24
London Smog, 1952
25
EPA and ozone
Bad
0.075 ppm
Good ????
0.20 ppm
Ozone Generators that are
Sold as Air Cleaners
Reviewed by EPA
26
LC London Ozone
0. Variable selection
1. Cluster to 34/35
2. LR within cluster
3. Append to intercept and slope to data set
4. P-value plot
5. Histograms
6. RP on intercepts and slopes
London time Series (outliers removed)
Daily Deaths
Smoothed
Subtract Moving Median
21-5
28
Step 0: Variable selection
29
Time Series Smoother Add-In
Paul Fogel
Paris
30
Possible Ozone Effect
31
Step 1: Cluster
32
LR within clusters,
Intercept
33
LR Within clusters
Slope
No evidence that daily deaths are affected by current ozone.
34
Predict Intercept
35
Predict Slope
36
P-value plot for intercepts and
slopes ozone - 1
37
Summary of ozone
Local Control Analysis
1. NB: Outlier time period was removed.
2. Ozone on day of death had no effect, LR.
3. Ozone on day -1, Intercepts and slopes within
clusters were consistent with random, with
one exception of a strong negative effect, pvalue plots.
4. Deaths on day -1, and temperature were the
most important predictors for death deviation
from time series.
38
Contact Information
S. Stanley Young
[email protected]
Robert Obenchain
[email protected]
Goran Krstic
[email protected]
39
Outlier
40
What is going on?
41
Plan for Radon Data Set
1. Radon background
2. Local Control analysis strategy
3. Analysis of ~3000 US counties
4. Review of Local Control Results
5. Summary
42
Step Zero: Selecting Variables
43
Puzzle, parts and fit
44