Download Diabetes: A Case Study with SAS Enterprise Miner 5.3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Diabetes: A Case Study
with SAS© Enterprise Miner
Southern Methodist University
Dedman College
Dallas, Texas 75275
Subhojit Das | Gregory Johnson | Jacob Williamson
Faculty Advisor: Dr. Thomas B. Fomby
SAS© Data Mining Shootout 2010
Problem Statement and Data
Stratification
SAS© Data Mining Shootout 2010
Problem Statement


The incidence of diabetes continues to be an increasing trend
in America and results in increased health care costs for both
individuals and the government.
Based on the data set provided, USHETH has tasked us with
the following:
–
–
–
–
Determine the incidence rate of diabetes in America by age and
gender.
Estimate the medical costs associated with the disease from an
individual and government perspective (Medicare and Medicaid).
Derive the potential savings that may result from a treatment
program that would reduce the body mass index (BMI) of
participants in the treatment group.
Lastly, assess the data set’s representativeness as it pertains to
extrapolating these trends across the United States.
SAS© Data Mining Shootout 2010
Data Stratification - I

The sample data set’s contents are as follows:
–
–
50,788 observations
45 variables



37 class variables (various demographic, behavioral, and medical attributes)
8 interval variables (BMI, Age, Number of Visits to Doctor, etc.)
Initial analysis of the response data indicated that not all questions
were applicable to each participant.
–
–
–
These responses were noted as -1 in the data set, and these
“inapplicables” could be handled by further stratifying age and gender.
For example, mammograms, breast exam, and pap smear tests are only
applicable to female respondents, while an individual’s smoking preference
may only be suitable for “adult” respondents (i.e. individuals over 16 or 18
years of age).
Because of these data phenomena, it was only natural to segment the data
into adults and children, and then to divide the adult population further into
men and women.
SAS© Data Mining Shootout 2010
Data Stratification - II
% of Participants within each Age Group
Last PSA Value Distribution by Gender
30,000
Obs Count
25,000
20,000
Age Groups
0 to 18
19 to 35
36 to 54
55 to 71
72 plus
NA
Currently Smoke?
Yes
98%
0%
16%
19%
11%
23%
9%
18%
9%
9%
No
15,000
10,000
% of Participants within each Age Group
5,000
0
-1
Not -1
LAST_PSA
Male
Female
Age Groups
0 to 18
19 to 35
36 to 54
55 to 71
72 plus
Income Bands
<= 0
>0
91%
9%
11%
89%
6%
94%
6%
94%
3%
97%
SAS© Data Mining Shootout 2010
2%
65%
66%
73%
82%
Data Stratification - III



Additional analysis of the data reveals inconsistencies in
determining the age split between child and adult (i.e. some
responses are cutoff at age 16, while others are cutoff at age
18).
Since the BMI threshold varies linearly with age up until 20, a
decision was made to utilize this as the age segmentation
factor.
In conclusion, three stratification groups were chosen for model
execution:
–
–
–
Male (all adult variables excluding female-related variables)
Female (all adult variables excluding PSA)
Children (excludes variables answered by participants 18 and above)
SAS© Data Mining Shootout 2010
3 Stratification Groups
SAS© Data Mining Shootout 2010
Modeling
SAS© Data Mining Shootout 2010
Modeling of Diabetes - I


Diabetes diagnosis is treated as a binary target variable.
Due to the low incidence of diabetes (7% for the male and
female groups), oversampling of the positive diabetes
observations are needed in order to build a better predictive
model.
–

For the training set, the diabetes incidence rate was increased to 50%,
while the validation and test sets were maintained at the 7% diabetes
incidence rate.
In the children group, the incidence rate of diabetes was even
lower at 0.24% or 40 individuals.
–
–
The technique applied on the male/female groups was not a viable option
for the children group.
Instead, observations from the non-diabetes segment were randomly
eliminated until a 2% diabetes incidence rate was obtained for the group.
SAS© Data Mining Shootout 2010
Modeling of Diabetes - II


Four model frameworks were tested for each of the three
groups – Logit, Decision Tree, Neural Network, and Ensemble
(average of the previous three models).
The below average squared errors of these models are detailed
in the chart below (yellow denotes the selected model):
Average Squared Error
Logit
Decision Tree
Neural Network
Ensemble
Male
0.18008
0.17971
0.18492
0.17008
Female
0.17491
0.17736
0.17325
0.24506
Children
0.018968
0.015072
0.019556
0.016679
SAS© Data Mining Shootout 2010
Modeling of Total Expenditure - I

Since Total Expenditure is a continuous variable it can be partitioned
in a more traditional manner:
–

The diabetes diagnosis variable switches from being a target variable
to an input variable in the total expenditure model.
–
–


Training – 50%, Validation – 30%, Test – 20%
Since diabetes is associated with various health conditions (heart, eye), the
effect of diabetes in the total expenditure equation may not simply be one
of an additive nature.
Thus, it is imperative that interaction and squared terms such as diabetes
crossed with heart disease be introduced and allowed to model the
nonlinearity in the data.
Recalling the problem statement objective of determining the effects of
a 10% BMI reduction on healthcare costs, it was important to treat the
model as parsimoniously as possible and not include variables that
would be highly correlated with BMI.
In the children group, a dummy variable was utilized to treat the BMI
effect differently for children 5 or under versus greater than age 5.
SAS© Data Mining Shootout 2010
Modeling of Total Expenditure - II



Aware that expenditures cannot be negative, the model
truncates negative values to 0.
Once again, four model frameworks were tested for each of the
three groups – Regression, Decision Tree, Neural Network, and
Ensemble (average of the previous three models).
The average squared errors of these models are detailed in the
chart below (yellow denotes the selected model):
Average Squared Error
Regression
Decision Tree
Neural Network
Ensemble
Male
40114725.47
41520275.20
39882430.77
39226403.10
Female
57257862.33
58495994.70
57242916.66
58917874.93
Children
32041994.47
11090024.83
31046695.72
20103830.53
SAS© Data Mining Shootout 2010
Treatment Group


According to the problem
statement, USHETH is interested
in knowing how much savings can
be recognized from a treatment
program that reduces the BMI of
selected candidates by 10%.
This treatment group is defined as
follows:
–
–
–
BMI greater than 25 for adults
BMI greater than 17 for children 5
and under
BMI greater than the linear path of
17 to 25 BMI for children between
the ages of 6 and 20 (see graph to
the right)
SAS© Data Mining Shootout 2010
Expected Savings - I


The reduction in BMI on the treatment group impacts both the diabetes
model and the expenditure model.
Framework for Calculating the Expected Savings:
–
Diabetes Model – predict the probability of being diagnosed with diabetes
using original BMI as well as the reduced BMI we get
Probability of being diagnosed with diabetes before BMI treatment.
Probability of being diagnosed with diabetes after BMI treatment.
SAS© Data Mining Shootout 2010
Expected Savings - II
–
Expenditure Model - predict the expenditure of the person under the
assumption they have diabetes as well as under the assumption they do
not. We do this for each person twice, once using the original BMI and a
second time using their reduced BMI we get.
= Total Expenditure before BMI treatment under the assumption that the individual
has diabetes.
= Total Expenditure before BMI treatment under the assumption that the individual
does not have diabetes.
= Total Expenditure after BMI treatment under the assumption that the individual
has diabetes.
= Total Expenditure after BMI treatment under the assumption that the individual
does not have diabetes.
SAS© Data Mining Shootout 2010
Expected Savings - III

Mathematically, this is equivalent to:
SAS© Data Mining Shootout 2010
Expected Savings - IV

The framework described on the prior slide was leveraged for all
groups (Male, Female, and Children), and the results of this exercise
are displayed below:
Expected Sample Savings
Expected Total Savings
Maximum Expected Savings
Expected Average Savings


Male
$825,290.82
$905.68
$77.59
Female
$1,080,604.93
$1,804.71
$104.08
Children
$21,151.96
$94.70
$3.99
Total
$1,927,047.70
This total equates to a savings of 1.6% of the total expenditure in the
sample. However, this only highlights the savings expected using the
change in the probability of getting diabetes.
Savings associated with the prevention of diabetes would be higher.
SAS© Data Mining Shootout 2010
Expected Savings - V

The costs of diabetes can simply be calculated using the predicted expenditure with
diabetes minus the predicted expenditures without diabetes:

The average cost of diabetes versus
the average expected savings from
the change in the probability of
diabetes are compared graphically on
the right for all three groups.
Keep in mind that the savings
generated in both of these scenarios
are based on models and data at the
participant’s current age.
Lifetime savings of these effects would
likely be much higher (especially for
the Children group).


Expected Average Savings from Change in Probability of
Diabetes versus Average Cost of Diabetes
$2,500
$1,932
$2,000
$1,500
$1,000
$500
$237
$78
$104
$4
$39
$0
Male
Female
Expected Average Savings
Children
Average Cost of Diabetes
SAS© Data Mining Shootout 2010
Expected Savings - VI


Lastly and as part of the savings-related section of the problem statement, it
was important to show the percentage of savings experienced within the
Medicare and Medicaid expenses as a result of the treatment group. For
Medicare, this was performed through a linear regression technique as follows:
AMOUNT_PAID_MEDICARE = α+βTOTALEXP + error_term
Similar to the expected savings framework, the predicted cost of Medicare for
the treatment group before and after would be:
PRED_AMOUNT_PAID_MEDICARE(before)
PRED_AMOUNT_PAID_MEDICARE(after)

Medicare savings then can be calculated by subtracting the predicted amount
paid to Medicare before treatment from the predicted amount paid to Medicare
after the treatment.
SAVINGS(medicare) = PRED_AMOUNT_PAID_MEDICARE(before) –PRED_AMOUNT_PAID_MEDICARE(after)
SAS© Data Mining Shootout 2010
Expected Savings - VII

Since P_AMOUNT_PAID_MEDICARE can be expressed as a function of
P_TOTALEXP, this is mathematically equivalent to:

The amount of savings to Medicare expressed as a percentage of the total
expenditure is captured as follows:

Therefore, the coefficient β represents the percentage of savings from total
expenditure that will go towards Medicare. This same process can be
performed to determine the percentage of savings from total expenditure that
will go towards Medicaid.
SAS© Data Mining Shootout 2010
Expected Savings - VIII

The results of the method to estimate the Medicare/Medicaid inquiry are as
follows:
–
–

28.1% of the savings goes towards Medicare
30.2% of the savings goes towards Medicaid
If we combine these findings and scale the results to represent the entire US
population, the total expected savings can be illustrated as follows:
Expected Population Savings
Total United States
Expected Savings
$11,749,980,707


Medicare
$3,299,159,583
Medicaid
$3,547,319,175
Overall, this preventative measure of a 10% BMI reduction on the specific
treatment group could save the US government over $6.8 billion dollars.
Confirmation of the representativeness of the sample is shown next.
SAS© Data Mining Shootout 2010
Representativeness of the Sample - I


Additional research was performed in order to verify that the sample data set
provided was representative of the entire US population from an age and
gender perspective.
A Chi-Square Goodness of Fit Test was used to compare the age/gender
distributions of the sample data set versus the population distribution chart of
the country. The null and alternative hypotheses for the goodness of fit test are
described as follows:
–
–

H0: The sample is representative of the population.
H1: The sample is not representative of the population.
The test statistic is defined as follows:
SAS© Data Mining Shootout 2010
Representativeness of the Sample - II

Additionally, the degree of freedom (DF) is equal to:

Males:
–
Χ2males = 7.6347785 with DF = 84
–
P(Χ2 > 7.6347785) ≈ 1
–
Since the p-value is greater than the significance level of 0.05, we do not reject the
Null Hypotheses.

Females:
–
Χ2females = 9.030266 with DF = 84
–
P(Χ2 > 9.030266) ≈ 1
–
Since the p-value is greater than the significance level of 0.05, we do not reject the
Null Hypotheses.

Based on the results of the goodness of fit test, we can safely conclude that the sample is
representative of the US population by age and gender.
SAS© Data Mining Shootout 2010
EM© Project Description and Diagram
3 – Total Expenditure
Model
1 - Data Definition,
Visualization, and
Characterization
Complete Diagram for Male
Model
5 – Expected
Savings
2 – Diabetes Model
4 – Treatment Group
SAS© Data Mining Shootout 2010
Conclusions
SAS© Data Mining Shootout 2010
Conclusion - I

Utilizing the data set provided by USHETH, models were constructed
to predict the incidence rate of diabetes.
–
–
–

For the Male model, the Ensemble model produced the most favorable results.
For the Female model, the Neural Network model produced the most favorable
results.
For the Children model, the Decision Tree model produced the most favorable results.
In addition to the modeling of the incidence rate of diabetes, models
were produced to predict the health care expenditures of the
participants within the data set.
–
–
–
For the Male model, the Ensemble model produced the most favorable results.
For the Female model, the Neural Network model produced the most favorable
results.
For the Children model, the Decision Tree model produced the most favorable results.
Conclusion - II

The expected savings resulting from the treatment group,
a 10% BMI reduction, equated to approximately $1.9 MM.

Extrapolating these results across the entire U.S.
population reveals that this could equate to a $11.7 B
total expected savings where $6.8 B of this expected
savings would be tied to savings observed in the
Medicare/Medicaid categories.

This reflects only contemporaneous savings. Long-term
savings, especially for children avoiding a life with
diabetes, could be substantially higher.
Thank You

Our thanks go to SAS / DOW / CMU for
sponsoring this competition and offering us
this opportunity to present our work and
attend this conference.