Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Diabetes: A Case Study with SAS© Enterprise Miner Southern Methodist University Dedman College Dallas, Texas 75275 Subhojit Das | Gregory Johnson | Jacob Williamson Faculty Advisor: Dr. Thomas B. Fomby SAS© Data Mining Shootout 2010 Problem Statement and Data Stratification SAS© Data Mining Shootout 2010 Problem Statement The incidence of diabetes continues to be an increasing trend in America and results in increased health care costs for both individuals and the government. Based on the data set provided, USHETH has tasked us with the following: – – – – Determine the incidence rate of diabetes in America by age and gender. Estimate the medical costs associated with the disease from an individual and government perspective (Medicare and Medicaid). Derive the potential savings that may result from a treatment program that would reduce the body mass index (BMI) of participants in the treatment group. Lastly, assess the data set’s representativeness as it pertains to extrapolating these trends across the United States. SAS© Data Mining Shootout 2010 Data Stratification - I The sample data set’s contents are as follows: – – 50,788 observations 45 variables 37 class variables (various demographic, behavioral, and medical attributes) 8 interval variables (BMI, Age, Number of Visits to Doctor, etc.) Initial analysis of the response data indicated that not all questions were applicable to each participant. – – – These responses were noted as -1 in the data set, and these “inapplicables” could be handled by further stratifying age and gender. For example, mammograms, breast exam, and pap smear tests are only applicable to female respondents, while an individual’s smoking preference may only be suitable for “adult” respondents (i.e. individuals over 16 or 18 years of age). Because of these data phenomena, it was only natural to segment the data into adults and children, and then to divide the adult population further into men and women. SAS© Data Mining Shootout 2010 Data Stratification - II % of Participants within each Age Group Last PSA Value Distribution by Gender 30,000 Obs Count 25,000 20,000 Age Groups 0 to 18 19 to 35 36 to 54 55 to 71 72 plus NA Currently Smoke? Yes 98% 0% 16% 19% 11% 23% 9% 18% 9% 9% No 15,000 10,000 % of Participants within each Age Group 5,000 0 -1 Not -1 LAST_PSA Male Female Age Groups 0 to 18 19 to 35 36 to 54 55 to 71 72 plus Income Bands <= 0 >0 91% 9% 11% 89% 6% 94% 6% 94% 3% 97% SAS© Data Mining Shootout 2010 2% 65% 66% 73% 82% Data Stratification - III Additional analysis of the data reveals inconsistencies in determining the age split between child and adult (i.e. some responses are cutoff at age 16, while others are cutoff at age 18). Since the BMI threshold varies linearly with age up until 20, a decision was made to utilize this as the age segmentation factor. In conclusion, three stratification groups were chosen for model execution: – – – Male (all adult variables excluding female-related variables) Female (all adult variables excluding PSA) Children (excludes variables answered by participants 18 and above) SAS© Data Mining Shootout 2010 3 Stratification Groups SAS© Data Mining Shootout 2010 Modeling SAS© Data Mining Shootout 2010 Modeling of Diabetes - I Diabetes diagnosis is treated as a binary target variable. Due to the low incidence of diabetes (7% for the male and female groups), oversampling of the positive diabetes observations are needed in order to build a better predictive model. – For the training set, the diabetes incidence rate was increased to 50%, while the validation and test sets were maintained at the 7% diabetes incidence rate. In the children group, the incidence rate of diabetes was even lower at 0.24% or 40 individuals. – – The technique applied on the male/female groups was not a viable option for the children group. Instead, observations from the non-diabetes segment were randomly eliminated until a 2% diabetes incidence rate was obtained for the group. SAS© Data Mining Shootout 2010 Modeling of Diabetes - II Four model frameworks were tested for each of the three groups – Logit, Decision Tree, Neural Network, and Ensemble (average of the previous three models). The below average squared errors of these models are detailed in the chart below (yellow denotes the selected model): Average Squared Error Logit Decision Tree Neural Network Ensemble Male 0.18008 0.17971 0.18492 0.17008 Female 0.17491 0.17736 0.17325 0.24506 Children 0.018968 0.015072 0.019556 0.016679 SAS© Data Mining Shootout 2010 Modeling of Total Expenditure - I Since Total Expenditure is a continuous variable it can be partitioned in a more traditional manner: – The diabetes diagnosis variable switches from being a target variable to an input variable in the total expenditure model. – – Training – 50%, Validation – 30%, Test – 20% Since diabetes is associated with various health conditions (heart, eye), the effect of diabetes in the total expenditure equation may not simply be one of an additive nature. Thus, it is imperative that interaction and squared terms such as diabetes crossed with heart disease be introduced and allowed to model the nonlinearity in the data. Recalling the problem statement objective of determining the effects of a 10% BMI reduction on healthcare costs, it was important to treat the model as parsimoniously as possible and not include variables that would be highly correlated with BMI. In the children group, a dummy variable was utilized to treat the BMI effect differently for children 5 or under versus greater than age 5. SAS© Data Mining Shootout 2010 Modeling of Total Expenditure - II Aware that expenditures cannot be negative, the model truncates negative values to 0. Once again, four model frameworks were tested for each of the three groups – Regression, Decision Tree, Neural Network, and Ensemble (average of the previous three models). The average squared errors of these models are detailed in the chart below (yellow denotes the selected model): Average Squared Error Regression Decision Tree Neural Network Ensemble Male 40114725.47 41520275.20 39882430.77 39226403.10 Female 57257862.33 58495994.70 57242916.66 58917874.93 Children 32041994.47 11090024.83 31046695.72 20103830.53 SAS© Data Mining Shootout 2010 Treatment Group According to the problem statement, USHETH is interested in knowing how much savings can be recognized from a treatment program that reduces the BMI of selected candidates by 10%. This treatment group is defined as follows: – – – BMI greater than 25 for adults BMI greater than 17 for children 5 and under BMI greater than the linear path of 17 to 25 BMI for children between the ages of 6 and 20 (see graph to the right) SAS© Data Mining Shootout 2010 Expected Savings - I The reduction in BMI on the treatment group impacts both the diabetes model and the expenditure model. Framework for Calculating the Expected Savings: – Diabetes Model – predict the probability of being diagnosed with diabetes using original BMI as well as the reduced BMI we get Probability of being diagnosed with diabetes before BMI treatment. Probability of being diagnosed with diabetes after BMI treatment. SAS© Data Mining Shootout 2010 Expected Savings - II – Expenditure Model - predict the expenditure of the person under the assumption they have diabetes as well as under the assumption they do not. We do this for each person twice, once using the original BMI and a second time using their reduced BMI we get. = Total Expenditure before BMI treatment under the assumption that the individual has diabetes. = Total Expenditure before BMI treatment under the assumption that the individual does not have diabetes. = Total Expenditure after BMI treatment under the assumption that the individual has diabetes. = Total Expenditure after BMI treatment under the assumption that the individual does not have diabetes. SAS© Data Mining Shootout 2010 Expected Savings - III Mathematically, this is equivalent to: SAS© Data Mining Shootout 2010 Expected Savings - IV The framework described on the prior slide was leveraged for all groups (Male, Female, and Children), and the results of this exercise are displayed below: Expected Sample Savings Expected Total Savings Maximum Expected Savings Expected Average Savings Male $825,290.82 $905.68 $77.59 Female $1,080,604.93 $1,804.71 $104.08 Children $21,151.96 $94.70 $3.99 Total $1,927,047.70 This total equates to a savings of 1.6% of the total expenditure in the sample. However, this only highlights the savings expected using the change in the probability of getting diabetes. Savings associated with the prevention of diabetes would be higher. SAS© Data Mining Shootout 2010 Expected Savings - V The costs of diabetes can simply be calculated using the predicted expenditure with diabetes minus the predicted expenditures without diabetes: The average cost of diabetes versus the average expected savings from the change in the probability of diabetes are compared graphically on the right for all three groups. Keep in mind that the savings generated in both of these scenarios are based on models and data at the participant’s current age. Lifetime savings of these effects would likely be much higher (especially for the Children group). Expected Average Savings from Change in Probability of Diabetes versus Average Cost of Diabetes $2,500 $1,932 $2,000 $1,500 $1,000 $500 $237 $78 $104 $4 $39 $0 Male Female Expected Average Savings Children Average Cost of Diabetes SAS© Data Mining Shootout 2010 Expected Savings - VI Lastly and as part of the savings-related section of the problem statement, it was important to show the percentage of savings experienced within the Medicare and Medicaid expenses as a result of the treatment group. For Medicare, this was performed through a linear regression technique as follows: AMOUNT_PAID_MEDICARE = α+βTOTALEXP + error_term Similar to the expected savings framework, the predicted cost of Medicare for the treatment group before and after would be: PRED_AMOUNT_PAID_MEDICARE(before) PRED_AMOUNT_PAID_MEDICARE(after) Medicare savings then can be calculated by subtracting the predicted amount paid to Medicare before treatment from the predicted amount paid to Medicare after the treatment. SAVINGS(medicare) = PRED_AMOUNT_PAID_MEDICARE(before) –PRED_AMOUNT_PAID_MEDICARE(after) SAS© Data Mining Shootout 2010 Expected Savings - VII Since P_AMOUNT_PAID_MEDICARE can be expressed as a function of P_TOTALEXP, this is mathematically equivalent to: The amount of savings to Medicare expressed as a percentage of the total expenditure is captured as follows: Therefore, the coefficient β represents the percentage of savings from total expenditure that will go towards Medicare. This same process can be performed to determine the percentage of savings from total expenditure that will go towards Medicaid. SAS© Data Mining Shootout 2010 Expected Savings - VIII The results of the method to estimate the Medicare/Medicaid inquiry are as follows: – – 28.1% of the savings goes towards Medicare 30.2% of the savings goes towards Medicaid If we combine these findings and scale the results to represent the entire US population, the total expected savings can be illustrated as follows: Expected Population Savings Total United States Expected Savings $11,749,980,707 Medicare $3,299,159,583 Medicaid $3,547,319,175 Overall, this preventative measure of a 10% BMI reduction on the specific treatment group could save the US government over $6.8 billion dollars. Confirmation of the representativeness of the sample is shown next. SAS© Data Mining Shootout 2010 Representativeness of the Sample - I Additional research was performed in order to verify that the sample data set provided was representative of the entire US population from an age and gender perspective. A Chi-Square Goodness of Fit Test was used to compare the age/gender distributions of the sample data set versus the population distribution chart of the country. The null and alternative hypotheses for the goodness of fit test are described as follows: – – H0: The sample is representative of the population. H1: The sample is not representative of the population. The test statistic is defined as follows: SAS© Data Mining Shootout 2010 Representativeness of the Sample - II Additionally, the degree of freedom (DF) is equal to: Males: – Χ2males = 7.6347785 with DF = 84 – P(Χ2 > 7.6347785) ≈ 1 – Since the p-value is greater than the significance level of 0.05, we do not reject the Null Hypotheses. Females: – Χ2females = 9.030266 with DF = 84 – P(Χ2 > 9.030266) ≈ 1 – Since the p-value is greater than the significance level of 0.05, we do not reject the Null Hypotheses. Based on the results of the goodness of fit test, we can safely conclude that the sample is representative of the US population by age and gender. SAS© Data Mining Shootout 2010 EM© Project Description and Diagram 3 – Total Expenditure Model 1 - Data Definition, Visualization, and Characterization Complete Diagram for Male Model 5 – Expected Savings 2 – Diabetes Model 4 – Treatment Group SAS© Data Mining Shootout 2010 Conclusions SAS© Data Mining Shootout 2010 Conclusion - I Utilizing the data set provided by USHETH, models were constructed to predict the incidence rate of diabetes. – – – For the Male model, the Ensemble model produced the most favorable results. For the Female model, the Neural Network model produced the most favorable results. For the Children model, the Decision Tree model produced the most favorable results. In addition to the modeling of the incidence rate of diabetes, models were produced to predict the health care expenditures of the participants within the data set. – – – For the Male model, the Ensemble model produced the most favorable results. For the Female model, the Neural Network model produced the most favorable results. For the Children model, the Decision Tree model produced the most favorable results. Conclusion - II The expected savings resulting from the treatment group, a 10% BMI reduction, equated to approximately $1.9 MM. Extrapolating these results across the entire U.S. population reveals that this could equate to a $11.7 B total expected savings where $6.8 B of this expected savings would be tied to savings observed in the Medicare/Medicaid categories. This reflects only contemporaneous savings. Long-term savings, especially for children avoiding a life with diabetes, could be substantially higher. Thank You Our thanks go to SAS / DOW / CMU for sponsoring this competition and offering us this opportunity to present our work and attend this conference.