Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dental Data Mining: Practical Issues and Potential Pitfalls Stuart A. Gansky University of California, San Francisco Center to Address Disparities in Children’s Oral Health Support: US DHHS/NIH/NIDCR U54 DE14251 What is Knowledge Discovery and Data Mining (KDD)? • “Semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures in data” – MIT Tech Review (2001) • Interface of – Artificial Intelligence – Computer Science – Machine Language – Engineering – Statistics • Association for Computing Machinery Special Interest Group on Knowledge Discovery in Data and Data Mining (ACM SIGKDD sponsors KDD Cup) 2 Data Mining as Alchemy Pb Au 3 Some Potential KDD Applications in Oral Health Research • • • • • • • Large surveys (eg NHANES) Longitudinal studies (eg VA Aging Study) Disease registries (eg SEER) Digital diagnostics (radiographic & others) Molecular biology (eg PCR, microarrays) Health services research / claims data Provider and workforce databases 4 Supervised Learning Unsupervised Learning • Regression • Hierarchical clustering • k nearest neighbor • k-means • Trees (CART, MART, boosting, bagging) • Random Forests • Multivariate Adaptive Regression Splines (MARS) • Neural Networks • Support Vector Machines 5 KDD Steps Collect & Store PreProcess Sample Merge Warehouse Clean Impute Transform Standardize Register Analyze Supervised Unsupervised Visualize Validate Act Internal Intervene Split Sample Set Policy Cross-validate Bootstrap External 6 Data Quality 7 Example – Caries • Predicting disease with traditional logistic regression may have modelling difficulties: nonlinearity (ANN better) & interactions (CART better)(Kattan et al, Comp Biomed Res, ’98) • Want to compare the performance of logistic regression to popular data mining techniques – tree and artificial neural network models in dental caries data • CART in caries (Stewart & Stamm, JDR, ’91) 8 Example study – child caries • Background: ~20% of children have ~80% of caries (tooth decay) • University of Rochester longitudinal study (Leverett et al, J Dent Res, 1993) • 466 1st-2nd graders caries-free at baseline • Saliva samples & exams every 6 months • Goal: Predict 24 month caries incidence (output) 9 18-month Predictors (Inputs) • Salivary bacteria – Mutans Streptococci (log10 CFU/ml) – Lactobacilli (log10 CFU/ml) • Salivary chemistry – Fluoride (ppm) – Calcium (mmol/l) – Phosphate (ppm) 10 Modeling Methods Logistic Regression Neural Networks Decision Trees 11 Logistic Regression Models Logit (Primary Dentition Caries) Schematic Surface Fluoride (F) ppm log10 Mutans Streptococci 12 Tree Models Logit (Primary Dentition Caries) Schematic Surface Fluoride (F) ppm log10 Mutans Streptococci 13 Artificial Neural Networks Logit (Primary Dentition Caries) Schematic Surface Fluoride (F) ppm log10 Mutans Streptococci 14 Artificial Neural Network (p-r-1) wij x1 h1 x2 xp inputs wj h2 y hr hidden layer (neurons) output 15 Common Mistakes with ANN (Scwartzer et al, StatMed, 2000) • Too many parameters for sample size • No validation • No model complexity penalty (eg Akaike Information Criterion (AIC)) • • • • • Incorrect misclassification estimation Implausible function Incorrectly described network complexity Inadequate statistical competitors Insufficiently compared to stat competitors 16 Validation • Split sample (70% training/30% validation) Validation estimates unbiased misclassification • K-fold Cross Validation Mean squared error (Brier Score) 17 Why Validate? Example: Overfitting in 2 Dimensions Data Response 15 10 5 0 0 5 10 15 20 25 Predictor 19 Linear Fit to Data Response 15 10 5 y = 0.3449x + 1.2802 2 R = 0.9081 0 0 5 10 15 20 25 Predictor 20 High Degree Polynomial Fit to Data Response 15 10 5 6 5 4 3 2 y = -0.0012x + 0.1196x - 4.8889x + 105.05x - 1250.4x + 7811.5x - 19989 R2 = 1 0 0 5 10 15 20 25 Predictor 21 10-Fold Cross-validation 1 2 3 4 5 6 7 8 9 10 22 10-Fold Cross-validation 1 2 3 4 5 6 7 8 9 10 23 10-Fold Cross-validation 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 24 Caries Example Model Settings • Logit – Stepwise selection – Alpha=.05 to enter, alpha=.20 to stay – AIC to judge additional predictors • Tree – Splitting criterion: Gini index – Pruning: Proportion correctly classified 25 ANN Settings • Artifical Neural Network (5-3-1 = 22 df) – – – – – – – Multilayer perceptron 5 Preliminary runs Levenberg-Marquardt optimization No weight decay parameter Average error selection 3 Hidden nodes/neurons Activation function: hyperbolic tangent 26 ANN Sensitivity Analyses • Random seeds: 5 values – No differences • Weight decay parameters: 0, .001, .005, .01, .25 – Only slight differences for .01 and .25 • Hidden nodes/neurons: 2, 3, 4 – 3 seems best 27 Tree Model N=322 Training N=144 Validation Overall Primary Caries 15% log10 MS <7.08 15% log10 LB <3.05 10% log10 MS <3.91 3% Prevalence: Node > Overall (15%) Prevalence: Node < Overall (15%) log10 MS 7.08 91% log10 LB 3.05 23% log10 MS 3.91 14% F < .056 22% F < .110 100% F .110 0% F .056 25% 28 Receiver Operating Characteristic (ROC) Curves 30 Cumulative Captured Response Curves 31 Lift Chart 32 Logistic Regression Beta Std Err Odds Ratio 95% CI log10 MS .238 .072 1.27 1.10 – 1.46 log10 LB .311 .070 1.36 1.19 – 1.57 33 MARS – MS at 4 Times 34 Predicted Quintiles 2 1 0 -1 -2 0 1 2 3 Rank for Variable PR_ANN 4 35 Predicted Quintiles 2 1 0 -1 -2 0 1 2 3 Rank for Variable PR_ANN 4 36 5-fold CV Results RMS error AUC Logit .365 .680 Tree .363 .553 ANN .362 .707 37 Summary • • • • • Data quality and study design are paramount Utilize multiple methods Be sure to validate Graphical displays help interpretations KDD methods may provide advantages over traditional statistical models in dental data 38 39 Prediction as good as the data and model 40