Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Evaluation of Data Mining Methods Applied to Adverse Events for Clinical Trials Xiao-Song Zhong, UMBC, Paul Schuette, Scott Komo, Andrejus Parfionovas, FDA/CDER/OTS/OB Disclaimer: The views and opinions in this poster are those of the authors, and do not represent policy or guidance of the US FDA. Summary results of different algorithms for AE1 Introduction Lift Chart - Lift value Lift Chart - Lift value Cumulative Cumulative Selected category of AE1: No(AE1) Selected category of AE1: YES(AE1) 1.8 1.06 1.05 1.6 1.04 1.03 Lift value 1.4 Lift value Data mining consists of novel ways to find unexpected relationships and to summarize the large observational data sets; Data Mining can include interaction effects. I There are few papers on the application of Data Mining methods to AE analyses for clinical trials. I There are few papers that make comparisons among different data mining software packages. Compute overlaid lift charts from all algorithms 1.2 1.0 0.8 0.6 0 Project Goals 10 20 30 40 50 60 70 80 90 100 110 Percentile 1.02 1.01 1.00 Baseline GeneralDiscriminantA TreeModel ExhaustiveCHAIDModel MARSplinesModel BoostTreeModel SVMModel RandomForestModel SANNModel 0.99 0.98 0.97 0 10 20 30 40 50 60 70 80 90 100 110 Percentile Baseline GeneralDiscriminantA TreeModel ExhaustiveCHAIDModel MARSplinesModel BoostTreeModel SVMModel RandomForestModel SANNModel Summary results of different algorithms for AE5 I Determine the strength and weaknesses of the two software packages under consideration. I Identify unexpected AEs. I Identify the AEs associated with higher risk groups. Compute overlaid lift charts from all algorithms I Identify potential classification and machine learning algorithms for AE analyses of clinical data. Lift Chart - Lift value Lift Chart - Lift value Cumulative Cumulative The Basic Data Mining Process Selected category of AE5: YES(AE5) 3.0 1.04 2.5 1.02 2.0 1.00 Lift value AEs than any other steps. I Splitting the data. Data was randomly split into a training data set (30% of total) and a testing data set (70% ) of total. I Explore: Searching for anticipated relationships, unanticipated trends and anomalies in order to gain understanding and ideas. I Modify: Creating, selecting, and transforming the variables to focus the model selection process. I Model:Using the analytical tools to search for a combination of the data that reliably predicts a desired outcome. I Assess: Comparing the models using appropriate metrics to determine which appears to be best. Lift value I Data Preparation is critical and more time was spent organizing, cleaning, and generating indicator variables for Selected category of AE5: No(AE5) 1.5 1.0 0.5 0.0 0 10 20 30 40 50 60 70 80 90 Percentile 100 110 Baseline GeneralDiscriminantA TreeModel ExhaustiveCHAIDModel MARSplinesModel BoostTreeModel SVMModel RandomForestModel SANNModel 0.98 0.96 0.94 0.92 0 10 20 30 40 50 60 70 80 Percentile 90 100 110 Baseline GeneralDiscriminantA TreeModel ExhaustiveCHAIDModel MARSplinesModel BoostTreeModel SVMModel RandomForestModel SANNModel A Boosted classification tree appears to be the best model for both AEs. Classifation of TIBCO SpotFire Miner Data Source AE1 I Submission underwent FDA review and was approved. AE1 I Number of subjects was relatively large. Phase 3 clinical trial data came from multiple trials, multiple regions. I Primary data sources were analysis data in ADaM-like files. I Analysis Variables: Age, Race, Geographic Region, Sex, Treatment Arm, Cancer Population. Data Mining Algorithms used I SVM (Support Vector Machines) I CRT (Classification and Regression Tree) I MARS (Multivariate Adaptive Regression Splines) I Boosted trees I Exhaustive CHAID I Best-Subset and Stepwise General Discriminant Analysis ANCOVA I SANN (Select Automated neural network) AE5 I Random Forests AE5 Results Misclassification error rate from STATISTICA: The table shows misclassification error rate from two AEs of nine considered using an EDA tool. Model(AE1) Training error % Testing error % Neural network 3.87 3.34 SVM 3.87 3.34 Boosted tree 3.87 3.34 Random forest 27.63 30.17 Class&RT 47.35 50.09 Model(AE5) Training error% Testing error % Neural network 0.6 0.7 SVM 0.6 0.7 Boosted tree 0.6 0.7 Random forest 4.89 5.3 Class&RT 38.87 42.35 Risk estimates summary of boosted classification trees model from STATISTICA: No one algorithm performed uniformly well using Spotfire Miner, possibly due to the limited number of algorithms used. Conclusions. Summ ary of Boosted Trees Summ ary of Boosted Trees Response: AE1 Response:AE5 Optim al num ber of trees: 196; Maxim um tree s ize: 3 Optim al num ber of trees: 157; Maxim um tree s ize: 3 0.315 0.170 Average Multinomial Deviance Average Multinomial Deviance 0.165 0.313 0.312 0.311 0.310 0.309 0.308 0.307 0.160 0.155 0.150 0.145 0.306 0.305 20 40 60 80 100 120 140 160 180 200 Num ber of Trees Train data Test data Optimal num ber 0.140 20 40 60 80 100 120 140 160 180 200 Num ber of Trees Train data Test data Optimal num ber Important variables with Boosted classification tree models in STATISTICA: References Importance plot Importance plot Dependent variable: AE1 Dependent variable:AE5 1.2 1.2 1.0 1.0 0.8 I Austin PC. A comparison of classification and regression trees, logistic regression, generalized additive models, 0.8 Importance Importance One algorithm (Boosted Trees) performed well for all AEs under consideration. This method was available with STATISTICA, but not with Spotfire Miner. I There are more options for STATISTICA but more data cleaning was required. I Data mining methods can be helpful in identifying covariates related to adverse events. I Predictive modeling and data mining have the potential to identify groups at high risk for specific AEs. The regulatory impact of predictive modeling is not clear at this time. I 0.314 0.6 0.4 0.6 0.4 0.2 0.2 0.0 AGEY RACE GEOREG SEX TREAT CANCER 0.0 AGEY RACE GEOREG CANCER SEX TREAT Age,race and Geographic Region are the most important predictors for the AE5, which is consistent with other AEs. CDER Office of Biostatistics and multivariate adaptive regression splines for predicting AMI mortality. Statistics in Medicine. 2007;26:2937-2957. I Austin PC, Le DS. Boosted classification trees result in minor to modest improvement in the accuracy in classifying cardiovascular outcomes compared to conventional classification trees. Am.J Cardivas Dis. 2011; 1(1); 1-15 I Freund Y, Schapire R. Machine Learning: Proceedings of the Thirteenth International Conference. San Francisco, CA: Morgan Kauffman; 1996. Experiments with a new boosting algorithm; pp. 148-156. I Nisbet R, Elder, J,Miner Gary. Statistical Analysis and Data Mining Application. (ACADEMIC PRESS) Mail: [email protected] Supported by ORISE