Download An Evaluation of Data Mining Methods Applied to Adverse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
An Evaluation of Data Mining Methods Applied to Adverse Events for Clinical
Trials
Xiao-Song Zhong, UMBC, Paul Schuette, Scott Komo, Andrejus Parfionovas, FDA/CDER/OTS/OB
Disclaimer: The views and opinions in this poster are those of the authors, and do not represent policy or guidance of the US FDA.
Summary results of different algorithms for AE1
Introduction
Lift Chart - Lift value
Lift Chart - Lift value
Cumulative
Cumulative
Selected category of AE1: No(AE1)
Selected category of AE1: YES(AE1)
1.8
1.06
1.05
1.6
1.04
1.03
Lift value
1.4
Lift value
Data mining consists of novel ways to find unexpected relationships and to
summarize the large observational data sets; Data Mining can include
interaction effects.
I There are few papers on the application of Data Mining methods to AE
analyses for clinical trials.
I There are few papers that make comparisons among different data mining
software packages.
Compute overlaid lift charts from all algorithms
1.2
1.0
0.8
0.6
0
Project Goals
10
20
30
40
50
60
70
80
90
100
110
Percentile
1.02
1.01
1.00
Baseline
GeneralDiscriminantA
TreeModel
ExhaustiveCHAIDModel
MARSplinesModel
BoostTreeModel
SVMModel
RandomForestModel
SANNModel
0.99
0.98
0.97
0
10
20
30
40
50
60
70
80
90
100
110
Percentile
Baseline
GeneralDiscriminantA
TreeModel
ExhaustiveCHAIDModel
MARSplinesModel
BoostTreeModel
SVMModel
RandomForestModel
SANNModel
Summary results of different algorithms for AE5
I Determine the strength and weaknesses of the two software packages under consideration.
I Identify unexpected AEs.
I Identify the AEs associated with higher risk groups.
Compute overlaid lift charts from all algorithms
I Identify potential classification and machine learning algorithms for AE analyses of clinical data.
Lift Chart - Lift value
Lift Chart - Lift value
Cumulative
Cumulative
The Basic Data Mining Process
Selected category of AE5: YES(AE5)
3.0
1.04
2.5
1.02
2.0
1.00
Lift value
AEs than any other steps.
I Splitting the data. Data was randomly split into a training data set (30% of total) and a testing data set (70% ) of
total.
I Explore: Searching for anticipated relationships, unanticipated trends and anomalies in order to gain
understanding and ideas.
I Modify: Creating, selecting, and transforming the variables to focus the model selection process.
I Model:Using the analytical tools to search for a combination of the data that reliably predicts a desired outcome.
I Assess: Comparing the models using appropriate metrics to determine which appears to be best.
Lift value
I Data Preparation is critical and more time was spent organizing, cleaning, and generating indicator variables for
Selected category of AE5: No(AE5)
1.5
1.0
0.5
0.0
0
10
20
30
40
50
60
70
80
90
Percentile
100
110
Baseline
GeneralDiscriminantA
TreeModel
ExhaustiveCHAIDModel
MARSplinesModel
BoostTreeModel
SVMModel
RandomForestModel
SANNModel
0.98
0.96
0.94
0.92
0
10
20
30
40
50
60
70
80
Percentile
90
100
110
Baseline
GeneralDiscriminantA
TreeModel
ExhaustiveCHAIDModel
MARSplinesModel
BoostTreeModel
SVMModel
RandomForestModel
SANNModel
A Boosted classification tree appears to be the best model for both AEs.
Classifation of TIBCO SpotFire Miner
Data Source
AE1
I Submission underwent FDA review and was approved.
AE1
I Number of subjects was relatively large. Phase 3 clinical trial data came from multiple trials, multiple regions.
I Primary data sources were analysis data in ADaM-like files.
I Analysis Variables: Age, Race, Geographic Region, Sex, Treatment Arm, Cancer Population.
Data Mining Algorithms used
I SVM (Support Vector Machines)
I CRT (Classification and Regression Tree)
I MARS (Multivariate Adaptive Regression Splines)
I Boosted trees
I Exhaustive CHAID
I Best-Subset and Stepwise General Discriminant Analysis ANCOVA
I SANN (Select Automated neural network)
AE5
I Random Forests
AE5
Results
Misclassification error rate from STATISTICA:
The table shows misclassification error rate from two AEs of nine considered using an EDA tool.
Model(AE1)
Training error % Testing error %
Neural network
3.87
3.34
SVM
3.87
3.34
Boosted tree
3.87
3.34
Random forest
27.63
30.17
Class&RT
47.35
50.09
Model(AE5)
Training error% Testing error %
Neural network
0.6
0.7
SVM
0.6
0.7
Boosted tree
0.6
0.7
Random forest
4.89
5.3
Class&RT
38.87
42.35
Risk estimates summary of boosted classification trees model from
STATISTICA:
No one algorithm performed uniformly well using Spotfire Miner, possibly due to the limited number of algorithms
used.
Conclusions.
Summ ary of Boosted Trees
Summ ary of Boosted Trees
Response: AE1
Response:AE5
Optim al num ber of trees: 196; Maxim um tree s ize: 3
Optim al num ber of trees: 157; Maxim um tree s ize: 3
0.315
0.170
Average Multinomial Deviance
Average Multinomial Deviance
0.165
0.313
0.312
0.311
0.310
0.309
0.308
0.307
0.160
0.155
0.150
0.145
0.306
0.305
20
40
60
80
100
120
140
160
180
200
Num ber of Trees
Train data
Test data
Optimal num ber
0.140
20
40
60
80
100
120
140
160
180
200
Num ber of Trees
Train data
Test data
Optimal num ber
Important variables with Boosted classification tree models in
STATISTICA:
References
Importance plot
Importance plot
Dependent variable: AE1
Dependent variable:AE5
1.2
1.2
1.0
1.0
0.8
I Austin PC. A comparison of classification and regression trees, logistic regression, generalized additive models,
0.8
Importance
Importance
One algorithm (Boosted Trees) performed well for all AEs under consideration.
This method was available with STATISTICA, but not with Spotfire Miner.
I There are more options for STATISTICA but more data cleaning was required.
I Data mining methods can be helpful in identifying covariates related to adverse
events.
I Predictive modeling and data mining have the potential to identify groups at
high risk for specific AEs. The regulatory impact of predictive modeling is not
clear at this time.
I
0.314
0.6
0.4
0.6
0.4
0.2
0.2
0.0
AGEY
RACE
GEOREG
SEX
TREAT
CANCER
0.0
AGEY
RACE
GEOREG
CANCER
SEX
TREAT
Age,race and Geographic Region are the most important predictors for the AE5, which is consistent with other
AEs.
CDER Office of Biostatistics
and multivariate adaptive regression splines for predicting AMI mortality. Statistics in Medicine.
2007;26:2937-2957.
I Austin PC, Le DS. Boosted classification trees result in minor to modest improvement in the accuracy in classifying
cardiovascular outcomes compared to conventional classification trees. Am.J Cardivas Dis. 2011; 1(1); 1-15
I Freund Y, Schapire R. Machine Learning: Proceedings of the Thirteenth International Conference. San Francisco,
CA: Morgan Kauffman; 1996. Experiments with a new boosting algorithm; pp. 148-156.
I Nisbet R, Elder, J,Miner Gary. Statistical Analysis and Data Mining Application. (ACADEMIC PRESS)
Mail: [email protected]
Supported by ORISE