Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovering Patterns in Adverse Drug Reactions Student: Ernst Joham Supervisor: Associate Prof Jiuyong Li Associate Supervisor Dr. Jan Stanek Outline • • • • • • • Background Motivation Research questions Literature review Data Mining process Results Conclusion 2 Background • What is data mining? Data mining is used to discover unexpected, interesting and valuable information in datasets. • High percentage of patients admitted or prolonged hospitalisation is due to ADRS. • What can cause ADRS? • Amount of dosage given to patients • More then one drug taken at the same time • Ingredients in drugs which can result in adverse reaction. 3 Background • Problems with medical datasets • Medical data is more diverse and complex • Ethical and legal issues • Data quality • • Missing values Noise • Ownership • Lack of information 4 Motivation • To have a successful outcome in discovering patterns for medical datasets • Finding the most suitable algorithms to handle noise and missing values for medical datasets • Improve complexity and diversity of medical datasets 5 Research Questions • The aim of the research was to use data mining methods in an attempt to produce relevant results from real world medical data. • The following research questions were answered (1) Is it possible to discover patterns in spares datasets? (2) What patterns can be identified through data mining for ADRs? 6 Literature review (techniques) • Decision Tree, Logistic programs, K nearest neighbour and Bayesian classifier techniques have been applied to medical datasets (Laverac 1999). • Lee et al(2000) states that techniques that easily extract specific knowledge are the key for medical decision. • A study on drug discovery showed that neural networks performed better then logistic regression, but decision tree performed better in identifying active compounds (Obenshain 2004). 7 Literature review (process model) • Medical data mining applications that is expected to discover new knowledge should follow a five stage process model (Wang 2000). • • • • • planning tasks developing data mining hypotheses preparing data selecting data mining tools evaluating data mining results. • Cios & Moore 2002 state that for success you need to follow the DMKD that adds several steps to the CRISP-DM model that has been applied to several medical problem domains. 8 Literature review (problems with medical datasets) • Brown & Kros (2003) focused on the impact of missing data and how existing methods can help. They categories methods for dealing with missing data into: • • • • Use complete data only Delete selected case or variables Data imputation Model-based approaches • Some researchers have focused on data cleansing tools to help eliminate noise but this can only achieve a reasonable result (Zhu & Wu 2004). 9 Literature review • (Zhu & Wu 2004). Attribute noise is more difficult to handle and include: • (1) Incorrect attribute values • (2) Missing or don’t know attribute values • (3) Incomplete attributes or don’t care values 10 Data Mining Processing • The project used the data mining method of CRISP_DM six step data mining process • Understand the main aim of the project • Understand the dataset ADRDATE Agedays BRAND DRUG ID Prob ROUTE Recov Severity URNO ATC 31/01/2007 Lyclear Permethrin 707 Cert Topical Rec Minor unknown P03AC04 9/06/2003 14367 Tegretol CR Carbamazepine 4 Cert Oral Rec ax6cx8z N03AF01 11/06/2003 1 4173 Zoloft Sertraline 5 Unc Oral ax66486 N06AB06 11 Data mining Process Summary of missing values Missing values Unknown NR REC ADRDATE 0 ADEDAYS 1 ROUTE 570 RECOV 344 ATC 191 188 82 657 Total 1286 records 12 Data Mining Process • • • • Data .csv format R programming language Rattle tool for data mining Data preparation • • • • Remove duplicates Correct misspelled words Correct meanings of values Find missing ATC values (Anatomical Therapeutic Chemical) • Leave missing values for rest of dataset 13 Data mining Process • Data transformation • Date when the patient was admitted to hospital for ADRs (October-March =1, April-September = 0) • How old the patient is categorised into equal number of records.(0-2 years old = 1, 2-5 years old = 2, 5-11 years old = 3, 11-16 years old = 4, and above 16 years of age = 5) • The administration of the medication that caused the ADR is either oral or intravenous.(Oral = 1, Intravenous = 0) • Recovered from ADRs or not.(Recovered = 0, Not recovered = 1) • The drugs given to the patient either are antibiotics or not.(Antibiotics =1, Not Antibiotics =0) 14 AGE ROUTE Data Mining Processing AGE ADRDATE ROUTE RECOV ATC 15 Data Mining Process • Modelling phase • Logistic regression, • Decision tree, • Risk pattern algorithm • Evaluation Phase • Deployment 16 Results • Results for the logistic regression technique Coefficients: (Intercept) ADRDATE AGEDAYS ROUTE ANTIBIOTICS Estimate Std. Error -1.901353 0.466304 0.136312 0.285722 0.002067 0.115482 0.059532 0.290016 -0.181255 0.300150 z value Pr(>|z|) -4.077 4.55e-05 *** 0.477 0.633 0.018 0.986 0.205 0.837 -0.604 0.546 17 Results • Decision Tree Result 1) root 1035 473 1 (0.4570048 0.5429952) 2) AGE>=3.5 407 140 0 (0.6560197 0.3439803) 4) ADRDATE< 0.5 203 61 0 (0.6995074 0.3004926) * 5) ADRDATE>=0.5 204 79 0 (0.6127451 0.3872549) 10) AGE>=4.5 100 35 0 (0.6500000 0.3500000) 20) ROUTE>=0.5 79 27 0 (0.6582278 0.3417722) * 21) ROUTE< 0.5 21 8 0 (0.6190476 0.3809524) 42) RECOV=Yes 18 6 0 (0.6666667 0.3333333) * 43) RECOV=NO 3 1 1 (0.3333333 0.6666667) * 18 Results • Decision Tree Result 11) AGE< 4.5 104 44 0 (0.5769231 0.4230769) 22) ROUTE< 0.5 77 30 0 (0.6103896 0.3896104) * 23) ROUTE>=0.5 27 13 1 (0.4814815 0.5185185) * 3) AGE< 3.5 628 206 1 (0.3280255 0.6719745) 6) ROUTE< 0.5 236 109 1 (0.4618644 0.5381356) 12) RECOV=NO 24 6 0 (0.7500000 0.2500000) 19 Results • Risk patterns for NO 1 2 3 4 • • • • 3 2 3 3 3.0324 3.1002 2.5663 2.5375 2.4852 2.5582 2.1904 2.1757 26 62 25 34 9 46 9 26 7 ADRDATE 1 A GEDAYS 3 ANTIBIOTICS 0 16 AGEDAYS 3 ANTIBIOTICS 0 6 ADRDATE 1 AGEDAYS 4 ROUTE 1 8 AGEDAYS 4 ROUTE 1 ANTIBIOTICS 0 Pattern 1 where Risk Ratio = 2.48 Agedays = between 5-11 years old Adrdate = months between October – March Antibiotics = No 20 Conclusion • Building a data mining process to answer the problem posed. • Use algorithms that work for medical applications • Noise and missing values does pose a problem but reasonable results can still be achieved. • More relevant patterns can be produced for medical experts if maximum information is included in the dataset. 21 Reference • • • • • • • • • • Brown, ML & Kros, JF 2003, 'Data mining and the impact of missing data', Industrial Management & Data Systems, vol. 103, pp. 611-621. Cios, K 2002, 'Uniqueness of medical data mining', Artificial intelligence in medicine, vol. 26, no. 1-2, pp. 124. CRISP_DM 2000, Cross Industry Standard Process for Data Mining, viewed 27 August 2008, <http://www.crisp-dm.org/Partners/index.htm>. Li, J, Fe, AW-c, He, H, Chen, J, Jin, H, McAullay, D, Williams, G, Sparks, R & Kelman, C 2005, Mining risk patterns in medical data, ACM, Chicago, Illinois, USA. Lavrač, N 1999, 'Selected techniques for data mining in medicine', Artificial intelligence in medicine, vol. 16, no. 1, pp. 3-23. Lee, I-N, Liao, S-C & Embrechts, M 2000, 'Data mining techniques applied to medical information', Medical Informatics & the Internet in Medicine, vol. 25, no. 2, pp. 81-102. Obenshain, MK 2004, ‘Application of Data Mining Techniques to Healthcare Data’, Infection Control and Hospital Epidemiology, vol.25, no 8, pp. 690-695. Safety of Medicines 2002, A Guide to Detecting and Reporting Adverse DrugReaction Why Health Professionals Need to Take Action, WHO publications, viewed 15 April 2008, http://whqlibdoc.who.int/hq/2002/WHO_EDM_QSM_2002.2.pdf>. Wang, H & Wang, S 2008, 'Medical knowledge acquisition through data mining', paper presented at the IT in Medicine and Education, 2008. ITME 2008. IEEE International Symposium on, Xiamen Zhu, X, Khoshgoftaar, T, Davidson, I & Zhang, S 2007, 'Editorial: Special issue on mining low-quality data', Knowledge and Information Systems, vol. 11, no. 2, pp. 131-136. 22