Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Effective Combination based on ClassWise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining Dr. Daniel NEAGU, UK ADMA 2006, Xi’an, China Dr. Gongde GUO Dept. of Computer Science, Fujian Normal University, China Ms. Shanshan WANG Dept. of Computer Science, Nanjing University of Aeronautics and Astronautics, China Bradford, UK Bradford, West Yorkshire National Museum of Film and Television School of Informatics, University of Bradford Overview (1) Introduction to ML applications to KDD Proposal of Combination Operators Model Construction and Classification Algorithms Model Library for Predictive Toxicology Collection of datasets Central store for models and results Formal structure to speed access and improve organisation; reduce ‘misplaced’ files Remote Access Secure access to data from remote locations possible in the future Overview (2) Comparative Studies Results from UoB Model Library Study of different Machine Learning techniques Variety of Feature Selection techniques Many datasets and endpoints Large variation in accuracy of created models One aim is to automatically build ensembles based on best class-wise models Results and Conclusions Current Context Hardware SW (Algorithms) Data collection/ management Nowadays more scientific data is generated and flows within systems: More data is stored and available: Man power/ laboratories Techniques and computational power (Moore’s Law) Funds/ Legislation Storage technology faster and cheaper (Storage Law) DBMS capable of handling bigger DB Web/on line access to distributed data Consequences Human expert is overloaded: very little data is checked Knowledge Discovery is NEEDED for data understanding and use General definitions Data is defined as facts regarding things (such as people, objects, events) which can be digitally transmitted or processed. Information is generally defined as data that have been processed and presented in a form suitable for human interpretation with the purpose of revealing meanings (such as patterns or rules). Models are defined as creating representations of patterns. Knowledge: the theoretical and practical comprehension of a certain domain, that supports making decisions. Intelligence: the capability of learning, understanding and finding solutions for problems in a specific domain. 1234567.89 is data. "Your bank balance has jumped 80.87% to £1234567.89" is information. "Nobody owes me that much money" is knowledge. "I'd better talk to the bank before I spend it, because of what has happened to other people" is intelligence. http://foldoc.doc.ic.ac.uk Knowledge Discovery in Databases (KDD) Data sources Feature Selection Select/preprocess Transform Models Extracted information Data mining Knowledg e Interpret/Evaluate/Assimilate Data preparation The nontrivial process of identifying valid, novel, potentially useful and, ultimately understandable patterns in data. Involves the following steps: understanding the application domain and definition of the goals selecting the target data set data cleaning and pre-processing data reduction and projection choosing the function of data modelling and the algorithm data mining interpretation evaluation and utilization of the discovered knowledge Predictive Data Mining The processes of data classification/ regression having the goal to obtain predictive models for a specific target, based on predictive relationships among large number of input variables. Classification identifies characteristics of data and identifies a data item as member of one of several predefined categorical classes. Regression uses the existing numerical data values and maps them to a real valued prediction (target) variable. Machine Learning Applications in Data Mining Dynamics (ISI Thompson Web of Knowledge) 3000 ANNs GAs ILP RI DTs k-NN 2500 2000 1500 1000 ANNs 500 GA ILP s References to Machine Learning techniques with applications in Predictive Data Mining: RI 0 DTs k2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 NN ANNs ANNs 55% GAs ILP RI k-NN 3% GAs 30% DTs 10% RI 1% ILP 1% DTs k-NN Multi-Classifier Systems Different classifiers potentially offer complementary or at least additional information about patterns to be classified Various approaches to classifier combinations: Majority voting [4] Entropy-based combination [5] Dempster-Shafer theory-based combination [6], [7] Bayesian classifier combination [8] Similarity-based classifier combination [9] Fuzzy inference [10] Gating networks [11] Statistical models [2] The Proposed Effective Combination Scheme We propose a hybrid classifier combination scheme which makes use of class-wise expertise of diverse classifiers – a priori knowledge obtained from the training set - to achieve potentially better performance. 2 Operators proposed: TPji j arg max i,M i { i | i 1,2,.., m}, i TPj FPj arg maxi,Mi {CAi | i 1,2,..,m} j 1,2,.., L Architecture of the Effective Multiple Classifier System x Best Model for Class 1 A1 If x is classified as C1 1 Testing data No x A2 Best Model for Class 2 If x is classified as C2 2 Training data Data Pre-processing A3 … … Best Model for Class L Output If x is classified as CL L Am No x Best Model for All Classes Otherwise Model construction algorithm Classification Algorithm ML applications for Predictive Toxicology The EC proposal for the REACH regulation indicates that the information requirements under REACH can be (partially) fulfilled by using scientifically valid (Q)SAR models. To guide the validation of computer-based methods, five OECD principles for the validation of (Quantitative) StructureActivity Relationships were adopted: a defined endpoint an unambiguous algorithm a defined domain of applicability appropriate measures of goodness-of-fit, robustness and predictivity a mechanistic interpretation, if possible Datasets (1) 1. 2. 3. 4. 5. DEMETRA* LC50 96h Rainbow Trout acute toxicity (ppm) 282 compounds EC50 48h Water Flea acute toxicity (ppm) 264 compounds LD50 14d Oral Bobwhite Quail (mg/ kg) 116 compounds LC50 8d Dietary Bobwhite Quail (ppm) 123 compounds LD50 48h Contact Honey Bee (μg/ bee) 105 compounds *http://www.demetra-tox.net Datasets (2) CSL APC* Datasets 5 endpoints A single endpoint/descriptor set used for our experiments Mallard Duck LD50 toxicity value 60 organophosphates 248 descriptors *http://www.csl.gov.uk Datasets (3) TETRATOX*/LJMU** Dataset Tetrahymena Pyriformis inhibition of growth IGC50 Phenols data 250 phenolic compounds 187 descriptors • http://www.vet.utk.edu/tetratox/ • http://www.ljmu.ac.uk Descriptors Multiple descriptor types Various software packages to calculate 2D and 3D attributes* http://www.demetra-tox.net Model Library Algorithms chosen for their representability and diversity, easy, simple and fast access Instance-based Learning algorithm (IBL) Decision Tree learning algorithm (DT) Repeated Incremental Pruning to Produce Error Reduction (RIPPER) Multi-Layer Perceptrons (MLPs) Support Vector Machine (SVM) Dimensionality Dataset Four Model Parameter file Results file Feature Selection Feature Selection Feature Selection Feature Selection Dataset Three Algorithms Dataset Two Algorithms Dataset One Algorithms Algorithms Organisation Source Endpoint/ Descriptors Feature Selection File Type Files CSL APC Trout Mallard_Duck CFS Chi Feature Subsets Model 1 CS DEMETRA Water Flea GR Models Model 2 Oral Quail IG TETRATOX/LJMU Dietary Quail ReliefF Bee SVM Parameters Model 3 PHENOLS KNNMFS Raw Results Model n Comparison of performance of combination schemes on seven data sets MCS: Majority Voting-based Combination (MVC) Maximal Probability-based Combination (MPC) Average Probability-based Combination (APC) Classifier Combination based on Dempster Rule of Combination (DRC) CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers) Conclusions The proposed combination scheme CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers): not only makes use of the expertise of best individual classifiers but removes their negative influences as well therefore results presented previously show significant improvement of global performance Acknowledgements This work is part-funded by: EPSRC GR/T02508/01: Predictive Toxicology Knowledge Representation and Processing Tool based on a Hybrid Intelligent Systems Approach http://pythia.inf.brad.ac.uk/ EU FP5 Quality of Life DEMETRA QLRT-2001-00691: Development of Environmental Modules for Evaluation of Toxicity of pesticide Residues in Agriculture http://www.demetra-tox.net Special thanks also to: Dr. Q. Chaudhry (CSL York) Dr. Mark Cronin (LJMU) and PhD students: Ms. Ladan Malazizi, BSc, PhD student Research Theme: Development of Artificial Intelligence-based in-silico toxicity models for use in pesticide risk assessment Mr. Paul Trundle, BSc, PhD student Research Theme: Hybrid Intelligent Systems applied to predict Pesticide Toxicity Ms. Areej Shhab, BEng, MPhil Research Theme: Applications of Machine Learning in Knowledge Discovery and Data Mining Mr. M. Craciun (University of Galati), BSc, MSc