Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO Bradford, UK Bradford, West Yorkshire National Museum of Film and Television School of Informatics, University of Bradford Overview Short Introduction to Predictive Toxicology Data and Models The Current Context on Interspecies Data Extrapolation Our Motivation and Approach Algorithm for Data-driven Hybrid Classification Model development Case studies Results and Conclusions Predictive Data Mining The processes of data classification/ regression having the goal to obtain predictive models for a specific target, based on predictive relationships among large number of input variables. Classification defines characteristics of data and identifies a data item as member of one of several predefined categorical classes. Regression uses the existing numerical data values and maps them to a real valued prediction (target) variable. Predictive Toxicology Predictive Toxicology: a multi-disciplinary science requires close collaboration among toxicologists, chemists, biologists, statisticians and AI/ML researchers. The goal of toxicity prediction is to describe the relationship between chemical properties, biological and toxicological processes: relates features of a chemical structure to a property, effect or biological activity associated with the chemical Data in Predictive Toxicology ML applications for Predictive Toxicology The EC proposal for the REACH regulation indicates that the information requirements under REACH can be (partially) fulfilled by using scientifically valid (Q)SAR models. To guide the validation of computer-based methods, five OECD principles for the validation of (Quantitative) Structure-Activity Relationships were adopted: a defined endpoint an unambiguous algorithm a defined domain of applicability appropriate measures of goodness-of-fit, robustness and predictivity a mechanistic interpretation, if possible The Context for our Approach Data from In Vivo experiments: In Vivo Data In Vitro generated data In Silico (Algorithms) In Vitro Data increased laboratory standards financial and social costs questionable outputs given different initial conditions for tests and also the definition of the output between various experiments reduces the costs of in vivo experiments dependent on artificial conditions focused on particular output measurements, without an integrated biological dependency and reaction In Silico data depends on the computing and modelling resources far less expensive than previous two one might define an inverse proportional relationship between data quality and data quantity Our Approach Data availability: different chemical compounds are chosen and tested on different species for different purposes, and some of them are tested on more than one species by various experimental reasons Sparse data sets Copyrighted Not homogeneous (endpoint, laboratory conditions, standards, measurement units) Distributed in time and sources Further supporting experimental data for training classifiers are frequently limited and expensive. Some endpoints show good correlations (i.e. Aquatic toxicity measured for various fish species, daphnia etc.) Consequently, extrapolation methods can be used in regulatory toxicology to overcome these drawbacks The goal is to predict toxic effects of different chemical compounds to particular species by considering both, toxicity values/classes of chemical compounds which have been tested on these species and on other species with correlated toxicity values/classes. Multi-Classifier Systems Different classifiers potentially offer complementary or at least additional information about patterns to be classified Various approaches to classifier combinations: majority voting entropy-based combination Dempster-Shafer theory-based combination Bayesian classifier combination similarity-based classifier combination fuzzy inference gating networks statistical models We propose a Data-driven Multi-Classifier Model for correlated PT Data Sets Step 1: for each dataset, build a model on all instances with a predefined class label, and then use this model to predict any unclassified instances. Step 2: for every two datasets count the number of instances both have predefined class label, and the numbers of exact match, match with distance=“1” and match with distance=“2” among them. Step 3: find potential pairs from different endpoints with highly correlation of their toxicity classes, i.e. the match rate of distance ≤ “1” is greater or equal to 90%. Based on previous investigations, under assumption that for two datasets exists highly correlation between their classes of the same chemical compounds, a hybrid integration scheme is proposed: Step 4: for each dataset, we build a model based on the training set and then use it to classify new instances. In the case the distance between the predicted class and the class of the same chemical compound in its most correlated dataset with different endpoint is 2 we give the class label of the latter to the new instance. The Architecture of the Data-driven Multiple Classifier System for PT Interspecies Extrapolation Descriptors Class Descriptors Class Training (Endpoint1) Model Testing Descriptors Class The class of an instance t: Cj d(Ci, Cj) ≤ δ Training (Endpoint2) Testing Model The predicted class of an instance t: Ci C=Ci Otherwise C=f(Ci, Cj) Datasets DEMETRA* LC50 96h Rainbow Trout acute toxicity (ppm) 1. 282 compounds EC50 48h Water Flea acute toxicity (ppm) 2. 264 compounds LD50 14d Oral Bobwhite Quail (mg/ kg) 3. 116 compounds LC50 8d Dietary Bobwhite Quail (ppm) 4. 123 compounds LD50 48h Contact Honey Bee (μg/ bee) 5. 105 compounds *http://www.demetra-tox.net Descriptors Multiple descriptor types Various software packages to calculate 2D and 3D attributes* *http://www.demetra-tox.net Model Development Algorithms chosen for their representability and diversity, easy, simple and fast access Bayes Networks (BN) Instance-Based Learning algorithm (IBL) Decision Tree learning algorithm (DT) Repeated Incremental Pruning to Produce Error Reduction (RIPPER) Multi-Layer Perceptrons (MLPs) Support Vector Machine (SVM) Experiments 1. For each dataset the most relevant descriptors were selected by considering the individual predictive ability of each descriptor along with the degree of redundancy between them: Subsets of descriptors that are highly correlated with the class while having low intercorrelation were preferred. 2. A model based on all available training instances with predefined classes was built for each dataset and then used to predict unclassified instances. 3. Comparison of the differences of toxicity classes of the same chemical compounds for two different endpoints. The difference between toxicity classes is measured by a distance function: for class labels in descent order in terms of toxicity (C={c1, c2,.., cm}, Toxicity(c1)≥Toxicity(c2) ≥ … ≥ Toxicity(cm)) j i , if i j, j i otherwise Distance(ci, cj)= The pairs (Trout, Daphnia), (Bee, Dietary_Quail), (Dietary_Quail, Oral_Quail) are significantly correlated Results C1 stands for high toxic class; C2 stands for medium toxic class; C3 stands for non toxic class; PTN is the Percentage of Toxic chemical compounds being classified as Non-toxic chemical compounds Conclusions no matter the performance of each original classification method is good or bad, its counterpart that integrates available correlative information has obtained better performance. experimental results of the proposed hybrid classification system tested on five toxicity datasets obtain better performance than that of each single classifier-based model. hybrid integration systems (IBL-HIS) reduced the percentage of toxic chemicals being classified as non-toxic chemicals Acknowledgements This work is part-funded by: EPSRC GR/T02508/01: Predictive Toxicology Knowledge Representation and Processing Tool based on a Hybrid Intelligent Systems Approach EU FP5 Quality of Life DEMETRA QLRT-2001-00691: Development of Environmental Modules for Evaluation of Toxicity of pesticide Residues in Agriculture http://www.demetra-tox.net Special thanks also to: http://pythia.inf.brad.ac.uk/ Dr. Q. Chaudhry (CSL York) Dr. Mark Cronin (LJMU) and PhD students: Ms. Ladan Malazizi, BSc, PhD student Mr. Paul Trundle, BSc, PhD student Research Theme: Hybrid Intelligent Systems applied to predict Pesticide Toxicity Ms. Areej Shhab, BEng, MPhil Research Theme: Development of Artificial Intelligence-based in-silico toxicity models for use in pesticide risk assessment Research Theme: Applications of Machine Learning in Knowledge Discovery and Data Mining Mr. M. Craciun (University of Galati), BSc, MSc