Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang The Background 26 million distinct organic, inorganic chemicals known > 80, 000 in commercial production Combinatorial chemistry adds more than 1 million new compounds to the library every year In UK, > 10,000 are evaluated for possible production every year Biggest cost factor TOXICITY values are not known !!! What is toxicity? • "The dose makes the poison” - Paracelsus (1493-1541) • Toxicity Endpoints: EC50, LC50,… … Toxicity tests are expensive, • time consuming and • disliked by many people In Silico Toxicity Prediction: SAR & QSAR - (Quantitative) Structure Activity Relationships TOPKAT, DERECK, MultiCase Molecular Modelling Toxicity Endpoints Daphnia magna EC50 Cancinogenicity Mutagenicity Rat oral LD50 Mouse inhalation LC50 Skin sensitisation Eye irritancy DESCRIPTORS Physcochemical, biological, structural SAR & QSARs e.g. Neural Networks PLS, Expert Systems Molecular weight HOMO LUMO Heat of formation Log D at pH 2, 7.4, 10 Dipole moment Polarisability Total energy Molecular volume ... HOMO - highest occupied molecular orbital LUMO - Lowest unoccupied molecular orbital No of descriptors cost time Aims of Research integrated data mining environment (IDME) for in silico toxicity prediction decision tree induction technique for ecotoxicity modelling in silico techniques for mixture toxicity prediction Why Data Mining System for In Silico Toxicity Prediction Existing systems: • • • • • Unknown confidence level of prediction Extrapolation Models built from small datasets Fixed descriptors May not cover the endpoint required Users own data resources, often commercially sensitive, not fully exploited Data Mining: Discover Useful Information and Knowledge from Data Value Decision Knowledge Information Data Data Data Data Volume The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data Data: records of numerical data, symbols, images, documents Knowledge: Rules: IF .. THEN .. Cause-effect relationships Decision trees Patterns: abnormal, normal operation Predictive equations …… More importantly Better understanding Clustering Classification Conceptual Clustering Inductive learning Dependency modelling Summarisation Regression Case-based Learning eg. Dependency Modelling or Link Analysis x1 x2 x3 1 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 x1 x2 x3 x2 x1 x3 Data pre-processing - Wavelet for on-line signal feature extraction and dimension reduction - Fuzzy approach for dynamic trend interpretation Clustering - Supervised classification - BPNN - Fuzzy set covering approach Unsupervised classification - ART2 (Adaptive resonance theory) - AutoClass - PCA Dependency modelling - Bayesian networks - Fuzzy - SDG (signed directed graph) - Decision trees Others - Automatic rules extraction from data using Fuzzy-NN and Fuzzy SDG - Visualisation Cost due to PREVENTABLE abnormal operations: e.g. $20 billion per year in pretrochemical ind. Fault detection & diagnosis: very complex sensor faults, equipment faults, control-loop, interaction of variables … Modern control systems Process Operational Safety Envelopes Start point Yussel’s work End point Loss Prevention in Process Ind. 2002 Integrated Data Mining Environment - Toxicity Descriptor calculation Data Preprocessing Scaling Missing values Outlier identification Feature extraction Data import Excel ASCII Files Database XML Data Mining Toolbox Regression PCA & ICA ART2 networks Kohonen networks K-nearest neighbour Fuzzy c-means Decision trees and rules Feedforward neural networks (FFNN) Summary statistics Visualisation Results Presentation Graphs Tables ASCII files Discovery Validation Statistical significance Results for training and test sets User Interface Quantitative Structure Activity Relationship 75 organic compounds with 1094 descriptors and endpoint Log(1/EC50) to Vibrio fischeri Zhao et al QSAR 17(2) 1998 pages 131-138 Log(1/EC50) = -0.3766 + 0.0444 Vx (r2 0.7078, MSE 0.2548) Vx – McGowan’s characteristic volume r2 – Pearson’s correlation coefficient q2 – leave-one out cross validated correlation coefficient Principal Component Analysis Clustering in IDME Multidimensional Visualisation Feedforward neural networks Input layer Hidden layer Output layer PC1 PC2 PC3 … PCm Log(1/EC50) FFNN Results graph QSAR Mode for Mixture Toxicity Prediction Similar Constituents Dissimilar Constituents TESTING TRAINING Similar Constituents Dissimilar Constituents Why Inductive Data Mining for In Silico Toxicity Prediction ? • Lack of knowledge on what descriptors are important to toxicity endpoints (feature selection) • Expert systems: subjective knowledge obtained from human experts • • Linear vs nonlinear Black box models What is inductive learning? Aims at Developing a Qualitative Causal Language for Grouping Data Patterns into Clusters Decision trees or production rules Explicit and transparent Expert Systems Human expert knowl. Knowl. transparent, causal Statistical Methods Data driven Quantitative Neural Networks Data driven Quantitative Nonlinear Inductive DM Easy setup Knowl. Subjective. Data not used Often qualitative Black-box Human knowl. not used Black-box Human knowl. not used Combines adv. of ESs, and SMs & NNs More research Continuous valued output Qualitative & quantitative, nonlinear Data & human knowledge used Dynamics / interactions Knowl. transparent and causal Discretization techniques C5.0 Binary discretization Information entropy (Quinlan 1986 & 1993) Methods Tested C5.0 LERS (Learning from Examples using Rough Sets, Grzymala-Busse 1997) LERS_C5.0 Probability distribution histogram Histogram_C5.0 Equal width interval KEX (Knowledge EXplorer, Berka & Bruha 1998) EQI_C5.0 KEX_chi_C5.0 KEX_fre_C5.0 KEX_fuzzy_C5.0 CN4 (Berka & Bruha 1998) CN4_C5.0 Chi2 (Liu & Setiono 1995, Kerber 1992) Chi2_C5.0 Decision Tree Generation Based on Genetic Programming Traditional Tree Generation methods – Greedy search, can miss potential models Genetic Algorithm – optimisation approach can effectively avoid local minima and simultaneously evaluate many solutions GA has been used in decision tree generation to decide the splitting points and attributes to be used whilst growing a tree Genetic (evolutionary) Programming : Not only simultaneously evaluate many solutions and avoid local minima But does not require parameter encoding into fixed length vectors called chromosomes Based on direct application of the GA to tree structures Genetic Computation (1)Generation of a population of solutions (2) Repeat steps (i) and (ii) until the stop criteria are satisfied (i) calculate the fitness function values for each solution candidate (ii) perform crossover and mutation to generate the next generation (3) the best solution in all generations is regarded as the solution Crossover Genetic algorithms Genetic (Evolutionary) Programming / EPTree + + = = 1. Divide data into training and test sets 2. Generate the 1st population of trees - randomly choosing a row (i.e. a compound), and column (i.e. descriptor) Molecules Descriptors - Using the value of the slot, s, to split, left child takes those data points with selected attribute values <= s, whilst the right child takes those > s. <s >s DeLisle & Dixon J Chem Inf Comput Sci 44, 862-870 (2004) Buontempo & Wang et al, J Chem Inf Comput Sci 45, 904-912 (2005) - If a child will not cover enough rows (e.g. 10% of the training rows), another combination is tried. - A child node becomes a leaf node if pure i.e. all the rows covered are in the same class, or near pure, whilst the other nodes grow children -When all nodes either have two children or are leaf nodes, the tree is fully grown and added to the first generation. -A leaf node is assigned to a class label corresponding to the majority class of points partitioned there. 3. Crossover, Mutation - Tournament: randomly select a groups of trees e.g. 16 - Calculate fitness values - Generate the first parent - Similarly generate the second parent - Crossover to generate a child - Generate other children - Select a percentage for mutation = + Mutation Methods - Random choice of change of split point (i.e. choosing a different row’s value for the current attribute) - Choosing a new attribute whilst keeping the same row - choosing a new attribute and a new row - re-growing part of the tree - If no improvement in accuracy for k generations, trees generated were mutated -…… Two Data Sets Data Set 1: Concentration lethal to 50% of the population, LC50, 1/Log(LC50), of vibrio fischeri, a biolumininescent bactorium 75 compounds 1069 molecular descriptors Data Set 2: Concentration effecting 50% of the population, EC50 of algae chlorella vulgaris, by causing fluorescein diacetate to disappear 80 compounds 1150 descriptors Data set Minimu m Class 1 range Class 2 range Class 3 range Class 4 range Maximum Bacteria 0.90 ≤3.68 ≤4.05 ≤4.50 >4.50 6.32 Algae -4.06 ≤-1.05 ≤-0.31 ≤0.81 >0.81 3.10 600 trees were grown in each egneration 16 trees competing in each tournament to select trees for crossover, 66.7% were mutated for the bacterial dataset, and 50% mutated for the algae dataset. Evolutionary Programming Results: Dataset 1 Cl attached to C2 (sp3) ≤ 1 No Yes Highest eigenvalue of Burden matrix weighted by atomic mass ≤ 2.15 Yes Class 1 (12/12) No Class 3 (8/8) No Class 4 (7/7) No Distance Degree Index ≤ 15.124 Class 4 (5/6) No Yes Class 4 (5/6) Summed atomic weights of angular scattering function ≤ -1.164 Yes No R autocorrelation of lag 7 weighted by atomic mass ≤ 3.713 Class 2 (7/8) Yes Lowest eigenvalue of Burden matrix weighted by van der Waals vol ≤ 3.304 Yes Yes Self-returning walk count of order 8 ≤ 4.048 Class 2 (5/6) For data set 1, bacteria data in generation 37 91.7% for training (60 cases) No Class 3 (6/7) 73.3% for the test set (15 cases) Decision Tree Using C5.0 for the Same Data For data set 1, bacteria data Gravitational index ≤ 7.776 Yes No 88.3% for training (60 cases) 60.0 % for test set (15 cases) Valence connectivity index ≤ 3.346 Class 1 (13/14) Yes No Cl attached to C1 (sp2) ≤ 1 Yes No H Autocorrelation lag 5 weighted by atomic mass ≤ 0.007 Yes No Summed atomic weights of angular scattering function ≤-0.082 Yes Class 3 (5/6) No Class 2 (11/12) Class 4 (14/15) Class 4 (3/6) Class 3 (7/7) Evolutionary Programming Results: Dataset 2 Solvation connectivity index ≤ 2.949 No Yes Self-returning walk count order 8 ≤ 3.798 Yes Class 1 (16/16) No Class 2 (14/15) 2nd dataset - algae data GP Tree, generation 9 Training: 92.2% Test: 81.3% Molecular multiple path count order 3 ≤ 92.813 Yes No Class 3 (6/8) H autocorrelation of lag 2 weighted by Sanderson electronegativities ≤ 0.401 Yes Class 4 (6/7) No 2nd component symmetry directional WHIM index weighted by van der Waals volume ≤ 0.367 Yes Class 3 (9/10) No Class 4 (8/8) Decision Tree Using See5.0 for the Same Data Max eigenvalue of Burden matrix weighted by van der Waals vol ≤ 3.769 No Yes Class 1 (16/16) Broto-Moreau autocorrelation of topological structure lag 4 weighted by atomic mass ≤ 9.861 Yes Class 2 (15/16) 2nd dataset, algae data See 5, Training: 90.6% Test: 75.0% No Total accessibility index weighted by van der Waals vol ≤ 0.281 Yes Class 3 (15/20) No Class 4 (12/12) Summary of Results Data set 1 – Bacteria data C5.0 GP method 6 8 Training Accuracy 88.3% 91.7% Test Accuracy 60.0% 73.3% Tree size Data set 2 – Algae data C5.0 GP method 4 6 Training Accuracy 90.6% 92.2% Test Accuracy 75.0% 81.3% Tree size Comparison of Test Accuracy for See5.0 and GP Trees Having the Same Training Accuracy Data Set 1 – Bacteria data C5.0 GP (Generation 31) 6 8 Training Accuracy 88.3% 88.3% Test Accuracy 60.0% 73.3% Tree size Data Set 2 – Algae data C5.0 GP (Generation 9) 4 6 Training Accuracy 90.6% 90.6% Test Accuracy 75.0% 87.5% Tree size Application to Wastewater Treatment Plant Data Primary Treatment Grit Screening Inflow Removal Outflow Secondary Treatment Primary Settler Aeration Tank Secondary Settler Data Corresponding to 527 Days’ Operation 38 Variables Primary Settler Aeration Tanks Screws Input Pre-Treatment Primary Treatment Sludge Line Secondary Settler Secondary Treatment Output Decision tree for prediction of suspended solids in effluents – training data SS-P ≤ -2.9572 SS-P ≤ -1.8445 DQO-D ≤ 1.80444 SS-P ≤ -1.68597 DBO-D ≤ 0.47006 RD-DBO-G ≤ 0.8097 PH-D ≤ 0.8699 L 2 SS-P ≤ -3.167930 SS-P ≤ -3.6479 L 3 N 3 DQO-D ≤ 2.53335 N 20 N 5 PH-D ≤ 0.59323 SS-P ≤ -1.58468 N 7 N 11 N 320/1 N 2 N 3 ZN-E≤ 2.2447 SSV-P ≤ 0.17786 H 4 N 16 DBO-SS ≤ 0.81806 N 30 PH-D ≤ 0.65534 N 27 Total No of Obs. =470 Training Accuracy: 99.8% Test Accuracy: 93.0% Leaf Nodes = 20 L = Low N = Normal H = High L 3 PH-D ≤ 0.68569 H 2 SS-P ≤ -1.20793 N 2 N 4 SS-P : input SS to primary settler DQO-D : input COD to secondary settler DBO-D : input COD to secondary settler PH-D : input pH to secondary settler SSV-P : input volatile SS to primary settler RD-DQO-S ≤ 0.31152 N 3 H 3 DBO-E ≤ 0.49701 Using all the data of 527 days N 76/3 SS-P ≤ -1.86019 SS-P ≤ -1.86017 SS-P ≤ -3.08361 RD-SS-P ≤ 0.491144 DBO-D ≤ 0.408557 L 3 N 25 RD-SS-G≤0.50018 RD-DQO-S ≤ 0.357935 N 2 PH-D ≤ 0.65537 SS-P ≤ -3.39768 No of Obs. = 527 Accuracy = 99.25% Leaf Nodes = 18 L = Low N = Normal H = High N 69 L 3 N 11 L 2 N 13 SED-P ≤-2.81193 DBO-E ≤0.71809 N 31 N 11 L 3 SS-P ≤ -1.20793 PH-P ≤ 0.41833 N 234/1 COND-S ≤ 0.49438 N 4 RD-DQO-S ≤ 0.35794 PH-P ≤ 0.17333 H 3 N 20 H 9 N 8 Final Remarks • An Integrated Data Mining Prototype System for Toxicity Prediction of Chemicals and Mixtures Developed • An Evaluation of Current Inductive Data Mining Approaches to Toxicity Prediction Has Been Conducted • A New Methodology for the Inductive Data Mining Based Novel Use of Genetic Programming is Proposed, Giving Promising Results in Three Case Studies On-going Work Accuracy 1) Adaptive Discretization of End-point Values through Simultaneous Mutation of the Output 100 90 80 70 60 50 40 30 20 10 0 2-class 3-class 4-class 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Generation SSRD - sum of squared differences in rank The best training accuracy in each generation for the trees grown for the algae data using the SSRD. The 2 class trees no longer dominate and very accurate 3 class trees have been found. Future Work 2) Extend the Method to Model Trees & Fuzzy Model Trees Generation Rule 1: If antecedent one applies, with degree μ1=μ1,1×μ1,2×…×μ1,9 then y1= 0.1910 PC1 + 0.6271 PC2 + 0.2839 PC3 + 1.2102 PC4 + 0.2594 PC5 + 0.3810 PC6 - 0.3695 PC7 + 0.8396 PC8 + 1.0986 PC9 - 0.5162 Rule 2: If antecedent two applies, with degree μ2=μ2,1×μ2,2×…×μ2,9 then y2 = 0.7403 PC1 + 0.5453 PC2 - 0.0662 PC3 - 0.8266 PC4 + 0.1699 PC5 - 0.0245 PC6 + 0.9714 PC7 - 0.3646 PC8 - 0.3977 PC9 - 0.0511 Final output: Crisp value (μ1×y1 + μ2×y2) / (μ1 + μ2) where μi=μi,1×μi,2×……×μi,10 Fuzzy Membership Functions Used in Rules Future Work 3) Extend the Method to Mixture Toxicity Prediction Similar Constituents TESTING TRAINING Similar Constituents Dissimilar Constituents Dissimilar Constituents Acknowledgements FV Buontempo M Mwense A Young D Osborn Crystal Faraday Partnership on Green Technology AstraZenaca Brixham Environmental Laboratory NERC Centre of Ecology and Hydrology Type of descriptor Definition Examples Constitutional Physical description of the compound Molecular weight, atoms count Topological 2D descriptors taken from Wiener index, the molecular graph Balaban index Walk counts Obtained from molecular graphs Total walk count Burden eigenvalues (BCUT) Eigenvalues of the adjacency matrix, weighting the diagonals by atom weights, reflecting the topology of the whole compound Weighted by atomic mass, volume, electronegativity or polarizability Galvez topological charge indices Describes charge transfer between pairs of atoms calculated from the eigenvalues of the adjacency matrix Topological and mean charge index of various orders 2D autocorrelation Sum of the atom weights of the terminal atoms of all the paths of a given length (lag) Moreau, Moran, and Geary autocorrelations Charge descriptors Charges estimated by quantum molecular methods Total positive charge, dipole index Aromaticity indices Estimated from geometrical distance between aromatically bonded atoms Harmonic oscillator model of aromaticity Randic molecular profiles Derived from distance distribution moments of the geometry matrix Molecular profile, shape profile Geometrical descriptors Conformationaldependant, based on molecular geometry 3D Wiener index, gravitational index Radial distribution function descriptors Obtained from radial basis functions centred at different distances 3D Molecule Representation of Structure based on Electron diffraction (MoRSE) Calculated by summing atomic weights viewed by different angular scattering functions GEometry, Topology, and Atom Weights AssemblY (GETAWAY) Calculated from the leverage matrix, representing the influence of each atom in determining the shape of the molecule, obtained by centred atomic coordinates Unweighted or weighted by atomic mass, volume, electronegativity or polarizability Weighted holistic invariant molecular (WHIM) Statistical indices calculated from the atoms projected onto 3 principal components from a weighted covariance matrix of atomic coordinates Unweighted or weighted by atomic mass, volume, electronegativity, polarizability or electrotopological state Functional groups Counts of various atoms and functional groups Primary carbons Aliphatic ethers Atom-centred fragments From 120 atom centred fragments defined by GhoseCrippen Cl-086; Cl attached to C1 (sp3) Various others Unsaturation index; number of non-single bonds Hy; a function of the count of hydrophilic groups Aromaticity ratio; aromatic bonds/ total number of bonds in a H-depleted atom Ghose-Crippen molecular refractivity Fragment-based polar surface area