Download Induction of Decision Trees Using Genetic Programming for the

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Induction of Decision Trees Using
Genetic Programming for the
Development of SAR Toxicity Models
Xue Z Wang
The Background
26 million distinct organic, inorganic chemicals known
> 80, 000 in commercial production
Combinatorial chemistry adds more than 1 million new
compounds to the library every year
In UK, > 10,000 are evaluated for possible production
every year
Biggest cost factor
TOXICITY values are not known !!!
What is toxicity?
• "The dose makes the poison”
- Paracelsus (1493-1541)
• Toxicity Endpoints: EC50, LC50,… …
Toxicity tests are
expensive,
• time consuming and
• disliked by many people
In Silico Toxicity Prediction:
SAR & QSAR - (Quantitative) Structure
Activity Relationships
TOPKAT, DERECK, MultiCase
Molecular
Modelling
Toxicity
Endpoints
Daphnia magna EC50
Cancinogenicity
Mutagenicity
Rat oral LD50
Mouse inhalation LC50
Skin sensitisation
Eye irritancy
DESCRIPTORS
Physcochemical,
biological, structural
SAR & QSARs
e.g. Neural Networks
PLS, Expert Systems
Molecular weight
HOMO
LUMO
Heat of formation
Log D at pH 2, 7.4, 10
Dipole moment
Polarisability
Total energy
Molecular volume
...
HOMO - highest occupied molecular orbital
LUMO - Lowest unoccupied molecular orbital
No of descriptors
cost time
Aims of Research
integrated data mining environment (IDME) for
in silico toxicity prediction
decision tree induction technique for ecotoxicity modelling
in silico techniques for mixture toxicity
prediction
Why Data Mining System for
In Silico Toxicity Prediction
Existing systems:
•
•
•
•
•
Unknown confidence level of prediction
Extrapolation
Models built from small datasets
Fixed descriptors
May not cover the endpoint required
Users own data resources, often commercially
sensitive, not fully exploited
Data Mining: Discover Useful
Information and Knowledge from Data
Value
Decision
Knowledge
Information
Data Data Data Data
Volume
The non-trivial process of
identifying valid, novel, potentially
useful, and ultimately
understandable patterns in data
Data: records of numerical
data, symbols, images,
documents
Knowledge:
Rules: IF .. THEN ..
Cause-effect relationships
Decision trees
Patterns: abnormal, normal
operation
Predictive equations
……
More importantly
Better understanding
Clustering
Classification
Conceptual Clustering
Inductive learning
Dependency modelling
Summarisation
Regression
Case-based Learning
eg. Dependency Modelling or Link Analysis
x1
x2
x3
1
1
0
1
0
0
1
0
1
0
0
1
0
1
0
1
1
0
1
0
0
1
1
1
0
1
1
0
1
0
x1
x2
x3
x2
x1
x3
Data pre-processing
- Wavelet for on-line signal feature extraction and dimension
reduction
- Fuzzy approach for dynamic trend interpretation
Clustering - Supervised classification
- BPNN
- Fuzzy set covering approach
Unsupervised classification
- ART2 (Adaptive resonance theory)
- AutoClass
- PCA
Dependency modelling
- Bayesian networks
- Fuzzy - SDG (signed directed graph)
- Decision trees
Others
- Automatic rules extraction from data using Fuzzy-NN and
Fuzzy SDG
- Visualisation
Cost due to PREVENTABLE abnormal operations:
e.g. $20 billion per year in pretrochemical ind.
Fault detection & diagnosis: very complex
sensor faults, equipment faults, control-loop,
interaction of variables …
Modern control systems
Process Operational Safety Envelopes
Start
point
Yussel’s work
End
point
Loss Prevention in Process Ind. 2002
Integrated Data Mining Environment
- Toxicity
Descriptor
calculation
Data Preprocessing
Scaling
Missing values
Outlier
identification
Feature extraction
Data import
Excel
ASCII Files
Database
XML
Data Mining Toolbox
Regression
PCA & ICA
ART2 networks
Kohonen networks
K-nearest neighbour
Fuzzy c-means
Decision trees and rules
Feedforward neural
networks (FFNN)
Summary statistics
Visualisation
Results
Presentation
Graphs
Tables
ASCII files
Discovery Validation
Statistical significance
Results for training and
test sets
User Interface
Quantitative Structure Activity Relationship
75 organic compounds with 1094 descriptors and endpoint
Log(1/EC50) to Vibrio fischeri
Zhao et al QSAR 17(2) 1998 pages 131-138
Log(1/EC50) = -0.3766 + 0.0444 Vx (r2 0.7078, MSE 0.2548)
Vx – McGowan’s characteristic volume
r2 – Pearson’s correlation coefficient
q2 – leave-one out cross validated correlation coefficient
Principal Component Analysis
Clustering in IDME
Multidimensional Visualisation
Feedforward neural networks
Input layer
Hidden layer
Output layer
PC1
PC2
PC3
…
PCm
Log(1/EC50)
FFNN Results graph
QSAR Mode for Mixture Toxicity Prediction
Similar Constituents
Dissimilar Constituents
TESTING
TRAINING
Similar Constituents
Dissimilar Constituents
Why Inductive Data Mining for
In Silico Toxicity Prediction ?
•
Lack of knowledge on what descriptors are
important to toxicity endpoints (feature selection)
•
Expert systems: subjective knowledge obtained
from human experts
•
•
Linear vs nonlinear
Black box models
What is inductive learning?
Aims at Developing a Qualitative Causal
Language for Grouping Data Patterns
into Clusters
Decision trees
or production rules
Explicit and transparent
Expert Systems
 Human expert knowl.
 Knowl. transparent,
causal
Statistical Methods
 Data driven
 Quantitative
Neural Networks
 Data driven
 Quantitative
 Nonlinear
Inductive DM
 Easy setup




 Knowl. Subjective.
 Data not used
 Often qualitative
 Black-box
 Human knowl. not used
 Black-box
 Human knowl. not used
Combines adv. of ESs, and SMs & NNs  More research
 Continuous valued output
Qualitative & quantitative, nonlinear
Data & human knowledge used
 Dynamics / interactions
Knowl. transparent and causal
Discretization techniques
C5.0
Binary discretization Information
entropy (Quinlan 1986 & 1993)
Methods Tested
C5.0
LERS
(Learning from Examples using
Rough Sets, Grzymala-Busse 1997)
LERS_C5.0
Probability distribution histogram
Histogram_C5.0
Equal width interval
KEX (Knowledge EXplorer,
Berka & Bruha 1998)
EQI_C5.0
KEX_chi_C5.0
KEX_fre_C5.0
KEX_fuzzy_C5.0
CN4 (Berka & Bruha 1998)
CN4_C5.0
Chi2 (Liu & Setiono 1995,
Kerber 1992)
Chi2_C5.0
Decision Tree Generation Based on Genetic Programming
Traditional Tree Generation methods – Greedy search, can miss potential models
Genetic Algorithm – optimisation approach can effectively
avoid local minima and simultaneously evaluate many solutions
GA has been used in decision tree generation to decide the
splitting points and attributes to be used whilst growing a tree
Genetic (evolutionary) Programming :
Not only simultaneously evaluate many solutions
and avoid local minima
But does not require parameter encoding into fixed
length vectors called chromosomes
Based on direct application of the GA to tree structures
Genetic Computation
(1)Generation of a population of solutions
(2) Repeat steps (i) and (ii) until the stop criteria are
satisfied
(i) calculate the fitness function values for each
solution candidate
(ii) perform crossover and mutation to generate
the next generation
(3) the best solution in all generations is regarded as
the solution
Crossover
Genetic
algorithms
Genetic (Evolutionary)
Programming / EPTree
+
+
=
=
1. Divide data into training and test sets
2. Generate the 1st population of trees
- randomly choosing a row (i.e. a compound), and column (i.e.
descriptor)
Molecules
Descriptors
- Using the value of the slot, s, to split, left child takes those data
points with selected attribute values <= s, whilst the right child
takes those > s.
<s
>s
DeLisle & Dixon J Chem Inf Comput Sci
44, 862-870 (2004)
Buontempo & Wang et al, J Chem Inf Comput Sci
45, 904-912 (2005)
- If a child will not cover enough rows (e.g. 10% of
the training rows), another combination is tried.
- A child node becomes a leaf node if pure i.e. all the
rows covered are in the same class, or near pure,
whilst the other nodes grow children
-When all nodes either have two children or are leaf
nodes, the tree is fully grown and added to the first
generation.
-A leaf node is assigned to a class label
corresponding to the majority class of points
partitioned there.
3. Crossover, Mutation
- Tournament: randomly select a groups of trees e.g. 16
- Calculate fitness values
- Generate the first parent
- Similarly generate the second parent
- Crossover to generate a child
- Generate other children
- Select a percentage for mutation
=
+
Mutation Methods
- Random choice of change of split point (i.e. choosing
a different row’s value for the current attribute)
- Choosing a new attribute whilst keeping the same row
- choosing a new attribute and a new row
- re-growing part of the tree
- If no improvement in accuracy for k generations, trees
generated were mutated
-……
Two Data Sets
Data Set 1:
Concentration lethal to 50% of the population, LC50,
1/Log(LC50), of vibrio fischeri, a biolumininescent bactorium
75 compounds
1069 molecular descriptors
Data Set 2:
Concentration effecting 50% of the population,
EC50 of algae chlorella vulgaris, by causing fluorescein
diacetate to disappear
80 compounds
1150 descriptors
Data set
Minimu
m
Class 1
range
Class 2
range
Class 3
range
Class 4
range
Maximum
Bacteria
0.90
≤3.68
≤4.05
≤4.50
>4.50
6.32
Algae
-4.06
≤-1.05
≤-0.31
≤0.81
>0.81
3.10
600 trees were grown in each egneration
16 trees competing in each tournament to
select trees for crossover,
66.7% were mutated for the bacterial
dataset, and 50% mutated for the algae
dataset.
Evolutionary Programming Results: Dataset 1
Cl attached to C2 (sp3) ≤ 1
No
Yes
Highest eigenvalue of Burden matrix
weighted by atomic mass ≤ 2.15
Yes
Class 1
(12/12)
No
Class 3
(8/8)
No
Class 4
(7/7)
No
Distance Degree Index ≤ 15.124
Class 4
(5/6)
No
Yes
Class 4
(5/6)
Summed atomic weights of angular
scattering function ≤ -1.164
Yes
No
R autocorrelation of lag 7 weighted
by atomic mass ≤ 3.713
Class 2
(7/8)
Yes
Lowest eigenvalue of Burden matrix
weighted by van der Waals vol ≤ 3.304
Yes
Yes
Self-returning walk count of
order 8 ≤ 4.048
Class 2
(5/6)
For data set 1, bacteria data
in generation 37
91.7% for training (60 cases)
No
Class 3
(6/7)
73.3% for the test set (15 cases)
Decision Tree Using C5.0 for the Same Data
For data set 1, bacteria data
Gravitational index ≤ 7.776
Yes
No
88.3% for training (60 cases)
60.0 % for test set (15 cases)
Valence connectivity index ≤
3.346
Class 1
(13/14)
Yes
No
Cl attached to C1 (sp2) ≤ 1
Yes
No
H Autocorrelation lag 5 weighted by
atomic mass ≤ 0.007
Yes
No
Summed atomic weights of angular
scattering function ≤-0.082
Yes
Class 3
(5/6)
No
Class 2
(11/12)
Class 4
(14/15)
Class 4
(3/6)
Class 3
(7/7)
Evolutionary Programming Results: Dataset 2
Solvation connectivity index ≤ 2.949
No
Yes
Self-returning walk
count order 8 ≤ 3.798
Yes
Class 1
(16/16)
No
Class 2
(14/15)
2nd dataset - algae data
GP Tree, generation 9
Training: 92.2%
Test: 81.3%
Molecular multiple path
count order 3 ≤ 92.813
Yes
No
Class 3
(6/8)
H autocorrelation of lag 2
weighted by Sanderson electronegativities ≤ 0.401
Yes
Class 4
(6/7)
No
2nd component symmetry
directional WHIM index
weighted by van der Waals
volume ≤ 0.367
Yes
Class 3
(9/10)
No
Class 4
(8/8)
Decision Tree Using See5.0 for the Same Data
Max eigenvalue of Burden matrix
weighted by van der Waals vol ≤ 3.769
No
Yes
Class 1
(16/16)
Broto-Moreau autocorrelation of topological
structure lag 4 weighted by atomic mass ≤ 9.861
Yes
Class 2
(15/16)
2nd dataset, algae data
See 5,
Training: 90.6%
Test: 75.0%
No
Total accessibility index weighted
by van der Waals vol ≤ 0.281
Yes
Class 3
(15/20)
No
Class 4
(12/12)
Summary of Results
Data set 1 – Bacteria data
C5.0
GP method
6
8
Training Accuracy
88.3%
91.7%
Test Accuracy
60.0%
73.3%
Tree size
Data set 2 – Algae data
C5.0
GP method
4
6
Training Accuracy
90.6%
92.2%
Test Accuracy
75.0%
81.3%
Tree size
Comparison of Test Accuracy for See5.0 and GP
Trees Having the Same Training Accuracy
Data Set 1 – Bacteria data
C5.0
GP (Generation 31)
6
8
Training Accuracy
88.3%
88.3%
Test Accuracy
60.0%
73.3%
Tree size
Data Set 2 – Algae data
C5.0
GP (Generation 9)
4
6
Training Accuracy
90.6%
90.6%
Test Accuracy
75.0%
87.5%
Tree size
Application to Wastewater Treatment Plant Data
Primary Treatment
Grit
Screening
Inflow
Removal
Outflow
Secondary Treatment
Primary
Settler
Aeration Tank
Secondary Settler
Data Corresponding to 527 Days’ Operation
38 Variables
Primary
Settler
Aeration
Tanks
Screws
Input
Pre-Treatment
Primary
Treatment
Sludge Line
Secondary
Settler
Secondary Treatment
Output
Decision tree for prediction of
suspended solids in effluents
– training data
SS-P ≤ -2.9572
SS-P ≤ -1.8445
DQO-D ≤ 1.80444
SS-P ≤ -1.68597
DBO-D ≤ 0.47006
RD-DBO-G ≤ 0.8097
PH-D ≤ 0.8699
L
2
SS-P ≤ -3.167930
SS-P ≤ -3.6479
L
3
N
3
DQO-D ≤ 2.53335
N
20
N
5
PH-D ≤ 0.59323
SS-P ≤ -1.58468
N
7
N
11
N
320/1
N
2
N
3
ZN-E≤ 2.2447
SSV-P ≤ 0.17786
H
4
N
16
DBO-SS ≤ 0.81806
N
30
PH-D ≤ 0.65534
N
27
Total No of Obs. =470
Training Accuracy: 99.8%
Test Accuracy: 93.0%
Leaf Nodes = 20
L = Low
N = Normal
H = High
L
3
PH-D ≤ 0.68569
H
2
SS-P ≤ -1.20793
N
2
N
4
SS-P : input SS to primary settler
DQO-D : input COD to secondary settler
DBO-D : input COD to secondary settler
PH-D : input pH to secondary settler
SSV-P : input volatile SS to primary settler
RD-DQO-S ≤ 0.31152
N
3
H
3
DBO-E ≤ 0.49701
Using all the
data of 527 days
N
76/3
SS-P ≤ -1.86019
SS-P ≤ -1.86017
SS-P ≤ -3.08361
RD-SS-P ≤ 0.491144
DBO-D ≤ 0.408557
L
3
N
25
RD-SS-G≤0.50018
RD-DQO-S ≤ 0.357935
N
2
PH-D ≤ 0.65537
SS-P ≤ -3.39768
No of Obs. = 527
Accuracy = 99.25%
Leaf Nodes = 18
L = Low
N = Normal
H = High
N
69
L
3
N
11
L
2
N
13
SED-P ≤-2.81193
DBO-E ≤0.71809
N
31
N
11
L
3
SS-P ≤ -1.20793
PH-P ≤ 0.41833
N
234/1
COND-S ≤ 0.49438
N
4
RD-DQO-S ≤ 0.35794
PH-P ≤ 0.17333
H
3
N
20
H
9
N
8
Final Remarks
• An Integrated Data Mining Prototype System
for Toxicity Prediction of Chemicals and
Mixtures Developed
• An Evaluation of Current Inductive Data
Mining Approaches to Toxicity Prediction
Has Been Conducted
• A New Methodology for the Inductive Data
Mining Based Novel Use of Genetic
Programming is Proposed, Giving Promising
Results in Three Case Studies
On-going Work
Accuracy
1) Adaptive Discretization of End-point Values through
Simultaneous Mutation of the Output
100
90
80
70
60
50
40
30
20
10
0
2-class
3-class
4-class
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Generation
SSRD - sum of
squared
differences
in rank
The best training accuracy in each generation for the trees grown for
the algae data using the SSRD. The 2 class trees no longer dominate
and very accurate 3 class trees have been found.
Future Work
2) Extend the Method to Model Trees & Fuzzy Model Trees Generation
Rule 1: If antecedent one applies, with degree μ1=μ1,1×μ1,2×…×μ1,9
then y1= 0.1910 PC1 + 0.6271 PC2 + 0.2839 PC3
+ 1.2102 PC4 + 0.2594 PC5 + 0.3810 PC6
- 0.3695 PC7 + 0.8396 PC8 + 1.0986 PC9 - 0.5162
Rule 2: If antecedent two applies, with degree μ2=μ2,1×μ2,2×…×μ2,9
then y2 = 0.7403 PC1 + 0.5453 PC2 - 0.0662 PC3
- 0.8266 PC4 + 0.1699 PC5 - 0.0245 PC6
+ 0.9714 PC7 - 0.3646 PC8 - 0.3977 PC9 - 0.0511
Final output: Crisp value (μ1×y1 + μ2×y2) / (μ1 + μ2)
where μi=μi,1×μi,2×……×μi,10
Fuzzy Membership Functions Used in Rules
Future Work
3) Extend the Method to Mixture Toxicity Prediction
Similar
Constituents
TESTING
TRAINING
Similar
Constituents
Dissimilar
Constituents
Dissimilar
Constituents
Acknowledgements
FV Buontempo
M Mwense
A Young
D Osborn
Crystal Faraday Partnership
on Green Technology
AstraZenaca Brixham
Environmental Laboratory
NERC Centre of Ecology
and Hydrology
Type of descriptor
Definition
Examples
Constitutional
Physical description of
the compound
Molecular weight,
atoms count
Topological
2D descriptors taken from Wiener index,
the molecular graph
Balaban index
Walk counts
Obtained from molecular
graphs
Total walk count
Burden
eigenvalues
(BCUT)
Eigenvalues of the
adjacency matrix,
weighting the diagonals
by atom weights,
reflecting the topology of
the whole compound
Weighted by
atomic mass,
volume,
electronegativity or
polarizability
Galvez
topological
charge indices
Describes charge transfer
between pairs of atoms
calculated from the
eigenvalues of the
adjacency matrix
Topological and
mean charge index
of various orders
2D
autocorrelation
Sum of the atom weights
of the terminal atoms of
all the paths of a given
length (lag)
Moreau, Moran,
and Geary
autocorrelations
Charge
descriptors
Charges estimated by
quantum molecular
methods
Total positive
charge, dipole
index
Aromaticity
indices
Estimated from
geometrical distance
between aromatically
bonded atoms
Harmonic
oscillator model
of aromaticity
Randic
molecular
profiles
Derived from distance
distribution moments
of the geometry matrix
Molecular profile,
shape profile
Geometrical
descriptors
Conformationaldependant, based on
molecular geometry
3D Wiener index,
gravitational
index
Radial distribution
function descriptors
Obtained from radial basis
functions centred at different
distances
3D Molecule
Representation of
Structure based on
Electron diffraction
(MoRSE)
Calculated by summing
atomic weights viewed by
different angular scattering
functions
GEometry, Topology,
and Atom Weights
AssemblY
(GETAWAY)
Calculated from the leverage
matrix, representing the
influence of each atom in
determining the shape of the
molecule, obtained by centred
atomic coordinates
Unweighted or
weighted by atomic
mass, volume,
electronegativity or
polarizability
Weighted holistic
invariant molecular
(WHIM)
Statistical indices calculated
from the atoms projected onto
3 principal components from a
weighted covariance matrix of
atomic coordinates
Unweighted or
weighted by atomic
mass, volume,
electronegativity,
polarizability or
electrotopological
state
Functional groups
Counts of various atoms and
functional groups
Primary carbons
Aliphatic ethers
Atom-centred
fragments
From 120 atom centred
fragments defined by GhoseCrippen
Cl-086; Cl attached to
C1 (sp3)
Various others
Unsaturation index; number of non-single bonds
Hy; a function of the count of hydrophilic groups
Aromaticity ratio; aromatic bonds/ total number of
bonds in a H-depleted atom
Ghose-Crippen molecular refractivity
Fragment-based polar surface area