Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/ 1. Objectives • The project aims: – to develop and validate techniques for extracting explicit information from bioinformatic data – to express this information as logical rules and decision trees – to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics 2. Extracting information • Artificial neural networks can be trained to reproduce the non-linear relationships underlying bioinformatic data with good predictive accuracy – but it is often hard to comprehend those relationships from the internal structure of the network – with the result that networks are often regarded as ‘black boxes’. • Decision trees using symbolic rules are easier to interpret – leading to a greater likelihood of understanding the relationships in the data – allowing the behaviour of individual cases to be explained. 3. Extracting Decision Trees • The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. • The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network. 4. Splitting Tests • The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi are Boolean conditions. • Start with a set of candidate tests – binary tests on each value for nominal features – binary tests on thresholds for real-valued features • Use a beam search with a beam width of two. • Initialize the beam with the candidate test that maximizes the information gain. 5. Splitting Tests (II) • To each m-of-n test in the beam and each candidate test, apply two operators: • m-of-n+1 • m+1-of-n+1 e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3} e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3} • Admit new tests to the beam if they increase the information gain and are significantly different (chi-squared) from existing tests. 6. Example: Substance P Binding to NK1 Receptors • Substance P is a neuropeptide with the sequence: H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • Wang et al. used the multipin technique to synthesize 512 = 29 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions • The aim was to measure binding potencies to NK1 receptors & identify the positions at which stereochemistry affects binding strength. 7. Application of Trepan • A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. • For each network a decision tree was grown using Trepan. • The trees showed high fidelity with the networks on a 10% test set. 8. Results • Binding activity was determined by five positions, viz. – H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-MetNH2 • The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins – Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen). 9. A Typical Trepan Tree 10. Test set confusion matrix: tree versus network Network R 1 0.70 30 12 1 1 2 0.68 27 16 0 1 3 0.64 31 9 2 2 4 0.71 29 10 2 3 5 0.71 29 9 2 4 6 0.67 30 9 3 2 True +ve True -ve False +ve False -ve 11. Test set confusion matrix: tree versus observed Network R 1 0.70 29 10 2 3 2 0.68 25 10 2 7 3 0.64 28 7 5 4 4 0.71 29 10 2 3 5 0.71 29 10 2 3 6 0.67 28 7 5 4 True +ve True -ve False +ve False -ve 12. Future Work • Complete the implementation of the Trepan algorithm. – model the distribution of the input data and generate a set of query instances to be classified by the network & used as additional training cases during tree extraction. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm. 13. Future Applications • Apply Trepan to ligand-receptor binding problems. – compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5). 14. References • Wang J-X et al. (1993) Study of stereo-requirements of substance P binding to NK1 receptors using analogues with systematic D-amino acid replacements. Biorganic & Medicinal Chemistry Letters, 3, 451-456. • Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen). Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth [email protected] Research Fellows Dr Shuang Cang Dr Abul Azad Mar - Sept 2000 Jan 2001 - Collaborators • Dr Antony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. [email protected] • Professor Philip Picton School of Technology and Design, University College Northampton. [email protected] • Dr David Whitley Centre for Molecular Design, University of Portsmouth. [email protected]