Download Biological Data Mining

Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/ People Centre for Molecular Design, University of Portsmouth •Professor Martyn Ford •Dr David Whitley •Dr Shuang Cang (Mar - Sept 2000) •Dr Abul Azad (Jan 2001 - ) •Dr Antony Browne, London Guildhall University. •Professor Philip Picton, University College Northampton. 1. Objectives • The project aims: – to develop and validate techniques for extracting explicit information from bioinformatic data – to express this information as logical rules and decision trees – to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics 2. Methods for Extracting Information • Artificial Neural Networks – good predictive accuracy – hard to decipher – often regarded as ‘black boxes’ • Decision Trees – symbolic rules easier to interpret – more likely to reveal relationships in the data – allow behaviour of individual cases to be explained 3. Extracting Decision Trees • The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. • The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network. 4. Splitting Tests • The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi are Boolean conditions. • Start with a set of candidate tests – binary tests on each value for nominal features – binary tests on thresholds for real-valued features • Use a beam search with a beam width of two. • Initialize the beam with the candidate test that maximizes the information gain. 5. Splitting Tests (II) • To each m-of-n test in the beam and each candidate test, apply two operators: • m-of-n+1 • m+1-of-n+1 e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3} e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3} • Admit new tests to the beam if they increase the information gain and are significantly different (chi-squared) from existing tests. 6. Example: Substance P Binding to NK1 Receptors • Substance P is a neuropeptide with amino acid sequence H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • Wang et al. (1993) used the multipin technique to synthesize 512 = 29 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions, and measured binding potencies to central NK1 receptors. • The objective was to identify the positions at which stereo-chemistry affects binding strength. 7. Application of Trepan • A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. • For each network a decision tree was grown using Trepan. • The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins (1999). 8. A Typical Trepan Tree 9. Future Work • Complete the implementation of the Trepan algorithm. – model the distribution of the input data and generate from this a set of query instances that are classified using the network and used as additional training cases during extraction of the tree. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm. • Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5). • Apply Trepan to ligand-receptor binding problems.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Biological Data Mining