Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/ Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth [email protected] Collaborators • Dr Anthony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. [email protected] • Professor Philip Picton School of Technology and Design, University College Northampton. [email protected] • Dr David Whitley Centre for Molecular Design, University of Portsmouth. [email protected] Objectives • The project aims: – to develop & validate techniques for extracting explicit information from bioinformatic data – to express this information as logical rules and decision trees – to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics Extracting information • Artificial neural networks (ANNs) can be used to identify the non-linear relationships that underlie bioinformatic data, but . . . – trained ANNs do not lead to a concise and explicit model – specifying the underlying structure is therefore difficult – as a result, ANNs are often regarded as ‘black boxes’ Data Mining and Neural Networks • Standard data mining algorithms exist (such as ID3 or C5) so why use an ANN? It would be advantageous if the rules extracted: – Give a better fit to the data with the same number of rules (i.e. explain the data more accurately); – Give the same fit to the data with less rules (i.e. explain the data more comprehensibly); or – Give both a better fit to the data and use less rules (i.e. explain the data more comprehensibly and more accurately). Extracting Decision Trees • The TREPAN procedure (Craven,1996) – extracts decision trees from ANNs – performs better than the symbolic learning algorithms ID3 and C5 – the current implementation is restricted to a particular network architecture, but – the underlying algorithm is independent of network architecture Trepan • Builds a decision tree representing the function the ANN has learnt by recursively partitioning the input space. • Draws query instances by taking into account the distribution of instances in the problem domain. • For real-valued features uses kernel density estimates to generate a model of the underlying data that is used to select instances for presentation to the network. Trepan • Builds the decision tree in a best-first manner: – as each node is added the fidelity of the decision tree to the ANN is maximised – this is done by examining the significance of the distributions at consecutive levels of the tree (Kolmogorov-Smirnoff test for real valued features, chi-squared for discrete ones) • Allows the user to control the size of the final tree by selecting appropriate stopping criteria. Aims • Implement the TREPAN algorithm in a portable format, independent of network architecture. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm. • Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5). Aims • Apply the extracted decision trees – to searches of bioinformatic databases • protein databases • genomic databases – to searches of cheminformatic databases • chemical libraries • natural product databases – to investigate ligand/receptor binding – to quantify molecular similarity/diversity – to identify new leads and optimise properties Case study: ligand interaction with GPCRs LNL 8.5 LNI SSF TLI SSF IEF TLI SSF ALM IEL ISI ISI LNL log1/c • 28 GPCRs • a number of putative interaction sites • 3 principal properties of amino acids (AAs) TLI 7.5 MNP IEF IEF LNL LEL MAV 6.5 LEL LNI MNV 5.5 6 7 8 9 Prediction 10 9 LSI LSI log1/c • MLR results for 2 ligands SLF SVI LAM LSI SAI SLF ETL ETL ETL 8 7 NTI NAP EKF 6 EKF NTV NTL EKF ETL EKL 5 5 6 7 8 9 Prediction 10 11