Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Building QSAR models using shannon entropy and a genetic algorithm Jörg K. Wegner and Andreas Zell Zentrum für Bioinformatik Tübingen (ZBIT) Universität Tübingen Sand 1 D-72076 Tübingen {wegnerj,zell}@informatik.uni-tuebingen.de We describe a fast and flexible descriptor selection method using a genetic algorithm variant (GA-SEC). The relevance of the descriptors is measured using Shannon Entropy (SE) and Differential Shannon Entropy (DSE) [1], which have very sparse memory requirements and allow the processing of huge data sets. A small quantity of the most important descriptors will be used automatically to build a value prediction model. The most important descriptors are not a linear combination of other descriptors, but transparent, pure descriptors. We used an artificial neuronal network (ANN) model [2] to predict logS/logP values and obtained R2AN N,T est =0.93 for the Huuskonen test set [3] and R2AN N,T est =0.92 for the Wang test set [4]. The SE and DSE values are used to initialise the GA algorithm and speed up the descriptor selection process dramatically. The value prediction model (e.g. ANN) is calculated using the selected descriptors. The fitness function rewards a small number of descriptors and penalizes a large number of descriptors. This modified fitness function avoids implicit overfitted models with poor predictive ability. The figure shows the mean values over 8 experiments. The GA-SEC algorithms are faster than standard GA algorithms. Additionally they are more stable and in general lead to models with a smaller standard deviation. The GA-SEC algorithms and the free descriptor calculation library JOELib [5] are completely written in Java. References [1] Godden, W., Bajorath, J., J. Chem. Inf. Comput. Sci., 2001, 41, 1060. [2] Zell, A., Simulation neuronaler Netze, Oldenbourg Verlag, München, 1997. [3] Huuskonen, J., J. Chem. Inf. Comput. Sci., 2000, 40, 773. [4] Wang, R., Gao, Y., Lai, L., Perspectives in Drug Discovery and Design, 2000, 19, 47. [5] JOELib, http://www-ra.informatik.uni-tuebingen.de/software/joelib.