Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysis of High-Throughput Screening Data C371 Fall 2004 Drug Discovery Process • The key steps of drug discovery are: – research - average 2 to 3 years – pre-clinical testing - average 1 year – clinical trial testing (involving human patients) - average 10 years – regulatory approval - average 2 years Drug Discovery Process: Web Sites • http://akosgmbh.de/Drug_discovery_process.htm • http://www.ppdi.com/PPD_U7.htm INTRODUCTION • HTS allows hundreds of thousands of compounds to be assayed very quickly • HTS data characterized by: – High volume – High level of noise – Diverse nature of the chemical classes involved – Possible presence of multiple binding modes INTRODUCTION • Select the most potent compounds to progress to the next stage • Problems: – Functional groups that interfere with the assay (e.g., fluoresce) – Functional groups that react with biological systems – Catch these with substructure and “druglikeness” filters Techniques for Analysis of HTS Data • Can’t use multiple linear regression or partial least squares as statistical tests – Data sets are too large • Data visualization • Data reduction • Data mining (if activity data is known) HTS Methodology • Procedure: – Measure activity at different concentrations for a subset of compounds – Define IC50 (Inhibitory Concentration 50): the concentration of a material estimated to inhibit the biological endpoint of interest (e.g., cell growth, ATP levels) by 50% – Solid pure sample that tests positively gets structure determined (hits-to-leads phase) DATA VISUALIZATION • Need to display simultaneously large data sets with many thousands of molecules and their properties • Typical software packages: – Draw various kinds of graphs – Color selected properties – Calculate simple statistics • HTS data sets may be divided into subsets to aid navigation SpotFire DecisionSite • DecisionSite Examples http://www.spotfire.com/ Features of Data Visualization • Often combined with structure searching to find compounds with certain features • Unsupervised methods – don’t use activity data • Supervised methods – incorporate activity data • Use of molecular descriptors Non-Linear Mapping • Descriptors: – Physicochemical properties – Fingerprints: a Boolean array with the meaning of each bit not predefined • List of patterns is generated for each – Atom, pair of adjacent atoms, bonds connecting them – Each group of atoms joined by longer pathways – Substructural fragments – Known activity against related targets Non-Linear Mapping (cont’d) • Non-Linear Mapping takes multidimensional data to a lower space (2or 3-dimensional) • Multidimensional scaling – Generate initial set of coordinates in the lowdimensional space – Modify the coordinates using optimization procedures DATA MINING METHODS • Construct models that enable the establishment of relationships between the structures and the observed activity • Simple division of structures is desirable: – Active vs. inactive – High, medium, or low activity classes Data Mining Methods: Techniques • Substructural analysis: weight each aspect of the structure according to a preassigned activity designation acti Wi = ---------------acti + inacti Data Mining Techniques • Discriminant Analysis: aims to separate the molecules into constituent classes – Linear discriminant analysis works with two variables and two activity classes • Straight line separates the data into areas where the maximum number of correct activities is found Data Mining Techniques • Neural Networks – need a training set of data • Once trained, the program predicts values for new molecules • Examples: feed-forward network and Kohonen network (self-organizing map) • Problem: over-training—gives excellent results on the test data, but poor results on unseen data Data Mining Techniques • Decision Trees – Rules associate specific molecular and/or descriptor values with the activity or property of interest – Start with the entire data set and identify the descriptor or variable that gives the best split – Follow the procedure until no more splits are possible or desirable – Some consider multiple splits at each node SUMMARY • Much interest and research on HTS analysis • New techniques being applied (e.g., support vector machines) • Analysis of large diverse data sets needs the most work • Results need to feed into subsequent analysis