Download Analysis of High-Throughput Screening Data

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Analysis of High-Throughput
Screening Data
Fall 2006
Drug Discovery Process
• The key steps of drug discovery are:
– research - average 2 to 3 years
– pre-clinical testing - average 1 year
– clinical trial testing (involving human patients)
- average 10 years
– regulatory approval - average 2 years
Drug Discovery Process:
Web Sites
• HTS allows hundreds of thousands of
compounds to be assayed very quickly
• HTS data characterized by:
– High volume
– High level of noise
– Diverse nature of the chemical classes
– Possible presence of multiple binding modes
• Select the most potent compounds to
progress to the next stage
• Problems:
– Functional groups that interfere with the assay
(e.g., fluoresce)
– Functional groups that react with biological
– Catch these with substructure and “druglikeness” filters
Techniques for Analysis of HTS
• Can’t use multiple linear regression or
partial least squares as statistical tests
– Data sets are too large
• Data visualization
• Data reduction
• Data mining (if activity data is known)
HTS Methodology
• Procedure:
– Measure activity at different concentrations for
a subset of compounds
– Define IC50 (Inhibitory Concentration 50): the
concentration of a material estimated to inhibit
the biological endpoint of interest (e.g., cell
growth, ATP levels) by 50%
– Solid pure sample that tests positively gets
structure determined (hits-to-leads phase)
• Need to display simultaneously large data
sets with many thousands of molecules
and their properties
• Typical software packages:
– Draw various kinds of graphs
– Color selected properties
– Calculate simple statistics
• HTS data sets may be divided into subsets
to aid navigation
SpotFire DecisionSite
• DecisionSite Examples
Features of Data Visualization
• Often combined with structure searching
to find compounds with certain features
• Unsupervised methods – don’t use activity
• Supervised methods – incorporate activity
• Use of molecular descriptors
Non-Linear Mapping
• Descriptors:
– Physicochemical properties
– Fingerprints: a Boolean array with the
meaning of each bit not predefined
• List of patterns is generated for each
– Atom, pair of adjacent atoms, bonds connecting them
– Each group of atoms joined by longer pathways
– Substructural fragments
– Known activity against related targets
Non-Linear Mapping (cont’d)
• Non-Linear Mapping takes
multidimensional data to a lower space (2or 3-dimensional)
• Multidimensional scaling
– Generate initial set of coordinates in the lowdimensional space
– Modify the coordinates using optimization
• Construct models that enable the
establishment of relationships between the
structures and the observed activity
• Simple division of structures is desirable:
– Active vs. inactive
– High, medium, or low activity classes
Data Mining Methods: Techniques
• Substructural analysis: weight each aspect
of the structure according to a preassigned activity designation
Wi = ---------------acti + inacti
Data Mining Techniques
• Discriminant Analysis: aims to separate
the molecules into constituent classes
– Linear discriminant analysis works with two
variables and two activity classes
• Straight line separates the data into areas where
the maximum number of correct activities is found
Data Mining Techniques
• Neural Networks – need a training set of
• Once trained, the program predicts values
for new molecules
• Examples: feed-forward network and
Kohonen network (self-organizing map)
• Problem: over-training—gives excellent
results on the test data, but poor results on
unseen data
Data Mining Techniques
• Decision Trees
– Rules associate specific molecular and/or
descriptor values with the activity or property
of interest
– Start with the entire data set and identify the
descriptor or variable that gives the best split
– Follow the procedure until no more splits are
possible or desirable
– Some consider multiple splits at each node
• Much interest and research on HTS
• New techniques being applied (e.g.,
support vector machines)
• Analysis of large diverse data sets needs
the most work
• Results need to feed into subsequent