Download Analysis of High-Throughput Screening Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Analysis of High-Throughput
Screening Data
I647
Fall 2006
Drug Discovery Process
• The key steps of drug discovery are:
– research - average 2 to 3 years
– pre-clinical testing - average 1 year
– clinical trial testing (involving human patients)
- average 10 years
– regulatory approval - average 2 years
Drug Discovery Process:
Web Sites
• http://akosgmbh.de/Drug_discovery_process.htm
• http://www.ppdi.com/PPD_U7.htm
INTRODUCTION
• HTS allows hundreds of thousands of
compounds to be assayed very quickly
• HTS data characterized by:
– High volume
– High level of noise
– Diverse nature of the chemical classes
involved
– Possible presence of multiple binding modes
INTRODUCTION
• Select the most potent compounds to
progress to the next stage
• Problems:
– Functional groups that interfere with the assay
(e.g., fluoresce)
– Functional groups that react with biological
systems
– Catch these with substructure and “druglikeness” filters
Techniques for Analysis of HTS
Data
• Can’t use multiple linear regression or
partial least squares as statistical tests
– Data sets are too large
• Data visualization
• Data reduction
• Data mining (if activity data is known)
HTS Methodology
• Procedure:
– Measure activity at different concentrations for
a subset of compounds
– Define IC50 (Inhibitory Concentration 50): the
concentration of a material estimated to inhibit
the biological endpoint of interest (e.g., cell
growth, ATP levels) by 50%
– Solid pure sample that tests positively gets
structure determined (hits-to-leads phase)
DATA VISUALIZATION
• Need to display simultaneously large data
sets with many thousands of molecules
and their properties
• Typical software packages:
– Draw various kinds of graphs
– Color selected properties
– Calculate simple statistics
• HTS data sets may be divided into subsets
to aid navigation
SpotFire DecisionSite
• DecisionSite Examples
http://www.spotfire.com/
Features of Data Visualization
• Often combined with structure searching
to find compounds with certain features
• Unsupervised methods – don’t use activity
data
• Supervised methods – incorporate activity
data
• Use of molecular descriptors
Non-Linear Mapping
• Descriptors:
– Physicochemical properties
– Fingerprints: a Boolean array with the
meaning of each bit not predefined
• List of patterns is generated for each
– Atom, pair of adjacent atoms, bonds connecting them
– Each group of atoms joined by longer pathways
– Substructural fragments
– Known activity against related targets
Non-Linear Mapping (cont’d)
• Non-Linear Mapping takes
multidimensional data to a lower space (2or 3-dimensional)
• Multidimensional scaling
– Generate initial set of coordinates in the lowdimensional space
– Modify the coordinates using optimization
procedures
DATA MINING METHODS
• Construct models that enable the
establishment of relationships between the
structures and the observed activity
• Simple division of structures is desirable:
– Active vs. inactive
– High, medium, or low activity classes
Data Mining Methods: Techniques
• Substructural analysis: weight each aspect
of the structure according to a preassigned activity designation
acti
Wi = ---------------acti + inacti
Data Mining Techniques
• Discriminant Analysis: aims to separate
the molecules into constituent classes
– Linear discriminant analysis works with two
variables and two activity classes
• Straight line separates the data into areas where
the maximum number of correct activities is found
Data Mining Techniques
• Neural Networks – need a training set of
data
• Once trained, the program predicts values
for new molecules
• Examples: feed-forward network and
Kohonen network (self-organizing map)
• Problem: over-training—gives excellent
results on the test data, but poor results on
unseen data
Data Mining Techniques
• Decision Trees
– Rules associate specific molecular and/or
descriptor values with the activity or property
of interest
– Start with the entire data set and identify the
descriptor or variable that gives the best split
– Follow the procedure until no more splits are
possible or desirable
– Some consider multiple splits at each node
SUMMARY
• Much interest and research on HTS
analysis
• New techniques being applied (e.g.,
support vector machines)
• Analysis of large diverse data sets needs
the most work
• Results need to feed into subsequent
analysis