Download Docking

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

3D optical data storage wikipedia , lookup

Determination of equilibrium constants wikipedia , lookup

Freshwater environmental quality parameters wikipedia , lookup

Registration, Evaluation, Authorisation and Restriction of Chemicals wikipedia , lookup

Process chemistry wikipedia , lookup

Chemical warfare wikipedia , lookup

Biochemistry wikipedia , lookup

Destruction of Syria's chemical weapons wikipedia , lookup

Green chemistry wikipedia , lookup

American Chemical Society wikipedia , lookup

Institute of Chemistry Ceylon wikipedia , lookup

Analytical chemistry wikipedia , lookup

Fine chemical wikipedia , lookup

History of molecular theory wikipedia , lookup

Natural product wikipedia , lookup

Drug design wikipedia , lookup

Triclocarban wikipedia , lookup

Inorganic chemistry wikipedia , lookup

Cocrystal wikipedia , lookup

Chemical imaging wikipedia , lookup

Organic chemistry wikipedia , lookup

Al-Shifa pharmaceutical factory wikipedia , lookup

Chemical biology wikipedia , lookup

Physical organic chemistry wikipedia , lookup

Chemical industry wikipedia , lookup

Chemical potential wikipedia , lookup

Chemical weapon proliferation wikipedia , lookup

California Green Chemistry Initiative wikipedia , lookup

Chemical plant wikipedia , lookup

Computational chemistry wikipedia , lookup

Chemical weapon wikipedia , lookup

Chemical Corps wikipedia , lookup

History of chemistry wikipedia , lookup

Safety data sheet wikipedia , lookup

List of artworks in the collection of the Royal Society of Chemistry wikipedia , lookup

Chemical thermodynamics wikipedia , lookup

Drug discovery wikipedia , lookup

VX (nerve agent) wikipedia , lookup

Transcript
Exploring Chemical Space with
Computers—Challenges and
Opportunities
Pierre Baldi
UCI
Chemical Informatics



Historical perspective: physics,
chemistry and biology
Understanding chemical space
Small molecules (systems biology,
chemical synthesis, drug design,
nanotechnology)
Chemical Space
Stars
Existing
1022
Small
Mol.
107
Virtual
0
1060 (?)
Access
Difficult
“Easy”
Mode
Individual
Combinatorial
Chemical Space
Chemical Informatics





Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Predict physical, chemical, biological properties
(classification/regression)
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Methods

Spetrum:

Schrodinger Equation


Molecular Dynamics


Machine Learning (e.g. SS prediction)
Chemical Informatics

Informatics must be able to deal with
variable-size structured data






Graphical Models
(Recursive) Neural Networks
ILP
GA
SGs
Kernels
Two Essential Ingredients
Data
Similarity Measures
1.
2.
Bioinformatics analogy and differences:


Data (GenBank, Swissprot, PDB)
Similarity (BLAST)
Data

Mutag (Mutagenicity)


PTC (Predictive Toxicity Challenge)


All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([164,174])
Benzodiazepines (QSAR)


70,000 compounds screened for ability to inhibit growth in 60 human tumor
cell lines
Alkanes (Boiling points)


A few hundred compounds, carcinogenicity (FM,MM,FR,MR)
NCI (Anti-cancer activity)


200 compounds (125/63), mutagenicity in Salmonella
79 1,4-benzodiazepines-2-one, affinity towards GABAA
ChemDB

7M compounds
Similarity

Rapid Searches of Large Databases

Predictive Methods (Kernel Methods)

Why it is not hopeless?
Similarity



Rapid Search of Large Databases

Protein Receptor (Docking)

Small Molecule/Ligand (Similarity)
Predictive Methods (Kernel Methods)
Why it is not hopeless
Linear Classifiers
Classification

Learning to Classify




Limited number of training
examples (molecules, patients,
sequences, etc.)
Learning algorithm (how to
build the classifier?)
Generalization: should correctly
classify test data.
Formalization



X is the input space
Y (e.g. toxic/non toxic, or {1,1}) is the target class
f: X→Y is the classifier.
Classification
 Fundamental Point:
 f is entirely determined
by the dot products xi,xj
measuring the similarity
between pairs of data
points
Non Linear Classification
(Kernel Methods)

We can transform a nonlinear problem
into a linear one using a kernel.
Non Linear Classification
(Kernel Methods)
We can transform a nonlinear problem
into a linear one using a kernel K.
 Fundamental property: the linear
decision surface depends on
K(xi ,xj)=(xi ) , (xj).
 All we need is the Gram similarity
matrix K. K defines the local metric of
the embedding space.

Similarity: Data Representations
O
H2N
OH
OH
NC(O)C(=O)O
Molecular Representations





1D: SMILES strings
2D: Graph of bonds
2.5D: Surfaces
3D: Atomic coordinates
4D: Temporal evolution
1D SMILES Kernel
HO
C H3
OH
H3C
HO
CCCCCc1ccc(cc1)CO
Kmer Count
CCCC
2
CCCc
1
CCc1
1
Cc1c
1
c1cc
1
1ccc
1
ccc(
1
cc(c
1
c(cc
1
(cc1
1
cc1)
1
c1)C
1
1)CO
1
Kmer Count1 Count2 Product
(cc1
1
1
1
1)CO
0
1
0
1O)O
1
0
0
1ccc
1
1
1
CCCC
3
2
6
CCCc
1
1
1
CCc1
1
1
1
Cc1c
1
1
1
c(cc
1
1
1
c1)C
0
1
0
c1O)
1
0
0
c1cc
1
1
1
cc(c
1
1
1
cc1)
0
1
0
cc1O
1
0
0
ccc(
1
1
1
Total: 15
CCCCCCc1ccc(cc1O)O
Kmer Count
CCCC
3
CCCc
1
CCc1
1
Cc1c
1
c1cc
1
1ccc
1
ccc(
1
cc(c
1
c(cc
1
(cc1
1
cc1O
1
c1O)
1
1O)O
1
2D Molecule Graph Kernel

For chemical compounds




atom/node labels:
A = {C,N,O,H, … }
bond/edge labels:
B = {s, d, t, ar, … }
Count labeled paths
(CsNsCdO)
Fingerprints
Similarity Measures
3D Coordinate Kernel
2.8 A
2.0 A
1.4 A
4.2 A
3.4 A
Atom Distance Histogram
8
7
Count
6
5
4
3
2
1
0
0
1
2
3
Distance (Angstroms)
4
5
Distance Count
0
0
1
5
2
7
3
3
4
1
5
0
Example of Results
Results
Results
Results
Example of Results
Summary







Derived a variety of kernels for small molecules
State-of-the-art performance on several benchmark datasets
2D kernels slightly better than 1D and 3D kernels
Many possible extensions: 2.5D kernels, isomers, etc…
Need for larger data sets and new models of cooperation in the
chemistry community
Many open (ML) questions (e.g. clustering and visualizing 107
compounds, intelligent recognition of useful molecules,
information retrieval from literature, docking, prediction of
reaction rates, matching table of all proteins against all known
compounds, origin of life)
Chemistry version of the Turing test
ChemDB







7M compounds (3.5M unique)
Commercially available
PostgreSQL/Oracle
Annotation (Experimental,
Computational)
Searchable
Web interface
Similarity, in silico reactions
Acknowledgements

Informatics







Liva Ralaivola
J. Chen
S. J. Swamidass
Yimeng Dou
Peter Phung
Jocelyne Bruand
Funding



NIH
NSF
IGB

Pharmacology


Daniele Piomelli
Chemistry



G. Weiss
J. S. Nowick
R. Chamberlin
New Questions

Predict drug-like molecules? toxicity?


How can we search efficiently? Intelligently?



New data structures and algorithms
Optimizing old structures
How can we understand this much data?



New Strategies
Cluster and visualize millions of data points
Define commercially accessible space.
Are there other useful things we can do with this?



Discover new polymers, etc.
Wonder about the origin of life.
Combinatorially combine all known chemicals.
Acknowledgements






Jocelyne Bruand
Peter Phung
Liva Ralaivola
S. Joshua
Swamidass
Yimeng Dou
NIH/NSF/IGB
Questions
Docking
Query:
Binding Site of Protein
Scoring
Function
&
Efficient
Minimizer
…
Some Targets




P53 (Luecke)
ACCD5 (Tsai)
IMPDH, PPAR, etc. (Luecke)
HIV Integrase (Robinson)
P53
Drug Rescue of P53 Mutants
Docking → ChemDB



~6 million commercially available
compounds
Searchable, annotated, downloadable.
Other Databases:



Cambridge Structural Database
ChemBank
PubChem
Chemical Toxicity Prediction
By Kernel Methods
Jonathan Chen
S Joshua Swamidass
The Baldi Lab
Data Flow
ID
Toxic?
1
No
O
Kernel
HN
N
H3C
O
CH 3
OH
2
No
Cl
Cl
Gram Matrix
ID
1
2
3
4
…
1 2 3
21 4 5
4 14 5
5 5 15
10 3 6
… … …
4
10
3
6
23
…
Cl
3
Yes
O
O
4
Toxicity
State List
Linear
Classifier
Yes
C H3
O
O
O
P
C H3
S
S
HN
C H3
Predictions
…
…
…
…
…
…
Results
Example of Results
Kernel/Method Mutag MM
FM
MR
FR
Kashima (2003) 89.1
61.0
61.0
62.8
66.7
Kashima (2003) 85.1
64.3
63.4
58.4
66.1
1D SMILES spec. 84.0
66.1
61.3
57.3
66.1
1D SMILES spec+ 85.6
66.4
63.0
57.6
67.0
2D Tanimoto
87.8
66.4
64.2
63.7
66.7
2D MinMax
86.2
64.0
64.5
64.5
66.4
2D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.9
2D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.8
2D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.1
2D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.7
2D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.7
2D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.5
2D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.4
2D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.4
3D Histogram
81.9
59.8
61.0
60.8
64.4
Chemical Informatics






Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Catalog
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Datasets
Small Molecules as Undirected Labeled
Graphs of Bonds


atom/node labels:
A = {C,N,O,H, … }
bond/edge labels:
B = {s, d, t, ar, … }
Chemical Informatics




Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Bioinformatics analogy:




Catalog (GenBank)
Search (BLAST)
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Chemical Informatics




Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Bioinformatics analogy:




Catalog (GenBank)
Search (BLAST)
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.