* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Docking
3D optical data storage wikipedia , lookup
Determination of equilibrium constants wikipedia , lookup
Freshwater environmental quality parameters wikipedia , lookup
Registration, Evaluation, Authorisation and Restriction of Chemicals wikipedia , lookup
Process chemistry wikipedia , lookup
Chemical warfare wikipedia , lookup
Biochemistry wikipedia , lookup
Destruction of Syria's chemical weapons wikipedia , lookup
Green chemistry wikipedia , lookup
American Chemical Society wikipedia , lookup
Institute of Chemistry Ceylon wikipedia , lookup
Analytical chemistry wikipedia , lookup
Fine chemical wikipedia , lookup
History of molecular theory wikipedia , lookup
Natural product wikipedia , lookup
Drug design wikipedia , lookup
Triclocarban wikipedia , lookup
Inorganic chemistry wikipedia , lookup
Chemical imaging wikipedia , lookup
Organic chemistry wikipedia , lookup
Al-Shifa pharmaceutical factory wikipedia , lookup
Chemical biology wikipedia , lookup
Physical organic chemistry wikipedia , lookup
Chemical industry wikipedia , lookup
Chemical potential wikipedia , lookup
Chemical weapon proliferation wikipedia , lookup
California Green Chemistry Initiative wikipedia , lookup
Chemical plant wikipedia , lookup
Computational chemistry wikipedia , lookup
Chemical weapon wikipedia , lookup
Chemical Corps wikipedia , lookup
History of chemistry wikipedia , lookup
Safety data sheet wikipedia , lookup
List of artworks in the collection of the Royal Society of Chemistry wikipedia , lookup
Chemical thermodynamics wikipedia , lookup
Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Chemical Space Stars Existing 1022 Small Mol. 107 Virtual 0 1060 (?) Access Difficult “Easy” Mode Individual Combinatorial Chemical Space Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Predict physical, chemical, biological properties (classification/regression) Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc. Methods Spetrum: Schrodinger Equation Molecular Dynamics Machine Learning (e.g. SS prediction) Chemical Informatics Informatics must be able to deal with variable-size structured data Graphical Models (Recursive) Neural Networks ILP GA SGs Kernels Two Essential Ingredients Data Similarity Measures 1. 2. Bioinformatics analogy and differences: Data (GenBank, Swissprot, PDB) Similarity (BLAST) Data Mutag (Mutagenicity) PTC (Predictive Toxicity Challenge) All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([164,174]) Benzodiazepines (QSAR) 70,000 compounds screened for ability to inhibit growth in 60 human tumor cell lines Alkanes (Boiling points) A few hundred compounds, carcinogenicity (FM,MM,FR,MR) NCI (Anti-cancer activity) 200 compounds (125/63), mutagenicity in Salmonella 79 1,4-benzodiazepines-2-one, affinity towards GABAA ChemDB 7M compounds Similarity Rapid Searches of Large Databases Predictive Methods (Kernel Methods) Why it is not hopeless? Similarity Rapid Search of Large Databases Protein Receptor (Docking) Small Molecule/Ligand (Similarity) Predictive Methods (Kernel Methods) Why it is not hopeless Linear Classifiers Classification Learning to Classify Limited number of training examples (molecules, patients, sequences, etc.) Learning algorithm (how to build the classifier?) Generalization: should correctly classify test data. Formalization X is the input space Y (e.g. toxic/non toxic, or {1,1}) is the target class f: X→Y is the classifier. Classification Fundamental Point: f is entirely determined by the dot products xi,xj measuring the similarity between pairs of data points Non Linear Classification (Kernel Methods) We can transform a nonlinear problem into a linear one using a kernel. Non Linear Classification (Kernel Methods) We can transform a nonlinear problem into a linear one using a kernel K. Fundamental property: the linear decision surface depends on K(xi ,xj)=(xi ) , (xj). All we need is the Gram similarity matrix K. K defines the local metric of the embedding space. Similarity: Data Representations O H2N OH OH NC(O)C(=O)O Molecular Representations 1D: SMILES strings 2D: Graph of bonds 2.5D: Surfaces 3D: Atomic coordinates 4D: Temporal evolution 1D SMILES Kernel HO C H3 OH H3C HO CCCCCc1ccc(cc1)CO Kmer Count CCCC 2 CCCc 1 CCc1 1 Cc1c 1 c1cc 1 1ccc 1 ccc( 1 cc(c 1 c(cc 1 (cc1 1 cc1) 1 c1)C 1 1)CO 1 Kmer Count1 Count2 Product (cc1 1 1 1 1)CO 0 1 0 1O)O 1 0 0 1ccc 1 1 1 CCCC 3 2 6 CCCc 1 1 1 CCc1 1 1 1 Cc1c 1 1 1 c(cc 1 1 1 c1)C 0 1 0 c1O) 1 0 0 c1cc 1 1 1 cc(c 1 1 1 cc1) 0 1 0 cc1O 1 0 0 ccc( 1 1 1 Total: 15 CCCCCCc1ccc(cc1O)O Kmer Count CCCC 3 CCCc 1 CCc1 1 Cc1c 1 c1cc 1 1ccc 1 ccc( 1 cc(c 1 c(cc 1 (cc1 1 cc1O 1 c1O) 1 1O)O 1 2D Molecule Graph Kernel For chemical compounds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … } Count labeled paths (CsNsCdO) Fingerprints Similarity Measures 3D Coordinate Kernel 2.8 A 2.0 A 1.4 A 4.2 A 3.4 A Atom Distance Histogram 8 7 Count 6 5 4 3 2 1 0 0 1 2 3 Distance (Angstroms) 4 5 Distance Count 0 0 1 5 2 7 3 3 4 1 5 0 Example of Results Results Results Results Example of Results Summary Derived a variety of kernels for small molecules State-of-the-art performance on several benchmark datasets 2D kernels slightly better than 1D and 3D kernels Many possible extensions: 2.5D kernels, isomers, etc… Need for larger data sets and new models of cooperation in the chemistry community Many open (ML) questions (e.g. clustering and visualizing 107 compounds, intelligent recognition of useful molecules, information retrieval from literature, docking, prediction of reaction rates, matching table of all proteins against all known compounds, origin of life) Chemistry version of the Turing test ChemDB 7M compounds (3.5M unique) Commercially available PostgreSQL/Oracle Annotation (Experimental, Computational) Searchable Web interface Similarity, in silico reactions Acknowledgements Informatics Liva Ralaivola J. Chen S. J. Swamidass Yimeng Dou Peter Phung Jocelyne Bruand Funding NIH NSF IGB Pharmacology Daniele Piomelli Chemistry G. Weiss J. S. Nowick R. Chamberlin New Questions Predict drug-like molecules? toxicity? How can we search efficiently? Intelligently? New data structures and algorithms Optimizing old structures How can we understand this much data? New Strategies Cluster and visualize millions of data points Define commercially accessible space. Are there other useful things we can do with this? Discover new polymers, etc. Wonder about the origin of life. Combinatorially combine all known chemicals. Acknowledgements Jocelyne Bruand Peter Phung Liva Ralaivola S. Joshua Swamidass Yimeng Dou NIH/NSF/IGB Questions Docking Query: Binding Site of Protein Scoring Function & Efficient Minimizer … Some Targets P53 (Luecke) ACCD5 (Tsai) IMPDH, PPAR, etc. (Luecke) HIV Integrase (Robinson) P53 Drug Rescue of P53 Mutants Docking → ChemDB ~6 million commercially available compounds Searchable, annotated, downloadable. Other Databases: Cambridge Structural Database ChemBank PubChem Chemical Toxicity Prediction By Kernel Methods Jonathan Chen S Joshua Swamidass The Baldi Lab Data Flow ID Toxic? 1 No O Kernel HN N H3C O CH 3 OH 2 No Cl Cl Gram Matrix ID 1 2 3 4 … 1 2 3 21 4 5 4 14 5 5 5 15 10 3 6 … … … 4 10 3 6 23 … Cl 3 Yes O O 4 Toxicity State List Linear Classifier Yes C H3 O O O P C H3 S S HN C H3 Predictions … … … … … … Results Example of Results Kernel/Method Mutag MM FM MR FR Kashima (2003) 89.1 61.0 61.0 62.8 66.7 Kashima (2003) 85.1 64.3 63.4 58.4 66.1 1D SMILES spec. 84.0 66.1 61.3 57.3 66.1 1D SMILES spec+ 85.6 66.4 63.0 57.6 67.0 2D Tanimoto 87.8 66.4 64.2 63.7 66.7 2D MinMax 86.2 64.0 64.5 64.5 66.4 2D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.9 2D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.8 2D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.1 2D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.7 2D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.7 2D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.5 2D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.4 2D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.4 3D Histogram 81.9 59.8 61.0 60.8 64.4 Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Catalog Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc. Datasets Small Molecules as Undirected Labeled Graphs of Bonds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … } Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Bioinformatics analogy: Catalog (GenBank) Search (BLAST) Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc. Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Bioinformatics analogy: Catalog (GenBank) Search (BLAST) Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.