Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
How deep learning can help to design better and safer medicine? KinomeNet: multi-task deep convolutional network Olexandr Isayev, Ph.D. University of North Carolina at Chapel Hill @olexandr http://olexandrisayev.com About me Ph.D. in Chemistry (computational) Minor in CS Worked in Federal research lab on HPC & GPU computing to solve chemical problems Now I am research faculty at the University of North Carolina, Chapel Hill http://olexandrisayev.com And I am also Director of Drug Discovery at Atlas Regeneration. We use AI & multi-omics for developing regenerative medicine and stem cell differentiation technologies. http://atlasregeneration.com/ A public-private partnership that supports the discovery of new medicines through open access research www.thesgc.org How drugs are discovered? The Long and Winding Road to Drug Discovery Data Science approaches useful across the pipeline, but very different techniques aim for success, but if not: fail early, fail cheap Medicines Are Transforming the Treatment of Many Diseases Robotic synthesis Robotic biological tests (HTS) Drowning in Data but starving for Knowledge The rapid growth of materials research has led to accumulation of vast amounts of data: For example, 160,000 entries in the Inorganic Crystal Structure Database (ICSD) Numerous commercial and open experimental databases NIST, MatWeb, MatBase etc. Vast computational databases such as AFLOWLIB, Materials Project, and Harvard Clean Energy. Decline in Pharmaceutical R&D efficiency The cost of developing a new drug (~$2‐3B) roughly doubles every nine years. Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191‐200 Why Drugs are failed? Selectivity of Kinase inhibitors All kinases bind ATP and therefore contain a conserved binding site Most compounds inhibit more than one kinase Why Don’t we Do Better? A Couple of Observations • Tykerb – Breast cancer • Gleevac – Leukemia, GI cancers • Nexavar – Kidney and liver cancer • Staurosporine – natural product – alkaloid – uses many e.g., antifungal antihypertensive >40% of biologically active compounds bind to more than one target Collins and Workman 2006 Nature Chemical Biology 2 689‐700 Virtual Screening to identify potential hits Empirical Rules/Filters Similarity Search ML or QSAR Models Consensus QSA Models Structure-based VIRTUAL SCREENING ~102 – 103 molecules ~106 – 107 molecules Candidate molecules Potential Hits Our vision for next-gen cheminformatics platforms • Scale up Machine Learning Methods with the Data • Use all viraity of available data (-omics, sensors, etc) • Take advantage of latest algorithmic developments – Deep Learning Human Kinase Inhibitor Data Collection Collected all human kinase data from open sources • • • • • ChEMBL PKIS PubChem Private datasets Literature, patents, etc. 300,000+ Molecules 489 Targets >800,000 Experimental data points Biggest target data: >25000 molecules Smallest target data: 1 Human Kinase IC50 Data Distribution “Popular” targets “Rare” targets Convolutional Neural Network (ConvNet) Convolution Function (Filter) Comes from Image and Signal Processing The easiest way to understand a convolution is by thinking of it as a sliding window function applied to a matrix. Groundbreaking results of DL are mostly based on networks with convolutional filters • Image recognition • Object detection • Medical image processing Different Levels of Abstraction • Hierarchical Learning • Natural progression from low level to high level structure as seen in natural complexity • Easier to monitor what is being learnt and to guide the machine to better subspaces • A good lower level representation can be used for many distinct tasks KinomeNet: Convolutional Neural Network for QSAR ABL1 ACVR1 … ConvNet ZAK ZAP70 2D matrix of Descriptors Multitask Learning (253 targets) Some Statistics & Performance Numbers RF (Random Forest) KinomeNet Average AUC: 0.90 Average AUC: 0.96 Random Forest Models N compounds Active AUC @1uM TN FP TP FN Sensitivity Specificity MAP4K4 160 10 0.88 149 1 1 9 0.1 0.93 BMX 151 0.78 0 4 151 0 1.0 0.0 0.6 0.94 0.99 1.0 155 DL Model MAP4K4 160 10 0.91 150 0 6 BMX 151 0.93 4 0 149 6 155 4 KinomeNet: “Deorphanizing” rare targets ABL1 ACVR1 … ConvNet ZAK ZAP70 2D matrix of Descriptors Multitask Learning (253 targets) KinomeNet: “Deorphanizing” rare targets “Frequent” (253 targets) ConvNet ACVR1 … 2D matrix of Descriptors “Rare” targets (67 targets) TYMS Multitask Learning (320 targets) Why it Works: Transfer Learning • Feature‐representation‐transfer • To learn a “good” feature representation for the target domain. • The knowledge used to transfer across domains is encoded into the learned feature representation. • With the new feature representation, the performance of the target task is expected to improve. Recovery of Kinase Similarity by the Network Atlas Regeneration Young dynamic startup company (formed in 2015) in North Carolina We use AI to develop regenerative medicine Design molecules to induce iPSC stem cell differentiation Tissue and muscle regeneration, fibrosis AI Drug Discovery Platform 250M+ SCREENING MOLECULES o Integrated public data (PubChem, ChEMBL, etc) o Private datasets o Literature and patents o In vitro (HTS) o In vivo (mouse, rats) o Multi-omics o Signaling Pathways o Gene Expression BIG CHEMICAL DATA FAST ARTIFICIAL INTELLIGENCE TOP HITS TGF beta inhibitor (Fibrosis) Large scale prediction of bioactivity with Deep Learning 200M+ of potential candidates Selectivity Off target binding Toxicity Metabolic stability Bioavailability Solubility etc. FAST ARTIFICIAL INTELLIGENCE 7 • • • • • Good selectivity Three novel scaffolds Predicted potency 7 – 25 nM Good synthetic accessibility Good ADME/Tox properties Conclusions • Data availability is the biggest barrier • Novel architecture for multitask‐QSAR • Improvement over well converged RF models • Convenience: 1 vs 320 models • Training of 1 network is faster that 320 RF models • Scalability of DL to “Big Data” • DL benefits from transfer learning • More tasks and more data – higher the benefit • Transferability: KinomeNet ‐> GPCRNet