Download A Critical Assessment of the Impact of Computational Methods on

A Critical Assessment of the Impact of Computational Methods on the Productivity in Pharmaceutical Research and Development Charlotta Schärfe Student ID -‐ 1058669 A thesis submitted for the degree of Master of Science MSc. Biotechnology, Bioprocessing and Business Management School of Life Sciences University of Warwick Coventry, UK September 2011 Word Count: 16,498 Table of Contents List of Figures .......................................................................................................................... iv List of Tables ........................................................................................................................... vi Abbreviations ......................................................................................................................... vii Acknowledgements ................................................................................................................ ix Executive Summary ................................................................................................................. x Chapter 1 The Pharmaceutical Industry Today ..................................................................... 1 1.1 R&D Process ............................................................................................................. 1 1.2 Development Costs .................................................................................................. 4 1.3 Productivity Problem ............................................................................................... 5 1.4 Conclusions ............................................................................................................ 11 Chapter 2 Target Identification and Validation .................................................................. 14 2.1 Bioinformatics in Target Discovery and Validation ................................................ 16 2.1.1 Obtaining the Protein Structure ..................................................................... 17 2.1.2 Target Characterisation .................................................................................. 18 2.1.3 Quality Control ............................................................................................... 20 2.2 Knowledge Discovery ............................................................................................. 20 2.3 Drug Reprofiling ..................................................................................................... 20 2.4 Conclusions ............................................................................................................ 24 Chapter 3 Hit and Lead Identification and Optimisation .................................................... 26 3.1 Virtual Screening .................................................................................................... 29 3.1.1 Compound Library Design .............................................................................. 29 3.1.2 Pharmacophore ............................................................................................. 30 3.1.3 Three-‐Dimensional Quantitative Structure-‐Activity Relationship (3D-‐QSAR) 32 3.1.4 Docking .......................................................................................................... 33 3.2 De Novo Design ...................................................................................................... 38 3.3 Lead Optimisation and Selection ........................................................................... 40 3.3.1 Quantitative Structure-‐Activity Relationships (QSAR) ................................... 43 3.3.2 Accuracy of Predictive Tools .......................................................................... 44 3.4 Scaffold Hopping .................................................................................................... 44 3.5 Conclusions ............................................................................................................ 53 Chapter 4 Pre-‐Clinical and Clinical Development ............................................................... 55 ii Table of Contents 4.1 Trial Forecasting ..................................................................................................... 57 4.1.1 4.2 Virtual Animal and Patient Models ................................................................ 57 Trial Design ............................................................................................................ 60 4.2.1 Biomarkers ..................................................................................................... 60 4.2.2 Example: Biomarker-‐Based Patient Selection ................................................ 61 4.3 Pharmacogenetics and Pharmacogenomics .......................................................... 62 4.4 Problems with Available Data ................................................................................ 65 4.5 Conclusions ............................................................................................................ 68 Chapter 5 Knowledge Discovery ......................................................................................... 70 5.1 Text Mining ............................................................................................................ 72 5.2 Data Mining ............................................................................................................ 73 5.3 Semantic Technologies .......................................................................................... 74 5.4 Integrated Informatics Systems and Workflows-‐Based Frameworks .................... 77 5.5 Development of a New Informatics-‐Centred Model for Drug Discovery and Development ..................................................................................................................... 78 5.6 Conclusions ............................................................................................................ 83 Chapter 6 Discussion .......................................................................................................... 85 References ............................................................................................................................. 90 Appendix A Detailed Description of the R&D Workflow ........................................................ 99 Appendix B Overview: Computer-‐Assisted Drug Design ...................................................... 101 Appendix C Novel Development Model (Details) ................................................................ 105 Appendix D Dr Gordon Baxter ............................................................................................. 107 Appendix E Dr William Hamilton ......................................................................................... 108 Appendix F Dr Gary Rubin .................................................................................................... 109 Appendix G Dr Matthew Segall ............................................................................................ 110 Appendix H Other Contacts ................................................................................................. 111 Appendix I Dr Bernard Munos ............................................................................................. 113 Appendix J Dr David Swinney ............................................................................................... 115 Appendix K Dr James L. Stevens .......................................................................................... 116 Appendix L Dr Sanat K. Mandal ............................................................................................ 119 Appendix M Dr William Bains .............................................................................................. 120 Appendix N Scott Lusher ...................................................................................................... 123 iii List of Figures FIGURE 1.1. TYPICAL DRUG DESIGN PROCESS .................................................................................... 2 FIGURE 1.2. APPROVED (1999-‐2008) FIRST-‐IN-‐CLASS NMES BY THERAPEUTIC AREA .............................. 3 FIGURE 1.3. EXPONENTIAL 100X DECLINE IN R&D PRODUCTIVITY ........................................................ 6 FIGURE 1.4. R&D MODEL BY PAUL ET AL. ........................................................................................ 8 FIGURE 1.5. PARAMETRIC SENSITIVITY ANALYSIS ................................................................................ 9 FIGURE 1.6. ELI LILLY’S ALTERNATIVE DEVELOPMENT MODEL “CHORUS” ............................................. 11 FIGURE 2.1. DRUG TARGETS R ..................................................................................................... 15 FIGURE 2.2. WORKFLOW FOR FINDING NEW TARGETS WITH BINDING SITE SIMILARITY ANALYSIS ............... 21 FIGURE 2.3. EXISTING DRUGS AND THE PROPORTION THAT IS AVAILABLE COMMERCIALLY SORTED BY THERAPEUTIC CATEGORY ...................................................................................................... 23 FIGURE 2.4. PROPOSED STEPS FOR DRUG DESIGN PROCESS WITH KNOWN DRUG AND UNKNOWN TARGET. . 23 FIGURE 2.5. RANKED LIST OF TARGETS FROM ENTELOS .................................................................... 25 FIGURE 3.1. IN SILICO DRUG DISCOVERY PIPELINE ............................................................................ 27 FIGURE 3.2. PHARMACOPHORE MODELS ........................................................................................ 31 FIGURE 3.3. PRINCIPLE OF DOCKING ............................................................................................. 33 FIGURE 3.4. THE THREE GENERAL WORKFLOWS APPLIED IN VIRTUAL SCREENING ................................... 34 FIGURE 3.5. GROWTH OF THE PROTEINDATABASE (PDB) ................................................................. 37 FIGURE 3.6. DE NOVO EVOLUTION ................................................................................................ 38 FIGURE 3.7. PHARMACOKINETIC PROPERTIES OF A DRUG .................................................................. 40 FIGURE 3.8. WORKFLOW FACILITATING EARLY RISK ASSESSMENT ........................................................ 42 FIGURE 3.9. AN INTEGRATIVE WORKFLOW FOR ADME AND TOXICITY PREDICTION USING BIOINFORMATICS AND SYSTEMS BIOLOGY ........................................................................................................ 44 FIGURE 3.10. THE MAIN CORNERSTONES OF THE CADDD PROCESS PERFORMED BY PROSARIX LTD. ......... 46 FIGURE 3.11. PROSARIX IN-‐HOUSE DISCOVERY ................................................................................ 47 FIGURE 3.12. STARDROP GRAPHICAL USER INTERFACE .................................................................... 50 FIGURE 3.13. OBTIBRIUM CASE STUDY .......................................................................................... 52 FIGURE 4.1. R&D COSTS FOR ALL EFFORTS AND PER APPROVED DRUGS ............................................... 56 FIGURE 4.2. HOW THE DRUG DESIGN AND DEVELOPMENT PROCESS MAY LOOK LIKE WHEN VIRTUAL HUMANS EXIST ................................................................................................................................ 57 FIGURE 4.3. ENTELOS’ CONCEPT OF VIRTUAL PATIENTS ..................................................................... 59 FIGURE 4.4. ENTELOS’ TOP-‐DOWN PHYSIOLAB® MODEL BUILDING PROCESS ......................................... 59 FIGURE 4.5. ENRICHMENT TRIAL DESIGN ....................................................................................... 61 FIGURE 4.6. MARKET POTENTIAL OF BIOINFORMATICS IN DIFFERENT MEDICAL APPLICATION AREAS .......... 64 iv List of Figures FIGURE 4.7. VALUE-‐ADDING ASPECTS OF BIOINFORMATICS AND PREDICTIVE BIOMARKERS ...................... 67 FIGURE 5.1. NUMBER OF AVAILABLE PUBLICATIONS IN THE LITERATURE DATABASE MEDLINE .................. 70 FIGURE 5.2. INTEGRATED DATA MINING WORKFLOW ...................................................................... 74 FIGURE 5.3. CONCEPT OF INTELLIGENT METADATA .......................................................................... 75 FIGURE 5.4. INSTEM’S CENTRUS™ PLATFORM ................................................................................. 77 FIGURE 5.5. THE ANATOMY OF A WORKFLOW NODE ........................................................................ 78 FIGURE 5.6. A TYPICAL KNIME WORKFLOW ................................................................................... 78 FIGURE 5.7. PROPOSED NEW INFORMATICS AND DATA CENTRED DRUG DISCOVERY AND DEVELOPMENT WORKFLOW. ...................................................................................................................... 79 FIGURE 6.1. MEDICINES IN DEVELOPMENT AS OF JANUARY 2011 ...................................................... 87 FIGURE B.1. IN SILICO DRUG DISCOVERY PIPELINE .......................................................................... 102 FIGURE C.1. NOVEL INFORMATICS-‐CENTRED RESEARCH AND DEVELOPMENT MODEL..........................105 v List of Tables TABLE 1.1 PATENT EXPIRATION IN THE YEARS 2009 TO 2012 FOR TEN TOP SELLING DRUGS AND ASSOCIATED ANNUAL SALES LOSSES. .......................................................................................................... 5 TABLE 1.2 COMPARISON OF KEY ASSUMPTIONS AND MAIN OUTPUTS OF SEVERAL FINANCIAL MODELS FOR THE PHARMACEUTICAL R&D PROCESS. ..................................................................................... 7 TABLE 1.3 KEY DEFINITIONS AND ASSUMPTIONS OF PAUL ET AL.’S PRODUCTIVITY MODEL ......................... 7 TABLE 1.4 NOVEL DEVELOPMENT CONCEPT AT ELI LILLY’S BASED ON THE “QUICK WIN, FAST FAIL” PARADIGM. ....................................................................................................................... 11 TABLE 2.1 TARGETS OF APPROVED DRUGS AND ESTIMATES ABOUT THE NUMBER OF DRUG TARGETS IN THE HUMAN BODY .................................................................................................................... 14 TABLE 2.2. BIOINFORMATICS METHODS IN TARGET IDENTIFICATION AND VALIDATION ............................ 17 TABLE 3.1 STAGES OF THE HIT-‐TO-‐LEAD PROCESS AND THE GENERAL CRITERIA A HIT NEEDS TO SATISFY AT EACH STAGE ....................................................................................................................... 26 TABLE 3.2 EXPERIMENTAL HIT IDENTIFICATION WITH HIGH-‐THROUGHPUT SCREENING (HTS) ................... 27 TABLE 3.3 DOCKING ALGORITHM BENCHMARK RESULTS .................................................................... 35 TABLE 3.4 DE NOVO DESIGN APPROACHES ...................................................................................... 39 TABLE 3.5 METHODS FOR SCAFFOLD HOPPING ................................................................................ 45 TABLE 3.6 CASE STUDY LEAD DESIGN: PROSARIX LTD. ...................................................................... 45 TABLE 3.7 PROTODISCOVERY™ VALIDATION DATA FROM SEVERAL PROJECTS ........................................ 48 TABLE 3.8 CASE STUDY LEAD OPTIMISATION: OPTIBRIUM LTD. ......................................................... 50 TABLE 3.9 ESTIMATED RETURN ON INVESTMENT (ROI) FOR COMPUTATIONAL MODELLING ..................... 54 TABLE 4.1 STAGES OF PRECLINICAL AND CLINICAL DEVELOPMENT ....................................................... 55 TABLE 4.2 CASE STUDY VIRTUAL MODELS: ENTELOS INC. ................................................................. 58 TABLE 4.3 EXAMPLE ENRICHMENT TRIAL ....................................................................................... 62 TABLE 4.4 EFFICIENCY OF ENRICHMENT STUDY DESIGNS .................................................................... 62 TABLE 4.5 POTENTIAL ADVANTAGEOUS APPLICATIONS OF BIOMARKERS, PHARMACOGENETICS, AND PGX .. 63 TABLE 4.6 CASE STUDY BIOINFORMATICS ANALYSES: FIOS GENOMICS LTD. .......................................... 66 TABLE 4.7 JAMES STEVENS (ELI LILLY) ON COMPUTATIONAL MODELS .................................................. 68 TABLE 5.1 TERMINOLOGY IN KNOWLEDGE DISCOVERY ..................................................................... 71 TABLE 5.2 CASE STUDY KNOWLEDGE DISCOVERY: BIOWISDOM (NOW PART OF INSTEM) ....................... 74 TABLE 6.1 SUMMARY OF PRODUCTIVITY IMPACT OF COMPUTATIONAL TECHNIQUES BASED ON THE EXAMPLES AND CASE STUDIES IN THE PREVIOUS CHAPTERS ......................................................... 86 vi Abbreviations 3D Three-‐dimensional Å Ångström ADME Absorption, distribution, metabolism and excretion ADMET Absorption, distribution, metabolism, excretion, and toxicity ANN Artificial neural net BMP Binding mode prediction CADD Computer-‐assisted drug design CEO Chief Executive Officer CIE Confidence in Efficacy CIS Confidence in Safety CNS Central nervous system cpds Compounds CSO Chief Scientific Officer CT Cycle time Cyp450 Cytochrome P450 DNA Deoxyribonucleic acid e.g. For example EGFR Epidermal growth factor receptor ELN Electronic laboratory notebooks FDA Food and Drug Administration GPCR G-‐protein coupled receptor hERG human Ether-‐à-‐go-‐go Related Gene HIV Human immunodeficiency virus HTC High-‐throughput chemistry HTS High-‐throughput screening Ibid. Ibidem = the same reference as the one cited before i.e. Id est = that is IL-‐5 Interleukin-‐5 IND Investigational new drug IP Intellectual property IT Information technology KNIME Konstanz Information Miner vii Abbreviations logD Distribution coefficient logP (clogP) Partition coefficient (calculated partition coefficient) M mole m million MOA Mechanism of action N/A Not available NCE New chemical entity NLM United States National Library of Medicine NLP Natural Language Processing NME New medical entity NMR Nuclear magnetic resonance p(TS) Probability of success PAMPA Parallel artificial membrane permeability assay PB-‐PK Physiology-‐based pharmacokinetics PGP P-‐glycoprotein PGx Pharmacogenomics PhD Doctor of Philosophy phRMA Pharmaceutical Research and Manufacturers of America PK Pharmacokinetics POC Proof of concept POP Proof of principle QSAR Quantitative structure-‐activity relationship R&D Research and Development RNA Ribonucleic acid ROI Return on investment SAR Structure-‐activity relationship SMILES Simplified molecular input line entry specification SNP Single nucleotide polymorphism UK United Kingdom VLS Virtual ligand screening VS Virtual screening WIP Work in progress viii Acknowledgements It is a pleasure to thank everyone who helped me make this dissertation possible. Firstly, I am indebted to my supervisor Dr Sandy Primrose for his guidance, helpful comments and discussions throughout the research process. Additionally, I would like to thank the following persons for sparing some of their valuable time to answer my questions and sharing their knowledge in person: Dr Gordon Baxter and Dr William Hamilton for the lovely and very inspiring meetings in Cambridge, Dr Neil Porter, Dr Sarah Duggan, Dr Mairéad Duke, and Dr Jill Makin for meeting me on campus, as well as Dr Gary Rubin and Dr Matthew Segall for similarly inspiring and insightful Skype-‐ conferences. In this context I would also like to thank Dr Sandy Primrose, Dr William Bains and Dr Crawford Dow for making these interviews possible by establishing the contacts. I would also like to express appreciation to Dr Bernard Munos, Dr David Swinney, Dr James Stevens, Dr Nicolas Fechner, Dr Sanat Mandal, Scott Lusher and Dr William Bains for being kind enough to answer all my questions via email. I owe my deepest gratitude to my parents and brothers who always supported me and made the participation in this course possible. I am indebted to you, Papa, for the countless hours of proof-‐reading! Furthermore, I am also grateful to my boyfriend Henry for his support throughout the year and his company in the weeks of writing. Finally, I would like to thank Dr Crawford Dow, Dr Charlotte Moonan, Dr Stephen Hicks, and Adrienne Davies for their continuous encouragement, guidance and assistance throughout the year! ix Executive Summary It is a long known fact for everyone associated with the pharmaceutical industry (“pharma”) that the current drug discovery and development model does not function very well. In order to slow down the escalating research and development (R&D) costs and increasing development times, new research models are needed (Cressey, 2011). To achieve that, novel technologies such as high-‐throughput screening (HTS) and combinatorial chemistry have been developed and it was anticipated that they would improve productivity quickly. This hope did not turn out to be true (William Bains, Sandy Primrose, personal communication). These novel technologies have however increased the amount of data acquired throughout the R&D process significantly and therefore paved the way for a more knowledge driven drug discovery (Gassmann et al., 2008). The huge amounts of data acquired by research laboratories in each phase of the drug development process using such novel technologies and external data integration have by now made manual data analysis and knowledge management very complex. This combined with the ever improving speed and performance of computers (Koomey, 2010) has led to the widespread and indispensable use of information technology in drug discovery. Although computational methods with the special fields of bioinformatics and cheminformatics for biological and chemical data storage, data visualisation, data mining, modelling, statistical analysis, and pattern recognition are omnipresent in pharmaceutical R&D, to the author’s knowledge no comprehensive evaluation exists about the impact of computational methods on pharmaceutical R&D productivity This thesis aims at tackling this question by combining an extensive literature review of currently used computational techniques and tools in the different R&D phases with case studies and interviews about their current and future applicability. The main focus in this work will be put on data mining and integration, modelling and predictive tools since methods for data storage and visualisation are very closely linked to the technology that generates the data and a comprehensive discussion of this technology goes beyond the scope of this thesis. In the end a R&D model is developed which is centred on informatics methods in all phases of drug discovery and development and could potentially reduce the time between target identification and marketing approval by 50% while reducing late stage attrition rates through better knowledge integration. x Chapter 1 The Pharmaceutical Industry Today Since the 1990s the pharmaceutical industry has to face a big challenge: the number of new medical entities (NMEs) that gain marketing authorisation is stagnating while the costs for bringing a new drug through R&D to the market are sky-‐rocketing to more than US$ 1.2 billion (Kapetanovic, 2008). This chapter will describe the drug R&D process and then discuss the issues behind the productivity problem as well as some of the possible solutions. 1.1 R&D Process Traditionally, drugs were small molecules that were in some way able to modulate cellular pathways to cure or treat a disease. In recent years, however, larger medical molecules which are produced using biotechnology (“biopharmaceuticals” or “biologics”) have gained more and more interest. Biopharmaceuticals can be proteins, nucleic acids, viruses, living cells or microorganism and offer new ways of treating diseases such as replacement therapy, gene therapy or by using disease specific monoclonal antibodies (Ng, 2008). A company can develop new drugs beginning with different starting-‐points. Different types are (1) a new class of treatment for a disease with a new molecular mechanism of action like molecules against a new cellular target (“first-‐in-‐class”), (2) a better drug for a known target (“best-‐in-‐class”), (3) simply another drug for a known target (“follower drugs”), or (4) new disease targets for a known drug (“reprofiling” or “repurposing”) (Chew, 2010, Swinney and Anthony, 2011, Buchan et al., 2011). The cost of development and approval varies drastically depending on the type of development, and so does the financial reward. For a long time it seemed attractive to find molecules with a novel mechanism of action and for new targets because it promised to create innovation and potential market leadership due to lack of competition. However, the process is expensive and risky and paved with hurdles by the regulatory agencies because the targets under investigation are not yet validated. Nowadays strategies such as finding the “best-‐in-‐class” molecule or drug repurposing are therefore also becoming attractive because the target and drug molecules are de-‐risked and, given that the period of exclusive marketing for this class of drugs without competition of similar drugs has decreased from five years in the 1960s to 1 year in the 1990s, the additional investment for being first-‐in-‐class does not always pay off (Chew, 2010, Kaitin, 2010). 1 The Pharmaceutical Industry Today The discovery and development process for a new drug is divided in several distinct steps in which expert teams try to find a new compound, optimise its efficacy and then move it on to clinical development and efficacy/safety testing in animals and humans (Figure 1.1). The process tasks and timeframe vary depending on the disease and the intended drug type. A more detailed description of the process can be found in Appendix A. Figure 1.1. Typical drug design process. Target identification is not always part of the process since this is not always known at the beginning of the R&D and compounds can be found with phenotypic screening without knowledge about the molecular mechanism. Adapted from (PricewaterhouseCoopers, 2008) In general, there are two different approaches to drug discovery – empirical discovery and rational design. The empirical (“irrational”) approach is based on the trial-‐and-‐error screening of a wide variety of natural and synthetic small molecules either in high-‐ or low-‐ throughput biological assays to assess their pharmacological potential. This approach does not always require prior knowledge about the drug target(s) and mechanisms of action (MOA) which can be advantageous in many cases where not much knowledge about the disease biology exists, but also expensive because many wet-‐lab experiments are needed to find potential compounds (Moustakas, 2008). Rational drug design on the other hand requires knowledge about biologically and pharmacologically relevant properties of the intended drug target or known compounds including protein structures and gene sequence to find or design specific binding partners (Ng, 2008). In this case a hypothesis about relationships between cellular processes and the disease stands at the beginning of the design process and is validated afterwards in several experiments (Swinney and Anthony, 2011). Libraries with potential compounds are then tested with in vitro and/or in silico screening processes to find molecules that can interact with the target (Mandal et al., 2009). Before the introduction of rational, and therefore often target-‐based, approaches most of the discovery was done using phenotypic assays in which first-‐in-‐class molecules were found mainly through serendipity or deliberate targeting of a particular phenotype. In the period from 1999 to 2008, this approach accounted for 56% of the first-‐in-‐class small molecule drugs (Swinney and Anthony, 2011). Although rational design approaches have 2 The Pharmaceutical Industry Today been in use for the past 50 years (Adam, 2005) and structure-‐based design for 20 years, only 34 % of the small molecule drugs between 1999 and 2008 were discovered by target-‐ based screening. However, virtually all biologics, which accounted for 33% of all first-‐in-‐ class newly approved medicines in this time frame, were found with target-‐based approaches. Additionally, when looking at the category of follower drugs it is apparent that in that case target-‐based screening accounts for half of the new compounds whereas phenotypic screening was used in only 18% of the cases (Swinney and Anthony, 2011). Therefore is seems as if target-‐based approaches are currently not as successful in finding truly novel small molecule drugs as the irrational phenotypic approach, but can already be used in de-‐risked targets (i.e. development of more advanced follower drugs) and for biologics. Figure 1.2 illustrates how the 75 new molecular entities were discovered that gained marketing approval between 1999 and 2008 and what approach was most successful in which disease category. This study has one disadvantage though: it shows which screening strategy was successful in the end, but it is not clear if this also represents distribution of screening efforts or if, for example, much more was invested in target-‐based Number of NMEs screening, but without success. 8 7 6 5 4 3 2 1 0 Target-‐based screening Phenotypic screening Biologics Figure 1.2. Approved (1999-‐2008) first-‐in-‐class NMEs by therapeutic area. Adapted from (Swinney and Anthony, 2011) 3 The Pharmaceutical Industry Today 1.2 Development Costs The past section has shown that the process for finding and developing a new drug is long and tedious. Current estimates for the time needed to bring a new drug to the market range from 7 to 16 years (Kapetanovic, 2008, Waller et al., 2007) and the clinical trial phases alone tend to last 7 years on average for a new medical entity after filing of the “new investigational drug” (IND) (Kaitin, 2010). The overall capitalised cost of the process is normally assumed to be in the range of US$ 800 million (DiMasi et al., 2003) to US$ 1.8 billion (Paul et al., 2010) depending on the disease target and drug chemistry. Analyses have shown that biopharmaceuticals tend to be slightly cheaper (US$ 559 million versus US$ 672 million) (DiMasi and Grabowski, 2007). It could be hypothesised that the knowledge-‐driven rational design of biopharmaceuticals may be a reason for the lower out-‐ of-‐pocket costs. This is supported by DiMasi’s observation that the time and costs for preclinical development of biopharmaceuticals is higher than for small molecule drugs indicating more efforts to acquire knowledge. The clinical phase, however, is shorter and cheaper and approval success rates are higher. More studies with bigger sample sizes and with data for both types of pharmaceuticals from the same time frame are yet needed to further validate these hypotheses. Although the current business model in the pharmaceutical industry is based on blockbuster drugs with sales exceeding US$ 1 billion and only molecules are progressed in development if they are expected to accomplish this status in the market, just one in five drugs is reaching this rank (Munos, 2009). Considering the fact that only a quarter of R&D performing companies have development costs lower than US$ 1 billion per NME (ibid.) and only a third of newly approved drugs generate revenues in par with the average development cost (Grabowski et al., 2002, cited by Kaitin, 2010) it is obvious that current R&D for NMEs is financed with the sales coming from already approved blockbuster drugs. Since the pharmaceutical industry currently faces a phase of patent expiration of these blockbuster products (Table 1.1) (Kaitin, 2010) and a trend towards personalised medicine with much smaller markets (Bates, 2010), this model will not be sustainable in the future. In fact, Munos (2009) estimates that the thirteen largest pharmaceutical companies will face a reduction of up to 10% in sales and up to 30% in net income in the following years. 4 The Pharmaceutical Industry Today Table 1.1 Patent expiration in the years 2009 to 2012 for ten top selling drugs and associated annual sales losses. 2009 2010 2011 Product 2007 Sales ($MM) Product 2007 Sales ($MM) Prevacid 3,962 Protonix Topamax 2,453 Cozaar/ Hyzaar Lamictal 2,194 Valtrex 1,868 2012 Product 2007 Sales ($MM) Product 4,221 Lipitor 13,652 Diovan 5,012 3,350 Plavix 8,079 Singulair 4,266 Aricept 3,311 Advair 6,998 Lexapro 3,044 Levaquin 2,862 Zyprexa 4,661 Viagra 1,764 Actos 4,333 Avandia 1,754 a 2007 Sales ($MM) Cellcept 1,677 Effexor XR 2,657 Keppra 1,407 Taxotere 2,569 Seroquel 4,219 Symbi-‐ cort 1,575 Flomax 1,399 Arimidex 1,730 Avapro 2,685 Zometa 1,297 Imitrex 1,370 Gemzar 1,592 Xalatan 1,604 Detrol 1,190 Adderall XR 1,031 Coreg 1,174 Avelox 1,013 Geodon 854 Novo Seven 1,078 Xeloda 959 Provigil 852 Total $24,54 4 Total Suboxone Total a a 531 $17,892 $48,20 3 Total $21,60 8 US sales only; Source: (Kaitin, 2010) 1.3 Productivity Problem The aforementioned increasing R&D costs combined with a stagnant number of market launches of NMEs (Munos, 2009) lead to a serious productivity problem for R&D conducting pharmaceutical companies. Although the exact figures for time and expenditure needed for a new NME to be launched differ, partly because different medicine categories have different R&D properties and risks, all experts agree on the fact that a drastic increase in productivity in the pharmaceutical industry is needed for the industry to sustain itself (Kapetanovic, 2008, Munos, 2009, Paul et al., 2010, DiMasi et al., 2010, Cressey, 2011). The decreasing R&D productivity is illustrated in Figure 1.3 in which the number of NME and biologics and combined R&D spending of more than 40 leading pharmaceutical companies (pharmaceutical Research and Manufacturers of America, phRMA) are compared. These analyses by Munos (2009) and Scannell et al. (2010) have shown that the decreasing trend in productivity started in the 1950s and from there the inflation adjusted R&D costs have increased at an average rate of 8% to 9% each year leading to a productivity decline of a 100 fold. This is especially troublesome because the productivity decline started during a period of time that is also called “the golden age of biology” 5 The Pharmaceutical Industry Today (because of the major breakthroughs achieved in understanding many important principles in biology and patho-‐physiology) (Kubinyi, 2008). Figure 1.3. Exponential 100x decline in R&D productivity. Data mainly based on (Munos, 2009), but R&D costs were estimated from the Pharmaceutical Research and Manufacturers of America (PhRMA) Annual Survey 2009 so it under-‐estimates R&D spending at the industry level, while NMEs are the total number of small molecule and biologic approvals by the FDA. Costs per NME are higher today if one also adds the R&D costs of smaller drug companies and the biotech sector. Source: (Scannell et al., 2010) The main factors leading to the productivity problem can be divided into two categories: (1) external factors such as generic competition, increasingly risk-‐averse regulatory authorities who increase the efforts needed for safety and efficacy assessment prior to the market launch, and re-‐importation (Waller et al., 2007). These factors are hardly controllable by the individual company although it was shown that closer collaboration with the regulatory authority during the clinical development stage could improve the chance of market approval (Eichler et al., 2010). (2) Internal factors include poor management structures, poor process management and other factors that could be changed within the company. Pharmaceutical companies have already tried to improve their R&D processes by improving the internal factors with various investments such as the implementation of High Throughput Screening (HTS) Technology and combinatorial chemistry. These have, however, failed to yield results on par with the investment that was needed until now (Waller et al., 2007). Several research groups in industry and academia have tried to create financial models for the pharmaceutical R&D process of which some were introduced in the section before 6 The Pharmaceutical Industry Today already. Their results are shown in Table 1.2 and the growth of R&D cost is clearly observable. Table 1.2 Comparison of key assumptions and main outputs of several financial models for the pharmaceutical R&D process. DiMasi 2003 Gilbert 2003 Adams 2006 DiMasi 2007 pharma DiMasi 2007 biopharma Paul 2010 Model Outputs Out of pocket $403 n/a n/a $672 $559 $873 Capitalised $802 $1,700 $868 $1,318 $1,241 $1,778 Key Assumptions P(TS) of clinical 21.5% development 11-‐12% 24% 21.5% 30.2% 11.7% Cost of capital n/a 11% 11% 11.5% 11% 11% Source: (Paul et al., 2010) Supplementary information S2 Of these, the model by Steven Paul and his colleagues (Paul et al., 2010) from Eli Lilly is the most recent one and will now be described in more detail because it will be the basis for other several calculations in this thesis. The model measures R&D productivity in relation to several factors including cost, probability of success, cycle time, and created value. From this a “pharmaceutical value equation was derived” which is introduced further in Table 1.3 together with the key assumptions of the model. Allowing for cost of capital, a capitalized cost of $1.8 billion per launch is inferred (Figure 1.4). It can be seen that lead optimization and clinical development in Phase II and Phase III are the most cost intensive steps. Table 1.3 Key definitions and assumptions of Paul et al.’s productivity model R&D productivity is defined as the relationship between created commercial and medical value by a new medical entity and the investment that was required to generate this NME. Related to that, R&D efficiency and R&D effectiveness are also introduced. Efficiency is equivalent to the relationship between certain inputs (e.g. investments) and resulting outputs (e.g. milestone achievements) and therefore relates to the cost per launch. Effectiveness, on the other side, describes the generated value per launch by relating outputs and (medical and commercial) outcomes. Their “pharmaceutical value equation” describes the productivity (P) dependent on the amount of scientific research and clinical development conducted in parallel (work in progress, WIP), the probability of success p(TS), the generated value V, the cycle time (CT) and the cost (C). 7 The Pharmaceutical Industry Today 𝑃 ∝ !"# ×! !" ×! !" ×! (Equation 1) The equation allows to measure P for single drug candidates (WIP=1) or for a whole development portfolio. Productivity data for thirteen large pharmaceutical companies and internal data of Eli Lilly’s was used to approximate the different factors in the equation for each stage of the discovery and development process (excluding target identification & validation and non-‐molecule costs such as overheads). Then a model was created for the current productivity using baseline parameter settings and a sensitivity analysis for various parameter settings was performed. Although the model depends on the business aspirations, annual target for launches and other factors, Paul et al. (2010) were able to include industry wide key observations: Not including the time and cost for target discovery and validation it takes 13.5 years for a NME to launch in 2007 and clinical development (Phases I onwards) accounts for 63% of the costs. The probability of technical success is 7% for small-‐molecule entities and 11% for biologics (8% on average). Therefore at least 9 molecules must enter clinical development each year to achieve 1 NME market launch per year. Since companies normally aim for 2 to 5 launches per year, 18-‐45 Phase I starts are needed every year. This is not a realistic number, not even for the large pharmaceutical companies, as the authors notice, which proves the current malfunction in the R&D process. Figure 1.4. R&D model by Paul et al. Model yielding the costs for different stages in drug discovery and development process and overall costs for a single new molecular entity. In US$ million. Based on assumptions yielded from industry benchmarks and internal data. Lighter shaded boxes: calculated values based on assumed inputs. Darker shaded boxes: Inputs. Model does not account for investments for exploratory research (target identification and validation), post-‐marketing costs and overheads (additional 20-‐30%). Source: (Paul et al., 2010) The model helped identifying the stages in drug design and development that account for most of the costs and in which cost saving or increased probability of success will have the 8 The Pharmaceutical Industry Today biggest impact on the overall productivity. For further analysis a sensitivity analysis was performed to further explore the possible areas for productivity improvement. Figure 1.5 displays this parametric sensitivity analysis. The analysis shows what was known before: attrition (failure of the NME, 1 – p(TS)) in late clinical development (Phase II and III) is the most important determinant for R&D efficiency and productivity can be heavily increased by decreasing attrition rates (= improving probability of success for Phase II and III clinical trials). Unfortunately, attrition rates in these stages are increasing throughout the industry due to various reasons including higher benefit-‐to-‐risk ratios and efficacy required and stricter scrutiny of drug safety1. The analysis of the 16 companies accounting for 60% of global R&D expenses for example has shown that Phase II success rates have fallen from 28% in 2006/2007 to 18% in 2008/2009 (Arrowsmith, 2011a) (108 failures between 2008 and 2010) and the combined Phase III/submission success rate also lies below 50% (83 failure between 2007 and 2010 for the analysed group of pharmaceutical companies) (Arrowsmith, 2011b). Figure 1.5. Parametric sensitivity analysis. This figure shows the impact of deviation from the baseline value for different model parameters. It is apparent that Phase II and III success rates are having the biggest impact on the capitalised cost per launch, followed by lead optimisation cost. An increase of p(TS) for Phase II studies from 34% to 50% would, for example, reduce the capitalised cost by $400 million. From this sensitivity analysis it can be concluded that improvement of productivity is a multi-‐parameter optimisation project aiming at increasing clinical development success rates and reduce costs. Source: (Paul et al., 2010) Attrition in Phase II is especially high because in this phase apart from safety and efficacy testing also strategic decisions are taken into consideration before moving on to the 1 These are all characteristics that are to the advantage of the patients and payers however whose major concerns are effective and safe treatment. Furthermore, only new drugs are approved for marketing if the developing company could clearly show that this new treatment approach is more effective than the current gold standard thus justifying a higher price. 9 The Pharmaceutical Industry Today resource consuming phase III clinical trials. Arrowsmith’s (2011a) analysis has shown that nearly one third of the strategic phase II failures were for validated targets and attrition seemed to be mainly for inadequate differentiation from drug candidates by competitors in the same class or from drug candidates with similar indication in a different mechanistic class that are more advanced in the development pipeline than the failed candidate. Another 40% were for insufficient efficacy (vs. 66% of Phase III failures) and 17% for safety reasons (21% in Phase III). Poor bioavailability is, according to Paul et al., only a minor cause for overall attrition of less than 10-‐20% since the year 2000. It therefore seems obvious that a major factor for attrition is lack of efficacy and overlapping R&D activity between different pharmaceutical companies that could be overcome by better collaboration with competing companies until a proof-‐of-‐concept stage (Arrowsmith, 2011a). This conclusion and suggestions are backed up by other experts such as Patrick Vallance (GlaxoSmithKline’s research chief) and Chas Bountra (Structural Genomics Consortium, University of Oxford, Chief Scientist) (Cressey, 2011). Paul et al. however, mainly blame increasing safety hurdles and the unprecedented nature of some drug targets for most of the failure. In his earlier analysis, Bernard Munos (2009) argues, on the other hand, that the pharmaceutical industry in countries with strict regulatory bodies such as the USA and UK tends to be more innovative and competitive than in others. This makes it hard to blame regulatory authorities like the FDA for the current productivity problem. It is clear though, that safety requirements are much more demanding than in the past leading to higher rates for pre-‐ and post-‐marketing failure (Stevens and Baker, 2009). There are several possibilities on how to decrease attrition rates, and although known for some time already, none of them has yet been specifically successful. According to Paul et al. there are two key approaches that are useful to reduce clinical stage attrition and are best performed in combination: (1) better target selection to use more validated and “druggable” targets, and (2) early proof-‐of-‐concept (POC) studies especially in Phase I. By doing this he suggests that not only attrition due to unforeseen biological effects attributable to lack of knowledge about the target, but also attrition due to toxicity and lack of efficacy could be minimized. He states that since the majority of drug candidates are destined to fail, the main goal should be to make them fail quicker and cheaper (“quick win, fast fail”) (Paul et al., 2010). The implementation of these suggestions and their impact on productivity by Eli Lilly are shown in 1.4. 10 The Pharmaceutical Industry Today Table 1.4 Novel development concept at Eli Lilly’s based on the “quick win, fast fail” paradigm. The implementation of this concept into an alternative clinical development model at Eli Lilly (“Chorus”, Figure 1.6) has helped to increase the estimated probability for success p(TS) to about 50% and minimize cost for POC by nearly three-‐quarters ($6 million vs. $22 million with traditional development). In his paper Paul et al. also argue that shifting investment emphasis on earlier stages in the development process (especially Phase I) the invested money can be used more effectively. For example, a shift of 25% of the Phase II attrition to Phase I can save $30 million which could be used to fund two additional Phase I WIPs and therefore increases the chances for a successful market launch. Figure 1.6. Eli Lilly’s alternative development model “Chorus”. CS: candidate selection, FED: first efficacy dose, FHD: first human dose, PD: product decision, POC: proof-‐of-‐concept Source: (Paul et al., 2010) 1.4 Conclusions This chapter has given an extensive overview over the process of drug discovery and development and the current problems associated with it. A financial model for the measurement of productivity in the process developed by Paul et al. from Eli Lilly was introduced. In this model it is apparent that the early discovery steps of target to hit and hit-‐to-‐lead account for only 9% of the total R&D expenditure (Paul et al., 2010) and can therefore be seen as unproblematic. The main stages at which major costs are inferred and risk of failure is high are lead optimisation and clinical development (ibid.). The main reason for late stage attrition is lack of efficacy in humans and toxicity. In this context the authors 11 The Pharmaceutical Industry Today state “any cost effective approach to predict toxicological liability or undesirable ADME2 properties preclinically [...] will reduce downstream attrition” (Paul, 2010). Given the 95% attrition rate in clinical development, a doubling of the current 5% success rate to 10% would result in the doubling of the annual NME output, as it was pointed out by Scott Lusher (2011, personal communication). Small improvements can therefore have great effects. In addition to that, Munos (2009) states that companies can be doing essentially the same kind of research, but have completely different NME output rates. From this he concludes that not only the methods are of importance, but also the ability to foster innovation. One of the ways in which this could be achieved is by better using the data acquired in the process for translating research results into relevant hypotheses and decision criteria. Since late stage failures and insufficient promotion of knowledge can therefore be seen as the main reasons for the current productivity problem, most parts of this thesis will be focusing on computational processes relevant to knowledge discovery, hypothesis support and potential predictive methods for early ADME/Toxicity prediction. There is so much wrong with the current business model in the pharmaceutical industry that it is very hard to find solutions that are easy to implement and to deliver a rapid cure. For example, only 10% of all R&D spending is for diseases that account for 90% of the global burden (Munos, 2006) and patient need and payer opinion is not always main priority in the development process (Mullard, 2011). Pharma has tried to increase its productivity by engaging into Merger & Acquisition (M&A) behaviour to fill short-‐term gaps in the development pipeline and increase innovation. However, this normally resulted in downsizing and rationalisation of the merged R&D teams with the effect of destroying the research culture in the acquired company (Talaga, 2009) thus destroying the major asset that was acquired – the innovative science performed by the smaller company. The adoption of rational design by using genomics and other “omic”-‐technologies also seemed attractive with the advent of new sequencing technology, microarrays etc., but is now understood to be much more complex than initially anticipated in order to find robust drugs that are not easily made redundant through emergence of resistance (Schiffer, 2008). The list of potential cures for the problem is nearly as endless and includes outsourcing, process optimisation, engagement into open innovation, Industry-‐Academia collaborations, 2 ADME = absorption, distribution, metabolism and excretion of a drug 12 The Pharmaceutical Industry Today further investment into new technology, and virtualisation of the R&D process. Only the integration of all these different strategies will however allow to perform innovative science and information technology (IT) may act as the glue on the interfaces. 13 Chapter 2 Target Identification and Validation Every drug – no matter if it is a small molecule or a biopharmaceutical – has one or more biological targets which are mediating the drug’s functionality. In the pharmaceutical R&D process, a “target” is relevant to a specific disease and can be of different kinds including genes, proteins, miRNA, biological pathways, and essential nodes in a regulatory system (Yang et al., 2009)(see Table 2.1). Target identification therefore often stands at the beginning of the rational drug discovery process (Young, 2009) and improper target selection is seen as a major contributing factor to drug failure (Yang et al., 2009). Historically, drugs have been discovered in phenotypic screenings by their ability to induce a certain biological outcome and only then the MOA has been analysed (Sleno and Emili, 2008). This approach has been replaced by the more rational way of first identifying a druggable target and then finding suitable compounds for it. Advantages of this strategy are higher throughput and that the MOA of a hit is already known (ibid.). With the improvement of high-‐throughput technology in gene sequencing, mass spectrometry, microarrays, and two-‐dimensional gel electrophoresis “omics” approaches such as genomics, proteomics, and metabolomics have found their way into the drug discovery and development process opening up more ways for target identification and validation (Wishart, 2005) which are highly intensive in data analysis and knowledge mining. Table 2.1 Targets of approved drugs and estimates about the number of drug targets in the human body In order to evaluate what may be possible in the future of target identification a quick look in the past should be taken to see what the targets for currently approved and developed drugs are and why. Paul Ehrlich suggested the concept of receptors as selective binding sites for chemotherapeutic agents in the early 1870s. This was further improved by the insight of John Newport Langley that receptors are functioning as “switches” that can be toggled by specific signals, namely by agonists and antagonists, generating specific signals depending on the input (Drews, 2000). Little has changed since then in the target-‐types for drugs: In Figure 2.1 it can be seen that G-‐ protein coupled receptors (GPCRs) and enzymes form the biggest categories for drug targets of approved and newly developed compounds (Overington et al., 2006, Swinney and Anthony, 2011, Rask-‐Andersen et al., 2011). Other targets, for which the drug design process may be slightly different than in the typical small molecule design process, are DNA, RNA and other targets inside cells (Young, 2009, Chapter 7). It is needless to say that some targets are disease-‐ or tissue specific and that there are many different MOAs (such as competitive inhibition of a 14 Target Identification and Validation free enzyme, uncompetitive inhibition of a protein-‐ligand complex, allosteric inhibition, agonistic binding, or antagonistic binding) (Young, 2009, Chapter 4). a) Classes of 830 therapeu`c targets (DrugBank May 2009) b) Targets of first-‐in-‐class NMEs (Approved 1999-‐2008 ) 2.0% 2.0% 18.9% 21.1% 8.0% 18.0% 4.0% 6.0% 2.8% 3.0% 12.0% 16.0% 8.0% 3.9% 4.8% 5.1% 5.3% 10.0% 9.0% 6.7% 7.6% GPCRs Hydrolases Ligand-‐gated ion channels Oxidoreductases Voltage-‐gated ion channels Transferases Receptor tyrosine kinases Immunoglobulin-‐like receptors Nuclear receptors Enzyme-‐interacwng proteins Solute carriers 10.0% 16.0% Unknown GPCRs Other enzymes Kinases Microbial enzymes Ion channels Proteases Nuclear receptors Transporter Enzyme cofactor Figure 2.1 Drug Targets. a) Overview over the 435 targets which are encoded in the human genome as identified from 1542 FDA-‐approved drugs from the May 2009 version of DrugBank by Rask-‐Andersen et al. Basis for this analysis were 985 drugs with identified therapeutic protein targets, additionally 328 drugs that didn’t have known targets and 192 that were addressed to non-‐human targets, mainly bacteria. b) Targets of the 50 first-‐in-‐class small molecule NMEs who gained marketing authorisation by the FDA between 1999 and 2008. GPCR: G-‐protein coupled receptor Sources: left: data from Rask-‐Andersen et al. 2011; right: data from Swinney and Anthony, 2011 There are different estimates about the number of potential targets encoded in the human genome ranging from 10,000 (1000 disease genes with five to ten associated proteins) (Drews, 2000) to a much lower 600-‐1500 (only 10% of the genes are disease modifying and only 10% are druggable with no complete overlap of these two subsets) (Hopkins and Groom, 2002, cited by Kubinyi, 2008). According to a recent analysis by Rask-‐Andersen et al. (2011) only 435 of these genome-‐encoded targets have been used up to now. Additional targets exist through the formation of protein complexes and outside the human genome (e.g. in bacteria, parasites, and viruses) and again the target space is not yet fully discovered (Kubinyi, 2008, Rask-‐Andersen et 15 Target Identification and Validation al., 2011). Strategies used to find new targets include target-‐based and diversity-‐based organic synthesis, mechanism-‐based strategies, active learning, target-‐deconvolution strategies (i.e. “identification of the molecular targets that underlie an observed phenotypic response” (Terstappen et al., 2007)), and bioinformatics (Phoebe Chen and Chen, 2008). The motivation behind the search for new targets is that although they are more risky than already validated ones they are also not yet protected by patents and thus give the developing company freedom to operate and a potential marketing advantage (Young, 2009, Chapter 3). Unfortunately, many drugs in use have a complex MOA, multiple targets and poorly understood pharmacology (Overington et al., 2006) which suggests that focus on a single drug target may not be sufficient to develop a potent drug against a disease. Furthermore, rational drug design in the past has focused on finding single targets for a potential disease treatment with the result that many of the drugs developed for a distinct target were rendered obsolete after some time by the acquisition of resistance mediating mutations in the target (Schiffer, 2008). Finally, existing models (animals and cell-‐culture) are still very poor for many diseases and proper target validation in these diseases is therefore not possible (Petsko, 2010). Better models or computational prediction tools are hence urgently needed. 2.1 Bioinformatics in Target Discovery and Validation The identification of potential drug targets and the validation of their relationship with a specific disease as well as their potential to mediate a change in the behaviour of diseased cells is a major field in which bioinformatics are applied (Phoebe Chen and Chen, 2008). In order to perform structure-‐based drug design, information about the molecular biology of the disease is needed. This includes knowledge about genes and proteins that are involved in the disease, the structure of the underlying biological network, features that are different in healthy and diseased state, and the structures of macromolecular targets and complexes (Moustakas, 2008). Initially methods are used to understand the disease dynamics and disease biology in detail. Once the molecular mechanisms are understood, a target can be chosen that seems to have the highest potential to elicit disease modification. In a validation process that normally continues into clinical development the enzymatic, signalling and structural function of the molecule are characterised and by site-‐directed mutagenesis, functional assays and bioinformatics a mapping of the functional residues is performed in order to find the area where ligand binding is most likely (ibid.) Several bioinformatics techniques can be used in target identification and validation which are described further in Table 2.2. It is important to note that all these methods are interlinked and complement each other (Phoebe Chen and Chen, 2008). 16 Target Identification and Validation Table 2.2. Bioinformatics methods in target identification and validation (Adapted from (Phoebe Chen and Chen, 2008) Bioinformatics Description technique Phase Some Methods Target identification Target pattern Determine the relationship between Emerging patterns, two or more entities like genes, Association rules proteins, or both in combination to discovery infer the function of an unknown gene in a pattern or novel information about the relationship between several genes Target gene Identify genes that are specifically associated with a particular disease or function Hidden Markov models, RNA interference, Exact match and weight matrix scan, MetaPathwayHunter, Naive bayes Target classification Categorise genes, proteins or other targets into the correct class (e.g. genomes), distinguish genes from each other, and predict class of previously unknown genes Profile hidden Markov-‐ model-‐artificial neural network (pHMM-‐ANN), mutual information Biomarker (substances used as indicators of a biological state such as organ function or a specific disease) Find out key steps or genes in a specific biological process or stage to understand the disease better and explore disease causes Web-‐based tools, Data mining Target validation Pathways Explore processes that control and regulate processes in living organisms like genes in pathogenesis, gene mutations etc. Knowledge about relationships between different pathways can help foresee and understand potential side effects Computational analysis of microarray data, docking 2.1.1 Obtaining the Protein Structure Once a protein is identified as a potential target, the time intensive process of obtaining the target’s protein structure and the structure of potential ligand-‐complexes begins. This can be done either by X-‐ray crystallography, nuclear magnetic resonance (NMR) spectroscopy or by computational modelling (Moustakas, 2008). A homology model can be derived when the protein structure cannot be determined experimentally, but the protein sequence and 17 Target Identification and Validation structures for molecules with a high sequence identity to the molecule of interest (“templates”) exist. The model is built by sticking together the fragments from the template at places where they have a high sequence identity with the target molecule (Young, 2009, Chapter 9). Although homology modelling is not the most exact strategy for predicting a protein structure and it is not useful to predict complex molecules, it sometimes is the only way to obtain a three-‐dimensional (3D) protein model because the experimental methods require large sample quantities, X-‐ray crystallography depends on the crystallisation of the sample, which is often not possible for transmembrane proteins like receptors and ion-‐ channels, and NMR spectroscopy is hindered by large molecules like proteins (Moustakas, 2008). According to Young (2009, Chapter 3) the best quality, however, is obtained by X-‐ray crystallography and if NMR spectroscopy or homology modelling are needed they should be performed both to compare the results. As a last resort and only if a structure is urgently needed, but no structures from homologous proteins are known, virtual protein folding can be performed. This process starts from the primary structure of the protein (i.e. amino acid sequence) and is based on the insight that a protein structure normally is at an energy minimum. By computing as many conformations as possible it is hoped to find this minimum (Young, 2009, Chapter 3). This process is limited by several factors including the very high computational cost to calculate all conformers for a given protein, unknown accuracy of the obtained structure, and problems with obtaining structures for chaperone-‐folded proteins (Young, 2009, Chapter 11). In order to obtain as much structural information as possible about the target and target-‐ complexes both experimentally derived and predicted structures of whole proteins and potential binding sites are generated at this stage (Davis et al., 2008). This mixture of available data and the quality of the obtained structures make further modelling prone to error propagation (Davis et al., 2008). On the other hand, by using methods for structure prediction at least some structure-‐based design can be performed until an experimentally derived structure can be produced thus giving the developing company a head start over the competing companies. 2.1.2 Target Characterisation Apart from knowing the target structure much more is needed to assure the use of a robust and druggable molecule as a target (Young, 2009, Chapter 4). 18 Target Identification and Validation A computational technique called “docking” in which potential binding interactions are simulated can be used to predict molecular contacts such as enzyme-‐substrate-‐, protein-‐ protein-‐, nucleic acid-‐protein-‐, and nucleic acid-‐small molecule-‐interactions and thus help to complete knowledge about the disease-‐underlying biological network and predict answers to questions like “Does protein A bind protein B? If so, does the complex AB bind C?” Furthermore, after a specific target molecule is selected, docking can be of importance during target validation by characterising the interaction of a protein and a known ligand or for solvent mapping (Moustakas, 2008). The main aim of docking at this stage is to plan and direct wet-‐bench experiments. If done correctly this could lead to the reduction of needed experiments thus reducing the time and money needed in this phase. If the docking results are wrong, the resulting wrong hypotheses can misdirect the experimental follow up and thus lead to higher expenditure and time delay. Therefore accuracy of the docking prediction needs to be high and ligand as well as receptor side chain flexibility should be taken into consideration (Moustakas, 2008). Automated crevice detection can be used for protein targets if no crystallographic structure for the target-‐ligand complex or a homologous complex exists and hence the active site of the protein is unknown. Based on the premise that the largest crevice on a protein’s surface is very likely to be the active site, these tools look for concave regions on the protein surface and categorize them by their size (Young, 2009, Chapter 4). According to Young (2009) these methods are very accurate, but the output needs manual verification. To further characterize the target mechanism it may be of interest to determine the reaction kinetics, the transition structures, the atom coordinates in the transition states, and molecular dynamics. Especially the determination of the exact position of all atoms at every time step in the reaction is impossible to determine in experiments, but can be simulated to a certain extent by quantum mechanical calculations (Young, 2009, Chapter 4). If the protein-‐ligand complex is formed by an induced fit, i.e. the active site changes its conformation to bind the ligand, molecular dynamics methods may give valuable information as they can give an insight on how protein structure changes upon ligand binding and why substrate binding may fail (ibid.) 19 Target Identification and Validation 2.1.3 Quality Control Regular monitoring, quality control and reproducibility of biological experiments are also of major importance in a commercial laboratory (Sandy Primrose, personal communication). In addition to its other benefits, bioinformatics methods for data visualisation in form of plots, sequence alignments and so on allow to do this with computers in a quick, reproducible and painless way (Phoebe Chen and Chen, 2008). 2.2 Knowledge Discovery With the explosion of available data from published journals, in-‐house and outsourced research entities and high throughput laboratories, it becomes hard for a pharmaceutical company to integrate the knowledge coming from the different sources and make it available to the scientists who need it (Waller et al., 2007). Computers are essential in finding the hidden knowledge in the haystack of data. Frawely et al. (1992) define the term of “knowledge discovery”, which is often used in this context, as “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data”. In target discovery mainly approaches of text mining and high throughput data mining (microarray data mining, proteomic data mining of mass spectrometry data points, and chemogenomic data mining) are of importance to identify disease-‐associated entities and networks as well as potential targets and diagnostic markers (Yang et al., 2009). Since knowledge discovery is a process spanning the whole drug discovery and development process, a separate chapter in the end will cover more about the potential of knowledge discovery tools in productivity improvement. 2.3 Drug Reprofiling An alternative to the lengthy and expensive discovery and development of novel targets and drugs is “drug reprofiling” in which new targets for already approved drugs are searched. This practice has been used historically to find cheap drugs for developing countries, but with the current productivity problems in the whole pharmaceutical sector it may also be a good approach for diseases that have only small markets (Chong and Sullivan, 2007). As the nobel laureate James Black already stated: “The most fruitful basis for the discovery of a new drug is to start with an old drug” (Raju, 2000). By doing this, most steps of the discovery process can be omitted and clinical trials can be minimised as knowledge about the drug molecule, its pharmacokinetics and its safety exists already. 20 Target Identification and Validation Computational methods can help finding these new targets for a given drug by mining knowledge from reported side effect data, identification of similar gene expression profiles of different diseases, and analysis of structural similarity of the known target’s binding site and binding sites of potential other targets (Haupt and Schroeder, 2011). In addition to that, drug re-‐profiling can be a good method to rescue drug candidates that have failed in clinical trials due to lack of efficacy since, again, much knowledge about the drug properties exists, its safety is established, but a suitable target is missing (Mizushima, 2011). Figure 2.2. Workflow for finding new targets with binding site similarity analysis. (i) the extraction of the known drug binding site from a 3D structure; (ii) the identification of similar binding sites (using the SOIPPA algorithm [72], which employs clique detection and a Cα representation); and (iii) docking to the putative target to assess atomic interactions. Problems: quality of data in PDB, ligand binding modes, etc. Source: (Haupt and Schroeder, 2011) 21 Target Identification and Validation So far most reprofiling projects have originated in chance observation and educated guesses3, today’s high-‐throughput screening and computing capabilities make a more systematic approach to reprofiling desirable. Haupt and Schroeder (2011) review some success stories from computational binding site similarity exploitation. A workflow like the one displayed in Figure 2.2 for example was used to discover enoyl–acyl carrier protein reductase, an essential protein in M. tuberculosis, as a new target for entacapone. Although the original target (catechol-‐O-‐methyltransferase) and enoyl–acyl carrier protein reductase do not have similarities in their protein sequence their similarity in tertiary structure makes them both prone to inhibition by entacapone which could be used in treatment for multi-‐ drug resistant tuberculosis strains (Kinnings et al., 2009, cited by Haupt and Schroeder, 2011). According to Chong and Sullivan (2007) only about 10% of the 10,000 drugs ever used in clinical medicine4 are covered by patents (Figure 2.3), thus making the creation of a compound library with all these molecules and major metabolites highly desirable. In their commentary in Nature the two scientists challenge the scientific community to create a comprehensive clinical drug library and screen it against all neglected diseases by the end of 2011 (ibid.). The lack of more recent news about this challenge raises doubts that the approach was successful. Although there may exist problems in creating such a library for wet-‐lab screening, the creation of this library in silico should easier. The DrugBank database (www.drugbank.ca) (Wishart et al., 2005, Knox et al., 2011) for example already contains exhaustive information about 1437 small molecule, 134 large molecule, and 5174 experimental drugs (DrugBank, 2011). Using the available data about the drug structures, virtual screening could offer some guidance which drugs may have potential against which target and then systematic validation experiments could be performed. Ideally, this could lead to a minimisation of wet-‐lab screening experiments for target and lead identification, reduction of the lead optimisation and safety testing steps and shortening of the clinical trials and marketing approval. 3 For example, the use of sildenafil citrate (Viagra) in erectile dysfunction instead of hypertension As approved by the FDA since 1938, listed in the FDA Orange Book 2006, the 2006 Physician Desk Reference, approved outside the United States, or in at least phase II clinical trials 4 22 Target Identification and Validation Figure 2.3. Existing drugs and the proportion that is available commercially sorted by therapeutic category. Source: (Chong and Sullivan, 2007) Indeed, a group of scientists in Canada proposed a similar approach in which they suggest the screening of a set of targets against a known potent compound whose target is unidentified in a process like the one illustrated in Figure 2.4 (Mandal et al., 2009). Upon personal contact asking whether this approach is successful, a positive reply was received. Apparently the group was able to identify several targets and a paper about it will be published soon (Sanat Mandal, personal communication). Figure 2.4. Proposed steps for drug design process with known drug and unknown target. Source: Mandal et al. 2009 23 Target Identification and Validation According to Chong and Sullivan (2006) the costs for bringing a drug to market can be reduced by 40% in drug repurposing and even more when using computational predictions to guide screening and efficacy experiments rather than brute force high-‐throughput techniques. According to the productivity model by Paul et al. (2009) (see Chapter 2) using a new target for an established compound can result in out-‐of-‐pocket cost savings of more than US$ 363 million in clinical development and a cycle-‐time reduction of 50%5 while simultaneously improving the probability of success for a lead compound drastically. Additionally, the life time of patent protection and market size for a compound may be extended by this approach thus improving future revenues and the return-‐on-‐investment (ROI). Problems can be seen in the lack of patentability of the found compound and target. Since the molecules used in screening are prior art to the invention no or only very narrow patents can be filed or even royalty-‐payments to the patent holder for the found drug may be required. Nevertheless, for diseases with a small market (orphan diseases, diseases of the developing world, etc.) and for screening with molecules for which the company already holds patents (e.g. failed candidates) this is a good way to save money in development and increase revenue. 2.4 Conclusions Several different approaches and their opportunities and limitations have been reviewed in this chapter. It is apparent that much potential exists for computational techniques, especially bioinformatics, in target identification and validation. A white paper sponsored by the software vendor Ingenuity has shown for example that the use of the computational tool “Ingenuity Pathways Analysis” results in shortening of target identification time, better prioritisation of the identified targets, increased validation, and confidence in the chosen one. It therefore resulted in 10% cost savings during this phase (Zimmerman et al., 2004). Similarly, the company Entelos which is developing platforms for integration of various data for modelling biological mechanisms, feedback loops and connections underlying a disease and its progression (PhysioLab®) can report some success stories in which a PhysioLab platform was useful for target characterisation (Entelos, 2003), 5 Reduced cycle time in discovery through higher probability of success and less testing through computer guided experiments: instead of 5.5 years only 2.75; No safety and only little pharmacokinetic testing needed in clinical studies: need for only large phase 2 trial: time savings 4 years, cost savings: US$ 363 million 24 Target Identification and Validation target prioritisation6 (Figure 2.5) (Entelos, 2004) and target validation7 (Entelos, 2005) with time savings of up to six months (Entelos, 2003). Computational methods in target discovery and validation can have a tremendous effect on overall productivity by reducing the risk of failure later on. Other ways in which they can help improve productivity are by reducing redundancy in research (drug repurposing, knowledge mining), informing laboratory experiments (Entelos PhysioLab® platform, docking) and make progress possible where it would not be otherwise achievable (homology modelling, virtual folding, molecular dynamics). Figure 2.5. Ranked List of Targets from Entelos. Each column is representing a potential target, the blue bar the predicted efficacy and the green diamond the mean. Text is not supposed to be legible. Source: (Entelos, 2004) 6 In over 20,000 simulations the efficacy of 30 potential rheumatoid arthritis targets was evaluated and the targets ranked to its predicted clinical impact for the pharmaceutical company Organon (Merck) and suggestions for in vitro and in vivo experiments were given. 7 Entelos confirmed phosphodiesterase 4 (PDE4) as an effective drug target for mild-‐to-‐moderate asthma treatment for Pfizer by identifying relevant pathways, elucidating PDE4’s mechanism of action and simulating clinical trials. Furthermore, specific biomarkers for patient response were identified and guidelines for additional laboratory experiments given. 25 Chapter 3 Hit and Lead Identification and Optimisation Once the company has identified a target, the actual process of finding a potent molecule against it begins: In hit identification efforts, the most active molecules are identified from a collection of compounds. Following that, the hits are validated in secondary screenings and those with the highest potential as future drugs are chosen as leads (“hit-‐to-‐lead”) (Table 3.1) (Goodnow, 2006). In the process of lead optimisation the lead structures are modulated in order to optimise their target affinity and secondary properties such as ADME and toxicity (Moustakas, 2008). Costs encountered in this process include chemical synthesis, compound purchase, library curation, biological screening of hundreds of thousands of molecules to identify hits, and further synthesis and screening to optimise the hits to leads (Kapetanovic, 2006). Table 3.1 Stages of the hit-‐to-‐lead process and the general criteria a hit needs to satisfy at each stage Assessing hits Identification of high quality hits Validating hits ‘A good lead’ Structure and purity confirmed Activity confirmed with powder sample Resolution and assay of chiral isomers Understanding mode of action Not a frequent hitter Prioritizing feasible chemistry for analogue synthesis Synthesis amenable to HTC Structure of lead-‐ target complex Minimum toxicity alerts Potency often <10 μM Plausible SAR in 50–100 analogues Potency often <1 μM Minimum Lipinski rules violations Appropriate target selectivity No Lipinski rules violation Encouraging preliminary PK Solubility, permeability, log P calculated Solubility, permeability, log D measured Relative stability in microsomal and hepatocyte assays Low hERG channel binding liability Intellectual property issues assessed log D: 0–3 Aqueous solubility >100 μg/mL Acquiring similar Permeability (Caco-‐2, commercial or historical MDCK, PAMPA): high analogues Low Cyp450, PGP liabilities Selectivity in enzyme and receptor panel assays HTC: high-‐throughput chemistry, SAR: Structure-‐activity-‐relationship, PGP: P-‐glycoprotein, Cyp450: cytochrome P450, MDCK: a cell line, PAMPA: parallel artificial membrane permeability assay, Source: (Goodnow, 2006) Computer-‐assisted drug design (CADD, also known as in silico design) is gaining momentum in drug design and comprises all steps in the process of compound design and optimisation 26 Hit and Lead Identification and Optimisation that are using computers for data storage and organisation, process virtualisation like compound screening, and other predictive and modelling steps in downstream processes such as ADME and toxicity prediction. An overview over the different phases in CADD is given in Figure 3.1 and Appendix B. This process is fuelled by the ever increasing amount of data acquired by single experiments and the fact that drug design is a multidimensional task in which a compound needs to be found with sufficient activity, (oral) bioavailability, hardly any toxicity, patentability, with a sufficiently long half-‐life in the blood stream, and with low manufacturing costs (Young, 2009, p.4). At the moment, however, CADD is only used as supporting technology to existing practices like HTS (Lusher et al., 2011) (see Table 3.2) although it has significantly contributed to the development of several drugs such as zanamivir amd amprenavir (Clark, 2006, cited by Lusher et al., 2011). Figure 3.1. In silico drug discovery pipeline. Adapted from (Zoete, 2011) Table 3.2 Experimental hit identification with high-‐throughput screening (HTS) Traditionally the potency of a compound to bind to a target was determined in laboratory experiments such as phenotypic cell-‐based assays in which the ability of a compound to change the cell’s phenotype is evaluated. This used to be done in targeted experiments for a curated library of compounds originating from a company’s previous drug development projects. Since the invention of high-‐throughput screening (HTS) in the early 1990s the strategy has changed, however. HTS has taken over in drug discovery projects because much bigger compound libraries of millions of entities can be screened for potential binders in a short time (1-‐3 month (Macarron et al., 2011)) and with a low cost per assay (US$ 1 per 27 Hit and Lead Identification and Optimisation compound (Finer-‐Moore et al., 2008)) due to the much lower amounts of synthesis per compound (Moustakas, 2008). In HTS, automated biochemical assays for the biological activity of compounds against a chosen target are performed. This is done by contacting the target with a different compound from a compound library in each well of small multi-‐well plates and evaluating the IC508 of each target-‐compound combination. Using this approach typically each compound is tested once, in a single well and at a single concentration (Young, 2009, Chapter 18). For HTS to work, the used assay must be constructed in a way that it produces optical output that can be read out automatically and it must be robust against vibrations, shocks and temperature changes in the machine (Moustakas, 2008). The most important step for a HTS experiment is the composition of the compound library for screening to cover the chemical space in an appropriate manner. Computational methods can be helpful either by increasing the structural or chemical diversity of the library (for a broader coverage of the chemical space) or by enriching the library with similar scaffolds to a group of known binding ligands (pharmacophore) (Young, 2009, Chapter 8). When HTS is combined with combinatorial chemistry9 the parallel screening of billions of compounds is possible, but further analysis of the active ingredient remains needed (ibid). The strategy of fragment-‐based screening in which not whole compounds but only fragments are tested for their binding affinity followed by coupling of several binding fragments is gaining interest because a larger diversity of molecules can be explored (ibid.) and a higher hit rate can be achieved since a fragment does not need to satisfy as many complementary interactions with the active site as a whole compound (Finer-‐Moore et al., 2008). Other high-‐throughput experiments include small-‐molecule microarrays in which the small-‐ molecule compound library is immobilised on a microarray and incubated with the protein target followed by tagging of the formed target-‐compound complexes by fluorescent-‐labelled antibodies (Duffner et al., 2007) and cellular microarrays for cheaper phenotypic screening and early toxicity testing (Fernandes et al., 2009). Typically, a sequence of biochemical and cellular assays is performed in order to obtain hits of good quality (Macarron et al., 2011). The failure of HTS that is currently perceived (Sandy Primrose, William Hamilton, William Bains, personal communication) by the inability to find more drugs faster has to be put in the perspective of the current lag time of 13.5 years (Paul et. al, 2010) between compound identification and drug approval. According to Macarron et al. (2011), 19 of the 58 drugs (33%) which gained marketing approval between 1991 and 2008 for which the initial lead was known originated from HTS campaigns and no difference in attrition rates is observable 8 IC50 is the half maximal inhibitory concentration, i.e. the concentration at which a compound inhibits 50% of the biological or biochemical function of a protein, and measures therefore the effectiveness of a ligand 9 A chemical technique in which thousands of compounds are synthesised in one pot and this combination is tested. Once a hit is obtained the combination is analysed further to find the active ingredient 28 Hit and Lead Identification and Optimisation between hits from HTS and from other strategies. Considering the fact that the microplate standards that made equipment manufacturing possible were published in 1999 and a certain time was needed for companies to optimise the process, this percentage is not that bad. It is now accepted that HTS alone is not the solution to the productivity problem (Macarron et al., 2011, Lusher, personal communication) and a combination of the different hit-‐finding techniques (Macarron et al., 2011) as well as a better integration of this technique with other stages in the process (Lusher, personal communication) is needed to improve the quality of the leads and reduce attrition rates. 3.1 Virtual Screening High-‐throughput screening is already very fast and quite cheap, but in order to further improve that and focus on more relevant experimental assays, computational techniques are available to simulate parts of the process. Instead of finding a hit in experimental assays, screening experiments can also be simulated in silico or hits suitable for the active site of the target can be designed directly. Same is applicable for the hit-‐to-‐lead and lead optimisation process in which normally many lead derivatives are synthesised and screened for their efficacy and potential ADMET. Virtual screening has been shown to be more efficient than empirical screening with hit rates10 of 2-‐3 orders of greater magnitude and that it’s capable of minimising false negative and false positive hits (Kapetanovic, 2006). 3.1.1 Compound Library Design The aspect that is closest linked to experimental hit identification is the use of software and machine learning to design the compound library. The chemical space of molecules with some drug-‐like size (i.e. less than 30 non-‐hydrogen atoms) is estimated to span more than 1060 molecules11 (Finer-‐Moore et al., 2008) and is therefore too big to be covered exhaustively in experiments. The largest compound libraries in big pharmaceutical companies are covering a maximum of 107 molecules (ibid.) and tend to be heavily biased towards the chemistry that the company was exploring in the history (Sandy Primrose, William Hamilton, personal communication) thus omitting parts of the chemical space that could be containing the next NME. To design a compound library for a screening experiment computational techniques and artificial intelligence can be useful to evaluate the existing libraries for potential gaps and 10 Hit rate := 11 !"#$%& !" !"#$"%&'( !"#$"#% !"#$%& !" !"#$"%&'( !"#!"$ If only one of each those molecules was synthesised, this would already exceed the mass of the 10 earth by 10 (Finer-‐Moore et al., 2008) 29 Hit and Lead Identification and Optimisation find another library or group of compounds that are capable of filling these gaps. Furthermore, in order to reduce synthesis efforts or to filter commercially available libraries it may also be of interest to reduce the library size while maintaining chemical diversity and drug-‐likeness of the entities. This can be illustrated with an example given by Lloyd Czaplewski from Biota Europe Ltd.: Czaplewski and his colleagues have shown that the use of artificial neural nets (ANN) for example can be of major help in enriching a library in scaffolds that are active against pathogens. By training the ANN on 2,500 antibiotic and non-‐antibiotic drugs the group could successfully create a screening library of 5,000 compounds enriched with antibacterial-‐like molecules (from a larger library of 2.5 million molecules) and identify 13 hits including a new class of type II topoisomerase inhibitors. It can be imagined that this approach was much quicker and cheaper than a similar project by GlaxoSmithKline that didn’t use an enrichment strategy, in which 450,000 compounds were screened in 70 HTS runs (US$ 1 million each) over the period of 7 years to identify only 18 hits for future antibacterial drugs (Payne et al., 2007). Another aspect in which computers help selecting compounds are by searching for similarity using mathematical descriptions for compounds such as molecular or side chain fingerprints, or pharmacophore descriptions. Furthermore, by using clustering algorithms12 similar compounds can be chosen and only one representative structure used for screening initially. If the screen is successful, the other structures can be evaluated further (Young, 2009, Chapter 8). By visualising the space covered by the chosen compounds, validation is simplified. 3.1.2 Pharmacophore A pharmacophore model (Figure 3.2) is an abstract concept that can be derived from known ligands for a specific target. It contains the common molecular interaction capacities, i.e. the three-‐dimensional arrangement of molecular descriptors, of this group of binders towards the target structure (Young, 2009, Chapter 13). Therefore it can be seen as the “largest common denominator” for electronic and steric features that are required for a molecule to bind to the target and trigger its biological response (Kapetanovic, 2008). 12 Clustering algorithms can also be useful in the data analysis of HTS results. By clustering sets of similar compounds it is possible to identify outliers which could be false-‐positives or false-‐negatives (Young, 2009, Chapter 18). For example all compounds with a similar structure but one show activity. The one that does not show activity therefore could potentially be a false-‐negative hit. 30 Hit and Lead Identification and Optimisation Descriptors such as hydrogen-‐acceptors or -‐donors, hydrophobic groups, aromatic groups, or positive or negative ionisable groups are used to build a pharmacophore (Kapetanovic, 2008; Young, 2009, Chapter 13). The model can then be used in a similar way as substructure searches or similarity metrics to filter large databases for those with a reasonable probability of being active against the target prior to performing wet-‐lab screening (Young, 2009, Chapter 13). a) b) Figure 3.2. Pharmacophore models. a) Pharmacophore based on known ligands, b) Pharmacophore in the active site of the target. Source: (Accelrys, 2011) Substructure searches are useful to find all compounds that contain an exact pattern of atoms and bonds in a database, whereas similarity searches are capable of finding very similar compounds that contain slightly different patterns of atoms and bonds such as slightly different ring systems which substructure searches wouldn’t find (Young, 2009, Chapter 13). Both groups of algorithms have however the disadvantage of relying on structural similarities in the compound backbone (ibid.). In the case where it is known that several active ligands exist that do not have structural similarity or common substructures these algorithms would fail finding them. A pharmacophore model would not fail because it is built on the common features of all known active ligands and thus is more general. A big advantage of pharmacophores is that they can be built without knowledge of the target’s 3D structure or the exact geometry of the active site by only relying on the common features of active compounds (Young, 2009, Chapter 13). A better quality model can however be obtained when more is known about the active site and this knowledge is used to build the pharmacophore in the cavity thus reducing potential error from small ligand training sets (ibid.). Once a pharmacophore model is established, it can be used to screen compound-‐databases and rank-‐order the compounds in them according to how well they fit the model and to exclude molecules which are very unlikely to bind to the target 31 Hit and Lead Identification and Optimisation (ibid.). The smaller size of the data set in the following steps can thus be performed faster and cheaper. Although pharmacophores are commonly used in the CADDD process because of their independence from the 3D target structure, several limitations have to be considered which reduce the applicability of pharmacophores as a sole strategy in hit identification: pharmacophore search can only be performed on databases with 3D structures of the ligands (which can be constructed virtually, but tend to be very large) and only the geometry in the database is checked. The tool is therefore easily affected by incorrect conformations (Young, 2009, Chapter 13). In ligand-‐based drug design pharmacophores are therefore mainly used for rapid searching of large compound structure database (Young, 2009, Chapter 15). 3.1.3 Three-‐Dimensional Quantitative Structure-‐Activity Relationship (3D-‐ QSAR) Three dimensional QSAR (3D-‐QSAR) is a method to predict the interaction between a molecule and a protein site quantitatively even when no information of the active site exists. It requires a reasonable amount of activity data to be trained on and therefore cannot be used at the very early stages of discovery. When this data exists, however, this technique can give more accurate predictions of a compound’s biological activity than pharmacophores (Young, 2009, Chapter 15). The 3D-‐QSAR model is trained on a set containing compounds and their known activities against the target (normally determined by biochemical assays). A grid is built surrounding one of the known active compounds and the steric and electrostatic interactions of an imaginary probe atom are computed for various positions on this grid. This process is iterated for all training molecules and a “partial least squares algorithm” then tries to predict the spatial arrangement of features in the binding site that may lead to interactions with the established active molecules (ibid). Although a 3D-‐QSAR model can only perform predictions based on the chemical space covered in the training set compounds, it is seen as the most accurate prediction method in ligand-‐based design once enough activity data is available to set up such a model (ibid.). 32 Hit and Lead Identification and Optimisation 3.1.4 Docking The use of docking in lead identification and optimisation is manifold13. Figure 3.3. Principle of Docking. 3D structures of compounds that could be ligands to the protein in the middle are positioned in the binding pocket as can be seen with one ligand in the centre. Following the docking the binding energy is calculated and the compounds can be ranked according to their target-‐affinity. Source: http://vds.cm.utexas.edu/ In docking, 3D structures of potential ligands are automatically placed into the presumed active site of a 3D model of the target protein (Kubinyi, 2008) (Figure 3.3) and a score for binding affinity is calculated (Warren et al., 2008). This score approximates the binding free energy of the tested ligand to the receptor (Moustakas, 2008) by calculating some sort of energy (usually not including an entropy term) (Young, 2009, Chapter 12). For prediction of a molecule’s binding mode to the target’s active site, docking can be applied using different ligand conformations and orientations (“virtual crystallography”) (Warren et al., 2008). Apart from binding mode prediction (BMP), docking is mainly applied in virtual ligand screening (VLS or VS) which is the simulated version of HTS. In this context it can be used in different ways which are illustrated in Figure 3.4. By now a plethora of different docking 13 Due to the restrictions for the length of this thesis, just concepts and some major advantages and disadvantages are discussed. The interested reader can find much more information about all theoretical and practical aspects of docking in the books “Computational and Structural Approaches to Drug Discovery” (Stroud and Finer-‐Moore, 2008) and “Computational Drug Design” (Young, 2009) and a comprehensive list of software packages at the Swiss Institute of Bioinformatics (http://www.click2drug.org/)(Zoete, 2011). 33 Hit and Lead Identification and Optimisation and scoring algorithms and software suites exists, all having their strengths and weaknesses in accuracy and runtime. Figure 3.4. The three general workflows applied in virtual screening. Apart from screening the whole library by applying docking more focused approaches can be chosen. These have the advantage that the final docking results are more precise. Source: (Beuscher Iv and Olson, 2008) In VS, the docking of each compound should not take more than a few minutes of processor time in order to screen huge libraries in an acceptable time. Algorithms for this purpose make simplifications such as maintaining full ligand flexibility while keeping the receptor structure rigid (Moustakas, 2008). This obviously will result in errors in the prediction and in order to achieve higher-‐quality hits more accurate docking is performed as a last step in a VLS experiment as the computing power needed for this step is even in the times of massive parallelisation of computations on server farms and web grids rate-‐limiting (Beuscher Iv and Olson, 2008). Several benchmarking studies for the ability of docking to identify the best compounds have been performed by now. The results are reviewed by Warren et al. (2008) and summarised in Table 3.3. Although the study design has been different and direct comparison is therefore not possible, a general trend is detectable: the performance of every docking algorithm varies highly depending on different targets. In addition, scoring 34 Hit and Lead Identification and Optimisation functions were not capable of correctly ranking the compounds according to their binding affinity; neither could they predict the ligand binding-‐mode correctly. Docking may however be useful in identifying potentially active compounds since the enrichment of active compounds near the top of the docking-‐score-‐ordered list was greater than random (Warren et al.,2008). This can lead to the conclusion that manual validation and the use of several docking and scoring algorithms are needed in order to gain high-‐quality results which contradicts the initial motivation of docking in speeding up the lead identification process. Some savings may however be possible by using docking to enrich screening libraries with compounds that could bind and remove compounds which are predicted to perform badly thus decreasing HTS library sizes and increasing the hit-‐rate of HTS. Table 3.3 Docking algorithm benchmark results Study Author Study design Result Kellenberger et al. BMP and VS accuracy: 8 docking algorithms, 100 protein complexes • At ≤ 2Å root mean square deviation (RMSD) success rate of top-‐ranked compounds to reproduce experimentally determined binding mode: 25-‐55% • The algorithms that were most successful in BMP performed also best in VS Kontoyianni et al. Independent evaluation of BMP and VS: 5 docking algorithms, 69 protein-‐ ligand complexes for BMP and 6 targets with 8-‐10 active compounds for VS • Success rate for BMP in 2Å RMSD range: 0-‐25% (individual docking programs performed differently on different proteins) • Variability in performance of algorithms in VS for different targets • Correlation between VS performance and BMP Cummings et al. Docking algorithms for VS: 5 protein targets, 5-‐14 active compounds per target • BMP accuracy: 5-‐32% • Algorithms able to identify known ligand at rates better than random for at least one target each, but also worse than random for at least one target Perola et al. Binding mode reproduction and VS accuracy: 3 docking algorithms, 200 complexes for BMP 3 protein targets with 140-‐250 compounds per target for VS • Algorithm and scoring performance was system dependent • No correlation between accuracy of BMP and VS enrichment 35 Hit and Lead Identification and Optimisation Study Author Study design Result Wang et al. Binding mode reproduction: 11 scoring functions, 100 complexes plus different poses of these complexes • Correct identification of binding mode: 26-‐76% Chen et al. BMP and VS accuracy: 4 docking algorithms, 164 publicly available complex structures for BMP VS: 2734 active compounds, 4 protein targets from public domain + 8 targets from AstraZeneca, decoy compound database: 20,000 commercially available drug-‐like compounds (expanded to 40,000 by including tautomers and stereoisomers) • Binding mode reproduction: 43-‐91% • VS performance varied depending on algorithm and protein target • No correlation between good BMP and VS enrichment • 3D ligand-‐based methods had comparable or better performance than docking algorithms Warren et al. BMP, VS and rank ordering compounds by affinity: 10 docking algorithms, 8 targets, 7 cognate sets of 140-‐200 compounds BMP: 136 structures (90% not in public domain) • Binding mode reproduction: 0-‐43% • VS performance: highly target dependent • No or low correlation between BMP and VS performance • No algorithm was able to predict affinity or rank-‐order compounds by affinity RMSD: root mean square deviation. Data from (Warren et al., 2008) Problems with the accuracy of docking can be explained by the lack of high-‐quality data for training the prediction tools (Warren et al., 2008). This can be illustrated by the low availability of relevant protein structures (see Figure 3.5): although the protein database PDB lists over 70,000 published structures, for some of the most interesting drug targets such as GPCRs hardly any structure or only fragment structures are available. The reason for that phenomenon is that NMR spectroscopy and X-‐ray crystallography have special requirements such a maximum protein size or the formation of protein-‐crystals. Often complete proteins are too big for NMR spectroscopy and highly hydrophobic proteins such as receptors and other trans-‐membrane molecules are impossible to crystallise in sufficient amounts. Additionally, the quality of the 3D structure is highly dependent on the experience of the crystallographer, crystallization conditions and the resolution; a resolution of at least 1.5Å is for example needed to create a model that is to more than 95% a consequence of the observed data (Davis et al., 2008). Since the development of 36 Hit and Lead Identification and Optimisation docking algorithms and scoring functions relies on these inaccurate structures inaccuracies in the predictions are inevitable. Other limitations for docking exist when the target is not a protein, when the 3D structure of the protein target is not known, and in finding drugs which have another mode of action than binding to the active site (Young, 2009, Chapter 12). 80000 70000 60000 50000 40000 30000 20000 10000 0 Yearly Total Figure 3.5. Growth of the ProteinDataBase (PDB) illustrated through the number of published structures as of 13/08/2011. Source of data: http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100 Although docking has all these limitations it is, according to Young (2009, Chapter 12), “the primary workhorse in structure-‐based drug design”. Conversely, according to Scott Lusher (personal communication) docking is working well, but is not yet integrated appropriately into the process and the inability to accurately predict binding modes has a serious negative impact on the credibility of computational chemistry. Furthermore, Warren et al. (2008) find that the “greatest potential for immediate impact on drug discovery lies in the development of docking algorithms and scoring functions that enable computational times consistent with the time required for chemical synthesis and computational predictiveness accurate enough for lead optimisation.” The maximum benefit from docking is therefore still to be expected. 37 Hit and Lead Identification and Optimisation 3.2 De Novo Design Apart from sole enrichment14 of HTS screening libraries and virtualisation of the process other techniques exist that try to design a ligand molecule de novo. Some of these techniques resemble fragment-‐based screening. Depending on the de novo approach used, new chemotypes or scaffolds can be created (Kapetanovic, 2008) from joining atoms or small functional groups in the active site of the target (Young, 2009, Chapter 17) (Figure 3.6). The design programs generally follow the outline of analysing the active site of the target or a known pharmacophore first, then building molecules fitting into it and finally sorting and selecting the best candidates (ibid.). For positioning the small fragments into the binding pocket often docking algorithms are used. Scoring functions like the ones in VS are applied to evaluate the new molecule created by joining the fragments (Moustakas, 2008). The different approaches of de novo design are further described in Table 3.4. Typically all of them are used in a project to maximise the diversity of compounds created (Young, 2009, Chapter 17). Figure 3.6. De novo evolution. Source: (Huang et al., 2010) 14 Example: HTS of large set of compounds results in 0.1% hits above a certain activity threshold. If docking or pharmacophore study is done first to select which compounds from the library to physically screen and the subsequent screen finds 10% in the smaller library to be active, a 100-‐fold enrichment in the results has been achieved. 38 Hit and Lead Identification and Optimisation Table 3.4 De novo design approaches Algorithm Description Characteristics Fragment-‐joining Dock many molecular fragments into Tendency to find solutions with the binding site and then join the best many rings which makes them ones together to form a single more drug-‐like; molecule Chain-‐growth Start building the compound from one Gives chemically sensible starting fragment by adding functional compounds, but problems groups or atoms when atoms only link without interacting with the active site Two-‐pass First build carbon-‐backbone and Higher runtime optimise the solution in a second round by adding or modifying groups to maximise electrostatic interactions with the binding site Data from (Young, 2009, Chapter 17) De novo design is mainly applied for suggesting classes of compounds that could fit into the active site to the chemist for further exploration and integration in other screening approaches (ibid.). This is of essential importance in situations where the chemical space should be screened rationally and to avoid the typical bias of medicinal chemists towards molecules they have worked with before. De novo compound design has its limitations, though: the possible number of compounds that could be created with the combinatorial algorithms is simply too big to generate and test all of them in a brute-‐force approach. Therefore simplifications are used, often in the form of heuristics and knowledge is integrated by using artificial intelligence, genetic algorithms or Monte Carlo approaches (Young, 2009, Chapter 13). Furthermore, there is the risk that compounds are created that are not chemically stable, synthesisable or do not satisfy drug-‐likeness or ADMET criteria. To reduce the number of such compounds modern de novo-‐tools include a final step for evaluating these criteria, thereby partially removing this problem (ibid.) 39 Hit and Lead Identification and Optimisation 3.3 Lead Optimisation and Selection In the lead optimisation phase normally secondary assays are performed to evaluate the ADME/Toxicity and biological selectivity of the lead compounds to find the optimal lead (Figure 3.7). According to Moustakas (2008), there are many different optimisation strategies available of which the following one is the most typical: In an iterative process derivatives of the lead structure are designed and synthesised. They are then screened against several assays for their primary and secondary properties. The results are used to build a quantitative structure-‐activity relationship model (QSAR) (see next subsection) which is used to design the next set of derivatives. In this process the primary way in which productivity can be improved is by reducing the number of syntheses needed to find the best drug candidate. Figure 3.7. Pharmacokinetic properties of a drug. A drug molecule has to overcome several barriers in the organism to reach the target. Only if it reaches the target at a certain threshold concentration efficacy may be detectable. Additionally, a drug molecule may bind to other proteins which reduces the amount of active drug in the organisms and may lead to side effects of the treatment. “Pharmacokinetics” therefore studies the fate of a drug in the body which includes absorption, distribution, metabolism and excretion (ADME). Normally a certain dose of a drug is given orally and therefore passes the gastro-‐intestinal system where it is dissolved and taken up through the gut wall. It passes the liver where most part of the dose is metabolised before some drug molecules enter the blood circulation. The fraction of drug reaching this step is called “bioavailable” as it can then be taken up by tissues and organs. The drug is mainly cleared from the organism via liver or kidney metabolism. The time needed for a drug’s concentration to be reduced by 50% in the plasma is called “half life” and is dependent on the clearance rate as well as the volume of distribution (including drug molecules bound to tissues and proteins and unbound molecules).The correct dose for a drug to be effective in an organism therefore depends heavily on its absorption rate, bioavailability and clearance parameters. Since computational models can be used to model the different pharmacokinetic properties of a drug candidate they can also be used to predict suitable dosing regimens which can reduce the amount of animal studies and improve trials in humans. Source: (van de Waterbeemd and Gifford, 2003) For the process of lead derivative design, often the methods of the lead design process are used. As the structural biology of the lead, target, and lead-‐target complex is normally known at the stage of lead optimisation, it is possible to produce very accurate models and predictions of the binding free energy for the lead structure and its derivatives with the active site of the target. The library of derivatives can then be reduced by the compounds that are predicted to have a lower binding energy than the original lead. Therefore the number of compounds needing synthesis can be already drastically scaled down. In order to 40 Hit and Lead Identification and Optimisation produce docking results with such high accuracy, however, the docking algorithm must consider full ligand and receptor side chain flexibility as well as some receptor backbone flexibility (Moustakas, 2008) which results in a computationally very intensive process with its own cost and time requirements. Structural bioinformatics can be employed in the modelling of ADMET properties. For example, 90% of all drugs in use are metabolised by 7 of 57 known human cytochrome P450 isoforms in the liver (Pitt et al., 2009). Iterative docking and pharmacophores of the known drugs for each of the seven isoforms can be used for the prediction of P450 metabolism and inhibition by a lead compound. Similarly, it is now possible to predict some effects of a compound on plasma proteins and endogenous transporters such as the human Ether-‐à-‐go-‐go Related Gene (hERG) potassium channel. Inhibition of hERG can lead to cardiac arrhythmia and therefore it is desirable to predict unwanted inhibition by the lead molecule in advance to prevent later failure. Unfortunately, the complete 3D structure of the channel is not yet resolved, but site-‐directed mutagenesis, structural bioinformatics and similarity analysis of known inhibitors have improved knowledge about hERG binding and channel activity (ibid.). Furthermore, it is imaginable that the use of homology modelling etc. can be useful in identifying proteins that are similar to the target and therefore potential off-‐target interactions of the lead compounds. Some people think that ADMET should be tested even earlier in the pipeline (Kreatsoulas et al., 2006, Clark, 2008, Stevens and Baker, 2009). James Stevens and Thomas Baker from Eli Lilly for example suggest a workflow for such testing that expands the use of predictive approaches and animal testing early in discovery whilst chemical diversity of the considered compounds is still high (Figure 3.8). 41 Hit and Lead Identification and Optimisation Figure 3.8. Workflow facilitating early risk assessment. Two learning loops are described, an inner loop through which in silico data are validated with surrogate results (gray box) and an outer loop for validation of the in silico and surrogates predictions in pilot in vivo rodent toxicity studies (yellow box). The focus is from target entry to lead declaration. IND: investigational new drug application: NDA, new drug application Source: (Stevens and Baker, 2009) The first generation of in silico models had two major disadvantages: most models did not include warning if the compound was remote from the molecules used to train the model and some aspects of ADMET such as excretion processes remain untouched from modelling attempts. This has now changed slightly and a second generation of in silico models may include these aspects in the future, but the paucity of publicly available data makes it hard for research groups outside the large pharmaceutical companies to develop new models. This is further complicated by the lack of understanding of the complexity underlying the basic ADME processes (Clark, 2008). Physiology-‐based pharmacokinetic (PB-‐PK) modelling is another method for predicting ADME characteristics of a molecule which integrates structural characteristics into the anatomic and physiological context of the human organism. In such models the organism is described by tissue compartments characterised by structure, volume, and composition and perfused by known blood flows. Functions of time, mass balance equations and 42 Hit and Lead Identification and Optimisation differential equations can then be used to describe elimination and other processes (Burton et al., 2006). 3.3.1 Quantitative Structure-‐Activity Relationships (QSAR) The discipline of QSAR15 can be considered the very first computer-‐aided approach in drug discovery and was started in the 1960s by Corwin Hansch and his post-‐doc Toshio Fujita. Based on the fact that neither very lipophilic nor very polar molecules tend to pass the several hydrophobic and aqueous phases on the way to the target, Hansch formulated a lipophilicity relationship for the transport and a linear model relating binding free energy to lipophilicity terms, electronic parameters, molar refractivity, and steric parameters (Kubinyi, 2008). Nowadays QSAR is seen as a method for finding a simple equation to predict some properties from the compound’s molecular structure. The model is used for a compound by curve fitting software that computes the equation coefficients for the equation. The coefficients are weights for a given set of molecular properties (“descriptors”). These descriptors can be of various kinds such as molecular weight, the number of double bonds or ionisation potential and are classified in the groups of constitutional, topological, electrostatic, geometrical, or quantum chemical descriptors (Young, 2009, Chapter 14). Similar to ANNs a QSAR model is set up by training with a set of compounds having experimentally determined activity to find the descriptors that best explain the activity. Compared to ANNs, QSAR models tend to be much better in extrapolating properties exceeding the ones of the training set (ibid.). QSAR is best used to compute nonspecific interactions of a compound and its environment such as normal boiling points, passive intestinal absorption, blood-‐brain barrier permeability and other pharmacokinetic features. Additionally, QSAR models exist for modelling different types of toxicity such as mutagenicity, carcinogenicity, hepatoxicity, cardiac toxicity, teratogenicity, bioaccumulation, bioconcentration, acute toxicity and maximum tolerated dose (Young, 2009, Chapter 19). Apart from all the potential this technique has, Kubinyi (2008) claims, that QSAR and 3D-‐ QSAR are not accepted by medicinal chemists because they require detailed statistical knowledge and much practical experience and also tend to “over-‐fit” the training data with the result of poor predictability in real life. According to Young (2009) however, QSAR models are an important part of in silico design. 15 Not to be confused with 3D-‐QSAR which has a similar name, but is used in a completely different context 43 Hit and Lead Identification and Optimisation Figure 3.9. An integrative workflow for ADME and Toxicity prediction using bioinformatics and systems biology. Source: (Bugrim et al., 2004) An integrative workflow using QSAR models together with methods for metabolic pathway reconstruction, visualisation tools, and experimental data could finally be used to select the most promising leads on the basis of a variety of information sources thus better validating its selection (Bugrim et al., 2004) (Figure 3.9). 3.3.2 Accuracy of Predictive Tools Unfortunately, so far ADME prediction tools only have an accuracy of 60-‐70% which means that one in three times the results are not correct. When using consensus scoring of all available tools, accuracy can be improved and a qualitatively correct result can be obtained in 85% of the tests. Toxicity models are performing slightly better than ADME tools since they are always developed for a specific type of toxicity (Young, 2009, Chapter 19). 3.4 Scaffold Hopping Another practice that is using a wide range of computational methods is called “scaffold hopping” and works by replacing parts of the lead molecule scaffold by another fragment (William Hamilton, personal communication). The motivation to do this lies in the aim of (1) increasing the chemical diversity, (2) improving solubility by replacing a lipophilic scaffold by one with higher polarity, (3) improving stability and toxicity, (4) forming a more rigid molecule backbone for binding affinity maximisation, or (5) creating a novel and thus patentable structure (Böhm et al., 2004). In this aspect several methods including 44 Hit and Lead Identification and Optimisation pharmacophore searching and fragment replacement can be used (as illustrated in Table 3.5). Table 3.5 Methods for Scaffold hopping Method Illustration Shape matching Pros Cons Fast, high success rate for relatively small or rigid compounds Requires knowledge about bioactive conformation, relative importance of functional groups not specified A rational approach yielding clear answers, based on a maximum of information Requires knowledge about bioactive conformation and alignment Can be performed on 2D or 3D structure, high success rate Calculations might yield many or no results depending on tolerance, results difficult to rank, degree of novelty depends on query Fast and always applicable High degree of uncertainty because of high abstraction from chemical structure Pharmaco-‐ phore searching Fragment replace-‐ ment Similarity searching Adapted from (Böhm et al., 2004) The methods described in the previous sections are further illustrated in case studies about two biotechnology companies, Prosarix (Table 3.6) and Optibirum (Table 3.8). Table 3.6 Case Study Lead Design: Prosarix Ltd. Prosarix Ltd. is a privately owned, worldwide operating bioinformatics company in Cambridge (UK) providing structure-‐based and ligand-‐based drug discovery services using their proprietary platform ProtoDiscovery™. It has more than 20 customers for which it predominantly performs structure-‐based hit-‐identification tasks, but also has competence in silico pharmacological re-‐ profiling and biosynthetic feasibility studies. Apart from consultancy services predominantly for targets where the in-‐house discovery in the partner company has failed, this proprietary platform is also used for the company’s own early stage discovery projects; one of which has already led to a patent for a novel 5-‐HT1A agonist. This case study will mainly discuss the potential of Prosarix’ novel hit-‐finding approaches and the successful in-‐house discovery project 45 Hit and Lead Identification and Optimisation as this illustrates how the knowledgeable use of in silico methods can improve the hit-‐ and lead-‐identification process. The different cornerstones of the Prosarix Discovery Platform are illustrated in Figure 3.10. Figure 3.10. The main cornerstones of the CADDD process performed by Prosarix Ltd. The core of the discovery process is formed by the ProtoDiscovery™ Suite which is based on a C++/C# framework. The ProtoScore™ scoring system is an empirical scoring function developed at Prosarix including parameters for ligand desolvation, receptor and ligand flexibility, cation-‐pi interactions, sulphur-‐aromatic interactions, Van-‐der-‐Waals and hydrogen bonds, and aromatic and heterocyclic ring interactions. Benchmarking with the test set provided by Wang et al. (2003) has shown a better performance of this scoring method compared to others with a Spearman correlation coefficient (rs) of 0.78 for the correlation of binding score and experimentally determined binding affinities. This correlation is significantly higher than the 0.66 of X_Score, the best scoring function in the 2003 analysis. Additionally, scoring function retraining for new projects allows building in knowledge that is relevant to the project and further improves the scoring capabilities. ProtoScreen™, the virtual screening application, is capable of screening compounds in a few seconds by applying rigid body docking and shape fitting. The database that is normally scanned contains 15 million commercially available and is extended by conformer enumeration of the compound molecules. Additionally to that, the platform contains a toolkit for designing ligands de novo. The de novo tool ProtoBuild™ has the unique feature that it combines chain growing and fragment joining in one process and uses ProtoScore™ affinity prediction for scoring the found compounds. It allows protein flexibility and infinite run modes. Additionally, it is possible to use a known ligand as a starting point and the program then trims and rebuilds new ligands to occupy a given threshold %volume of the active site. AutoStere™ is a novel approach for scaffold hopping in which not a single selected part in a single ligand is replaced, but any part in multiple ligands to allow the rational reuse of already de-‐risked scaffolds. For ligand-‐based design two different tools are developed which use Support Vector Machines (SVM) for screening based on known active compounds (ProtoClassify™) or are based on the hypothesis that molecules with similar electrostatic fields also have similar activities (ProtoShapeES™). Source: William Hamilton, personal communication The discovery toolbox ProtoDiscovery™ contains tools for performing most of the computational methods that where discussed before, including docking, pharmacophore modelling, and de novo design which makes it applicable in the entire process of hit-‐to-‐lead and lead optimisation as well as for drug re-‐profiling. As mentioned before, it is not only used for 46 Hit and Lead Identification and Optimisation discovery services, but also to perform some own early-‐stage discovery. This project considered thousands of compounds in virtual screening against the known and already de-‐risked16 target 5-‐HT1A, a GPCR that binds serotonin (NCBI Gene ID: 3350). With this approach it was possible to reduce the number of needed experimental assays for hit identification to 81 and for lead optimisation to 47. As a result, a potent lead compound with novel mechanism (thus patentable) and favourable first pre-‐clinical tests for toxicity and hERG metabolism could be selected in much less than 2 years (Figure 3.11). Figure 3.11. Prosarix in-‐house discovery. 5-‐HT1A is a low-‐risk target as it has been in clinical use for many years and shows an interesting pharmacology. The in-‐house process of virtual hit identification and lead optimisation combined with outsourced experimental screening and in vivo testing required the synthesis of only 128 compounds in total and resulted in the creation of a pre-‐clinical drug candidate (PEL-‐576) that shows higher selectivity for the target and better pharmacokinetic properties than a compound by Proximagen that is currently in clinical development for diseases such as epilepsy and Parkinson’s induced dyskinesia. The UK patent for the identified lead compound is already granted and partners for further development are needed. Abbreviations: cpds: compounds, Ki: measure of binding activity, EC50: half maximal effective concentration, PK: pharmacokinetics, ADHD: attention deficit hyperactivity disorder. Source: William Hamilton, personal communication In general, William Hamilton’s experience shows that VS can achieve time and cost savings by reducing the time frame for the hit identification by several months (HTS: 6-‐9 months plus 1 month data procurement, VS+HTS: less than 3 months for VS, 1 month data procurement + 1 month screening a much smaller library of only 50 to 200 compounds) and the amount of compounds that require synthesis. According to the CEO of the company, William Hamilton, push-‐button approaches that are commonly used in computational chemistry are not necessarily successful. Thorough knowledge about the target biology and prior art is needed as well as detailed expertise in structure-‐based and ligand-‐based algorithms to find hits that have the potential to become 16 The target, 5HT1A, is already the target of several drugs on the market 47 Hit and Lead Identification and Optimisation potent and patentable leads. This, according to Hamilton, cannot be provided by just using a variety of commercial software suites which are often used by in-‐house discovery units in large pharmaceutical companies. He even speculates that the in-‐house computational chemistry departments may be part of the productivity problem because they tend to try to reinvent the wheel when encountering problems instead of using the expertise that is available elsewhere. Prosarix’ high hit rates and the success in finding innovative and novel leads in a short time somewhat support this hypothesis. From the validation data (Table 3.7) it is apparent that both, structure-‐ and ligand-‐based approaches can be seen as very valuable in enriching databases with compounds showing a good activity against the target. Furthermore, de novo design is, according to William Hamilton, under-‐utilised and has great potential in improving drug discovery when used in a way similar to virtual screening. Table 3.7 ProtoDiscovery™ Validation data from several projects Technology Screening Type Target Class Library Size No. No. Hit cpds hits in Rate % vitro 43 24 55.6 1 ProtoScreen™ Structure Enzyme 1.2 m 2 ProtoScreen™ Structure 1.2 m 41 17 1.6 3 ProtoScreen™ Structure Nuclear Receptor Protein:protein 6.8 m 146 37 25.3 4 ProtoScreen™ Structure Enzyme 2.5 m 31 6 19.3 5 ProtoScreen™ Structure Enzyme 3 m 140 14 10 6 ProtoScreen™ Structure Enzyme 3 m 30 2 6.7 7 ProtoScreen™ Structure GPCR agonist 2.5 m 51 3 72 8 ProtoScreen™ Structure GPCR agonist (peptide) 3 M 88 18 20.4 9 ProtoScreen™ Structure GPCR antaonist 3 m 11 45.4 10 ProtoScreen™ Structure GPCR peptide 3 m 67 8 11.9 11 ProtoScreen™ Structure 3 m 28 8 28.6 12 ProtoScreen™ Structure GPCR antagonist Protein:protein 3 m 56 5 8.9 13 ProtoScreen™ Structure GPCR agonist 4.5 m 61 9 15 14 ProtoBuild™ Structure 10 14 5 5 100 15 ProtoBuild™ Structure Nuclear Receptor Enzyme 14 5 3 60 16 ProtoBuild™ Structure GPCR agonist (peptide) 10 11 1 1 100 17 ProtoBuild™ Structure (de novo) GPCR agonist (aminergic) 10 11 2 2 100 10 48 Hit and Lead Identification and Optimisation 18 ProtoBuild™ Ligand (de novo) 19 ProtoBuild™ Structure 20 ProtoShapeES™ Ligand GPCR antagonist (aminergic) GPCR antagonist (peptide) Enzyme 21 ProtoClassify™ Ligand GPCR 11 3 2 66.7 10 11 1 1 100 3 m 71 7 9.9 2.5 m 11 6 54.5 10 GPCR: G-‐Protein coupled Receptor, m: million, Library Size: size of virtual library of purchasable compounds (ProtoScreen™) or no. of compounds built in silico (ProtoBuild™), No. cpds: number of compounds procured or synthesised, No. hits in vitro: number of compounds showing ≥50% inhibition of control in binding assays at 10 µM, Hit Rate: Number of hits in vitro/Number of compounds; Data: William Hamilton, personal communication Although the company has a track-‐record in successful consultancies and many satisfied customers in the biotechnology industry, it finds it still hard to convince big pharmaceutical companies of the usefulness of their service. A problem that is mainly encountered when talking to the big pharmaceutical corporations can be described as the “not invented here”-‐ phenomenon by computational chemists in the pharmaceutical company who either don’t trust the technology since it was not invented by them or are convinced that they can take the idea and implement it in a better way by themselves. This, of course, has its disadvantages for the pharmaceutical company since the knowledge is already available and novel implementations hold the risk of failure and time delays. Interestingly, Scott Lusher sees a very similar problem in the team-‐work of medicinal chemists and computational chemists within a discovery-‐unit (personal communication). It may be therefore possible to hypothesise that this mindset may be one reason why many computational technologies still not deliver in the way they should. This case study shows that computational chemistry has some major advantages in the process of hit finding and lead optimisation. By reducing compound library-‐size manifold and only screen few molecules in experimental assays, better assays can be used since they don’t need to be have a high throughput and synthesis-‐costs as well as process-‐time are reduced. The use of a proprietary system containing several complementary algorithms for virtual hit identification by the trained experts in the company allows optimisation of the process for each project individually and increases hit rates. An especially smart way of adding value to the process is by using known characteristics of the work-‐process of medicinal chemists (e.g. the fact that chemists are biased towards structures they have worked with before since they know how to synthesise them and have more knowledge about other molecular features) and trying to improve it systematically like it is currently developed in the AutoStere™ process. 49 Hit and Lead Identification and Optimisation Table 3.8 Case Study Lead Optimisation: Optibrium Ltd. The company Optibrium™ Ltd. in Cambridge (UK) is taking a different approach to improving drug discovery than Prosarix by providing software rather than services to a worldwide customer base ranging from top ten pharmaceutical corporations to small biotechnology companies. The core product StarDrop™ (Figure 3.12) is mainly aiming at helping medicinal chemists and other scientist in the pharmaceutical industry to guide decisions in the multi-‐parameter molecule optimisation and compound selection process. Therefore this case study, mainly derived from personal communication with Optibrium’s CEO Matthew Segall, will focus on how software for early ADME and toxicity can help improve R&D productivity in the pharmaceutical industry. Figure 3.12 StarDrop Graphical User Interface. StarDrop contains four key concepts of probability scoring, chemical space projection, “Glowing Molecule™” and data visualisation. In this context it is using techniques such as QSAR and machine learning to create structure-‐property relationships. Furthermore, StarDrop can be expanded by four different add-‐ons. The first add-‐on provides ADME QSAR models which were trained globally on a wide diversity chemistry with experimental data for solubility, blood-‐brain barrier permeability, hERG inhibition and others, thus providing good predictions formany compounds. The second plug-‐in is Auto-‐Modeller™ which allowes users to create predictive models tailored to the specific data, even if they are not computational experts. Several modelling methods including machine learning can be chosen and results validated. The P450 module was the core technology of Camitro and aims at giving information about how a specific compound would be metabolised by Cytochrome P450 enzymes by using quantum mechanic models. The newest plug-‐in is called “Nova™” and aims at generating new compound ideas from a given set of molecules. This is done by using hit or lead compounds as the basis for the creation of related molecules which are created in a way a medinical chemist would consider modifying the initial compound, including 206 different transformations and an iterative process option. A prioritised list is given as output to the chemist to give him ideas for new strategies. Thereby Nova can be useful in expanding compound searches, finding high-‐quality hits that would have been overlooked or not thought of, or in identifying patent-‐busting opportunities. Many of the more than 35 customers now use all four plug-‐ins, especially the smaller companies who don’t have a big infrastructure, whereas the big pharmaceutical companies may not use all of the modules (they have teir own in-‐house software). According to Matthew Segall, the major strength of StarDrop is that it can be used to guide strategic decisions about compound selection and experimental design without being a 50 Hit and Lead Identification and Optimisation computational chemist. With the explosion of available data, the most important part is to make the right decisions based on it and not to be confused by the mass or the uncertainty of the data. StarDrop therefore uses probabilistic scoring that allows the user to define a profile of property criteria (including aspects such as logP, logD, hERG inhibition, and flexibility) and assign weights to the different properties to easily adjust the software to the important aspects of the specific project. After bringing together all available data and taking its uncertainty (such experiment characteristics like assay sensitivity and specificity) into consideration the compounds are ranked according to how well they fit into the given target profile to find those that are most likely to succeed. Methods for chemical space projections further allow identifying patterns of good or bad compounds in the chemical space and optimising the selection of those that are progressed to experimental assays. Finally, a part of StarDrop called “Glowing Molecule” allows the user to understand the relationship between compound structure and certain properties and highlights regions that need optimisation. These aspects plus several extensions for ADME or P450 metabolism predictions and other predictive models are embedded into a highly interactive data visualisation platform which can be easily used even by non-‐ computational chemists. Usability is indeed a major aspect in creating tools that are supposed to be applied by non-‐ expert users. The angle of the StarDrop developers to it was by working closely together with an anthropologist who looked at the characteristics of interaction between research team members in a company and their use of computers. From project team meetings it was apparent that chemists and biologists were missing each other’s points because they could not relate to the presented data. However, when the chemical structure of the compound was presented next to the activity data, both sides were interested and engaged more effectively with each other. This shows that different members of the research team may have different priorities and data presentation heavily affects whether it is taken up or ignored. When asked about competitors, Matthew Segall listed several companies providing tools for some aspects of the StarDrop suite, but especially focused on in-‐house groups in pharmaceutical companies as Optibriums biggest competitors. He, like William Hamilton17, commented that these groups would often rather develop their own platform to do what they need to do than spending money for licenses. He points out that this process is not very effective since the resulting software normally is not as functional as the commercially available solution and includes high maintenance cost. In recent times however, Matthew Segall admits, a behavioural change in big pharma is detectable towards reducing in-‐house efforts and focusing on commercially available products or services – a view that is shared by others like Prof. Oliver Kohlbacher, a bioinformatics professor at the University of Tübingen (Germany). Matthew Segall can, furthermore, list several scenarios in which StarDrop has been used 17 Interestingly, also Gary Rubin (Fios Genomics) and Gordon Baxter (BioWisdom) mentioned in-‐ house development as the hardest competition 51 Hit and Lead Identification and Optimisation successfully to improve the process of hit-‐to-‐lead and lead optimisation: in one case a client used the software to focus screening resources by using probabilistic scoring to identify compounds that are likely to fail due to unfavourable ADME. In another case, an in silico guided approach was utilised to identify areas of chemical space containing compounds that are likely to have successful ADME and target binding properties prior to in vitro screening which helped to de-‐risk the process and the client to progress from hit-‐to-‐lead to Phase I clinical trials in only 2.5 years. In a last example, a client who wanted to develop an orally bioavailable drug for a target in the central nervous system (CNS), but got stuck with a group of compounds that had a high potency and only either good bioavailability or CNS penetration, asked the company how much they could have saved if they had used StarDrop earlier in the pipeline. An analysis by the StarDrop team showed that by using the tool another part of the chemical space would have been used in which the two properties of bioavailability and CNS penetration were better balanced thus finding successful compounds with 90% less compound synthesis, that in vitro screens could have been avoided, and in vivo screens could have been reduced by 70% (Figure 3.13) (b) (a) (c) Figure 3.13. Obtibrium Case Study. A pharmaceutical company tried to develop a central nervous system drug and used a screening library to identify hits. (a) The initial set of compounds selected by the pharmaceutical company had a high target activity, but either low bioavailability or low blood-‐brain barrier penetration and was therefore not useful for further development. (b) The use of StarDrop could have helped identify an area in the chemical space that has a more appropriate balance of these properties which could be then used to target the central nervous system. (c) The workflow suggested by Obtibrium includes several steps of negative selection of compounds that would not have the desired properties and result in the selection of 25 compounds which could be evaluated further in vivo. All figures in the exhibit are reproduced with the permission of Optibrium Ltd. ©Optibrium Ltd. These examples show that in silico methods have a high potential to eliminate compounds that have a high probability of failing later in the process while simultaneously reducing the amount 52 Hit and Lead Identification and Optimisation of synthesised compounds and in vitro as well as in vivo tests, thus reducing cost and time in this phase. StarDrop and its developers are proactively looking for what the customers’ aim actually is and try helping them reach conclusions quicker by supplying a decision support environment to make the right decisions. 3.5 Conclusions This chapter has shown that there is a plethora of different algorithms available for rational compound design. It may be argued that time and money savings in the hit-‐to-‐lead phase by the use of computers cannot be very significant since this part is already quick and cheap according to Paul et al. (2009). Yet, according to Matthew Segall much room for improvement is still left in the hit-‐to-‐lead and lead optimisation process (personal communication). A drastic enhancement of the total process of hit identification, hit-‐to-‐lead and lead optimisation was reported in the case studies either by using VS technology or a decision guidance framework containing ADME prediction and digital compound optimisation to around 2 years instead of the 4.5 years discussed by Paul et al. (2009). Furthermore, Accelrys, the provider of a market-‐leading modelling software package, has performed an analysis of the return on investment (ROI) of the use of modelling and simulation software tools in pharmaceutical development and estimated that the cumulative ROI can be estimated to be in the order of US$ 3-‐10 for every dollar invested. Further details are listed in Table 3.9. Additionally, this phase is the most important one in the overall R&D procedure since it results in the identification of a molecule that is moved on into animal and clinical testing. By finding out as much as possible about the properties of the chosen compound, later failure in in vivo studies may be prevented and even more cost be saved. In conclusion, the measures discussed in this chapter by which the increase in productivity due to computer use can be evaluated are compound library enrichment, time and money savings due to better guidance of hypothesis building and experimental design, and the creation of compounds that have better secondary properties and therefore may not fail in clinical trials. In general, current methods are better in sorting out compounds that are very likely to fail than in finding the very best molecule that could be directly used as a lead. 53 Hit and Lead Identification and Optimisation Table 3.9 Estimated return on investment (ROI) for computational modelling Annual projects 1 Experimental Efficiency Occasional User 2 Power User 5 2 Deeper Understanding Occasional User 2 Power User 5 3 Product Annual Development Projects Save Occasional User 2 Power User 5 Total (1+3) Occasional User 2 Power User 5 Average Savings experimental cost per project Total Annual Benefit Cost $1,500,000 $1,500,000 5% 7.5% $150,000 $562,500 $56,600 $2.65 $80,000 $7.03 $1,500,000 $1,500,000 Projects Generating a Save 5% 7.5% Value of a Save $56,600 $2.65 $80,000 $7.03 Cost ROI 0.20% 0.75% 6,000,000 6,000,000 $150,000 $562,500 Total Annual Benefit $24,000 $225,000 $56,600 $80,000 ROI $0.42 $2.81 $3.07 $9.84 Occasional user: 10-‐20 projects simultaneously with 1-‐2 projects advanced per year; power user: 20-‐40 projects simultaneously a 5 projects (average) advanced per year. Estimated cost per project per year: US$ 1.5 million. Additional ROI is achieved through earlier marketing of a drug with 1-‐day savings of $34.3-‐$48.4. ROI: return on investment. Data source: (Louie et al., 2007) 54 Chapter 4 Pre-‐Clinical and Clinical Development Pre-‐clinical and clinical development aims at testing if the drug candidate is safe and efficacious in animals and humans and to determine the final dosing of the treatment (Table 4.1). Table 4.1 Stages of preclinical and clinical development Phase Objective Patient Size Preclinical Testing of lead compounds in animals for efficacy, toxicity and pharmacokinetic properties At least 1 rodent model and one non-‐rodent model suitable for the disease Often mice, plus 2-‐year rat chronic oral toxicity tests 3 90 Phase I Primary safety testing 20-‐100 volunteers (normally healthy) 1 45 Phase II Testing of efficacy and side effects 100-‐500 patients 2 65 Phase III Testing of efficacy, safety and 1,000-‐5,000 patients dosage 3 205 Phase IV Assessment of post-‐ marketing safety 1 N/A All patients taking the drug Time Cost (Years) (million) Cost data in 2004 US dollars. Source: (Zimmerman et al., 2004) According to current industry analyses, attrition in late stage clinical development is the leading cause for the persisting productivity problem. As can be seen in Figure 4.1, especially failure in the late stage clinical phases adds massively to the overall R&D costs. According to analyses of the reasons for attrition in these phases between 2007 and 2010, Arrowsmith found out that in Phase II half of the drugs that failed, failed for lack of efficacy, in 30% of the cases development was discontinued due to strategic reasons and a fifth had safety issues (Arrowsmith, 2011a). The picture is similar for Phase III and submission failures in the same period: two thirds failed for lack of efficacy and a fifth for safety reasons (Arrowsmith, 2011b). This pattern can be explained by the complexity of the human organism: “If one attempts to intervene in a pathway by inhibiting an enzyme, for example, a cascade of adjustments ensues that tend to compensate for the changes in the concentration in some of the biomolecules that result from the intervention. In most cases, these adjustments will buffer and negate the intervention, or will produce unexpected side effects. So the result is no efficacy and/or toxicity. Only [in] very rare cases will these 55 Pre-‐Clinical and Clinical Development adjustments amount to not much, and we'll have a potential drug candidate [...] Network effects that take place outside the biological networks that we know can, and often do, lead to unpredictable outcomes” (Bernard Munos, personal communication). 100 90 80 70 60 50 40 30 20 10 0 All R&D efforts Preclinical Phase I Per approved drug Phase II Phase III Approval Figure 4.1. R&D costs for all efforts and per approved drugs. It can be seen that the clinical costs are dominated by failed drug candidates, especially at the milestones pre-‐clinical proof-‐of-‐concept prior to clinical trials in humans and in Phase II and Phase III. Source: (Scannell et al., 2010) Especially due to the development of high-‐throughput screening methods for early ADME assessment and predictive strategies such as those described in the last chapter, failure because of poor pharmacokinetics is nowadays rare (less than 11% compared to 40% in the 1990s) (Tsaioun et al., 2009). Lack of efficacy can be the result of various factors such as poor target choice, inability of the drug to penetrate biological barriers such as the blood-‐ brain barrier or cell membranes leading to low on-‐target concentrations, or off-‐target binding (Tsaioun et al., 2009). These properties can partially be predicted in advance by using in vitro assays and in vivo/animal disease models, but often the disease and metabolism in these models may be unrepresentative for the human organism thus making the results relatively unreliable apart from giving general tendencies for safety or efficacy (PricewaterhouseCoopers, 2008). However, there are several ways in which the use of information technology and computing may help reduce such attrition rates or at least move them forward to the discovery or early development stage. The aspects discussed in this section will include the use of 56 Pre-‐Clinical and Clinical Development computational methods in trial design, patient selection and predictive models, but will omit administrative aspects such as clinical data storage or recruitment due to space constraints. 4.1 Trial Forecasting Aspects of ADME prediction were already discussed in the context of lead identification and optimisation since that is where they are applied best. Regulatory agencies furthermore require the testing of the bioavailability, tissue distribution, pharmacokinetics, metabolism, and toxicity of the drug candidate in one rodent and one non-‐rodent. These tests have the disadvantage that their accuracy is only 50% and may result in the abandonment of a candidate that would have worked in humans while allowing the continuation of the development for a compound that is doomed to fail later on because it works in animals but not humans18 (Tsaioun et al., 2009). Therefore, the “holy grail” in computational toxicology and PK prediction is seen in the creation of a “virtual human” that could be used as a more reliable model for the complex behaviour of a drug prior to testing in humans. Once such a model exists, the time needed for discovery could be shortened massively and clinical testing could be reduced to only 1.5 year, as the analysts at PricewaterhouseCoopers for example suggest (Figure 4.2) (PricewaterhouseCoopers, 2008). Figure 4.2. How the drug design and development process may look like when virtual humans exist. Grey: in silico work, white: wet-‐lab processes, black: testing in humans; CIE: Confidence in efficacy, CIS: Confidence in Safety; adapted from (PricewaterhouseCoopers, 2008) 4.1.1 Virtual Animal and Patient Models There are several projects working on virtual human models including “The Step Consortium”, the “Living Human Project”, and the “Human Physiome Project”, but more than a decade of basic research and model development may be needed for them to be 18 An example for the latter is thalidomide, which is teratogenic in humans but not in rats or mice (Fratta et al., 1965) 57 Pre-‐Clinical and Clinical Development useful to reliably model the complex behaviour the human body shows (William Hamilton, Gordon Baxter, PricewaterhouseCoopers, 2008). In the near future, it may be more realistic that virtual models for animals or specific organs are developed and predictions of clinical trial outcomes are performed on the basis of available pre-‐clinical data thus reducing the need for experiments with living organisms19. Examples for such products already in use are a virtual diabetes mouse model by Entelos (http://www.entelos.com), a virtual tumour platform by Physiomics (http://www.physiomics-‐plc.com/), and a virtual heart (Noble, 2007). One thing that must not be forgotten in this whole development is however, that computers are not humans and therefore all these models will have weaknesses, just like animal models (Gary Rubin, personal communication). Furthermore, due to genetic diversity there is no such thing as “the human organism”. A computer model is forced to generalize and to leave out considerations about slight differences in receptors, enzymes and proteins and the potential combination of the different variations that can be found in the human population. A drug developed for a general model of the human organism will therefore not work in every patient. Simulating all possible combination of receptor/enzyme/protein variations that are part in the drug-‐activity or disease network would widely exceed existing computing capacities. These models therefore currently rely on specific predominant genotypes and phenotypes (see Table 4.2 for a case study about virtual patient and animal models). Table 4.2 Case Study Virtual Models: Entelos Inc. Entelos Inc., a Californian biotechnology company, is currently a leader in virtual animal and human models. The company started with a model for a virtual non-‐obese Type 1 diabetes mouse and has extended its product portfolio from there to specified platforms that predict behaviour in virtual human patient populations. The models are based on the existing literature for a certain disease to mimic observed patient types as well as healthy individuals and also disease progression. PhysioLab models are created specifically for a disease by outcome-‐focused top-‐down approaches (Shoda et al., 2010) (Figure 4.3). Their main application is for in silico hypothesis testing and simulating clinical trials outcomes to guide decisions about further compound development, but can also be applied in target characterisation and other early stage discovery steps for compound prioritisation. 19 The current use of animals in the pharmaceutical research process represents, according to Gordon Baxter, a much bigger problem than the productivity issues since it costs the life of many mice, rats and others animals with only limited use to the project (personal correspondence). 58 Pre-‐Clinical and Clinical Development Figure 4.3. Entelos’ concept of virtual patients. In a PhysioLab platform for a specific disease observed patient phenotypes can be modelled. As these can be induced by varying underlying features such as lifestyle, genetics or others, patient populations can be created to evaluate the impact of different hypothesis on the phenotype. A virtual clinical patient population consists of a mixture of different phenotypes and different disease-‐underlying biology. Hence different subgroups respond to treatment regimens in different ways imitating real trial situations while allowing testing of hypotheses on underlying disease biology, biomarkers or dosing schemes. Source: (Entelos, 2007) Figure 4.4. Entelos’ top-‐down PhysioLab® model building process. In the top-‐down approach the simplest processes are modelled first and then increasing levels of details are added in an iterative process until the platform predictions are accurate in reflecting published experimental and clinical observations. This process is therefore different than most approaches to virtual patient-‐building who rely on bottom-‐up techniques which try to integrate all available data at the fundamental level first, but often fail to reproduce the behaviour at a systems-‐ level due to lack of data. According to Shoda et al. (2010), the top-‐down approach has the advantage that it includes the variability in the parameterisation of disease underlying biology while giving precise phenotypic information. However, by relying on protein and expression data to value assignment and relationship characterisation between biological entities, the PhysioLab platform also includes some bottom-‐up aspects (Shoda et al., 2010). Source: (Entelos, 2007) 59 Pre-‐Clinical and Clinical Development PhysioLab platforms have been used by several large pharmaceutical companies to optimise their clinical development. Johnson & Johnson, for example, could achieve a 40% reduction in time and a 60% reduction in patients for a Phase I trial by simulating a clinical trial with the protocol and optimising the actual trial according to the simulated outcome (Entelos, 2005). In another case, it was predicted that a target (IL-‐5) for treatment of late asthmatic response was not going to be effective although animal-‐results were promising. After the prediction, the customer (Aventis) stopped further development and avoided later failure. Clinical trials performed by competing companies on that target later validated the prediction of Entelos (Entelos, 2000). The PhysioLab results have been validated externally for common diseases such as diabetes and asthma. Nevertheless, one problem with this approach remains: since the models are created specifically for a certain disease and require extensive literature about disease biology and differences to the healthy state, this method may not be applicable for rare or neglected diseases for which such information does not exist. In addition, most of the case studies are now five years old which may leave the reader wondering about the existence of more recent successful applications of the concept. Apart from specialist software, also tools in general use can be helpful in evaluating preclinical data and building useful models from it. The pharmaceutical company Roche, for instance, uses the numerical computing environment Matlab to build pharmacokinetic, pharmacodynamic and organ models to increase the efficacy and safety of their clinical trials. By incorporating data from animal studies with the drug candidate and similar approved drugs as well as published clinical trial data, the company develops methods to predict drug disposition, penetration of target organ, fluid transport through tissue, drug accumulation in organs and suggest an optimal dosing regimen for clinical trials. The use of such models therefore helps guide experiment and trial design and reduce the time needed for (pre-‐) clinical development thus streamlining the approval process (Matlab, no date a). One pharmaceutical company reports that it could cut its drug discovery time by 80% (discovery and development of seven drug candidates in less than 3 years) when using Matlab for target discovery up to clinical trials to predict pathways, networks, targets, pharmacokinetics and dosing (Matlab, no date b). 4.2 Trial Design 4.2.1 Biomarkers Biomarkers in the context of drug safety and efficacy evaluation can be used to determine whether the drug is actually working on-‐target in vivo by measuring changes in pathway constituents downstream of the target. Additionally, they can be of use in answering 60 Pre-‐Clinical and Clinical Development questions about the drug’s metabolism and its perturbation due to mutations in enzymes and transporters, as well as perturbation of the mechanisms of action due to variations in the target or pathway constituents. Therefore biomarkers have a high potential in determining early signs of efficacy, toxicity or off-‐target effects, thus allowing changes in patient selection or trial design earlier (Ginsburg et al., 2006). Moreover, relatively new strategies such as pharmacogenetics and pharmacogenomics may have a significant impact on the way how clinical trials are performed. For them to work, similar to all other “omics”-‐ technologies, bioinformatics and information technology are needed to acquire the relevant data and analyse it in a timely manner. Biomarker identification and validation are not an easy process and although the expression profile or concentration of a protein may initially seem relevant and representative for a specific biological process, most of the candidate biomarkers ultimately fail in the validation process (Watson et al., 2006). Validated biomarkers and candidate biomarkers could be applied at several steps in the drug R&D process from target validation (Gul, 2011), the prediction of potential pharmacokinetic properties from in vitro or pre-‐ clinical in vivo data (ibid.) to toxicology prediction from short-‐term animal studies (Watson et al., 2006) and in patient selection for clinical trials and post-‐marketing (Barton, 2008). 4.2.2 Example: Biomarker-‐Based Patient Selection A major application of biomarkers lies in the design of adaptive clinical trials with enrichment of the trial population for those individuals that test positive for a specific marker which was identified prior to this from the literature or pre-‐clinical/early clinical data (Figure 4.5, illustrated in Table 4.3). Efficiency gains are highest when the prevalence of marker positive individuals is low in the population and marker negative individuals have very low response rates to the drug (Table 4.4). Figure 4.5 Enrichment Trial Design. Only patients are selected for a trial that contain a marker that was associated with higher chance of response in previous preclinical or clinical trials. Normally the exploration and validation of such biomarkers takes place in late stage development, but even in Phase I trials biomarkers can be used to accelerate drug development. This approach may, however, not be useful for compounds with multiple targets or in which the mechanism of action is unknown. Source: (Barton, 2008) 61 Pre-‐Clinical and Clinical Development Table 4.3 Example Enrichment Trial A simple calculation can illustrate how enrichment trials can help salvage drugs that would otherwise not show the needed efficacy to gain marketing approval or help extend the spectrum of target diseases: To test the use of gefitinib (Iressa, AstraZeneca) in non-‐small cell lung cancer two randomised trials were performed with a total of 2,130 patients and efficacy could not be observed because the drug only works in cancer patients with EGFR mutations. In order to demonstrate such sufficient efficacy (20% survival benefit for 10% of the patients) in an untargeted study more than 12,000 patients would have been needed. In an enrichment trial, 138 marker-‐positive patients would have been enough to give a proof of principle and picture of true efficacy of the treatment (Dr Richard Simon, 2007, cited by (Barton, 2008)). With an average cost of US$ 19,300 per patient in Phase II and US$ 26,000 per patient in Phase III (2006 dollars) (Kirk et al., 2008), the reduction in cost when reducing the trial size from 2,000 to 140 patients is tremendous (ca. US$ 36-‐48 million), not to mention the added profits for earlier marketing approval. Table 4.4 Efficiency of enrichment study designs Prevalence 25% 25% 50% 50% 75% 75% Relative efficacy 0% 50% 0% 50% 0% 50% Efficiency gain 16x 2.5x 4x 1.8x 1.8x 1.3x Prevalence: Prevalence of marker positive patients, Relative efficacy: efficacy that would be measured in the general population. Source: Simon and Maitorman, 2004 cited by (Barton, 2008) 4.3 Pharmacogenetics and Pharmacogenomics Pharmacogenetics, as defined by Ginsburg et al. (2005), depicts the impact of genetic variation on the proteins involved in drug metabolism and transporters20.The broader field of pharmacogenomics (PGx) describes the effects of genetic variation (multiple genes or the whole genome) on pharmacodynamic variables like the drug targets and pathway components (Ginsburg et al., 2005). Signatures such as proteomic or transcriptional patterns may then allow a prediction which patients will respond to a treatment and which may suffer from adverse drug reactions (ADR). 20 In this context mainly the association between an individual gene and drug response variability is analysed (e.g. the differences of warfarin metabolism depending on cytochrome P450 variation)(Weinshilboum, 2003, cited by Ginsburg et al., 2006). 62 Pre-‐Clinical and Clinical Development The use of pharmacogenetics can improve understanding of pharmacokinetic data, as well as adverse reactions, dosing regimens and also for the identification of predictive toxicity and response markers (Ginsburg et al., 2006). The application of pharmacogenomics is manifold in drug development (Table 4.5). The used markers are often derived from single nucleotide polymorphism (SNP) profiling or whole genome analysis (Barton, 2008). Table 4.5 Potential advantageous applications of biomarkers, pharmacogenetics, and PGx Application Impact of PGx Drug Safety Identification of genotypes which are likely to suffer from adverse effects such as the patients with the LQT1/LQT2 genotype who represent 60 of the cases suffering from long QT syndrome, a heart condition which can lead to sudden death and is often aggravated by drugs (Kirk et al., 2008). Furthermore, a drug for which side effects or toxicity are detected in clinical trials could still be “saved” if the adverse behaviour can be linked to a specific genotype which could then be excluded from the target group (Ginsburg et al., 2005). Drug Efficacy Testing Enriching trials with patients showing a genetic profile or other biomarkers that are predicted to be related to good response to the treatment can be helpful in showing efficacy for a drug that only works in a small subset of the population, but not in the rest. The efficacy would not be detected in a random sample and thus the drug’s development abandoned although the drug has potential in a fraction of the market (Barton, 2008) Dosing Identification of poor, intermediate and ultra-‐rapid drug metabolisers allows dosing of the drug according to the rate of its degradation thus minimising toxicity, side effects and improving efficacy as well as data stratification (Kirk et al., 2008) Clinical Trial Size Use of biomarkers for enriching trial populations with patients having a positive marker profile requires lower trial size to detect the difference between the placebo and the drug (Kirk et al., 2008). Additionally, adaptive trial design with continuous data analysis and biomarker identification and validation can be used to shorten the time needed for clinical testing (Barton, 2008) Pricing A drug developed and marketed accompanied with a diagnostic marker can have massively increased efficacy since it is only given to patients in which it 63 Pre-‐Clinical and Clinical Development Application Impact of PGx is likely to be efficacious. This improved efficacy against the current gold-‐ standard justifies a price premium that results in increased revenues although the market size for the drug may be reduced (Kirk et al., 2008). Additionally, due to the higher efficacy a higher rate of adoption of the new drug and better compliance can be expected (Ginsburg et al., 2005) Market Expansion The use of biomarkers and genetic profiles allows a better understanding of disease processes and establish links between underlying pathways and causative disease mechanisms which may make it easier to extend the range of indications for a drug (Ginsburg et al. 2005) Differentiation The smaller market and the targeted therapy for specific phenotypes results in differentiated products. This makes the marketplace less attractive for competitors (Ginsburg et al., 2005) Information Source: Kirk et al., 2009, Barton, 2008 and Ginsburg et al., 2005 A Datamonitor report shows that bioinformatics applications find their highest application in molecular medicines including PGx and pharmacogenetics (Figure 3) (Merchant, 2009), and another Datamonitor analysis predicted that the use of pharmacogenomics can potentially reduce the number of new compounds tested in phase II and III trials by 20%, the number of participants in trials by 50% in phase II trials and by 10% in Phase III with an additional reduction of 20% in phase III trial length (Datamonitor, 2002, cited by Ginsburg et al., 2005). Figure 4.6. Market potential of bioinformatics in different medical application areas. Molecular Medicine includes pharmacogenomics, pharmacoproteomics and metabolomics, Preventive Medicines are strategies to prevent an injury or illness before it occurs, Gene Therapy is defined as the introduction of healthy genes in order to treat a genetic disease, and Drug Development encompasses data processing and sequence analysis in the pharmaceutical sector as well as virtual screening, lead optimisation techniques, physicochemical modelling, similarity searches, and bioavailability/bioactivity prediction. Bioinformatics market growth is mainly driven by the growth of genomics-‐applications in the pharmaceutical sector according to the author’s analysis of bioinformatics in drug development. Source: (Merchant, 2009) 64 Pre-‐Clinical and Clinical Development Similarly, toxicogenomics can be useful in predicting toxicology early. The biotechnology company Entelos for example reports the prediction of nephrotoxicity with the use of kidney gene expression profiles from rats treated with different known nephrotoxic and non-‐toxic compounds. From this a gene signature consisting of 35 genes was derived that could predict the future development of renal tubular degeneration before it appeared histologically with a predictivity of 76%. By using the computational approach, a higher success rate was achieved than in animal studies and the time needed for getting a result was reduced by 80% (prediction result obtained in 5 days, animal studies took 30 days) (Fielden et al., 2005). The major obstacles of biomarker and pharmacogenomic applications lie in finding the relevant genes or other biomarkers and validating them in statistically significant sample sizes of hundreds to thousands of patients (Reed, 2011). 4.4 Problems with Available Data The potential in modelling trials and predicting potential efficacy and toxicity biomarkers as well as predicting specific drug properties from virtual models can be very exciting. However, at the moment most of the computational techniques suffer from a massive lack of data and knowledge that could be used to develop models that are truly predictive21. This situation is further aggravated by the pharmaceutical industry’s reluctance to disclose data about all their failures in the process of drug design and development because it may be IP sensitive and could be perceived as detrimental (William Hamilton, personal communication) or about their results of animal studies (Gordon Baxter, personal communication). Without this data from compounds that showed for example toxicity in vitro, the creation of better models may be slowed down massively, and some experiments may be redundantly performed by different research. By now, databases for clinical trial data are gaining acceptance, however. Nevertheless, these do not span preclinical and Phase I data since people argue that this data is not for the benefit of patients and may tell competitors too much about the current internal development in a pharmaceutical 21 As Bernard Munos puts it: “40% of the human genome, although it was cloned long ago, remains unknown. We don't know what it does, although some of it is clearly important. I don't think one can model a system [...] if its make-‐up is shrouded in mystery. The models coming out of system biology are based on what we know, but their predictions are frequently defeated by what we don't know” (personal communication, 2011). 65 Pre-‐Clinical and Clinical Development company22 (Bouchie, 2006). In the context of development of predictive techniques however, another statement is of bigger importance: “Competition in business is understandable, but science doesn't work that way. Failures advance the field” (Merrill Goozner, Center for Science in the Public Interest, Washington, DC) (ibid.). Table 4.6 Case Study Bioinformatics Analyses: fios genomics Ltd. 70% of the toxicity that is observed in clinical trials was predicted in preclinical studies (Olson, 2000, cited by Stevens and Baker, 2009), therefore it is important to integrate as much knowledge gained in preclinical testing into the decision process. Fios Genomics Ltd., an Edinburgh (UK)-‐based bioinformatics company, is focusing on genomic data analysis which is, according to the Director of Operations Gary Rubin, one way in achieving this. Increasingly high-‐throughput genomics technologies such as microarrays and next-‐generation sequencing (NGS) data have an output of up to 4 terabyte in data per run which is by now impossible to process without the help of computational methods and also hard to be analysed by a normal researcher. In order to improve this, either specialist software packages could be used in the developing company to support data analysis or the process could be out-‐sourced to a specialist company. Since, as Gary Rubin points out correctly, specialist software often requires time for the user to get accustomed to and relationship building is easier with human beings than with a computer, such tasks should be outsourced by pharmaceutical companies and contract research organisations to data analysis companies like Fios Genomics. Service-‐ based businesses such as Fios Genomics or Almac Diagnostics are capable of adding additional value to the software that is used in data processing by bringing in their own expertise and experience thus making data analysis more efficient and increasing the quality of the gained insights. Especially with biomarkers gaining importance in clinical trials such expertise is required for marker identification and validation (Carden et al., 2010). As shown in Figure 4.7, services provided by Fios Genomics encompass all parts of the drug R&D process, although the current focus lies on clinical development. This trend is corresponding with the strategy change of large pharmaceutical companies towards late stage development23. 22 Normally Phase I trials are performed in healthy volunteers and thus it can be argued that no new disease knowledge is added in this process. A statement that is definitely not true in cancer therapy trials as these are usually performed in patients (Carden et al., 2010). In cancer Phase I trials a biomarker-‐based patient selection may be able to accelerate the development while increasing patient benefit (ibid.). For such a model to work, more knowledge exchange would be needed. 23 The discovery and early stage development is left to smaller biotechnology companies as they are now seen as more innovative. Once a compound is considerably far down in the pipeline and therefore de-‐risked, a pharmaceutical company either licenses the IP or acquires the whole biotechnology company. Similar behaviour is detectable for other technology “big pharma” doesn’t have including bioinformatics companies. 66 Pre-‐Clinical and Clinical Development Figure 4.7 Value-‐adding aspects of bioinformatics and predictive biomarkers. According to Fios Genomics, the data acquired by today’s high-‐throughput methods overwhelms normal researchers and data analysis specialists are needed. The company, a spin-‐out from Edinburgh University, develops their own algorithms and software for genomic data analysis to automatically perform data processing (normalisation, statistical analysis) tasks. Value is added by additional data interpretation and data mining for gene-‐expression genotyping and further in-‐depth analysis and trend-‐detection. Due to the development of own software and the use of a supercomputer-‐format the workflow is scalable to data input from samples in clinical trial ranges. Source: Gary Rubin, personal communication Gary Rubin sees three major fields in which genomics analysis will gain importance in the future: (1) predictive toxicology at the preclinical stage will improve decision making in sorting out compounds that already show the tendency to be toxic, (2) pharmacogenomics in Phase II clinical development, and (3) data mining to look for trends or find new knowledge from existing data. Phase II trials are, according to Gary Rubin, currently the most important phase in the process and biomarkers (mainly RNA, DNA, genes and proteins) can help to predict how a patient will respond to a drug. Fios Genomics themselves can also provide a case in which they used a microarray-‐based approach for looking at gene expression profiles from animals treated with different doses of acetaminophen to find early markers of liver toxicity in the blood and predict the experiment outcome within 24 to 48 hours. After finishing the animal trial 28 days later (the typical length for a short pre-‐clinical toxicity study) histology and assay work gave the same results (for further information please refer to (Harris et al., 2009)), thus validating the approach and proving that animal toxicology studies could be cut down in some cases by more than 90% when using biomarkers and bioinformatics. Additionally, it was shown that even non-‐invasive samples such as blood can be used to detect toxicity-‐related expression profiles from remote organs which could have an impact on the way animal testing is currently done. When asked about the chance of replacing animal testing with computational methods, Gary Rubin stated that a 100% displacement will not occur since certain in vivo tests are still a mandatory requirement by regulatory agencies before the start of testing in humans, but that predictive toxicology could be used as a rapid “go-‐no go” decision to identify early signs of toxicity and stop a drug in development before more money is potentially wasted on its development, thus replacing the Ames test and other experimental screens. He furthermore argues, that every available method should be used to avoid even very rare, but fatal adverse 67 Pre-‐Clinical and Clinical Development drug reactions in the clinic and, since the regulators have similar opinions, predictive toxicology results, pharmacogenomics and biomarkers will play a more important role in the approval process in the future. The use of computational tools may therefore help streamline the approval process and simplify compliance with the regulating agency. 4.5 Conclusions Due to the severe problems in productivity, the current large pharmaceutical companies shift their efforts towards late stage development, hoping to avoid risk of failure and reduce their costs. Nevertheless, the past chapters (and the case study in Table 4.6) have shown that much can be done earlier in the pathway to guide decision processes and remove bad compounds from the pipeline. The motivation for use of computational techniques in preclinical development, which is often interlinked with lead optimisation, is very similar to that by having the intention to identify compounds that may be toxic or induce unwanted side effects, before moving on to expensive trials in humans. Once a compound is in clinical development, the application of computers changes to streamlining clinical trials and improving the trial design. Although it is still important to detect toxicity, emphasis is put onto identifying and applying biomarkers to patient selection in order to show the true efficacy of the compound and to exclude patient groups which are very likely to show adverse effects due to known genetic variation in drug metabolism or target structure. When considering that the cost of post-‐marketing withdrawal is even higher than failing at the development stage (several billion US dollar in lawsuits and lost revenues) (Tsaioun et al., 2009), and that most of the drugs which were withdrawn recently showed cardiac toxicity (like Vioxx in 2004) or hepatoxicity (Stevens and Baker, 2009), it is apparent that much more should be done in predictive toxicology early in the development pipeline. This remains challenging, however, as most relevant data are not published by developing companies and modelling always needs to be put into a decision framework (Table 4.7). Table 4.7 James Stevens (Eli Lilly) on computational models James Stevens from the Division of Toxicology at Eli Lilly states that the “key to any computational model is what will be the decision framework and what value will the model add” (James Stevens, personal communication). Therefore pure data acquisition does not make sense and neither does unrelated model building or predictive efforts that may not be used later-‐on. However, when put in the context of already sunk costs and potential savings, 68 Pre-‐Clinical and Clinical Development such models can give valuable insights and help reduce late stage attrition. James Stevens also introduces the concept of “sunk cost bias” which shows that the money already spent on a development project has at least as much impact on the decision making process as the model or experimental results. According to him, predictive tools therefore have a greater benefit in earlier stages of the discovery and development process. The way in which computational methods can help improve productivity best is by making sure development of candidates that work only in a patient subpopulation is not discontinued due to lack of efficacy in random population samples and that less drugs are moved on into late stage clinical development although they are doomed to fail. The lesson that can be learned from Pfizer’s US$ 800 million failure of their drug candidate torcetrapib in Phase III trials in 2007 shows that early signs of toxicity or adverse reactions should always be taken seriously and the underlying biology should be evaluated further before moving on (Nature, 2011). The best way of decreasing late stage attrition rates is, however, using better validated targets and better compounds. Efforts for both of these aspects have to happen much earlier in the pipeline. 69 Chapter 5 Knowledge Discovery In previous chapters, it was pointed out several times that the lack of high-‐quality data still remains a major obstacle in the process. Today’s high-‐throughput methods in compound screening, but also in “omics” fields like genomics and proteomics (e.g. NGS, mass spectrometry) allow the rapid acquisition of millions of data points in a short time. All the data is stored in internal, public or commercial databases and the sizes of public databases such as PDB, MedLine or Genbank are growing rapidly (Figure 5.1). Analysing such data and combining it with literature knowledge and experimental data from other sources is a challenging task. With reduced resources distributed to R&D it becomes more interesting however, to analyse what is hidden in existing data (Gary Rubin, personal communication). The process of searching for hidden knowledge, patterns and connections among different database entries is called “Knowledge Discovery (in databases)” and its application to drug design will be discussed in this chapter. Relevant terminology is explained in Table 5.1. Number of Publicaòns (in millions) 20 18 16 14 12 10 8 6 4 2 0 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Figure 5.1. Number of available publications in the literature database MedLine. The database for biomedical and life science citations of the U.S. National Library of Medicine's® (NLM) today contains citations from more than 5500 journals worldwide with more than 18 million citations in total (U.S. National Library of Medicine, 2011). The free portal to this database, PubMed, contains even more citations by including sources that have not yet been indexed by Medline and of some additional life science journals (U.S. National Library of Medicine, 2010). Furthermore, the curated Medline database has grown with an average rate of nearly 2000 new citations per day in the 2010 (U.S. National Library of Medicine, 2011). 70 Knowledge Discovery Table 5.1 Terminology in Knowledge Discovery Several subgroups of knowledge discovery exist by now, depending on the mining input. Text mining, an approach to mine information from written documents, has started with the emergence of Natural Language Processing (NLP) in the 1970s and 80s and made much improvement since then. The somewhat different field of data mining, which aims at pattern and model development from data series (Hand et al., 2001) has been developed in the same time and is influenced and fuelled by improvements in statistics, computer science such as machine learning techniques (e.g. artificial neural networks (ANNs)), and database management systems (Ranjan, 2007, Smit and van der Graaf, 2011). Content mining for improved information retrieval and search started in the 1990s and all these techniques are now integrated into semantic applications in order to reflect the context in which the information is found (ibid). A database can be defined as a “logically integrated collection of files” that “contains the raw data to be processed by the discovery system” (Frawley et al., 1992). In addition to that, normally further information about the form of data exists in manuals, the expert’s head and a data dictionary structure which contains further definitions and constrains of the data such as field names and allowable data types (ibid.). The data in the database then forms the input for the data mining algorithm, a well defined procedure, which produces output in the form of models or patterns from it in a finite number of steps (if it would not terminate after a finite number of steps the one would talk about a computational method rather than an algorithm) (Hand et al., 2001, Chapter 5). A model is defined by Hand et al. (2001, Chapter 6) as a high-‐level, global description of a data set that can be either descriptive or inferential. A descriptive model would summarise the data concisely whereas an inferential model would allow inferring hypotheses about the data source and future data points. A pattern, on the other hand, is defined as “a local feature of the data” that may only be true for some variables or data points and gives some insight about records that have unusual properties or deviate from the general run of the data (ibid.). Methods used in data mining include pattern mining in which rules of affinities among a collection of data points are identified, classification and prediction which rely on decision trees and machine learning, and clustering which groups similar records. Since the field of pharmaceutical research is creating huge amounts of data, automated knowledge extraction and strategic information discovery systems have been incorporated into the pharmaceutical drug R&D process for a long time as methods to achieve time savings (Smit and van der Graaf, 2011). According to a recent study by Eefke Smit and Maurits van der Graaf (2011), the main applications for content mining in pharma are “information retrieval to find relevant documents in the sea of information and extraction of 71 Knowledge Discovery facts and assertions from within documents” (ibid.). Processes of text, data and content mining can and should have, however, their application in all stages of the discovery and development process to integrate the acquired knowledge from previous projects and earlier phases into the thinking, analysis and hypothesis building process. Examples for such applications are: • Target identification for the treatment of a disease • Identification of controlling factors that are regulating the expression of a gene • Discovery of contradictions in scientific articles and patents as a new source of knowledge • Exclusion of specific chemotypes or compounds that have been shown to be toxic or not useful in humans • Analysis of intellectual property and prior art • Drug re-‐profiling • Discovery of the mode of action for a compound basing on microarray data (Iorio et al., 2010) • Inference of networks (Iorio et al., 2010) • Hypotheses generation and testing 5.1 Text Mining According to Yang et al. (2009) text mining is a computational approach to find previously unknown information by automated extraction of information from different written resources. The mining process can be based on prior knowledge and patterns to find literature of interest or it can use statistics and machine learning to classify the literature (ibid.). Text mining can be used in the drug design and development process at various occasions and can be of major importance in discovering prior knowledge about the disease biology, known treatments and their side effects. This method can be used to identify disease-‐ associated genes or proteins, to understand the role of such entities in diseases, and to reconstruct disease-‐related networks. For example, the program GeneWays is capable of automatically examining literature databases and predicting the physical interactions among several disease-‐associated genes from data in literature (Krauthammer et al., 2004, cited by Yang et al., 2009). Other approaches such as text mining in patent databases can be more useful at later stages of discovery in evaluating the prior art of a compound, 72 Knowledge Discovery identification of interesting chemical themes, and patent busting (Robson et al., 2011). A limitation of text mining is, however, that it is very domain dependent because used terms, ontologies and lexicons tend to be subject specific and depending on a specific context (Smit and van der Graaf, 2011). 5.2 Data Mining The previously described category of text mining is a part of the field of data mining (Smit and van der Graaf, 2011). Data mining is always of importance when experimental results are obtained and for the integration of knowledge from external databases. In high-‐ throughput data analysis, supervised clustering and unsupervised classification are predominantly used (Yang et al., 2009). Mining gene expression data for example can be of use to identify characteristics that are specific for the healthy or diseased state thus giving implications for potential therapeutic targets and biomarkers. This approach needs follow-‐ up validation though, because due to the complexity of the human organism, gene expression profiles and protein levels do not always correlate (ibid.). In this context mining of high-‐throughput mass spectrometry may be of further use to analyse proteomic data. Other applications of data mining are (Ranjan, 2007): • Analysis of safety and efficacy profiles from patient databases • Improvement of patient selection through biomarker identification • Identification of factors and pathway interactions that interfere with the drug • Identification of the best hits and potential lead compounds • Clustering of compounds into groups according to their chemical properties thus giving information about the chemical space and similar compounds • Organisation of clinical trial data (correlation between drug and adverse effects, disease progression etc.) In many cases, the best results are obtained when mining approaches for different sources are integrated with each other to give a better impression of the big picture (Figure 5.2). 73 Knowledge Discovery Figure 5.2 Integrated Data Mining Workflow. Microarray data can be combined with text mining, “omics”-‐data and other sources to find new targets, biomarkers or networks in a systematic way. Source: Yang et al., 2009 5.3 Semantic Technologies Traditional computational mining approaches are limited by the structure of the data and its representation in different sources. A single concept can have different names and representations (e.g. chemical formula, trivial chemical name, marketed drug name, etc.) in different sources, which makes it hard to connect the information coming from different sources or, if different concepts have the same name24, it is impossible for standard computational mining approaches to distinguish between them (PricewaterhouseCoopers, 2008). For this purpose semantic technologies gain importance in the drug discovery process because they enable the researcher to connect disparate data sets and find correlations that would not have been observable otherwise (ibid.) (illustrated in Table 5.2). Table 5.2 Case Study Knowledge Discovery: BioWisdom (now part of Instem) BioWisdom, a company from Cambridge (UK) and now part of the software vendor Instem, is the market leader in delivering healthcare intelligence to the pharmaceutical industry by providing semantic metadata (“data about data”) platforms. Gordon Baxter, former CEO of BioWisdom and Chief Scientific Officer (CSO) at Instem, has emphasised the importance of data integration from various sources as a key criterion in creating intelligence. BioWisdom’s products are based on the MetaWise ontology which is used to annotate, translate and connect assertions and critical concepts expressed in unstructured and structured (text-‐) records. The intelligent metadata is connected to reference (e.g. accession number, HGNC term, or MedDRA term) and referent (e.g. Disease, Drug, Target, Pathway, Tissue) and therefore harmonises differing terminology and semantics or translates 24 e.g. “Sonic Hedgehog” in the context of the computer game or a mammalian signalling pathway 74 Knowledge Discovery between different contexts (e.g. switching from the gene name to the sequence, drug name and SMILES25 string or structure, etc.). By providing software for data mining in very large datasets (Sofia), for data visualisation and manipulation (OmniViz) or for integration of genomic data coming from a variety of sources (SRS), a better integration of data is possible. According to Gordon Baxter, it is essential to integrate all types of information to obtain the complete picture of a concept. In this context he gives the example of the well known drug Viagra (sildenafil citrate): Apart from the name of the marketed drug, much data is connected to other related labels such as the chemical structure of the compound, the SMILES string, the patent, the chemical name etc. that would not be obtained when searching only for the term “Viagra”. The company has developed algorithms to recognise concepts of the various languages used in biomedical research to overcome the described problems and normalise synonyms, aliases, text variants in written documents, and alternatives to concepts (Figure 5.3). Figure 5.3. Concept of intelligent metadata. Key concepts are listed and ordered and connected to other concepts. The knowledge how different types of concepts relate to each other is important to create assertional metadata. Assertional metadata describes statements in texts in triplets (concept-‐relationship-‐concept) useful to create a summary layer over multiple data sources for better knowledge alignment from contemporary and historical sources. Source: (Reed et al., 2010) He explains that a current challenge in the industry is that most information lies in unstructured text files (e.g. laboratory notebooks) and therefore inference of networks by mapping of concepts (e.g. text to structures) from various data sources (text databases, experimental data of various kinds, laboratory notebooks, adverse effect data etc.) remains challenging. Intelligent metadata may, however, be part of the solution as it allows the alignment of systematic and non-‐systematic text records. If data was structured better according to standards and integrated from all sources, Gordon 25 SMILES (simplified molecular input line entry specification) strings are a way of describing the structure of a molecule in line notation. 75 Knowledge Discovery Baxter suggests, the knowledge going into the drug design process would be of a higher quality. As poorly validated targets are one of the main reasons for later problems with drug efficacy, the use of standards and mining technology can improve the quality of the decisions that are made and allow a drug target or compound fail for the best reasons and as early as possible. The measures in which data mining tools and metadata can help improve productivity in the pharmaceutical R&D process are, according to Gordon Baxter, (1) “making the impossible possible” and, (2) reducing the “junk” going into development. Since the paradigm “junk in – junk out” applies, reducing bad quality data/hypothesis/compounds will reduce costs because once such a poor compound is in clinical trials it costs money although it will finally fail. One thing that used to be impossible, but was made possible by BioWisdom is for example reading all abstracts in MedLine and put the knowledge hidden in there in a useful context. Furthermore, the Safety Intelligence Program (SIP), a comprehensive knowledgebase about adverse effects of chemical compounds, is created with BioWisdom’s Sofia platform and heavily relies on assertional metadata from data provided by public databases and industrial partners such as AstraZeneca about toxicity and adverse effects data. The assertional metadata in the knowledgebase is structured in triplets of the form “drug X has effect Y in entity Z” which wouldn’t be obtainable without data mining techniques and is now used by AstraZeneca as their main tool in predictive toxicology. Gordon Baxter furthermore states that AstraZeneca’s application of highly innovative information technology like SIP in their R&D process is now showing its benefits in their drug candidate pipeline. It will be seen in the future, if this statement remains true for improved productivity or quality of approved drugs. Anyways, it is true that apart from smart researchers and good technology, data is the key to innovation. Currently the knowledge is hidden in the data and is trapped in different formats. Methods that help organise such data will be helpful in making the most of the experiments and thus probably increase the return on investment for experiments and novel technologies. Another aspect in which information technology is of high value lies in the optimisation of the interaction with the regulator. In this context more systematic approaches are in use so far, such as Instem’s Provantis for early drug development. The standards required by regulating agencies (e.g. the FDA) in reporting results make the use of computational systems now highly attractive in streamlining the process not only in clinical development, but also in preclinical stages. A data warehouse like Centrus, a new product that includes both Instem’s highly systematic and BioWisdom’s semantic technology, may therefore be useful in including intelligence from unstructured reports into the structured framework required by management and the regulator (Figure 5.4). Often the best improvements can be made with small things that make a task a little easier to perform instead of “big solutions” (William Bains and Gordon Baxter, personal communication) and data mining and automatic harmonisation of formats may be part of this group in making interaction with and adherence to the regulating body easier. 76 Knowledge Discovery Figure 5.4. Instem’s Centrus™ platform. A platform for data accession from various partially unstructured sources with technology to transform, align, and view this data in a meaningful way, as well as for easy sharing the knowledge derived from it within the company, journals and regulatory authorities. Source: http://www.instem-‐lss.com/centrus.pdf 5.4 Integrated Informatics Frameworks Systems and Workflows-‐Based Integrated informatics systems and workflow-‐based frameworks are becoming increasingly popular as they are often modular and can be personalised according to the company and the project. Furthermore, in the present time of austerity in the pharmaceutical sector, new requirements for knowledge discovery exist for simpler and cheaper mining solutions. According to Smit and van der Graaf (2011) the most common ways to reduce costs consist in outsourcing to low-‐cost countries such as India, improving in-‐house mining tools to make them easier to use by untrained persons, and switching to standard tools from third party suppliers. Gary Rubin (Fios Genomics) has stated in his interview that outsourcing to low labour-‐cost countries is only partially attractive since the analysts expertise is the main value-‐adding part of the process. Therefore a bigger trend can be observed towards out-‐ sourcing to specialist companies (Gary Rubin, Gordon Baxter, personal communication) and towards externally developed software. This trend towards standard software that is not developed in-‐house could be confirmed in several interviews (Nicolas Fechner, Gordon Baxter). Workflow technology like PipelinePilot and the open source tools Taverna and KNIME have been seen as indirect competition at different stages in the drug discovery and development pipeline (Gordon Baxter, personal communication). By now they have evolved into standard software-‐tools for flexible data and content mining in the pharmaceutical industry (Nicolas Fechner, personal communication). In workflow-‐based frameworks each 77 Knowledge Discovery task is represented by a single module (or “node”) (Figure 5.5) and several nodes can be connected by pipes that transfer the metadata from one node to the other (Figure 5.6). An extensive review of the plethora of commercially available and open-‐source tools is given by Tiwari and Sekhar (2007). Figure 5.5. The anatomy of a workflow node. Some kinds of data or metadata are given as input to the node. Then it is transformed according to given rules, algorithms or parameters. Metadata is then passed onwards for further processing or visualisation. Source: (Tiwari and Sekhar, 2007) Figure 5.6. A typical KNIME workflow. The Konstanz Information Miner (KNIME) was developed as an open-‐ source workflow suite at the University of Konstanz by the Chair of Bioinformatics and Information Mining headed by Michael Berthold. This workflow tool is designed in a modular way that allows the user to create workflows for recurring tasks. The frontend contains “nodes” that are performing specified sub-‐tasks which can be combined in different ways. Importantly, the structure of the data can change when passing through the workflow and be visualised and transformed to the wishes of the user. The open-‐source platform allows the use of standard nodes, but also the extension by commercial nodes for specific tasks as well as the development of own modules to individualise the framework. By separating the development and implementation of algorithms from their application, workflow technologies can be used by non-‐computational chemists and others working in data analysis without deep knowledge of the underlying mathematical and computational concepts. Source: (KNIME, 2011) 5.5 Development of a New Informatics-‐Centred Model for Drug Discovery and Development Using the insight gained from this chapter, personal interviews and the economic analysis of the various applications of computational methods discussed in previous chapters, an own model for the drug discovery and development process was developed by the author of this thesis that is centred around informatics techniques for guiding the decision making 78 Knowledge Discovery process (Figure 5.7). Most of the involved tools are already in current use, but the integration of all of them is not. Figure 5.7. Proposed new informatics and data centred drug discovery and development workflow. By integrating predictive tools and knowledge from failed and successful projects, compounds that may have problems with efficacy, ADME, or toxicity are abandoned as soon as possible whereas the chemical diversity of potentially successful compounds is kept as broad as possible throughout the discovery stage. Computational approaches are used to guide hypothesis generation and the design of wet-‐lab experiments while the need for high-‐throughput approaches is reduces as these tend to have high false positive and false negative rates. Ideally, consensus approaches are used which integrate several complementary techniques to validate the output of the computational techniques. By integrating phenotypic screening and various types of cell lines, the number of physiologically relevant compounds may be increased while costs remain low since the screening libraries are focused on compounds that are very likely to work. The identification of biomarkers and genotypes in which the treatment has the highest chance to succeed are studied throughout the process and validated in early clinical trials. Adaptive trial design (*as reviewed by Barton (2008)) has been estimated to allow improvements in compound overall success rate of 10% -‐ that would be decreasing attrition rates from ca. 95% to 85% thus trebling the amount of approved drugs – while reducing costs through time savings and much smaller patient cohorts. The time estimates are based on the discussed case studies in the previous chapters. (A larger version of the picture can be found in Appendix C) 79 Knowledge Discovery Several additional aspects that lead to poor productivity and high costs in pharmaceutical R&D should be taken into consideration to streamline the development of new drugs: 1. Equality of bench scientists and computational scientists with shared credit for the result: Scott Lusher and his colleagues made a strong point for the integration of computational experts in the discovery team (Lusher et al., 2011). Furthermore, all parties (bench chemists, computational chemists, biologists and bioinformaticians) need to accept each other as equal partners in the team with shared credit. According to Lusher, that is the only scenario where they work well together (personal communication). This team-‐mentality is of special importance if part of the task, e.g. computational modelling, is not performed in-‐house. The integration of expertise, working mentalities, and creativity will eventually create innovation. 2. Change of reward system and incentives: Currently, members of the teams at different stages are often rewarded for the sheer number of compounds they push into the next stage without considering their overall drug-‐like quality or the attrition rate in the next stage. This tends to give the scientists a pressure to push not very promising compounds into the next stage where time and money is wasted on further research. By limiting incentives only to really promising high-‐quality candidates and rigorous failure management, attrition can be shifted to earlier stages of development saving costs and futile efforts. 3. Increase trust in computational models: all parties should make an effort to understand the others and the techniques they use. This may be a non trivial task, but Optibrium’s example has shown that sometimes simple things like data visualisation and presentation can make a difference. Furthermore, efforts are needed from all parties: “Chemists must be willing to use them [predictive models] and when they do, the results must be good enough to justify using them again” (Scott Lusher, personal communication). 4. Virtual Development model/outsourcing of many steps to knowledgeable partners (biotechnology companies, bioinformatics/cheminformatics companies, and academia): The use of informatics in life sciences and the pharmaceutical development process is very complex. You could compare the complexity of a software package that would integrate all relevant techniques to the complexity of flying a modern airplane. It is possible to learn how to use it, but not in ten minutes. Therefore we normally 80 Knowledge Discovery leave it to specialists. Similarly, the case studies of fios genomics and Prosarix have shown that the use of such software is best performed by trained experts that are capable of individualising the toolbox according to the specific project and who can use their knowledge from prior projects to get reliable results quicker. Additionally, external specialist companies have to prove that they are better than in-‐house teams to stay in business thus they have a bigger motivation for providing good results in a short time and for as low cost as possible. Since in-‐house teams often fail even if they try to imitate an external solution/software/algorithm while wasting time and money in the process of developing their own platforms, collaborations should be prioritised from the beginning. It may be a major task for the management of the project to allow this to happen and reduce “not invented here”-‐issues. Certainly, it may be conceivable to build your own airplane in your basement – but again: there are specialists for that task. The rate of integration of novel software tools and the maximum amount of specialist software on the desktop is limited by the scientist’s tolerance for training to use it (Waller et al., 2007) and to keep up with updates. Therefore, software that is developed for non-‐computational scientists should be designed with a clean and intuitive user interface that hides most of the potentially confusing computational framework and everything that may be too complex to be performed with such software should be sourced out to external specialists or other team members. With such an approach a “highly networked, connected ecosystem of people doing highly innovative work” (Chakma et al., 2009) can be created. 5. Knowledge integration: Every insight as well as all experimental data should be organised in a data warehousing system that offers easy access to all members of the discovery or development team in order to make decision on the basis of a comprehensive knowledge that has been gained in the project, in previous projects, or by external sources 6. Open Innovation: Open source projects such as Linux, Open Office or Android have drastically changed the way in which development and business are done in the IT industry. Similar to that, open innovation in the pharmaceutical industry may have potential in drastically reducing cost and harnessing creativity and innovation from a global community (academia, pharma, biotech, CROs, and non-‐for-‐profit organisations) (Talaga, 2009). Leaders in the pharmaceutical industry such as Bernard Munos, Patrick 81 Knowledge Discovery Vallance, the senior vice-‐president for medicines development and discovery at GSK, Chas Bountra, Chief Scientist at The Structural Genomics Consortium, start to see the potential of this strategy (Cressey, 2011). In order to reduce duplication in research, it may be a viable option to use open source approaches in the discovery stage and pre-‐ POC studies. The collaboration of competing pharmaceutical companies, academia, and biotechnology companies up to a point in which a compound’s use in animal models is shown without the now common IP restrictions may give a larger basis of risk-‐sharing companies the chance of reduced development risk and cost. After POC the potential drug could be licensed out or acquired by one or more members from the rest of the consortium for further development. 7. Improvement of current computational techniques: Bioinformatics and cheminformatics tools and databases follow two different strategies. While bioinformatics platforms and databases such as GenBank are largely open-‐source and non-‐commercial, cheminformatics software and databases such as compound libraries tend to be commercial and closed-‐source (Wishart, 2005). The integration of these two informatics disciplines is therefore challenging and needs to be improved. Furthermore, improvements in molecular modelling, docking, and predictive approaches for ADME and toxicity characterisation are hindered by lack of data. This data should, if available, be made public and models should be improved in an iterative manner throughout the project. 8. Fail fast, fail cheaply and fail for the best reasons: While current computational techniques are used for enrichment of experiments with potentially good compounds and negative selection of those that are very unlikely to succeed, informatics can be used even more to inform all steps of the process and allow failure of hits, leads, and drug candidates for the best scientific reasons. The implementation of such a new discovery and development model may be a task that requires drastic restructuring of research teams and changes in the IP-‐centred research mentality. The use of open-‐source approaches can be highly valuable as it allows every member of the team to be innovative while simultaneously reducing costs. However, not all scientists currently in research teams are part of the “generation internet” and understand 82 Knowledge Discovery the importance of computational techniques in modern drug development26. It is therefore the task of the management to decide how to balance different research mentalities in the project to achieve maximum productivity and innovation. 5.6 Conclusions The effect of data mining and workflow are not measurable directly as they are pervading the whole process in pharmaceutical discovery and development processes. They are needed to make sense of the data acquired with novel high-‐throughput technology, to find knowledge hidden in already existing literature and to integrate the data coming from various sources. By doing this, more is learned about the biology of diseases which then makes it easier to find optimal compounds for the right disease, the best target(s) and the right group of patients, and make the right decision for the right reasons. Furthermore, data mining can be of use in drug re-‐profiling and patent busting which have the advantages discussed in Chapter 3 (e.g. short development cycle and lower costs of development). Ideally, using a translational approach to the process, a better target for the drug development (or even better: several targets for a robust treatment) is chosen, compounds that are likely to fail due to toxicity or other reasons are filtered out early, potentially very good compounds are analysed with priority, and experiments are not uselessly repeated only because no one in the research team knew that someone else already failed with that approach. Therefore the discovery process would be optimised and attrition rates could be reduced in late stage development. Serendipity, often named as the source for innovation, could also be harnessed in a more productive way: taking the example Viagra: initially the drug was intended for use against hypertension and angina pectoris. Trials showed unfortunately, that it has only low efficacy in this context. One of the scientists at Pfizer, the developing pharmaceutical company, was however able to link the observed side effects of sildenafil to information in a PhD thesis and two scientific papers which led to the allocation of further resources to test the drug against erectile dysfunction (de Rond and Thietart, 2007, de Rond et al., 2011) and resulted in the successful development of Viagra. It 26 In fact, to the author’s knowledge even many young scientists show an alarmingly high reluctance to accept the importance of computers in the laboratory environment and for data analysis purposes up to the stage that they boycott them completely (while happily using facebook, computer games and their newest electronic gadgets in their free-‐time). Obviously, this needs to be changed: the current practice in life science degrees of only including very few, mainly elective bioinformatics modules in the curriculum will not result in the education of open-‐minded innovative scientists that are required in the pharmaceutical industry! 83 Knowledge Discovery becomes clear that not only chance, but also causal background and strategic choice are part of serendipity (ibid.), which can be improved by systematic knowledge integration into the process. Unfortunately, the reluctance of the big pharmaceutical companies to share data is still impairing such a development. However, some have understood the importance of data sharing and cooperation (Cressey, 2011) thus this may change in the future. 84 Chapter 6 Discussion The past chapters have illustrated that computational methods have the potential of improving pharmaceutical research productivity in various ways. However, rational drug design is not perfect – yet. History has shown that many drugs owe their discovery to serendipity (de Rond et al., 2011) and not all function on a distinct molecular target (Overington et al., 2006). Furthermore, some drugs cannot be found with current target-‐ based approaches at all such as omeprazole which is only active in acid producing cells, acivlovir that only works in virus infected cells and prodrugs like sulfamidochrysoidine who need prior metabolic activation (Kubinyi, 2008). As many rationally designed drugs against a singular target were rendered useless as singular treatment through the development of resistance (e.g. HIV protease-‐1 inhibitors)(Mandal et al., 2009), polypharmacology is gaining importance and research requirements (and computational techniques) in this strategy may be different (Durrant et al., 2010). This thesis has covered the use of many computational methods in the drug design and development process and their impact on R&D productivity. The results are summarised in Table 6.1. There are, however, several things that need to be considered when evaluating these results: • Imprecise models can have a negative effect on the project as they can misguide experiments and therefore result in time delays and additional costs. Therefore the specificity and sensitivity of the used methods needs to be considered. • Due to space constraints, only bioinformatics-‐ and cheminformatics related approaches have been discussed. Information technology has many more applications in the pharmaceutical industry that were not subject of this work including synthesis route predictions, formulation modelling, administrative tasks etc. • A “survivor bias” is inherent in all published data as only successful applications are published in journals or case studies on company websites and only little is published about failed attempts of rational drug design. Furthermore, no information exists in company case studies on those cases in which their technology has been used, but without beneficial outcome. Therefore judging from this data may result in an overestimation of the benefits of computational methods • Case Studies and expert interviews have formed the core of this analysis. Here a bias is apparent since every company trusts in its product and over-‐emphasises its benefits. 85 Discussion The only realistic way of benchmarking the impact of informatics on drug discovery productivity would be “to take the same drug/target/development candidate, convene two teams of identically qualified scientists to develop them, give one of them technology X, and see who wins by launching a drug in 15 years' time.” (William Bains, personal communication). This approach is not practical however, and surrogate measurements are used which have their limitations. Table 6.1 Summary of Productivity Impact of Computational Techniques based on the examples and case studies in the previous chapters Phase Measures Quantitative estimate Reported time savings Target identification and validation Reduce redundancy and minimise 10% Cost Savings need for wet lab experiments, make the impossible possible, better target selection Hit-‐to-‐lead Smaller screening libraries, better hit rates ROI of more than US$ 4 months in hit/lead 3; more than 5% identification improvement in experimental efficiency Lead optimisation Better drug candidate selection, lower number of in vitro and in vivo experiments 90% less compound syntheses 9 months Pre-‐clinical development Fewer animals used, Use of minimally invasive approaches, faster results N/A Animal study results in 3 instead of 30 days Clinical development Smaller trial sizes, fewer trials, lower late stage attrition rate, more approved drugs Trial Size reduction by up to 60% 40% reduction in Phase I trial (7 months), 20% reduction in Phase III trial length ( 6 months) Total -‐ -‐ More than 32 months 6 months N/A: not available; Data in this table is discussed and cited in the corresponding chapters of the thesis The overall R&D expenditure is increasing while the amount of newly approved drugs remains stagnant. This shows that not all of the novel technologies that were praised as the big solution to the problem have delivered as they were expected. This is especially true for some HTS approaches that have led to the neglect of the complex character of the human organism and disease so that newly developed compounds fail as they do not work in the system of the human body. According to Gordon Baxter’s personal experience, the time required before a lead compound is found has definitely increased in the past years and a lot of experimentation is performed without even sight of a molecule (personal 86 Discussion communication). It can be argued that such basic research should not belong in the pharmaceutical industry, but in academia, and that better collaborative models and knowledge exchange may improve the time for lead discovery again. There are currently 9000 compounds in development (The Association of the British Pharmaceutical Industry, 2011) of which nearly half are in clinical development (Figure 6.1) and the overall amount of research has increased (Gordon Baxter, personal communication). There is, and will always be, a lag-‐time between the change of the research model and the benefits. Therefore it is yet to be seen how the strategies applied today will increase productivity. Same applies to computational methods. As the performance of most computational tools heavily relies on data of good quality which is only now starting to be available in sufficient amounts, the true impact of IT on R&D performance will be seen in the future although some of the reviewed examples have shown that the use of current computational tools can already reduce the discovery time prior to clinical development to less than three years (instead of more than 5.5 years). Figure 6.1. Medicines in Development as of January 2011. Source: (The Association of the British Pharmaceutical Industry, 2011) The use of informatics to structure the process into a more systematic scheme is seen critical by some (Bernard Munos, personal communication27) as they may reduce serendipity and creativity that have driven past successes. This may be true for the way in 27 “I don't think innovation can be scripted or mandated, or reduced to a rational set of rules. If it were possible, we could produce Nobel prize winners on demand! Research shows instead that most past biomedical breakthroughs came from engaging in high-‐risk, unconventional science. Until we have enabled predictive biology, this will probably remain the most productive pathway for discovering breakthrough drugs” 87 Discussion which computational methods have been applied in the past – as tools applied only when nothing else was possible and highly focused on a specific task. With ever increasing risk-‐ aversion and prudence of the regulating authorities – for the benefit of the patients – an informatics centred framework will however help streamline the process. It is true, that drug design and development does not need computers (Gordon Baxter, personal communication) as this is how it used to work in the past, but IT can be very useful in improving essential tasks in the process making the research ideas of scientists easier to fulfil. By integrating data, different technologies and smart scientists who are willing to perform unconventional science in a knowledge driven research framework which allows the exploration of a diversity of innovative solutions, informatics will help harness innovation and help finding new and better drugs in less time. With ever increasing computing power28, growing databases and more evolving software tools, computational techniques in drug design have not yet reached their peak and will gain reliability and importance in the future. The structure of an informatics-‐centred framework has been developed in the past Chapter and is based on the model of a virtual organisation that is centred on computational methods for optimal data analysis and knowledge driven decision making: • Get all the data from all the sources relevant for a target/compound/chemotype • Integrate teams of bio-‐/cheminformaticians and wet-‐lab scientists to get them to work together (people needed on the interface who can understand and work together with both kinds of (sometimes very peculiar) personalities) • Use expert companies or academia collaborations for different steps, but make data available globally to everyone involved in the project (someone may notice something unusual and innovative). Using computational methods only for focused problem solving in a small area without giving the big picture to the person performing the task will result in work performed while being "mentally blinkered" • Allow innovative research routes and hypothesis and use IT to further evaluate them before implementing them in an experiment • Create an abundance of potential compounds to allow quick failure without getting pipeline problems 28 Moore’s law states that the number of transistors that can be placed inexpensively on a chip doubles every two years thus doubling the computational capacity (Koomey, 2010). 88 Discussion With such a model it may be possible to de-‐risk some aspects of the process while still allowing highly innovative drugs to be found. As there are still great unmet medical needs for efficient therapies against many diseases, every available tool should be used to improve the productivity in the pharmaceutical R&D process and reduce the time lag between start of research and marketing approval to a minimum. Computational methods are not the sole solution to that problem, but when integrating information technology and systems biology better into the rational process of designing a drug, significant improvements can be expected with benefits for the patients and the developing companies. 89 References References ADAM, M. 2005. Integrating Research and Development: the Emergence of Rational Drug Design in the Pharmaceutical Industry. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 36, 513-‐537. ACCELRYS. 2011. Pharmacophore Modeling and Analysis [Online]. Accelrys. Available: http://accelrys.com/products/discovery-‐studio/pharmacophore.html [Accessed 28 August 2011]. ARROWSMITH, J. 2011a. Trial Watch: Phase II Failures: 2008–2010. Nat Rev Drug Discov, 10, 328-‐329. ARROWSMITH, J. 2011b. Trial Watch: Phase III and Submission Failures: 2007–2010. Nat Rev Drug Discov, 10, 87-‐87. BARTON, C. L. 2008. Innovative Clinical Trial Design and Management. London: Datamonitor. BATES, S. 2010. Progress Towards Personalized Medicine. Drug Discovery Today, 15, 115-‐ 120. BEUSCHER IV, A. E. & OLSON, A. J. 2008. Iterative Docking Strategies for Virtual Ligand Screening. In: STROUD, R. M. & FINER-‐MOORE, J. (eds.) Computational and Structural Approaches to Drug Discovery: Ligand-‐Protein Interactions. Cambridge: The Royal Society of Chemistry. BLEICHER, K. H., BOHM, H.-‐J., MULLER, K. & ALANINE, A. I. 2003. Hit and Lead Generation: Beyond High-‐Throughput Screening. Nat Rev Drug Discov, 2, 369-‐378. BÖHM, H.-‐J., FLOHR, A. & STAHL, M. 2004. Scaffold Hopping. Drug Discovery Today: Technologies, 1, 217-‐224. BOUCHIE, A. 2006. Clinical Trial Data: to Disclose or not to Disclose? Nat Biotech, 24, 1058-‐ 1060. BUCHAN, N. S., RAJPAL, D. K., WEBSTER, Y., ALATORRE, C., GUDIVADA, R. C., ZHENG, C., SANSEAU, P. & KOEHLER, J. 2011. The Role of Translational Bioinformatics in Drug Discovery. Drug Discovery Today, 16, 426-‐434. BUGRIM, A., NIKOLSKAYA, T. & NIKOLSKY, Y. 2004. Early Prediction of Drug Metabolism and Toxicity: Systems Biology Approach and Modeling. Drug Discovery Today, 9, 127-‐135. BURTON, P. S., POGGESI, I., GERMANI, M. & GOODWIN, J. T. 2006. Computational Models Supporting Lead Optimization in Drug Discovery. In: BORCHARDT, R. T., KERNS, E. H., HAGEMAN, M. J., THAKKER, D. R. & STEVENS, J. L. (eds.). Optimizing the “Drug-‐Like” Properties of Leads in Drug Discovery. New York: Springer 90 References CARDEN, C. P., SARKER, D., POSTEL-‐VINAY, S., YAP, T. A., ATTARD, G., BANERJI, U., GARRETT, M. D., THOMAS, G. V., WORKMAN, P., KAYE, S. B. & DE BONO, J. S. 2010. Can Molecular Biomarker-‐Based Patient Selection in Phase I Trials Accelerate Anticancer Drug Development? Drug Discovery Today, 15, 88-‐97. CHAKMA, J., CALCAGNO, J. L., BEHBAHANI, A. & MOJTAHEDIAN, S. 2009. Is It Virtuous to Be Virtual? The VC Viewpoint. Nat Biotech, 27, 886-‐888. CHEW, J. 2010. The Economics of Drug Discovery: 'First in Class' vs. 'Best in Class' [Online]. Seeking Alpha. Available: http://seekingalpha.com/article/221704-‐the-‐economics-‐of-‐ drug-‐discovery-‐first-‐in-‐class-‐vs-‐best-‐in-‐class [Accessed 08 August 2011]. CHONG, C. R. & SULLIVAN, D. J. 2007. New Uses for Old Drugs. Nature, 448, 645-‐646. CLARK, D. E. 2008. Computational Prediction of Aqueous Solubility, Oral Bioavailability, P450 Activity and hERG Channel Blockade. In: STROUD, R. M. & FINER-‐MOORE, J. (eds.) Computational and Structural Approaches to Drug Discovery: Ligand-‐Protein Interactions. Cambridge: The Royal Society of Chemistry. CRESSEY, D. 2011. Traditional Drug-‐Discovery Model Ripe for Reform. Nature, 471, 17-‐18. DAVIS, A. M., TEAGUE, S. J. & KLEYWEGT, G. J. 2008. Application and Limitations of X-‐Ray Crystallographic Data in Structure-‐Guided Ligand and Drug Design. In: STROUD, R. M. & FINER-‐MOORE, J. (eds.) Computational and Structural Approaches to Drug Discovery: Ligand-‐Protein Interactions. Cambridge: The Royal Society of Chemistry. DE ROND, M. & THIETART, R.-‐A. 2007. Choice, Chance, and Inevitability in Strategy. Strategic Management Journal, 28, 535-‐551. DE ROND, M., MOORHOUSE, A. & ROGAN, M. 2011. Make Serendipity Work For You [Online]. Boston: Harvard Business Publishing. Available: http://blogs.hbr.org/cs/2011/02/make_serendipity_work.html?cm_sp=blog_flyout-‐_-‐ cs-‐_-‐make_serendipity_work [Accessed 25 June 2011]. DIMASI, J. A. & GRABOWSKI, H. G. 2007. The Cost of Biopharmaceutical R&D: Is Biotech Different? Managerial and Decision Economics, 28, 469-‐479. DIMASI, J. A., FELDMAN, L., SECKLER, A. & WILSON, A. 2010. Trends in Risks Associated With New Drug Development: Success Rates for Investigational Drugs. Clin Pharmacol Ther, 87, 272-‐277. DIMASI, J. A., HANSEN, R. W. & GRABOWSKI, H. G. 2003. The Price of Innovation: New Estimates of Drug Development Costs. Journal of Health Economics, 22, 151-‐185. DREWS, J. 2000. Drug Discovery: A Historical Perspective. Science, 287, 1960-‐1964. DRUGBANK. 2011. Drug Bank -‐ Open Data Drug & Drug Target Database [Online]. Alberta. Available: http://www.drugbank.ca/ [Accessed 14 August 2011]. 91 References DUFFNER, J. L., CLEMONS, P. A. & KOEHLER, A. N. 2007. A Pipeline for Ligand Discovery Using Small-‐Molecule Microarrays. Current Opinion in Chemical Biology, 11, 74-‐82. DURRANT, J. D., AMARO, R. E., XIE, L., URBANIAK, M. D., FERGUSON, M. A. J., HAAPALAINEN, A., CHEN, Z., DI GUILMI, A. M., WUNDER, F., BOURNE, P. E. & MCCAMMON, J. A. 2010. A Multidimensional Strategy to Detect Polypharmacological Targets in the Absence of Structural and Sequence Homology. PLoS Comput Biol, 6, e1000648. EICHLER, H.-‐G., ARONSSON, B., ABADIE, E. & SALMONSON, T. 2010. New Drug Approval Success Rate in Europe in 2009. Nat Rev Drug Discov, 9, 355-‐356. ENTELOS 2007. SGLT2. Target Clinical Report. Entelos. ENTELOS. 2000. Target Evaluation/ Translational Failure for Aventis (Sanofi-‐Aventis) [Online]. Foster City: Entelos. Available: http://www.entelos.com/casesMain.php?ID=cs04 [Accessed 23 August 2011]. ENTELOS. 2003. Novel Target Evaluation for Bayer [Online]. Foster City: Entelos, Inc. Available: http://www.entelos.com/casesMain.php?ID=cs09 [Accessed 14 August 2011]. ENTELOS. 2004. Target Evaluation/Prioritization for Organon (Merck) [Online]. Foster City: Entelos. Available: http://www.entelos.com/casesMain.php?ID=cs03 [Accessed 14 August 2011]. ENTELOS. 2005. Significant Savings with Clinical Trial Optimization for Johnson & Johnson Pharmaceutical Research and Development (J&JPRD) [Online]. Foster City: Entelos. Available: http://www.entelos.com/casesMain.php?ID=cs10 [Accessed 23 August 2011]. ENTELOS. 2005. Target Evaluation and Screening for Pfizer [Online]. Foster City: Entelos, Inc. Available: http://www.entelos.com/casesMain.php?ID=cs07 [Accessed 14 August 2011]. FERNANDES, T. G., DIOGO, M. M., CLARK, D. S., DORDICK, J. S. & CABRAL, J. M. S. 2009. High-‐Throughput Cellular Microarray Platforms: Applications in Drug Discovery, Toxicology and Stem Cell Research. Trends In Biotechnology, 27, 342-‐349. FIELDEN, M. R., EYNON, B. P., NATSOULIS, G., JARNAGIN, K., BANAS, D. & KOLAJA, K. L. 2005. A Gene Expression Signature that Predicts the Future Onset of Drug-‐Induced Renal Tubular Toxicity. Toxicologic Pathology, 33, 675-‐683. FINER-‐MOORE, J. S., BLANEY, J. & STROUD, R. M. 2008. Facing the Wall in Computationally Based Approaches to Drug Discovery. In: STROUD, R. M. & FINER-‐MOORE, J. (eds.) Computational and Structural Approaches to Drug Discovery: Ligand-‐Protein Interactions. Cambridge: The Royal Society of Chemistry. 92 References FRATTA, I. D., SIGG, E. B. & MAIORANA, K. 1965. Teratogenic Effects of Thalidomide in Rabbits, Rats, Hamsters, and Mice. Toxicology and Applied Pharmacology, 7, 268-‐286. FRAWLEY, W. J., PIATETSKY-‐SHAPIRO, G. & MATHEUS, C. J. 1992. Knowledge Discovery in Databases: An Overview AI Magazine, 13, 57-‐70. GASSMANN, O., REEPMEYER, G. & ZEDTWITZ, M. 2008. The Science and Technology Challenge: How to Find New Drugs. Leading Pharmaceutical Innovation. Springer Berlin Heidelberg. GINSBURG, G. S., KONSTANCE, R. P., ALLSBROOK, J. S. & SCHULMAN, K. A. 2005. Implications of Pharmacogenomics for Drug Development and Clinical Practice. Archives of Internal Medicine, 165, 2331-‐2336. GINSBURG, G. S., LEKSTROM-‐HIMES, J. & TREPICCHIO, W. 2006. Optimizing Biomarker Development for Clinical Studies at the Lead Optimization Stage of Drug Development. In: BORCHARDT, R. T., KERNS, E. H., HAGEMAN, M. J., THAKKER, D. R. & STEVENS, J. L. (eds.). Optimizing the “Drug-‐Like” Properties of Leads in Drug Discovery. New York: Springer GOODNOW, J. R. A. 2006. Hit and Lead Identification: Integrated Technology-‐Based Approaches. Drug Discovery Today: Technologies, 3, 367-‐375. GUL, S. 2011. Reducing Attrition in Drug Discovery: the Role of Biomarkers. European Pharmaceutical Review. Brasted: Russell Publishing Limited. HAND, D. J., MANNILA, H. & SMYTH, P. 2001. Principles of Data Mining, Boston, MIT Press. HARRIS, D., GASKIN, P., HENDERSON, W., TESSIER, Y., TEMPLETON, A., HARTNESS, M., CRAIGON, M., FREEMAN, T., FORSTER, T., RUBIN, G., IVENS, A. & GHAZAL, P. 2009. Quantitative assessment of whole blood RNA profiling as an early temporal marker of acetaminophen hepatotoxicity. SOT Annual Meeting and ToxExpo ™ March 15-‐19, 2009. Baltimore. HAUPT, V. J. & SCHROEDER, M. 2011. Old Friends in New Guise: Repositioning of Known Drugs with Structural Bioinformatics. Briefings in Bioinformatics. HUANG, H.-‐J., YU, H. W., CHEN, C.-‐Y., HSU, C.-‐H., CHEN, H.-‐Y., LEE, K.-‐J., TSAI, F.-‐J. & CHEN, C. Y.-‐C. 2010. Current Developments of Computer-‐Aided Drug Design. Journal of the Taiwan Institute of Chemical Engineers, 41, 623-‐635. IORIO, F., BOSOTTI, R., SCACHERI, E., BELCASTRO, V., MITHBAOKAR, P., FERRIERO, R., MURINO, L., TAGLIAFERRI, R., BRUNETTI-‐PIERRI, N., ISACCHI, A. & DI BERNARDO, D. 2010. Discovery of Drug Mode of Action and Drug Repositioning from Transcriptional Responses. Proceedings of the National Academy of Sciences. 93 References KAITIN, K. I. 2010. Deconstructing the Drug Development Process: The New Face of Innovation. Clin Pharmacol Ther, 87, 356-‐361. KAPETANOVIC, I. M. 2008. Computer-‐Aided Drug Discovery and Development (CADDD): In Silico-‐Chemico-‐Biological Approach. Chemico-‐Biological Interactions, 171, 165-‐176. KIRK, R. J., HUNG, J. L., HORNER, S. R. & PEREZ, J. T. 2008. Implications of Pharmacogenomics for Drug Development. Experimental Biology and Medicine, 233, 1484-‐1497. KNIME. 2011. KNIME Desktop [Online]. Zurich: KNIME. Available: http://www.knime.org/knime-‐desktop [Accessed 23 August 2011]. KNOX, C., LAW, V., JEWISON, T., LIU, P., LY, S., FROLKIS, A., PON, A., BANCO, K., MAK, C., NEVEU, V., DJOUMBOU, Y., EISNER, R., GUO, A. C. & WISHART, D. S. 2011. DrugBank 3.0: a Comprehensive Resource for ‘Omics’ Research on Drugs. Nucleic Acids Research, 39, D1035-‐D1041. KOOMEY, J. G. 2010. Outperforming Moore's Law. Spectrum, IEEE, 47, 68-‐68. KREATSOULAS, C., DURHAM, S. K., CUSTER, L. L. & PEARL, G. M. 2006. Elementary Predictive Toxicology for Advanced Applications. In: BORCHARDT, R. T., KERNS, E. H., HAGEMAN, M. J., THAKKER, D. R. & STEVENS, J. L. (eds.). Optimizing the “Drug-‐Like” Properties of Leads in Drug Discovery. New York: Springer KUBINYI, H. 2008. The Changing Landscape in Drug Discovery. In: STROUD, R. M. & FINER-‐ MOORE, J. (eds.) Computational and Structural Approaches to Drug Discovery: Ligand-‐ Protein Interactions. Cambridge: The Royal Society of Chemistry. LIPINSKI, C. & HOPKINS, A. 2004. Navigating Chemical Space for Biology and Medicine. Nature, 432, 855-‐861. LOUIE, A. S., BROWN, M. S. & KIM, A. 2007. Measuring the Return on Modeling and Simulation Tools in Pharmaceutical Development. IDC White Paper. Framingham: Health Industry Insights. LUSHER, S. J., MCGUIRE, R., AZEVEDO, R., BOITEN, J.-‐W., VAN SCHAIK, R. C. & DE VLIEG, J. 2011. A Molecular Informatics View on Best Practice in Multi-‐Parameter Compound Optimization. Drug Discovery Today, 16, 555-‐568. MACARRON, R., BANKS, M. N., BOJANIC, D., BURNS, D. J., CIROVIC, D. A., GARYANTES, T., GREEN, D. V. S., HERTZBERG, R. P., JANZEN, W. P., PASLAY, J. W., SCHOPFER, U. & SITTAMPALAM, G. S. 2011. Impact of High-‐Throughput Screening in Biomedical Research. Nat Rev Drug Discov, 10, 188-‐195. MANDAL, S., MOUDGIL, M. N. & MANDAL, S. K. 2009. Rational Drug Design. European Journal of Pharmacology, 625, 90-‐100. 94 References MATLAB. no date a. Roche Evaluates Drug Safety and Efficacy Using MathWorks Tools [Online]. Natick: Matlab. Available: http://www.mathworks.co.uk/company/user_stories/Roche-‐Evaluates-‐Drug-‐Safety-‐ and-‐Efficacy-‐Using-‐MathWorks-‐Tools.html [Accessed 23 August 2011]. MATLAB. no date b. Merrimack Pharmaceuticals Reduces Drug Discovery Time with MATLAB and SimBiology [Online]. Natick: Matlab. Available: http://www.mathworks.co.uk/company/user_stories/Merrimack-‐Pharmaceuticals-‐ Reduces-‐Drug-‐Discovery-‐Time-‐with-‐MATLAB-‐and-‐SimBiology.html [Accessed 27 August 2011]. MERCHANT, M. 2009. The Global Bioinformatics Market. London: Datamonitor. MIZUSHIMA, T. 2011. Drug Discovery and Development Focusing on Existing Medicines: Drug Re-‐Profiling Strategy. Journal of Biochemistry, 149, 499-‐505. MOUSTAKAS, D. T. 2008. Application of Docking Methods to Structure-‐Based Drug Design. In: STROUD, R. M. & FINER-‐MOORE, J. (eds.) Computational and Structural Approaches to Drug Discovery: Ligand-‐Protein Interactions. Cambridge: The Royal Society of Chemistry. MULLARD, A. 2011. Richard Bergström. Nat Rev Drug Discov, 10, 408-‐408. MUNOS, B. 2006. Can Open-‐Source R&D Reinvigorate Drug Research? Nat Rev Drug Discov, 5, 723-‐729. MUNOS, B. 2009. Lessons from 60 Years of Pharmaceutical Innovation. Nat Rev Drug Discov, 8, 959-‐968. NATURE 2011. Learning Lessons from Pfizer's $800 Million Failure. Nat Rev Drug Discov, 10, 163-‐164. NG, R. 2008. Drug Discovery: Small Molecule Drugs, In: NG, R. 2008. Drugs: From Discovery to Approval. 2nd ed. London: John Wiley & Sons. 53-‐92 NOBLE, D. 2007. From the Hodgkin–Huxley Axon to the Virtual Heart. The Journal of Physiology, 580, 15-‐22. OVERINGTON, J. P., AL-‐LAZIKANI, B. & HOPKINS, A. L. 2006. How Many Drug Targets are There? Nat Rev Drug Discov, 5, 993-‐996. PAUL, S. M., MYTELKA, D. S., DUNWIDDIE, C. T., PERSINGER, C. C., MUNOS, B. H., LINDBORG, S. R. & SCHACHT, A. L. 2010. How to Improve R&D Productivity: the Pharmaceutical Industry's Grand Challenge. Nat Rev Drug Discov, 9, 203-‐214. 95 References PAYNE, D. J., GWYNN, M. N., HOLMES, D. J. & POMPLIANO, D. L. 2007. Drugs For Bad Bugs: Confronting the Challenges of Antibacterial Discovery. Nat Rev Drug Discov, 6, 29-‐40. PETSKO, G. 2010. When Failure Should Be the Option. BMC Biology, 8, 61. PHOEBE CHEN, Y.-‐P. & CHEN, F. 2008. Identifying Targets For Drug Discovery Using Bioinformatics. Expert Opinion on Therapeutic Targets, 12, 383-‐389. PITT, W. R., PEREZ HIGUERUELO, A. & GROOM, C. R. 2009. Structural Bioinformatics in Drug Discovery. In: GU, J. & BOURNE, P. E. (eds.) Structural Bioinformatics. 2nd ed. Hoboken: John Wiley & Sons. PRICEWATERHOUSECOOPERS 2008. Pharma 2020: Virtual R&D -‐ Which path will you take? In: PRICEWATERHOUSECOOPERS (ed.) Pharma 2020. New York City. RAJU, T. N. K. 2000. The Nobel Chronicles. The Lancet, 355, 1022-‐1022. RANJAN, J. 2007. Applications of Data Mining Techniques in the Pharmaceutical Industry. Journal of Theoretical and Applied Information Technology, 3, 61-‐67. RASK-‐ANDERSEN, M., ALMÉN, M. S. & SCHIÖTH, H. B. 2011. Trends in the Exploitation of Novel Drug Targets. Nat Rev Drug Discov, 10, 579-‐590. REED, J. Z., DAMIAN, D. & BRADLEY, P. 2010. The Power of Metadata. Cambridge: Biowisdom Ltd. REED, S. 2011. Is There an Astronomer in the House? Science, 331, 696-‐697. ROBSON, B., LI, J., DETTINGER, R., PETERS, A. & BOYER, S. 2011. Drug Discovery Using Very Large Numbers of Patents. General Strategy with Extensive Use of Match and Edit Operations. Journal of Computer-‐Aided Molecular Design, 25, 427-‐441. SCANNELL, J., BLANCKLEY, A., REDENIUS, J. & BODELL CLIVE, L. 2010. The Long View: Pharma R&D Productivity -‐ When the Cures Fail, It Makes Sense to Check the Diagnosis. In: RESEARCH, A. B. (ed.) Bernstein Research. London: AB Bernstein Research. SCHIFFER, C. A. 2008. Combating Drug Resistance -‐ Identifying Resilient Molecular Targets and Robust Drugs. In: STROUD, R. M. & FINER-‐MOORE, J. (eds.) Computational and Structural Approaches to Drug Discovery: Ligand-‐Protein Interactions. Cambridge: The Royal Society of Chemistry. SCHMIDT, B. 2006. Proof of Principle Studies. Epilepsy Research, 68, 48-‐52. SHODA, L., KREUWEL, H., GADKAR, K., ZHENG, Y., WHITING, C., ATKINSON, M., BLUESTONE, J., MATHIS, D., YOUNG, D. & RAMANUJAN, S. 2010. The Type 1 Diabetes Physiolab® Platform: A Validated Physiologically Based Mathematical Model if Pathogenesis in the Non-‐Obese Diabetic Mouse. Clinical & Experimental Immunology, 161, 250-‐267. 96 References SLENO, L. & EMILI, A. 2008. Proteomic Methods for Drug Target Discovery. Current Opinion in Chemical Biology, 12, 46-‐54. SMIT, E. & VAN DER GRAAF, M. 2011. Journal Article Mining: A research study into Practices, Policies, Plans…..and Promises. Amsterdam: Publishing Research Consortium. SONG, C. M., LIM, S. J. & TONG, J. C. 2009. Recent Advances in Computer-‐Aided Drug Design. Briefings in Bioinformatics, 10, 579-‐591. STEVENS, J. L. & BAKER, T. K. 2009. The Future of Drug Safety Testing: Expanding the View and Narrowing the Focus. Drug Discovery Today, 14, 162-‐167. STROUD, R. M. & FINER-‐MOORE, J. (eds.) 2008. Computational and Structural Approaches to Drug Discovery: Ligand-‐Protein Interactions, Cambridge: The Royal Society of Chemistry. SWINNEY, D. C. & ANTHONY, J. 2011. How Were New Medicines Discovered? Nat Rev Drug Discov, 10, 507-‐519. TALAGA, P. 2009. Open Innovation: Share or Die. Drug Discovery Today, 14, 1003-‐1005. TERSTAPPEN, G. C., SCHLUPEN, C., RAGGIASCHI, R. & GAVIRAGHI, G. 2007. Target Deconvolution Strategies in Drug Discovery. Nat Rev Drug Discov, 6, 891-‐903. THE ASSOCIATION OF THE BRITISH PHARMACEUTICAL INDUSTRY. 2011. The Development of New Medicines [Online]. London: The Association of the British Pharmaceutical Industry (ABPI). Available: http://www.abpi.org.uk/industry-‐info/knowledge-‐ hub/randd/Pages/new-‐medicines.aspx [Accessed 23 August 2011]. TIWARI, A. & SEKHAR, A. K. T. 2007. Workflow Based Framework for Life Science Informatics. Computational Biology and Chemistry, 31, 305-‐319. TSAIOUN, K., BOTTLAENDER, M., MABONDZO, A. & THE ALZHEIMER'S DRUG DISCOVERY, F. 2009. ADDME -‐ Avoiding Drug Development Mistakes Early: Central Nervous System Drug Discovery Perspective. BMC Neurology, 9, S1. U.S. NATIONAL LIBRARY OF MEDICINE. 2010. PubMed®: MEDLINE® Retrieval on the World Wide Web Fact Sheet [Online]. Bethesda: U.S. National Library of Medicine. Available: http://www.nlm.nih.gov/pubs/factsheets/pubmed.html [Accessed 23 August 2011]. U.S. NATIONAL LIBRARY OF MEDICINE. 2011. MEDLINE Fact Sheet [Online]. Bethesda: U.S. National Library of Medicine. Available: http://www.nlm.nih.gov/pubs/factsheets/medline.html [Accessed 23 August 2011]. VAN DE WATERBEEMD, H. & GIFFORD, E. 2003. ADMET in Silico Modelling: Towards Prediction Paradise? Nat Rev Drug Discov, 2, 192-‐204. 97 Appendix WALLER, C. L., SHAH, A. & NOLTE, M. 2007. Strategies to Support Drug Discovery Through Integration of Systems and Data. Drug Discovery Today, 12, 634-‐639. WANG, R., LU, Y. & WANG, S. 2003. Comparative Evaluation of 11 Scoring Functions for Molecular Docking. Journal of Medicinal Chemistry, 46, 2287-‐2303. WARREN, G. L., PEISHOFF, C. E. & HEAD, M. S. 2008. Docking Algorithms and Scoring Functions; State-‐of-‐the-‐Art and Current Limitations. In: STROUD, R. M. & FINER-‐ MOORE, J. (eds.) Computational and Structural Approaches to Drug Discovery: Ligand-‐ Protein Interactions. Cambridge: The Royal Society of Chemistry. WATSON, D. E., RYAN, T. P. & STEVENS, J. L. 2006. Case History: Toxicology Biomarker Development Using Toxicogenomics. In: BORCHARDT, R. T., KERNS, E. H., HAGEMAN, M. J., THAKKER, D. R. & STEVENS, J. L. (eds.). Optimizing the “Drug-‐Like” Properties of Leads in Drug Discovery. New York: Springer WISHART, D. S. 2005. Bioinformatics in Drug Development and Assessment. Drug Metabolism Reviews, 37, 279-‐310. WISHART, D. S., KNOX, C., GUO, A. C., SHRIVASTAVA, S., HASSANALI, M., STOTHARD, P., CHANG, Z. & WOOLSEY, J. 2005. DrugBank: A Comprehensive Resource for in Silico Drug Discovery and Exploration. Nucleic Acids Research, 34, D668-‐D672. YANG, Y., ADELSTEIN, S. J. & KASSIS, A. I. 2009. Target Discovery from Data Mining Approaches. Drug Discovery Today, 14, 147-‐154. YOUNG, D. C. 2009. Computational Drug Design: A guide for computational and medicinal chemists, Hoboken, John Wiley & Sons. ZIMMERMAN, Z., REEVE, B. & GOLDEN, J. B. 2004. The Return on Investment for Ingenuity Pathways Analysis within the Pharmaceutical Value Chain. In: IDC (ed.) White Paper. Framingham. ZOETE, V. 2011. Directory of Computer-‐Aided Drug Design Tools [Online]. Lausanne: Swiss Institute of Bioinformatics. Available: www.click2drug.org [Accessed 10 August 2011]. Appendix 98 Appendix Appendix A Detailed Description of the R&D Workflow The first phase is often target identification in which molecular targets for the disease under investigation are explored. “Targets” can be of different nature, for example a receptor, a (mutated) gene, protein-‐protein interactions, disease specific mRNA or miRNA, as long as it is related to the disease. In the case of infectious diseases drug targets are normally found in the infecting organisms (bacteria, virus, and parasites) rather than in the human cells. After several potential targets are found for the investigated disease, tests are performed to reduce the number of targets to one or few with the highest potential to induce required modulations in the diseased cells. This step is called target validation. Once a target is found that is considered “druggable”, i.e. amendable to changes by molecules with drug-‐ like properties (Lipinski and Hopkins, 2004), efforts follow to find molecules that are capable of binding to the target and then inducing the desired cellular changes. This process is known as “lead identification” and aims at finding chemical compounds that, when screened against the target, show some biological efficacy. In this context firstly target assays are performed to find “hits”. A “hit” is a primary, non-‐promiscuously binding, active compound that exceeds a given threshold in an adequate target assay. Hits are then filtered by pharmacological and biochemical screens to find those that show efficacy and selectivity in the context of drug development and are patentable. These form the group of “leads” (Bleicher et al., 2003). From this group of leads normally one is picked for further development and one or two distinct other leads are kept as back-‐up compounds if the first choice fails later on. The lead structure or group of structures is then seen as the structural core for further, more focused medicinal chemistry efforts to identify a clinical drug candidate with higher biological activity (Moustakas, 2008) and drug-‐like pharmacokinetic properties such as absorption, aqueous solubility, permeability and bioavailability (Lipinski and Hopkins, 2004). Following that, a group of scientists tries to optimise the lead compound in a “lead optimisation” phase to create derivatives of the lead backbone that have favourable delivery and target properties (Moustakas, 2008). Once a molecule is discovered that has all the needed properties to be used as a drug for a specific disease, further testing is needed to assure its efficacy in living organisms and especially humans. This phase of R&D is called preclinical and clinical development. In the 99 Appendix preclinical development the compound is tested in animals for efficacy, pharmacokinetic properties and toxicity and then, if successful, moved on for clinical development in humans in which safety, efficacy and pharmacokinetics as well as dosing are analysed (PricewaterhouseCoopers, 2008). Clinical development is organised in three phases before the company can apply for marketing authorisation for the new drug and this process normally takes up to 8 years (Kaitin, 2010). After initial safety testing in humans (“first in men”), a small Phase I study is performed in which pharmacokinetics such as absorption, distribution, metabolism and excretion (ADME) and dosing are assessed and a “proof of principle” (POP) is achieved (Schmidt, 2006). After about two and a half years of clinical testing (during Phase II) confidence in the mechanisms is established. At the end of Phase II (after ca. 3.5 years) confidence in safety exists. Only if this is the case, the “proof of concept” (POC) milestone is accomplished and the drug is progressed to longer and bigger Phase III studies for proof of efficacy and detection of rare adverse reactions before the Marketing Authorisation Application can be filed (PricewaterhouseCoopers, 2008). Today, regulating agencies also required a post-‐marketing surveillance process (often termed Phase IV) to further evaluate the risks and benefits of the drug. 100 Appendix Appendix B Overview: Computer-‐Assisted Drug Design Computers have found access to most parts of our daily life and also in the drug development process. Through the advent of mass synthesis, high-‐throughput screening and automation of wet-‐lab processes robotics are nowadays commonly found in research laboratories (Sandy Primrose, personal communication). Additionally to that, laboratory notebooks are available in electronic form (Electronic Laboratory Notebooks, ELNs) by now with the hope to make documentation of experiments easier and more reliable (William Hamilton, personal communication). This technology is, however, not yet widely accepted due to lack in security and protection against forgery which results in potential problems in patent filing and intellectual property (IP) protection (Mairéad Duke, personal communication). Additionally to that, computer-‐assisted drug design (CADD, also known as in silico design) is gaining momentum in drug design and comprises all steps in the process of design and optimisation that are using computers for data storage and organisation, process virtualisation like compound screening, and other predictive and modelling steps in downstream processes such as ADME and toxicity prediction. This process is fuelled by the ever increasing amount of data acquired by single experiments and the fact that drug design is a multidimensional task in which a compound needs to be found with sufficient activity, (oral) bioavailability, hardly any toxicity, patentability, with a sufficiently long half-‐ life in the blood stream, and with low manufacturing costs (Young, 2009)(p. 4). At the moment, however, CADD is only used as supporting technology to existing practices like HTS (Lusher et al., 2011) although it has significantly contributed to the development of several drugs such as zanamivir, amprenavir, norfloxacin, losartan, and zolmitrapan (Clark, 2006, cited by Lusher et al., 2011). 101 Appendix Figure B.1. In silico drug discovery pipeline. Adapted from (Zoete, 2011) The rest of this section will cover the main parts of the CADD workflow as illustrated in Figure B.1. Data Sources (i.e. databases) are used to store and organise information needed in the drug design and development process. This data can originate from within the company through experiments in the past, from commercial vendors or be in the public domain. In the first part of the pipeline mainly databases with information about small-‐molecules (all kinds of known chemical compounds such as drugs, enzymes, reactants, natural products etc.) and genetic profiles (DNA sequences, gene expression profiles, mass spectrometry data, electrophoresis data etc.) are of importance, but also virtual libraries containing molecule-‐fragments which are combined in multiple ways to create new molecules as screening compounds (Song et al., 2009). For screening purposes the component databases are often pre-‐filtered for molecules with drug-‐like features such as the “Lipinski Rule of Fives”29 (Kapetanovic, 2008). Other databases are of importance further down in the discovery process in order to develop algorithms for virtual docking and pharmacokinetics (ADME/Toxicity) prediction (Kapetanovic, 2008). 29 an empirical rule that compounds should have five or less hydrogen-‐bond donors, ten or less hydrogen-‐bond acceptors, a molecular weight of less than 500 Da and a calculated logP (ClogP) less or equal to five to be orally bioavailable (Young, 2009, p. 28) 102 Appendix If sufficient experimental data about the target structure is not available, molecular modelling of the binding site or the entire protein can be used to predict this structure as long as data of the primary sequence and for homologous molecules exists (Young, 2009). This can be done because protein structure is known to be conserved better than sequence and therefore even with less than 50% sequence identity a sufficient 3D model can be predicted (Song et al., 2009). Young (2009) discusses the needed sequence identity for this approach further, illustrating that still disagreement exists about what percentage identity is needed and warns that even in the case of high sequence identity 1 in 10 homology models has a root mean square deviation of more than 5Å. In this case the model can be seen as not accurate enough for further use. Virtual Screening, Lead Design, and Lead Optimization make up the core applications in CADD. In this step, the main goal is finding potential lead structures for a defined biological target and computers are used to either produce virtual results or inform the laboratory experiments. Two different approaches are commonly used: virtual screening (finding of the best known compound by simulating screening experiments) and de novo lead design (creation of a new active molecule from scratch). As can be seen in Figure B.1 there is also a differentiation between structure-‐based and target-‐based approaches (Song et al., 2009). Structure-‐based methods use information about the structure of the biological target (often a receptor) and compound to test whether the compound fits into the binding site of the target. The compounds in the library are then ranked by binding affinity to select the most promising ones (Young, 2009). Ligand-‐based methods, on the other side, do not use real molecules as substrates, but work with the knowledge that molecules with similar structures often have similar properties. With this knowledge then “the largest common denominator” (“pharmacophore”) of ligand structure is found to model steric and electronic features necessary for binding to the target and triggering the wanted biological response (Kapetanovic, 2008). Often virtual screening uses a combination of both structure-‐ and ligand-‐based approaches to get optimal results (ibid.). Molecule Selection is finally needed to find the optimal molecule for further development and often uses methods such as machine learning and QSAR (quantitative structure-‐activity relationship) to predict potential toxicity and pharmacokinetics of the lead (Song et al., 2009). Unfortunately, these methods are not very precise yet and therefore still many leads fail because of drug toxicity, off-‐target effects or poor efficacy (Kapetanovic, 2008). 103 Appendix In today’s drug design process the aid of computers is indispensable, not only due to the huge amounts of data collected in high-‐throughput experiments: The chemical space of molecules that could potentially be drugs, i.e. organic molecules with molecular weight > 2000 Da, is today estimated to comprise more than 1060 molecules (Song et al., 2009). It is therefore too big to be exhaustively screened even with today’s HTS technologies (Lipinski and Hopkins, 2004). Utilising the increasing computing power may help to find new drugs in this space in a cost and time efficient way. 104 Appendix Appendix C Novel Development Model (Details) Figure C.1. Novel Informatics-‐Centred Research and Development Model A new development model should include the insights gained in this dissertation as well as those from other analyses to improve the process at several of its bottlenecks, including 105 Appendix knowledge integration, team cooperation, and accuracy of available tools. Among these, I see especially human factors and team-‐management issues that lead to unnecessary competitiveness among involved teams and companies instead of teamwork and conscientious and responsible use of the limited resources: time and money. 106 Appendix Appendix D Dr Gordon Baxter Company Position Qualification Company profile Type of Correspondence Date Topics covered BioWisdom (part of Instem plc) Harston Mill, Harston, Cambridge CB22 7GG United Kingdom http://www.biowisdom.com/ Chief Scientific Officer (CSO) at Instem PhD in Pharmacology Previous positions include: CEO: BioWisdom Chairman: ERBI Ltd Chief Operating Officer: Pharmagene plc Discovery Project Manager: SmithKline Beecham BioWidsom: Bioinformatics provider who is part of the Instem group since 03/2011. Market leader in delivering healthcare intelligence software to the pharmaceutical industry in the form of software tools for the acquisition, integration, visualisation and high-‐value analysis of healthcare data (including semantic analysis) Instem: Software provider for late-‐stage preclinical development Customers include all of the top pharmaceutical companies Personal Interview (Audio-‐File of the interview on the CD) 10/06/2011 -‐ Who got the idea for the company and how did it all start? -‐ How many employees do you have? -‐ Who are your customers? How do you work with them? -‐ Who are your main competitors (direct and indirect)? Why? -‐ Which parts of the drug R&D pipeline could be targeted with your product? -‐ Is this also the most difficult part of the R&D process? -‐ Which part in the R&D process could be improved best with computational methods? -‐ Where are the major challenges in the R&D pipeline? -‐ How does your technology improve productivity? How would you measure that? Do you have figures about that (Cost/Time savings, ROI etc.)? -‐ Does Pharma tend to outsource more than before? -‐ Why is the R&D productivity still decreasing (is it?) although computational methods are around for about 20 years now? -‐ How do you think does the R&D process look like in 2020? -‐ Open discussion 107 Appendix Appendix E Dr William Hamilton Company Position Qualification Company profile Type of Correspondence Date Topics covered Prosarix Ltd. Newton Hall, Town Street Newton Cambridge CB22 7ZE United Kingdom http://www.prosarix.com/ Chief Executive Officer (CEO) PhD in Molecular Biology from Imperial College, London Over 50 scientific publications and several patent applications Previous positions include: Chief Operating Officer: Amura Ltd; Chief Operating Officer: Proteom Ltd; Co-‐founder and Research Director: Axis Genetics plc; Technical Director: Pestax Ltd.; Senior Manager: Agricultural Genetics Company Ltd. Non-‐executive Director: InSecta Ltd. Consultancies with AstroMed Ltd./Milligen Ltd., BBSRC, EU, WHO, the Dow Chemical Co Privately-‐owned biotechnology and bioinformatics company. Core Product: ProtoDiscovery™ a validated and state-‐of-‐the-‐art computational platform with for the identification and optimisation of small molecules and proteins for single or multiple targets and “reprofiling”. The company provides services for more than 20 industrial companies, performs in-‐house discovery and novel chiral reagent panels Personal Interview (Audio-‐File of the interview on the CD) 10/06/2011 -‐ Who got the idea for the company and how did it all start? -‐ How many employees do you have? -‐ Who are your customers? How do you work with them? -‐ Who are your main competitors (direct and indirect)? Why? -‐ Which parts of the drug R&D pipeline could be targeted with your product? -‐ Is this also the most difficult part of the R&D process? -‐ Which part in the R&D process could be improved best with computational methods? -‐ Where are the major challenges in the R&D pipeline? -‐ How does your technology improve productivity? How would you measure that? Do you have figures about that (Cost/Time savings, ROI etc.)? -‐ Does Pharma tend to outsource more than before? -‐ Why is the R&D productivity still decreasing (is it?) although computational methods are around for about 20 years now? -‐ How do you think does the R&D process look like in 2020? -‐ Open discussion 108 Appendix Appendix F Dr Gary Rubin Company Position Qualification Company profile Type of Correspondence Date Topics covered Fios Genomics Ltd. ETTC King’s Buildings Edinburgh EH9 3JL United Kingdom http://www.fiosgenomics.com Director of Operations PhD from University of Dundee Previous Positions include: Business and Operations Manager at Division of Pathway Medicine, University of Edinburgh Business Development Director -‐ Europe at CyGenics UK Ltd Business Development Executive at Edinburgh Research & Innovation Investment Manager at Bio*One Capital Bioinformatics service provider. Services focus on in-‐depth bioinformatics analyses of gene expression, microRNA, and SNP data, as well as genomic bioinformatics. Applications for these services can be found in predictive toxicology, biomarker identification, and pathway discovery (among others) Personal Interview (Skype) (Audio-‐File of the interview on the CD) 20/06/2011 -‐ Who got the idea for the company and how did it all start? -‐ How many employees do you have? -‐ Who are your customers? How do you work with them? -‐ Who are your main competitors (direct and indirect)? Why? -‐ Which parts of the drug R&D pipeline could be targeted with your product? -‐ Is this also the most difficult part of the R&D process? -‐ Which part in the R&D process could be improved best with computational methods? -‐ Where are the major challenges in the R&D pipeline? -‐ How does your technology improve productivity? How would you measure that? Do you have figures about that (Cost/Time savings, ROI etc.)? -‐ Does Pharma tend to outsource more than before? -‐ Why is the R&D productivity still decreasing (is it?) although computational methods are around for about 20 years now? -‐ How do you think does the R&D process look like in 2020? -‐ Open discussion 109 Appendix G Dr Matthew Segall Optibrium Ltd. 7226 Cambridge Research Park Beach Drive Company Cambridge CB25 9TL UK http://www.optibrium.com Position Director and Chief Executive Officer (CEO) PhD in Physics from the University of Cambridge (UK) Senior Director ADMET: BioFocus DPI Head of Admensa Business Unit: Inpharmatica Ltd Qualification Associate Director: Inpharmatica Ltd Associate Director: ArQule Assistant Director, Quantum Simulation: Camitro Privately owned company providing software (licensed) to pharmaceutical and biotechnology companies for decision support Company profile based on ADME predictions and multi-‐parameter optimisation of compounds Type of Personal Interview (Skype)( Audio-‐File of the interview on the CD) Correspondence Date 14/06/2011 -‐ Who got the idea for the company and how did it all start? -‐ How many employees do you have? -‐ Who are your customers? How do you work with them? -‐ Who are your main competitors (direct and indirect)? Why? -‐ Which parts of the drug R&D pipeline could be targeted with your product? -‐ Is this also the most difficult part of the R&D process? -‐ Which part in the R&D process could be improved best with computational methods? Topics covered -‐ Where are the major challenges in the R&D pipeline? -‐ How does your technology improve productivity? How would you measure that? Do you have figures about that (Cost/Time savings, ROI etc.)? -‐ Does Pharma tend to outsource more than before? -‐ Why is the R&D productivity still decreasing (is it?) although computational methods are around for about 20 years now? -‐ How do you think does the R&D process look like in 2020? -‐ Open discussion 110 Appendix Appendix H Other Contacts Name Position Dr Bernard Munos Chief Apostle, breakthrough innovation, InnoThink Center for Research in Biomedical Innovation Previously: multiple positions at Eli Lilly CEO at iRND3, Institute for Rare and Neglected Diseases drug discovery Distinguished Research Fellow at Lilly Research Laboratories Consultant at Jill Makin Consulting Dr David Swinney Dr James L Stevens Dr Jill Makin Dr Lloyd Czapewski Dr Mairéad Duke Dr Neil Porter Dr Nicolas Fechner Type of Communication eMail Further information eMail Appendix J eMail Appendix K Inteview -‐ (13/06/2011) Founder at Chemical Lecture Biology Ventures, VP R&D at Biota Europe Ltd Interview Director at Épée Services Ltd (13/06/2011) Independent Biotechnology Professional, Lecturer at the University of Warwick Computational Drug Discovery at Eli Lilly and Company Appendix I Personal -‐ -‐ -‐ communication, Lecture Personal -‐ Communication 111 Appendix Name Position Dr Sanat K. Mandal Teacher and Researcher at Memorial University of Newfoundland and College of the North Atlantic Independent Biotechnology Professional, Lecturer at the University of Warwick Clinical Trials Unit Manager at Warwick Clinical Trials Unit Scientist, Entrepreneur, Lecturer at the University of Cambridge and University of Warwick Head, Applied Bioinformatics Group, University of Tübingen Research Fellow. SBDD, CADD, Drug Design, Drug Discovery, Cheminformatics, Molecular Modeling at Merck Sharp & Dohme Dr Sandy Primrose Dr Sarah Duggan Dr William Bains Prof Oliver Kohlbacher Scott Lusher Type of Communication eMail Further information Personal -‐ Appendix L communication, Lecture Interview Audio-‐File of the interview on the CD eMail, Lecture Appendix M Personal -‐ communication eMail Appendix N 112 Appendix Appendix I Dr Bernard Munos Bernard Munos <[email protected]> 3. August 2011 22:26 An: "Scharfe, Charlotta" <[email protected]> Hello Charlotta, You are asking all the right questions. I think predictive biology (i.e., biological modeling) is the future of drug R&D, but is still in its infancy. At the moment, it does not really exist because of the huge knowledge gaps that remain in basic cell biology and pathology. 40% of the human genome, although it was cloned long ago, remains unknown. We don't know what it does, although some of it is clearly important. I don't think one can model a system is 40% if its make-‐up is shrouded in mystery. The models coming out of system biology are based on what we know, but their predictions are frequently defeated by what we don't know. To put is differently, the cell contents are basically a soup of biomolecules connected by (mostly) reversible chemical reactions. If one attempts to intervene in a pathway by inhibiting an enzyme, for example, a cascade of adjustments ensues that tend to compensate for the changes in the concentration in some of the biomolecules that result from the intervention. In most cases, these adjustments will buffer and negate the intervention, or will produce unexpected side effects. So the result is no efficacy and/or toxicity. Only very rare cases will these adjustments amount to not much, and we'll have a potential drug candidate. This is why drug R&D faces such dismal odds. Network effects that take place outside the biological networks that we know can, and often do, lead to unpredictable outcomes. I think the models can perform much better, but before they do, one must eliminate the knowledge gaps, which is not a trivial challenge. Another challenge is the reductionist mindset of most scientists. This is not the way biology works. Biological networks are the fabric of life, and we need to train scientists to better understand this novel science. More generally, I don't think innovation can be scripted or mandated, or reduced to a rational set of rules. If it were possible, we could produce Nobel prize winners on demand! Research shows instead that most past biomedical breakthroughs came from engaging in high-‐risk, unconventional science. Until we have enabled predictive biology, this will probably remain the most productive pathway for discovering breakthrough drugs. Hope this help. Good luck with your thesis, Bernard 113 Appendix On Wed, Aug 3, 2011 at 4:42 PM, Scharfe, Charlotta <[email protected]> wrote: Dear Mr. Munos, I am currently studying Biotechnology and Business Management at the University of Warwick (UK) and read your article about lessons from 60 years of pharmaceutical innovation (Nature Reviews Drug Discovery, 2009) with great interest. Since I am currently analysing the impact of computational methods on R&D productivity for my Master’s dissertation, I would be interested to know how much, in your opinion, productivity could be improved by using predictive and modelling tools in the R&D process. Which are the areas that are already using computers in an exhaustive way and in which are the parts (e.g. toxicology prediction) where most improvements could be achieved? Are computational tools a way to overcome the current productivity problem in the pharmaceutical industry and in what extent? Additionally, I would be interested to know if there is a barrier for scientists to be overcome before they start using predictive software tools (“being afraid of using the computer”) and if there is a chance to improve the relationship between chemists and computational chemists/external bio/chemoinformatics companies? In which area of computational chemistry do you see the main obstacles at the moment and what is currently working the best? I would appreciate your help to gain some insight into the current state-‐of-‐the-‐art of drug design in the pharmaceutical industry. Thank you for your help! Yours sincerely, Charlotta Schärfe 114 Appendix Appendix J Dr David Swinney [email protected] <[email protected]> 6. August 2011 16:33 An: "Scharfe, Charlotta" <[email protected]> Dar Charlotta Thank you for your e-‐mail and questions. It is good that you are thinking about this. These questions are difficult to answer. I think using computation tools effectively is important but there are limitations when trying to predict the complex dynamic nature of biological processes. In my mind computation tools have the greatest value to assist in thinking through complex situations to help design and evaluate the biological experiments. I do not see computation tools by themselves as a way to overcome productivity, biology will ultimately tell you what works and does not work. Computational tools may assist by helping to define the boundaries for biological testing. Best regards Dave > Dear Mr. Swinney, > > I am currently studying Biotechnology and Business Management at the > University of Warwick (UK) and read your article about how new medicines > were discovered (Nature Reviews Drug Discovery, 2011) with great interest. > > Since I am currently analysing the impact of computational methods on R&D > productivity for my Master’s dissertation, I would be interested to know > how much, in your opinion, productivity could be improved by using > predictive and modelling tools in the R&D process. Which are the areas > that are already using computers in an exhaustive way and in which are the > parts (e.g. toxicology prediction) where most improvements could be > achieved? Are computational tools a way to overcome the current > productivity problem in the pharmaceutical industry and in what extent? > Alternatively, do you see a problem with the use of rational drug design > and computer-‐aided discovery and development in finding robust > first-‐in-‐class components? > In which area of computational chemistry and bioinformatics do you see the > main obstacles at the moment and what is currently working the best? > I would appreciate your help to gain some insight into the current > state-‐of-‐the-‐art of drug design in the pharmaceutical industry. > 115 Appendix Appendix K Dr James L. Stevens James L Stevens <[email protected]> 4. August 2011 12:29 An: "Scharfe, Charlotta" <[email protected]> Cc: "[email protected]" [email protected] Charlotte -‐ You have asked a series of important, if not simple questions. Let me try to answer some of them with the following. It is possible to create a computational model for any data set. However,whether or not that model is statistically valid and has the appropriate performance in predicting positive and negative outcome will be an open question. In addition, the threshold for false positive or negative calls is not a number that can be set a priori when building the model, because it will depend to a large extent on the tolerance for making a "good" or "bad" decision. The perception of the decision makes will also vary depending on the degree of sunk cost that will contribute to the "sunk cost bias" in the decision making. Take the following two question for example: Case 1: Drug company X wants to know if the safety signal for liver injury seen in the chronic preclinical toxicology studies is relevant to humans. There is little in the literature to suggest a clear mechanism for the toxicity, but there is genetic evidence that patients with a mutation in a particular gene related to (but not identical to) the target of the drug might be implicated in a disease state that produces some of the pathologies. A computation model based on high content gene expression shows that the pathway perturbed by the mutation is related to the therapeutic pathway suggesting that the biology underpinning the efficacy and the pathobiology underpinning the toxicity are related. However, the correlations are not strong and there are conflicting data in the literature. Nonetheless, the company pays a state-‐of-‐the-‐art biotechnology company to construct a computational model of both normal and diseased human liver. The model predicts that the pathology is relevant to human. Yet the company knows that it can continue to develop the drug while monitoring patients for safety using a simple blood test. It is clear that the blood test will detect liver injury before it becomes severe, so patient safety can be protected using good screening in the clinical trial. The company also knows that if this toxicity is observed in people, even if they can detect it in advance and avoid serious injury by stopping the dosing, regulatory agencies will not approve the drug. Question: The computational model predicted the toxicity is relevant for human. Should the company do the clinical trial (decision point) or take the output from the model and avoid the additional cost of the trial while forfeiting the sunk cost? 116 Appendix Case 2: Drug Company X is investigating 5 chemical scaffolds each of which might be useful in identifying a drug candidate that will modulate a new enzyme identified as a disease target. The company has a set of computational models with 80% positive and negative predictivity for general toxicity. The model is not good at predicting which target organ will be hit, but is built to predict general toxicity that may involve any organ. The company has invested little so far in any of the 5 scaffolds, but 2 clearly score much better (i.e. few positive toxicity scores) that the other 3. However, the 2 'good' scaffolds present some chemical challenges and are not as easy for the medicinal chemists to handle in increasing the potency against the target. Question: The computation model predicts that 2 of the scaffolds are better than the other 2. There is little mechanistic information on why toxicity might occur. Toxicity is a major contributor to late phase drug failures and no one want to sink $350M getting to Phase III only to find an unacceptable toxicity regardless of the mechanism. Charlotte, the key to any computational model is what will be the decision framework and what value will the model add. I expect you know what choices you would make in the two cases outlined above. So, the answer is yes and no. Yes computational models can add a great deal of value in some decision points but little or no value in others. Thus, I recommend that you incorporate decision context into your evaluation. Perhaps there is a statistician in a decision sciences role in your university that may be helpful in this regard. I hope that helps to some degree. Apologies in advance for spelling and syntax errors as my old fingers and eyes can't keep up with the thoughts. Jim (James L Stevens) ************************************************************ James L. Stevens, Ph.D. Distinguished Research Fellow Lilly Research Laboratories P-‐317-‐276-‐1070 F-‐317-‐277-‐4436 ************************************************************ CONFIDENTIALITY NOTICE: This email from Eli Lilly and Company (including attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. 117 Appendix "Scharfe, Charlotta" <[email protected]> To "[email protected]" <[email protected]> 08/03/2011 04:46 cc "[email protected]" <[email protected]> Subject: Questions about your paper Dear Mr. Stevens and Mr. Baker, I am currently studying Biotechnology and Business Management at the University of Warwick (UK) and read your article about the future of drug safety testing (Drug Discovery Today, 2008) with great interest. Since I am currently analysing the impact of computational methods on R&D productivity for my Master’s dissertation, I would be interested to know how much, in your opinion, productivity could be improved by using predictive and modelling tools in the R&D process, especially for predicting adverse side effects in early clinical development or even before that. Which are the areas that are already using computers in an exhaustive way and in which are the parts where most improvements could be achieved? Are computational tools a way to overcome the current productivity problem in the pharmaceutical industry for example by reducing late stage attrition rates? Alternatively, do you see a problem with the use of rational drug design and computer-‐aided discovery and development in finding robust first-‐in-‐class components? What are the current state-‐of-‐the-‐art ways to determine toxicology of drug leads and is prediction done early on? Are the results of the computational predictions taken seriously? I would greatly appreciate your help to gain some insight into the current state-‐of-‐the-‐art of toxicology evaluation in drug development. Thank you for your help! Yours sincerely, Charlotta Schärfe 118 Appendix Appendix L Dr Sanat K. Mandal Sanat Mandal <[email protected]> 24. August 2011 05:55 To: "Scharfe, Charlotta" <[email protected]> Dear Scharfe, Thank you for your interest in our article. Yes, applying our approach, we have identified a number of targets. Recently, we have submitted another article for publication describing an alternative approach of rational drug design. Please let me know if you need additional information. Thank you for your interest. Regards, Sanat -‐-‐-‐ On Sun, 8/14/11, Scharfe, Charlotta <[email protected]> wrote: From: Scharfe, Charlotta <[email protected]> Subject: Question about your paper "Rational Drug Design" To: "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]> Received: Sunday, August 14, 2011, 8:17 AM Dear Mr. Mandal, I am a student at the University of Warwick (UK) and I am currently pursuing my Master's dissertation project about computational methods in drug design. With great interest I have read your review about rational drug design (European Journal of Pharmacology) in which you propose a strategy for screening a collection of targets against a known potent drug. In this paper you mention that you are using this strategy in your research for target discovery. Since the paper is already two years old I would like to know if your experience with this strategy has been positive. Have you found any new targets with this approach? Thank you for your help. Best regards, Charlotta Scharfe 119 Appendix Appendix M Dr William Bains Charlotta Schärfe <[email protected]> An: [email protected] 3. Juni 2011 11:10 Hi William, I am a student at the Warwick MSc in BBBM. Maybe Crawford already contacted you about my dissertation because it will be about computational methods (mainly computational biology, chemistry and bioinformatics) and how they impact the productivity of the pharmaceutical R&D process (you already sent me a really interesting report during the Bio-‐Commercialisation module). Although you won't be back in the country until the 13th, I wanted to ask you if you know an interesting company in Cambridge that might be doing something related to that topic (i.e. a bioinformatics company, pharma collaborator or general pharma)? I am going to Cambridge on the June 10th to talk to a company called Prosarix and thought it might be nice speaking to other people as well since I'm already there then. Thank you for your help. Regards, Charlotta William Bains <[email protected]> Antwort an: [email protected] An: Charlotta Schärfe <[email protected]> 3. Juni 2011 13:56 Dear Charlotta, Well, any drug company will be using computational methods, any biology company will be using bioinformatics, and any software company will tell you that their software enhances productivity, reduces failures and so on. So it is hard to know where to start. I would have a chat with BioFocus, which is in Great Chesterford (fairly near Cambridge), Astex (which specialises in structure-‐based computational work), Biowisdom (which does all sorts of informatics-‐type stuff) or Optibrium (which does similar sorts of things to Prosarix). I know Biowisdom and Optibrium, used to know BioFocus, do not know Astex. I will drop Biowisdom and Optibrium a line and ask if they would be willing to meet. Give my best to Prosarix. I have known Bill and Jonathan for ages. best, William Charlotta Schärfe <[email protected]> An: [email protected] 5. Juni 2011 00:05 Dear William, many thanks for your quick reply and instant help! As you pointed out, finding a start 120 Appendix with the project is a little bit hard because everybody thinks they are the best and so on. Do you have an idea on how to analyse the effect on productivity in a more quantitative way? Is this possible at all without benchmarking experiments? Best, Charlotta William Bains <[email protected]> Antwort an: [email protected] An: Charlotta Schärfe <[email protected]> 6. Juni 2011 12:59 This is a problem that the software vendors and the drug companies themselves struggle with. How do you prove that your, minor contribution to a very long and complex process made any difference, especially as historical data is not much use because epople change many aspects of the process all at once, so you cannot attribute an imporvement to just one piece of technology. Even benchmarking experiments do not help. If they are at all realistic, then they have to be on ral drug projects, and the advocates for those projects will say that each is unique. The only realistic benchmark is to take the same drug/target/development candidate, convene two teams of identically qualified scientists to develop them, give one of them technology X, and see who wins by launching a drug in 15 years' time. Obviously, no-‐one is going to do that. So people use surrogate measures -‐ so many compounds screened, so many candidates put through ADME and so on. It is easier to demonstrate that a specific bit of technology improves these surrogate measures of productivity, but clearly this approach is not successful as the cost of developing a new drug has gone up inexorably over the last 50 years. (see attached for an objective view on this -‐ I may have sent you this already). The only provable improvement is when technology X replaces technology Y, and is faster and/or cheaper. 'Better' assumes that whay technology Y did in the first place was useful. Thus HTS technology was 'better' than low throughput screens at screening, but it turned out not to be better at discovering drugs, because 'screening' is not that helpful. So I think successful software companies have asked what scientists want to do anyway, and then given them better tools for doing that. EG molecular modelling, searching databases and so on. This is a straigt X for Y replacement -‐ Y is 'doing it all by hand', X is 'doing it with my software'. Ones that try to re-‐engineer the whole process have traditionally failed miserably. One of the earliest 'bioinformatics' companies was DNA-‐Star -‐ they just did software to draw plasmids. They are still around. The ones with fancier products died. The problem at heart is that pharma is not very good at discovering drugs, and does not know why. 121 Appendix So I would turn the question around. I would ask the companies how they prove that their technology adds value. The only convincing answer, by the way, is 'People will pay a lot of money for it'. I am not sure this helps, but it is all I can think of. It is a tough question! William 122 Appendix Appendix N Scott Lusher SCOTT LUSHER <[email protected]> An: "Scharfe, Charlotta" <[email protected]> Cc: "[email protected]" <[email protected]> 5. August 2011 15:40 Hi Charlotta, You pose some very interesting questions. I will take some time over the next few days to answer them as best as I can. Regards, Scott Op 03/08/11, "Scharfe, Charlotta" <[email protected]> schreef: Dear Mr. Lusher, I am currently studying Biotechnology and Business Management at the University of Warwick (UK) and read your article about molecular informatics in compound optimisation (Drug Discovery Today, 2011) with great interest. Since I am currently analysing the impact of computational methods on R&D productivity for my Master’s dissertation, I would be interested to know how much, in your opinion, productivity could be improved by using predictive and modelling tools in the R&D process. Which are the areas that are already using computers in an exhaustive way and in which are the parts (e.g. toxicology prediction) where most improvements could be achieved? Are computational tools a way to overcome the current productivity problem in the pharmaceutical industry? Additionally, I would be interested to know if there is a barrier for scientists to be overcome before they start using predictive software tools (“being afraid of using the computer”) and if there is a chance to improve the relationship between chemists and computational chemists? In which area of computational chemistry do you see the main obstacles at the moment and what is currently working the best? I would appreciate your help to gain some insight into the current state-‐of-‐the-‐art of drug design in the pharmaceutical industry. Thank you for your help! Yours sincerely, Charlotta Schärfe 5. August 2011 22:58 SCOTT LUSHER <[email protected]> An: "Scharfe, Charlotta" [email protected] Lusher’s answers are given in red in the original eMail text written by the author. Dear Mr. Lusher, 123 Appendix I am currently studying Biotechnology and Business Management at the University of Warwick (UK) and read your article about molecular informatics in compound optimisation (Drug Discovery Today, 2011) with great interest. Since I am currently analysing the impact of computational methods on R&D productivity for my Master’s dissertation, I would be interested to know how much, in your opinion, productivity could be improved by using predictive and modelling tools in the R&D process. Are computational tools a way to overcome the current productivity problem in the pharmaceutical industry? Given an estimate of 95% attrition in clinical development, just a 5% reduction in this failure would achieve a doubling of the number of compounds reaching the market. The point of this statement is that a relatively small improvement in the quality of the compounds we take into the clinic will result in a huge impact on the productivity of the industry. How to achieve that improvement? That list is endless… better validated targets and pathways, better design of clinical trials (both areas requiring improved informatics approaches) and much better translational models. The area I can speak with some knowledge about is the design of compounds themselves, so: We set criteria for the drugs that progress from research to development, and even if the compound only scrapes through we spend the time and money progressing it through development where it eventually dies. So why do our compounds only just scrape through? The DDT paper addresses that question, and my belief that we can do better if we make medchem an informatics driven discipline. Which are the areas that are already using computers in an exhaustive way and in which are the parts (e.g. toxicology prediction) where most improvements could be achieved? Im not sure our use of any technology is ever exhaustive as they always develop so quickly. We need to spend more effort in HTS triage, virtual screening and datamining. The reason I pick these is that the tools are good, but not integrated well in the system. Tox prediction and metabolite prediction are areas needing more focus, but at this point the tools are not extremely good. Additionally, I would be interested to know if there is a barrier for scientists to be overcome before they start using predictive software tools (“being afraid of using the computer”) and if there is a chance to improve the relationship between chemists and computational chemists? Trust in the tools is the biggest barrier. “Not invented here” issues are also a problem. As long as bench chemists and computational chemists believe they are working as a team with shared credit they can work well. In which area of computational chemistry do you see the main obstacles at the moment 124 Appendix and what is currently working the best? Our inability to accurately predict binding modes remains our biggest failing and affects our credibility. I would appreciate your help to gain some insight into the current state-‐of-‐the-‐art of drug design in the pharmaceutical industry. Thank you for your help! Yours sincerely, Charlotta Schärfe Op 03/08/11, "Scharfe, Charlotta" <[email protected]> schreef: Dear Mr. Lusher, I am currently studying Biotechnology and Business Management at the University of Warwick (UK) and read your article about molecular informatics in compound optimisation (Drug Discovery Today, 2011) with great interest. Since I am currently analysing the impact of computational methods on R&D productivity for my Master’s dissertation, I would be interested to know how much, in your opinion, productivity could be improved by using predictive and modelling tools in the R&D process. Which are the areas that are already using computers in an exhaustive way and in which are the parts (e.g. toxicology prediction) where most improvements could be achieved? Are computational tools a way to overcome the current productivity problem in the pharmaceutical industry? Additionally, I would be interested to know if there is a barrier for scientists to be overcome before they start using predictive software tools (“being afraid of using the computer”) and if there is a chance to improve the relationship between chemists and computational chemists? In which area of computational chemistry do you see the main obstacles at the moment and what is currently working the best? I would appreciate your help to gain some insight into the current state-‐of-‐the-‐art of drug design in the pharmaceutical industry. Thank you for your help! Yours sincerely, Charlotta Schärfe Charlotta Schärfe <[email protected]> An: SCOTT LUSHER <[email protected]> 6. August 2011 12:35 Hi Scott, just noticed your second mail. Thank you very much for the insights the mail gave me! Do you have an idea how trust in the computational tools could be improved or is it that trust will follow as soon as the perfomance is better? 125 Appendix Regards, Charlotta Am 6. August 2011 13:30 schrieb Charlotta Schärfe <[email protected]>: SCOTT LUSHER <[email protected]> An: Charlotta Schärfe <[email protected]> 7. August 2011 11:13 Increasing trust in predictive models requires effort from both sides. Chemists must be willing to use them and when they do, the results must be good enough to justify using them again. 126

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Critical Assessment of the Impact of Computational Methods on