Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics from a drug discovery perspective EMBRACE Workshop, 22-23 March 2007 Niclas Jareborg AstraZeneca R&D Södertälje AstraZeneca Drug Discovery • Research Areas • • CV/GI (Cardiovasc/Gastrointest), RIRA (Resp/Infl), CNS/Pain, Cancer, Infection Discovery Sites • UK – • North America – • Lund (RIRA), Mölndal (CVGI), Södertälje (CNS/Pain) India – • Boston (Cancer, Infection), Willmington (CNS/Pain), Montreal (CNS/Pain) Sweden – • Charnwood (RIRA), Alderley Park (Cancer, CV/GI, RIRA) Bangalore (Infection) Bioinformatics • • All RAs have their own bioinformatics teams Infrastructure at Alderley Park (db:s, large Linux clusters) – IS organisation A target is defined as… • ... a biological target protein on which a chemical entity (e.g. a drug molecule) exerts its action • A drug target must be associated with a disease Drug discovery process Target identification Protein Compound library Assay Target validation Hit identification (HTS) Hit Genes Hit to lead (Lead identification) Lead optimisation Candidate drug Effort Clinical trials Target Definition • Alternative Splicing • Identify pharmacologically relevant target variant(s) • Sequence variation • Function – Target – Metabolizing enzyme • Binding of substance • Identify most common variant – Might differ in different populations! Target Definition • Expression • Is the target expressed in a relevant human tissue? • Databases – Microarrays – Immunhistochemistry – In situ hybridization – Proteomics • Literature Target Definition • Selectivity • How similar are related proteins? • Do similar proteins have functions that we do not want to affect? • Animal models • Orthologous genes – Same family size? • Splice variants – Same as in human? • Polymorphisms – Differences between inbred strains • Tissue expression – Overlap human? • Available transgenes or knock-outs Genetics & Bioinformatics Bioinformatics input to the drug discovery process Research MS1 Target Identification Development MS2 MS3 MS4 Hit Identification Lead Identification Lead Optimisation CD Prenomination Primary screening Identify polymorphic and splice variants Support target identification Support choice of model organism(s) Selectivity screening Identify paralogues Commercialisation MS5 Support Biomarker identification flag up population variants in target Development for Launch Registration Launch Sales In-house generated gene centric information resource Splice variants Tissue expression Genetic mutations DNA and protein sequence Similarity to other species In-house generated gene centric information resource Patents Gene symbol Synonyms Splice variants Literature Pathways Functional motifs Tissue expression Genetic mutations DNA and protein sequence Similarity to other species Target identification Targets from different experimental approaches as well as validation using different technologies ESTs sequencing Genetics/genome information campaigns Proteomics Differential biology Literature Target Candidates Micro arrays (Affymetrix, glas etc.) In silico Validation (in silico, lab bench) Validation as potential targets Specificity / selectivity Target identification ~30000 human genes What? Link to disease? Where? Novel? 1 potential target The human genome offers many potential drug targets Current Drug Targets - few target classes Based on 483 drugs in Goodman and Gilman's "The Pharmacological basis of therapeutics" Enzymes 28% Hormones & factors 11% IonChannels 5% Nuclear Receptors 2% DNA 2% Receptors (GPCRs) 45% Samuel Svensson, PhD AstraZeneca R&D Södertälje Unknown 7% Number of druggable targets smaller than expected? ~30000 human genes Only a subfraction of gene products play a direct role in disease patophysiology Druggable genome ~2-3.000 genes; 500 GPCRs, 50 NHRs, >200 ion channels, >1.000 enzymes (e.g. 450 proteases, 500 kinases, >200 others) pathogens & commensal gut bacteria genes < 5.000 targets for small molecule drugs ~2-3.000 druggable targets Updating the (shrinking?) “Targetome” Down to 22K ? (see) PMID: 15174140 Some of the 120 InterPro domains are unpromising – many potentials still functional orphans – realistically nearer 2000 ? OMIM still only at 1900 and only low numbers of “robust” genetic association results Current trends • “Blue sky genomics” -> literature • Finding “unknown” targets -> prioritizing the lists • Moving from single target focus • Comparing and ranking of target candidates – Integration of relevant but disparate data sources • Better understanding of the target “neighbourhood” – Disease mechanism – Biomarkers – Toxicology Sources of Contextual Information • Structured • Unstructured 80% 20% Current approach to retrieving information from unstructured sources is through manual extraction I.e. Finding documents and reading them! • • • Internal Chemical Dbs Internal Biological Dbs External, Commercial Dbs – • GVK Bio, Ingenuity IPA… External Public Dbs – EMBL, PDB, SNPdb, etc • Internal Docs: – • External Docs: – – – – – – Mature Technology Tox Reports, Clinical Trial Reports. Patents; USPTO, WIPO, EP, etc Literature; Medline, Embase Press Releases: – competitor, supplier, collaborator, academic (etc) Government Agencies Conference Proceedings News Feeds Emerging Technology Dissecting the Decision Making Process Finding Extracting Integrating Creating • Locating relevant documents and information • Retrieving them in a useable format • Reading information • Locating the facts within documents • Understanding what it means • Putting the information into context • Turning information into knowledge • Developing new hypotheses • Input into decision making Issues with the Manual Approach Finding Extracting Integrating Creating • Difficult to capture breadth • Chance to miss things • “White space” in failing to find things • Limited time to read things • Focus on reviews and summaries • Based on individual scientists own knowledge • Narrow • Biased • Hypotheses are “per project” • Reactive not proactive Text mining • Sources • • • • Literature Patents In-house reports Information • • • Protein-protein interactions Tissue expression Pharmacological differences – Splice variants, Polymorphisms – Species • • Toxicology etc Emerging Systems:Text Mining • Extraction of facts from unstructured data sources • Natural Language Processing, Ontologies • Linguamatics I2E • Knowledgebase generation Biomedical Entity-Relationship Data Co-Published Gene:Metabolite Gene:Chemical/Drug Gene:Gene Semantic Gene:Disease Information Semantic Semantic Relationships Semantic Relationships Relationships Relationships Hyperplasia ADP-ribose Increases Synthesizes Neoplasia Thalidomide Activates Associated with BCL2 PARP Inhibits Co-published Inactivates Binds TNF CASP9 Co-published Inc Expression MTPN BindsCo-published Co-published Activates CASP3 Binds Co-published Co-published Activates CASP8 Pilot Systems: Pathway Analysis: Ingenuity IPA www.ingenuity.com BER System in Action Evidence Trail Gene Expression Significant Biological Entity List: Proteomic •Gene List •Protein List •Metabolite List ERSystem (Gene/Metabolite Knowledgebase) Metabonomic Genetic Biological environment of the list. Canonical pathways associated with the list Question: What is the underlying biology, pathology, physiology etc associated with this list of entities? What is it telling me? Diseases, Biological processes associated with the list Hypothesis Generation Literature Structuring the Knowledge Delivers facts as networks of information: Knowledge Bases GI Tox Knowledge Map Species Human Rat Dog Etc. Observed in Clinical Observations Observed in Affects Diarrhoea Vomiting Loose Stools Bloating Nausea Etc. Linked with Compound Genes Is a Linked with Affects Pathology GI toxicity GI pathology Involved in Affects Involved in Cellular Processes Linked with CVGI TSR Interface Disease KB Interface Complex Data Query DataMart DataMart DataMart ETL ETL: Biz rules, scoring Disease/ Target KB Ontologies Automated ETL engines Genes Expression Targets Chem Ontologies Focused NLP Extraction Literature Patent CI Direct Project Queries Extraction CIRA TSR Interface Representation Vizualisation Data source integration Workflow technology • Enables scientists to use, modify and implement solutions that specialist groups help them put in place; removes (in principle) the need to make extensive IS projects for new data types. The Knowledge Technology Ziggurat Create Modelling Integrate Knowledge Structuring Information Structuring Extract Developing semantic relationships Fact Extraction (Text Mining) Find Decision Making Process Systems biology Document Retrieval and Storage Builds on KNOWLEDGE BASES Builds on Builds on Builds on Content Licensing & Access Unstructured Information Current focus “Bio” and “Chemo” Informatics Joins to Aid Target Selection Links to endogenous ligands & modulators Sequences Patented inhibitors Literature inhibitors and PDB ligands Expression data, gene structure, SNPs & splicing Families of known targets Structures HTS, foussed screens & project SAR data Sequence alignment structure hom. modelling Docking & virtual screening Cross-species (orthology) comparisons Fingerprint structure search Sequences gene names disease literature links Competitor compounds Functional genomics mouse fish yeast Library and fragment data Linking non-homologs with analogous mechanisms and binding pockets AZ protein and ligand structures Chemistry What do we need to do ? Clinical Practice Chemistry Biology Hypothesis Generation Using Informatics/Modelling Proteins Term Association via Text Mining Testicular Degeneration Ligand-Protein Association via Experimental & Virtual Methods Candidate Compound A multidimensional jigsaw puzzle • Target - Biological mechanisms - Disease • Target/Off-target - Biological mechanisms - Toxicology • Polymorphisms • Splice variants • Interaction partners • Tissues • Compounds • Animal models • etc etc etc… Current needs • Pathways / Systems biology • Mining of unstructured data • Connect biology and chemistry informatics domains • System / data integration • Ontologies! • Workflow technology AZ - EBI • AZ member of the Industry programme • Training and Education • Network meetings • Research, Standards