Download Bioinformatics in drug discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Bioinformatics from a drug
discovery perspective
EMBRACE Workshop, 22-23 March 2007
Niclas Jareborg
AstraZeneca R&D Södertälje
AstraZeneca Drug Discovery
•
Research Areas
•
•
CV/GI (Cardiovasc/Gastrointest), RIRA (Resp/Infl), CNS/Pain, Cancer, Infection
Discovery Sites
•
UK
–
•
North America
–
•
Lund (RIRA), Mölndal (CVGI), Södertälje (CNS/Pain)
India
–
•
Boston (Cancer, Infection), Willmington (CNS/Pain), Montreal (CNS/Pain)
Sweden
–
•
Charnwood (RIRA), Alderley Park (Cancer, CV/GI, RIRA)
Bangalore (Infection)
Bioinformatics
•
•
All RAs have their own bioinformatics teams
Infrastructure at Alderley Park (db:s, large Linux clusters)
–
IS organisation
A target is defined as…
• ... a biological target protein on which a
chemical entity (e.g. a drug molecule) exerts
its action
• A drug target must be associated with a
disease
Drug discovery process
Target identification
Protein
Compound library
Assay
Target validation
Hit identification (HTS)
Hit
Genes
Hit to lead (Lead identification)
Lead optimisation
Candidate drug
Effort
Clinical trials
Target Definition
• Alternative Splicing
•
Identify pharmacologically relevant target variant(s)
• Sequence variation
•
Function
– Target
– Metabolizing enzyme
•
Binding of substance
• Identify most common variant
– Might differ in different populations!
Target Definition
•
Expression
•
Is the target expressed in a relevant human
tissue?
•
Databases
– Microarrays
– Immunhistochemistry
– In situ hybridization
– Proteomics
•
Literature
Target Definition
• Selectivity
• How similar are related proteins?
• Do similar proteins have functions that we do not want to affect?
• Animal models
• Orthologous genes
– Same family size?
•
Splice variants
– Same as in human?
•
Polymorphisms
– Differences between inbred strains
•
Tissue expression
– Overlap human?
•
Available transgenes or knock-outs
Genetics &
Bioinformatics
Bioinformatics input to the drug discovery process
Research
MS1
Target
Identification
Development
MS2
MS3
MS4
Hit Identification
Lead
Identification
Lead
Optimisation
CD Prenomination
Primary screening
Identify polymorphic
and splice variants
Support
target
identification
Support
choice of
model
organism(s)
Selectivity screening
Identify paralogues
Commercialisation
MS5
Support
Biomarker
identification
flag up population
variants in target
Development
for Launch
Registration
Launch
Sales
In-house generated gene
centric information resource
Splice variants
Tissue expression
Genetic
mutations
DNA and protein sequence
Similarity to other species
In-house generated gene
centric information resource
Patents
Gene symbol
Synonyms
Splice variants
Literature
Pathways
Functional
motifs
Tissue expression
Genetic
mutations
DNA and protein sequence
Similarity to other species
Target identification
Targets from different experimental approaches as
well as validation using different technologies
ESTs
sequencing Genetics/genome information
campaigns
Proteomics
Differential biology
Literature
Target Candidates
Micro arrays
(Affymetrix, glas etc.)
In silico
Validation (in silico, lab bench)
Validation as potential targets
Specificity / selectivity
Target identification
~30000 human genes
What?
Link to disease?
Where?
Novel?
1 potential target
The human genome offers many potential drug targets
Current Drug Targets - few target classes
Based on 483 drugs in Goodman and Gilman's "The Pharmacological basis of therapeutics"
Enzymes
28%
Hormones &
factors
11%
IonChannels
5%
Nuclear
Receptors
2%
DNA
2%
Receptors
(GPCRs)
45%
Samuel Svensson, PhD
AstraZeneca R&D Södertälje
Unknown
7%
Number of druggable targets smaller than expected?
~30000 human
genes
Only a subfraction of gene products
play a direct role in disease
patophysiology
Druggable genome ~2-3.000 genes;
500 GPCRs, 50 NHRs,
>200 ion channels, >1.000 enzymes
(e.g. 450 proteases, 500 kinases, >200 others)
pathogens & commensal
gut bacteria genes
< 5.000 targets for
small molecule
drugs
~2-3.000 druggable targets
Updating the (shrinking?) “Targetome”
Down to 22K ? (see) PMID: 15174140
Some of the 120 InterPro
domains are unpromising – many
potentials still functional orphans
– realistically nearer 2000 ?
OMIM still only at 1900 and
only low numbers of
“robust” genetic association
results
Current trends
• “Blue sky genomics” -> literature
• Finding “unknown” targets -> prioritizing the lists
• Moving from single target focus
•
Comparing and ranking of target candidates
– Integration of relevant but disparate data sources
•
Better understanding of the target “neighbourhood”
– Disease mechanism
– Biomarkers
– Toxicology
Sources of Contextual Information
• Structured
• Unstructured
80%
20%
Current approach to retrieving information
from unstructured sources is through
manual extraction
I.e. Finding documents and reading them!
•
•
•
Internal Chemical Dbs
Internal Biological Dbs
External, Commercial Dbs
–
•
GVK Bio, Ingenuity IPA…
External Public Dbs
–
EMBL, PDB, SNPdb, etc
•
Internal Docs:
–
•
External Docs:
–
–
–
–
–
–
Mature Technology
Tox Reports, Clinical Trial
Reports.
Patents; USPTO, WIPO, EP, etc
Literature; Medline, Embase
Press Releases:
– competitor, supplier,
collaborator, academic
(etc)
Government Agencies
Conference Proceedings
News Feeds
Emerging Technology
Dissecting the Decision Making Process
Finding
Extracting
Integrating
Creating
• Locating relevant documents and information
• Retrieving them in a useable format
• Reading information
• Locating the facts within documents
• Understanding what it means
• Putting the information into context
• Turning information into knowledge
• Developing new hypotheses
• Input into decision making
Issues with the Manual Approach
Finding
Extracting
Integrating
Creating
• Difficult to capture breadth
• Chance to miss things
• “White space” in failing to find things
• Limited time to read things
• Focus on reviews and summaries
• Based on individual scientists own knowledge
• Narrow
• Biased
• Hypotheses are “per project”
• Reactive not proactive
Text mining
•
Sources
•
•
•
•
Literature
Patents
In-house reports
Information
•
•
•
Protein-protein interactions
Tissue expression
Pharmacological differences
– Splice variants, Polymorphisms
– Species
•
•
Toxicology
etc
Emerging Systems:Text Mining
• Extraction of facts from unstructured data sources
• Natural Language Processing, Ontologies
• Linguamatics I2E
• Knowledgebase generation
Biomedical Entity-Relationship Data
Co-Published
Gene:Metabolite
Gene:Chemical/Drug
Gene:Gene Semantic
Gene:Disease
Information
Semantic
Semantic
Relationships
Semantic
Relationships
Relationships
Relationships
Hyperplasia
ADP-ribose
Increases
Synthesizes
Neoplasia
Thalidomide
Activates
Associated with
BCL2
PARP
Inhibits
Co-published
Inactivates
Binds
TNF
CASP9
Co-published
Inc Expression
MTPN
BindsCo-published
Co-published
Activates
CASP3
Binds
Co-published
Co-published
Activates
CASP8
Pilot Systems:
Pathway Analysis: Ingenuity IPA
www.ingenuity.com
BER System in Action
Evidence
Trail
Gene
Expression
Significant Biological
Entity List:
Proteomic
•Gene List
•Protein List
•Metabolite List
ERSystem
(Gene/Metabolite
Knowledgebase)
Metabonomic
Genetic
Biological environment
of the list.
Canonical pathways
associated with the list
Question: What is the underlying
biology, pathology, physiology etc
associated with this list of entities?
What is it telling me?
Diseases, Biological
processes associated with
the list
Hypothesis Generation
Literature
Structuring the Knowledge
Delivers facts as networks of information: Knowledge Bases
GI Tox Knowledge
Map
Species
Human
Rat
Dog
Etc.
Observed in
Clinical Observations
Observed in
Affects
Diarrhoea
Vomiting
Loose Stools
Bloating
Nausea
Etc.
Linked with
Compound
Genes
Is a
Linked with
Affects
Pathology
GI toxicity
GI pathology
Involved in
Affects
Involved in
Cellular Processes
Linked with
CVGI TSR
Interface
Disease
KB Interface
Complex Data
Query
DataMart
DataMart
DataMart
ETL
ETL: Biz rules,
scoring
Disease/
Target KB
Ontologies
Automated ETL engines
Genes
Expression Targets
Chem
Ontologies
Focused NLP Extraction
Literature
Patent
CI
Direct
Project
Queries
Extraction
CIRA TSR
Interface
Representation Vizualisation
Data source integration
Workflow technology
•
Enables scientists to use, modify and implement solutions that
specialist groups help them put in place; removes (in principle) the
need to make extensive IS projects for new data types.
The Knowledge Technology Ziggurat
Create
Modelling
Integrate
Knowledge Structuring
Information
Structuring
Extract
Developing semantic
relationships
Fact Extraction (Text Mining)
Find
Decision Making Process
Systems biology
Document Retrieval and Storage
Builds
on
KNOWLEDGE BASES
Builds
on
Builds
on
Builds
on
Content Licensing & Access
Unstructured Information
Current focus
“Bio” and “Chemo” Informatics Joins to Aid Target Selection
Links to endogenous ligands
& modulators
Sequences
Patented inhibitors
Literature inhibitors
and PDB ligands
Expression data, gene
structure, SNPs & splicing
Families
of known targets
Structures
HTS, foussed screens &
project SAR data
Sequence  alignment 
structure  hom. modelling
Docking & virtual screening
Cross-species (orthology)
comparisons
Fingerprint structure search
Sequences  gene names
 disease  literature links
Competitor compounds
Functional genomics
mouse  fish  yeast
Library and fragment data
Linking non-homologs with
analogous mechanisms
and binding pockets
AZ protein and ligand
structures
Chemistry
What do we need to do ?
Clinical
Practice
Chemistry
Biology
Hypothesis Generation
Using
Informatics/Modelling
Proteins
Term Association
via Text Mining
Testicular Degeneration
Ligand-Protein Association
via Experimental & Virtual Methods
Candidate Compound
A multidimensional jigsaw puzzle
• Target - Biological mechanisms - Disease
• Target/Off-target - Biological mechanisms - Toxicology
• Polymorphisms
• Splice variants
• Interaction partners
• Tissues
• Compounds
• Animal models
• etc etc etc…
Current needs
• Pathways / Systems biology
• Mining of unstructured data
• Connect biology and chemistry informatics
domains
• System / data integration
•
Ontologies!
• Workflow technology
AZ - EBI
• AZ member of the Industry programme
•
Training and Education
•
Network meetings
•
Research, Standards