Download QSTAR - Institute of Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene desert wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene nomenclature wikipedia , lookup

NEDD9 wikipedia , lookup

Genome evolution wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Quantitative Structure-Transcription-Activity
Relationships (QSTAR)
Günter Klambauer
Institute of Bioinformatics
Johannes Kepler University, Linz, Austria
Günter Klambauer
The QSTAR project
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
2
The QSTAR project
●
Sub-projects for specific drug targets
●
PDE10: phosphodiesterase inhibitors
●
Macrocycles – EGFR inhibitors
●
MTP – microsomal triglyceride transfer protein
●
ROS1
●
mGLU-R2 PAM: schizophrenia
●
FGFR
●
~1600 microarrays, ~750 compounds profiled
●
Analysis with machine learning methods
Günter Klambauer
PDE10 inhibitors project
Günter Klambauer
Macrocycles / EGFR inhibitors project
Günter Klambauer
Macrocycles/EGFR inhibitors project
●
Identified a fingerprint
feature in the MCX/EGFR
data set that explains
inactivity
Günter Klambauer
62513
89668
O
N
O
N
O
N
N
NH
O
Cl
O
NH
NH
O
Cl
EGFR: FABIA for identification of compoundinduced transcriptional modules
●
●
Transcriptional module
containing genes related to
the MAPK/ERK pathway and
cell cycle
Transcriptional module
containing mitochondrial
genes → potential adverse
side effect
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
7
MTP: microsomal triglyceride transfer protein
Günter Klambauer
FABIA for identification of compound-induced
transcriptional modules
●
Module with genes encoding
proteins of the SREBP
pathway
HepG2
LnCap
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
9
ROS1: Selection of scaffold with low promiscuity
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
10
mGLU-R2-PAM
Detection of transcriptional side-effect
●
CHAC1 stress effect in
the mGLU-R2PAM project
and in Cmap
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
11
FGFR: Biclustering for identification of compoundinduced transcriptional modules
●
●
Transcriptional module with
genes encoding MAPK/ERK
inhibiting proteins
One transcriptionally inactive
compound
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
12
Methods
●
●
Detection of differential expression, sparse signals
●
FARMS, Laplace-FARMS (microarrays)
●
DEXUS (RNA-Seq)
Biclustering
●
●
FABIA
Molecule kernels, similarity measures for molecules
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
13
Gene expression data matrix
●
●
●
●
Rows: genes, transcripts,
probes
Columns: Samples
Entry: expression value
(intensity, read count)
Symbol: X
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
14
FARMS & MetaFARMS: The model
●
●
●
●
●
The gene expression data x of the genes in the module is
explained by
l measures how much a gene contributes to the module
z measures how much the gene module is expressed in the
sample
e is independent noise
Difference to FARMS: correction of the covariance matrix not
done; negative loading values are allowed in metaFARMS
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
15
Laplace FARMS for detection of sparse signals in
gene expression data
●
●
●
Extension of the FARMS algorithm
with a Laplacian prior
Variational approach and exact
computation
Detection of rare side-effects of
compounds
Günter Klambauer
DEXUS: Detection of differential expression in
RNA-Seq data
●
Statistical model for RNA-Seq read counts
●
Detection of differential expression
●
Negative binomial distribution for read counts
●
Previous methods: supervised setting
●
●
Case-control studies
●
Multiple groups
●
Replicates
DEXUS: unsupervised setting
●
All study designs
●
Unknown groups
●
No replicates
Günter Klambauer
The DEXUS Model
●
●
mi is the mean of condition i
ri controls the variance of condition i (small r → high
variance)
●
Probability of observing a condition i is ai (mixture weights)
●
Estimating MAP parameters using EM
Günter Klambauer
The DEXUS Model
●
●
mi is the mean of condition i
ri controls the variance of condition i (small r → high
variance)
●
Probability of observing a condition i is ai (mixture weights)
●
Estimating MAP parameters using EM
Günter Klambauer
Distribution of Read Counts Per Transcript
Günter Klambauer
Mixture Model of Readcounts
Günter Klambauer
Unknown Conditions
●
●
Differential expression (conditions known)
●
more than one condition is present
●
conditions are different with respect to mean read counts
For unknown conditions
●
must detect conditions first
●
multiple conditions can only be detected if transcript is DE
→ Simultaneous detection of DE and conditions!
Günter Klambauer
Unknown Conditions
●
●
Differential expression (conditions known)
M-Step
●
more than one condition is present
●
conditions are different with respect to mean read counts
For unknown conditions
E-Step
●
must detect conditions first
●
multiple conditions can only be detected if transcript is DE
→ Simultaneous detection of DE and conditions!
Günter Klambauer
Prior distribution
●
Prior on the size parameter: Exponential distribution
●
Prior on the condition probabilities: Dirichlet distribution
Günter Klambauer
MAP and ML Estimators for the Size Parameter
r drawn from a Gaussian (1,0.1); m=20; n=5; h=0.8.
Günter Klambauer
Update rules derived from the EM algorithm
E-step:
M-step:
m update:
r update:
a update:
Günter Klambauer
Determining Differential Expression in DEXUS
Strong Signal
Weak Signal
Weak Signal
Günter Klambauer
Calling Differential Expression
●
I/NI call
●
●
Evidence for multiple components
(Low FDR due to Dirichlet prior)
Evidence for different means
Günter Klambauer
Real world data: Pickrell et al.
●
Genes on the Y chromosome
●
Genes with eQTLs (known from other studies) with high MAF
●
High ranks: genes on X-chromosome
●
High ranks: genes with CNVs as eQTLs
Günter Klambauer
Real world data: maize plant leafs
●
Different locations of the maize leaf:
●
-1cm from base,
●
base,
●
+4 cm,
●
tip,
●
bundle sheat,
●
mesophyll
●
Illumina Genome Analyzer II
●
Mapped to ZmB73v2 with GSNAP, read counts per gene
Günter Klambauer
Real world data: maize plant leafs
Günter Klambauer
FABIA:
Factor analysis for bicluster acquisition
Günter Klambauer
FABIA:
Factor analysis for bicluster acquisition
●
●
●
L is sparse loading
matrix
Laplace-distribution,
mean zero, variance one
Z is sparse factors
Laplace distribution,
mean zero, variance one
U is additive noise
Günter Klambauer
Biclustering of bioassay and fingerprint data
Goals
●
●
●
●
FABIA using a sparse
algebra for efficient
biclustering of sparse data
Reducing the dimensionality
of the chemical and
the bioassay data
Finding “building blocks” in
the chemical data
Finding assays correlated on
a subset of compounds
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
34
Sparse FABIA for biclustering of bioassay and
fingerprint data RESULTS
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
35
Rchemcpp: Molecular similarity by kernels
●
Molecule kernels
●
●
●
●
Measuring the similarity of molecules (compounds)
Measure is based on the number of common
substructures of two compounds
Result is a similarity matrix or (positive semi-definite)
kernel matrix
Rchemcpp
●
Implementation of various types of molecule kernels
●
R package for easy handling of the data
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
36
Molecule kernels for structural analoging
●
●
●
R implementation of different types of molecule kernels and
visualization
●
Walk-based kernels
●
Tanimoto kernels
●
MinMax kernels
●
Pharmacophore kernels
Similarity measures for
●
Clustering
●
Machine learning methods (e.g. SVMs)
Dimensionality reduction of the chemical space
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
37
Rchemcpp: Molecular similarity by kernels
●
●
Input is a set of
molecules in sdf format
Output is a numeric
similarity matrix (Fig.)
●
●
●
Reducing the
dimensionality of the
chemical space
(clustering)
Prediction of
properties/activity
Does not scale to
databases like
ChEMBL
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
38
Rchemcpp
●
Molecule kernels available in R via Rchemcpp
●
●
Easy handling of similarities of compounds
Alternative approach to fingerprints
●
Prediction of properties/activity
●
Reduction of dimensionality by clustering of compounds
●
Identification of “exemplars” for each cluster
●
Dataset-wise approach
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
39
Molecule kernels for structural analoging
●
●
●
Bioconductor package
Rchemcpp
Web-service for finding
structural analogs in
ChEMBL
Winning method of the
NIEHS-NCATS-UNC
Toxicogenetics Challenge
Günter Klambauer
Summary
●
Application of machine learning techniques in drug discovery
●
Gene expression signatures of compounds
●
Analysis of gene expression data
●
●
Insights in target-related effects
●
Insights into side-effecs (off-target effects)
All methods thoroughly compared against competing
methods in separate publications
Günter Klambauer
References
●
●
●
●
Günter Klambauer, Bie Verbist, Liesbet Vervoort, Willem Talloen, QSTAR Consortium,
Ziv Shkedy, Olivier Thas, Andreas Bender, Hinrich W.H. Göhlmann, Sepp Hochreiter
(2015). Using transcriptomics to guide lead optimization in drug discovery projects.
Drug Discovery Today, 20(5).
Klambauer, G., Wischenbart, M., Mahr, M., Unterthiner, T., Mayr, A., & Hochreiter, S. (2015).
Rchemcpp: a web service for structural analoging in ChEMBL, Drugbank and the
Connectivity Map. Bioinformatics, 31(20), 3392-3394.
Hochreiter, S., Bodenhofer, U., Heusel, M., Mayr, A., Mitterecker, A., Kasim, A., ... & Bijnens,
L. (2010). FABIA: factor analysis for bicluster acquisition. Bioinformatics, 26(12), 1520-1527.
Hochreiter, S., Clevert, D. A., & Obermayer, K. (2006). A new summarization method for
Affymetrix probe level data. Bioinformatics, 22(8), 943-949.
Günter Klambauer
THANK YOU!
HepG2
Günter Klambauer
P-SVM for prediction of transcriptional response
●
Prediction of transcriptional response by chemical features
●
Obtaining information about structure-activity relationships
●
Construction and verification of gene modules:
●
Prediction of gene modules by gene expression features
●
Prediction of primary assay by gene expression features
Günter Klambauer
metaFARMS for summarizing gene modules
●
Modification of the FARMS algorithm for summarization of
gene modules
●
Visualization of gene modules (“gene plots”)
●
Expression value of a compound for a gene module
●
Ranking compounds with respect to a gene module
●
Reducing noise in the measurements
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
45
Logistic regression for target prediction
●
Development of a new method
●
●
●
●
●
Logistic regression
Fast pipeline using fingerprint features
and bioassay data
Revealing the biomolecular target of a
new compound
Identifying the mechanism of action
Repurposing known drugs for new
indications
Günter Klambauer
Institute of Bioinformatics, JKU, Linz
46
Logistic regression for predicting
compound targets
Günter Klambauer
Naive Bayes
Parzen-Rosenblatt
Logistic Regression
AUC
0.661
0.630
0.599
Scoring
SVM
Deep Networks
0.663
0.671
?
Institute of Bioinformatics, JKU, Linz
47