Download Microarray Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Maximum parsimony (phylogenetics) wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Metagenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Probabilistic Models in
Bioinformatics
Sepp Hochreiter
Institute of Bioinformatics
Johannes Kepler University, Linz, Austria
Outline
FARMS
Microarray Analysis
FABIA
Biclustering
cn.MOPS
Copy Number Estimation in NGS Data
DEXUS
Differential Expression in NGS Data
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Statistics / Bioinformatics
Hardy’s 1908 paper:
To the Editor of Science: I am reluctant
to intrude in a discussion concerning
matters of which I have no expert
knowledge, and I should have expected
the very simple point which I wish to
make to have been familiar to biologists.
However, some remarks of Mr. Udny
Yule, to which Mr. R. C. Punnett has
called my attention, suggest that it may
still be worth making...
Hardy–Weinberg principle: both allele and genotype frequencies in a
population remain constant from generation to generation
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Statistics / Bioinformatics
Example: mice
AA, Aa  white coat
aa
 gray coat
A’s: 80%
a’s: 20%
genotype
to
phenotype
allele frequency
random mating  genotype frequencies:
64% AA homozygous (0.8*0.8 = 0.64)
32% Aa heterozygotes (0.8*0.2*2 = 0.32)
4% aa homozygous (0.2*0.2 = 0.04)
(aA=Aa)
96% white coats
4% gray coats
phenotype
frequency
Hardy–Weinberg principle: both
gray coated mice will disappear?
allele and genotype frequencies
in a population remain constant No!
from generation to generation
A’s: 80% (0.64 + .5*0.32)
a’s: 20% (0.04 + .5*0.32)
IFAS, Linz, 03.04.2014
allele frequency
Sepp Hochreiter
Microarray Analysis
Beerse,
Belgium
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Berlin,
Germany
Group:
Dept. Psychiatry and
Psychotherapy
Prof. Andreas Heinz
Group:
Dept. Nephrology and
Internal Intensive Care
Prof. Petra Reinke
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Gene Network Science
“GNS is collaborating with researchers at Johannes Kepler University,
Linz, by way of genomics data processing algorithms developed by the
researchers and licensed by GNS.”
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Affymetrix
Fluidics station
Wash / Stain
Affymetrix
Scanner
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
mRNA reference sequence
5‘
3‘
probe
probeset
5‘
mRNA reference sequence
3‘
…TGTGATGGTGGGAATGGGTCAGAAGGACTCCTATGTGGGTGACGAGGCC…
|||||||||||||||||||||||||
TTACCCAGTCTTCCTGAGGATACAC
perfect match
Fluorescence intensity image
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
FARMS (Factor Analysis for Robust Microarray Summarization)
factor
z
loading
matrix
¸1
¸2
¸3
¸4
¸5
¸6
¸7
¸8
¸9
observations
x1
x2
x3
x4
x5
x6
x7
x8
x9
additive
noise
²1
²2
²3
²4
²5
²6
²7
²8
²9
z = variation in mRNA concentration
¸i = sensitivity of log-PMi
²i = measurement noise for log-PMi
xi = observed log-PMi
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Model assumption: measurement x is Gaussian
• Naef et al. 2002: difference of replicates is Gaussian  assumption follows
if noise is symmetric
• Freudenberg 2004
showed it for log2-PM
• On real world data we
did a Shapiro-Wilk test
for normality (continuous
curve is log2-transformed)
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
x = ¸z + ²
Generative model:
 z: factor N(0,1)
 ² :noise N(0,ª)
 ¸: loading vector
 x: data N(0 , ¸¸T + ª)
 ª,¸  EM-algorithm
ª is diagonal
covariance x: ¸¸T + ª
FARMS
Maximum a posterior:
 Data:
 Posterior:
 Likelihood:
 Prior:
correlations between
probes can only be
explained by hidden factor
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Prior knowledge used in the model via prior on the loading matrix:

: more mRNA gives more intensity (positive correlation)
 high values of
are seldom (high variance)
 most genes have constant signal:
Rectified Gaussian as prior:
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
E-step
M-step
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Efficient through algebraic reformulations
Matrix inversion lemma
leading to (note that
IFAS, Linz, 03.04.2014
is diagonal):
Sepp Hochreiter
Microarray Analysis
International competition: “Affycomp” (http://affycomp.biostat.jhsph.edu/)
Participants from Berkely, Affymetrix, EBI, Roche, etc.
AUC (most relevant criterion)
Johnson & Johnson tested it on 30 internal data sets
 Now their default microarray normalization method
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Affycomp data set A
FARMS
Affycomp data set C
FARMS
RMA
MAS 5.0
RMA
MAS 5.0
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Informative/Non-Informative Call
I/NI Call
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Accession number
E-MEXP-101
E-MEXP-120
E-MEXP-121
E-MEXP-714
E-MEXP-72
Spike-in U133
E-MEXP-882
E-TABM-127
E-TABM-34
E-TABM-84
GSE3744
E-MEXP-834
E-MEXP-835
E-MEXP-839
E-MEXP-842
E-TABM-102
E-MEXP-856
GSE2867
GSE2882
GSE3858
GSE4065
E-MEXP-553
E-MEXP-920
E-MEXP-948
GSE5606
GSE6119
GSE1491
GSE3326
GSE3350
GSE3416
GSE431
Chip
hgu133a
hgu133a
hgu133a
hgu133a
hgu133a
hgu133a
hgu133plus2
hgu133plus2
hgu133plus2
hgu133plus2
hgu133plus2
Mouse430_2
Mouse430_2
Mouse430_2
Mouse430_2
Mouse430_2
Mouse430A_2
Mouse430A_2
Mouse430A_2
Mouse430A_2
Mouse430A_2
Rat230_2
Rat230_2
Rat230_2
Rat230_2
Rat230_2
ATH1-121501
ATH1-121501
ATH1-121501
ATH1-121501
ATH1-121501
IFAS, Linz, 03.04.2014
Total
22283
22283
22283
22283
22283
22300
54675
54675
54675
54675
54675
45101
45101
45101
45101
45101
22690
22690
22690
22690
22690
31099
31099
31099
31099
31099
22810
22810
22810
22810
22810
I/NI calls
1726
5027
5105
1242
4385
113
16022
4962
12810
6781
10673
8067
5247
8107
1756
8858
5014
3027
4080
2801
984
3255
954
4080
2723
7449
3138
8186
5716
4635
3593
A/P calls
12898
13850
16574
13711
13801
12869
41355
41022
35162
38258
42625
26382
26891
28485
27945
29934
16569
16412
15035
14379
12181
19261
22725
19378
20626
22030
17855
17827
16646
15159
15653
real-life datasets to assess I/NI calls
•
•
•
•
•
accession number: GEO/ArrayExpress
chip type
number of probe sets (total)
number of probe sets after A/P call
number of probe sets after I/NI call
Sepp Hochreiter
Microarray Analysis
A/P call
I/NI call
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
effect of gene filtering on
tests for differential
expression
t-test after filtering using
both A/P and I/NI calls
The proportion of significant
probe sets (®=0.05) is
given for the two filtering
techniques before and after
multiple testing correction
with an FDR of 10%
(Benjamini and Hochberg,
1995)
A/P call: 740 DE genes
I/NI call: 36 DE genes
True DE: 35 of them
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
appropriate filter (enrich with low p-values, control type I error rate):
1) dependent on the test statistic for alternative
hypotheses to enrich the remaining hypotheses with
low p-values
2) not introduce dependencies between hypotheses
3) must be independent of the subsequent test statistic for
null hypotheses in order to control the type I error rate
Item 1): assures an increase of the study's power
Item 2) and 3): ensure control of the type I error rate
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Talloen, Hochreiter, et al., PNAS, 2010
t-test statistic: two sample t-test, Gauss assumption, for every scale and location invariant test
permutation invariant: non-parametrical test, any test independent of the order of its arguments
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Results for CNA detection on breast cancer. “FP” and “FDR” is the number of
falsely detected segments and the false detection rate on the normal cell lines,
respectively. The I/NI call filter reduces the FDR 18 to 22 fold.
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Area under the precision-recall curves at detecting previously multiple confirmed CNVs.
The I/NI call filter clearly outperforms variance-based filters.
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
A
B
Precision-recall curves (PRCs) at detecting previously multiple confirmed CNVs.
Panel (A) and panel (B) gives the PRC for the whole genome for 3 loci and for 5 loci, respectively.
I/NI call filter has a considerable lower FDR compared to variance-based filtering.
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
A perfect filtering method would call all true CNVs (red circles at 1) and does not call others (dark-blue
background at 0).
The I/NI call filter separates called true positives (true CNVs) from true negatives better than other.
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Rare events: few arrays have signal few factors differ  Laplace distributed
Laplace FARMS
• likelihood cannot be analytically computed
• variational approach (from physics, “calculus of variations”)
Laplace
Gauss
variational:
is variance of the local
Gaussian approximation of the factor
E-step
M-step
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Microarray Analysis
Laplace distribution
vs. Gaussian
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Biclustering
Beerse,
Belgium
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Biclustering
University
Hasselt
Statistics
Univ.
Leuven
Statistics
IFAS, Linz, 03.04.2014
Sepp Hochreiter
Biclustering
Cleveland, Ohio,
United States
Group:
IFAS, Linz, 03.04.2014
Experimental Haematology and Hematopoiesis
Taussig Cancer Center
Prof. Dr. J. Maciejewski
Sepp Hochreiter
Biclustering
Biclustering Cluster rows and columns of a matrix simultaneously
Bicluster A set of row indices and a set of column indices
Bicluster criterion Row vectors are similar to each other on the subset
of column indices and columns are similar to each
other on the subset of row indices
IFAS, Linz, 03.04.2014
Sepp Hochreiter