Download SHrinkage covariance estimation Incorporating Prior biological

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
SHrinkage covariance estimation Incorporating
Prior biological knowledge with applications
to high-dimensional data
V. Guillemot, M. Jelizarow, A. Tenenhaus, A.-L. Boulesteix
Ludwig-Maximilians-Universität München
ISI 2011
Dublin, August 25th 2011
Boulesteix
High-dimensional prediction
1/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Boulesteix
High-dimensional prediction
2/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
High-dimensional omics data

x11
 x21

X = 
 ...
 ...
xn1
...
...
...
...
...
...
...
...
...
x1p
x2p
...
...
xnp






I
Random vector (X1 , . . . , Xp )> with covariance matrix Σ
I
Example: gene expression data
X1 , . . . , Xp = expression levels of genes, p ∝ 10, 000, n ∝ 100
I
The unbiased empirical covariance estimator Σ̂ is ill-conditioned if
n p: p(p + 1)/2 parameters but only n observations.
Boulesteix
High-dimensional prediction
3/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Covariance estimation in multivariate methods
Many multivariate statistical methods require the estimation of the
covariance matrix Σ or its inverse Σ−1 :
I
global test with GlobalANCOVA for gene-set analysis (Hummel et
al, Bioinformatics 2008)
I
multiblock analysis with RGCCA (Tenenhaus & Tenenhaus,
Psychometrika 2011)
I
linear discriminant analysis
Boulesteix
High-dimensional prediction
4/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Shrinkage covariance estimation
Schäfer & Strimmer (2005), Ledoit and Wolf (2003):
Σ∗ = λΣ̂ + (1 − λ)T
λ is an analytically determined parameter.
T is a structured covariance target, e.g.
Target
(D
sii
tij =
0
if i = j
if i 6= j
Boulesteix
Target
(F
sii
tij =
√
r̄ sii sjj
if i = j
if i 6= j
High-dimensional prediction
5/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
A popular idea: incorporate
prior biological knowledge on
the structure of the variables
X1 , . . . , Xp into statistical
learning methods
Our contribution: Implement this idea in the framework of
shrinkage covariance estimation through the choice of an adequate
target.
Boulesteix
High-dimensional prediction
6/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
SHIP (implemented in R package SHIP)
Σ∗SHIP = λT + (1 − λ)Σ̂
with Target G:


 sii
√
tij =
r̄ sii sjj


0
if i = j
if i 6= j and i ∼ j
otherwise
i ∼ j means, e.g., that Xi and Xj are in the same pathway, same cluster,
etc
SHIP: SHrinkage covariance estimation Incorporating Prior
biological knowledge
Boulesteix
High-dimensional prediction
7/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Choice of λ
The shrinkage parameter λ is chosen analytically to minimize the
MSE:
P d
P
i6=j Var (sij ) −
i∼j r̄ fij
λ̂ = P
√
2
i6=j (sij − I (i ∼ j)r̄ sii sjj )
in the special case of target G, where:
n
(n−1)3
Pn
2
k=1 (wkij − w̄ij )
P
n
n
d (sij , slm ) =
Cov
k=1 (wkij − w̄ij )(wklm
(n−1)3
qs
q
jj d
1
d (sjj , sij )}.
fij = 2 { s Cov (sii , sij ) + ssii Cov
ii
jj
d (sij ) =
Var
Boulesteix
− w̄lm )
High-dimensional prediction
8/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Simulation design
Multivariate normal distribution with
A1
0
0
0
A2
0
0
0
AK
with Ak = (1 − ak )Ipk + ak Jpk where ak is a scalar in ]0, 1[.
Boulesteix
High-dimensional prediction
9/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
SHIP-based linear discriminant analysis
I
Supervised classification method to predict class membership
(e.g. disease vs. healthy) based on normality assumptions.
I
Idea: Plug the estimator Σ̂∗SHIP into the discriminant function of
linear discriminant analysis
G
●
G (p)
Boulesteix
0.3
0.4
D
●
●
●
●
●
0.2
Test error rate
●
●
●
●
0.1
●
●
●
●
0.0
0.3
0.2
0.1
0.0
Test error rate
0.4
0.5
Low correlations
0.5
High correlations
D
G
G (p)
High-dimensional prediction
10/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Regularized Generalized Canonical Correlation Analysis
(Tenenhaus & Tenenhaus, Psychometrika 2011)
I generalization of Canonical Correlation Analysis (CCA) for multiblock
analysis
I needs an estimate of inverse covariance in each block
I evaluation criterion: MSE of covariance of latent components
I target H: variant of target G with non-zero correlation for all pairs of
variables within a block even if i 6∼ j:


 sii
√
tij =
r̄C sii sjj

√

r̄NC sii sjj
if i = j
if i 6= j and i ∼ j
otherwise
Boulesteix
High-dimensional prediction
11/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Regularized Generalized Canonical Correlation Analysis


ρ1,2 ρ1,3
1 ρ2,3 
ρ3,2 1
1
Ση =  ρ2,1
ρ3,1
H
H (p)
0.25
H
H (p)
Boulesteix
0.20
0.05
●
●
D
D
H
●
●
0.15
●
●
●
MSE
0.20
MSE
●
0.10
0.04
D
n = 50 and α = 2
●
0.15
0.16
0.12
MSE
0.08
0.10
0.15
●
0.05
MSE
●
n = 50 and α = 0.1
n = 200 and α = 2
●
0.10
n = 200 and α = 0.1
H (p)
High-dimensional prediction
D
H
H (p)
12/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
GlobalANCOVA (Hummel et al, Bioinformatics 2008)
I
Global test to test the equality of the mean vector in two groups
(e.g. disease vs. healthy patients)
I
The asymptotic testing procedure uses an estimate of the
covariance matrix.
I
Investigated scenario: each gene is represented by several variables
(probesets), yielding small highly correlated groups of variables.
under H0 (equal mean vector in the
two groups), the p-values are not
uniformly distributed using standard
(diagonal) target D!
Boulesteix
High-dimensional prediction
13/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Conclusions on SHIP (pro)
I
“Proof of concept” successful in simulations
I
Simple generalization of an existing approach (shrinkage
estimation)
I
Common framework for various applications
I
Possible extension: relaxing the cluster structure, making the
target well-conditioned, etc
Boulesteix
High-dimensional prediction
14/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Conclusions on SHIP (cons)
I
Limitation in practice: such prior information on the structure
of variable is often
- incomplete
- partially unreliable
- not directly connected to the notion of correlation
→ approaches not related to correlation may be more
successful in some cases.
I
Do not forget substantive context...
I
Do not be over-optimistic when assessing a new method...
Boulesteix
High-dimensional prediction
15/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Optimization mechanisms in the n p setting
I
In simulations we find that the improvement of LDA through
SHIP is very moderate.
I
With real data we observe no improvement, probably because
the considered prior information are not related to correlation.
I
However, by “fishing for significance” we can make the
classification results look fine.
I
That is because error estimation is very variable in these
settings and thus prone to optimization.
Jelizarow et al, 2010. Over-optimism in bioinformatics: an illustration.
Bioinformatics 26:1990–1998.
Boulesteix
High-dimensional prediction
16/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Optimization mechanisms in the n p setting
I
Optimization of the data sets: Try the new method on different
data sets... and report only the best results...
I
Optimization of the competing methods: Omit the best
state-of-the-art competing methods in the comparison study.
I
Optimization of the settings: Try the new method in combination
with different variable selection or preprocessing steps... and report only
the best results...
I
Optimization of the methods’ characteristics: Consider several
variants of the new method... and report only the best results...
Jelizarow et al, 2010. Over-optimism in bioinformatics: an illustration.
Bioinformatics 26:1990–1998.
Boulesteix
High-dimensional prediction
17/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
On the difficulty to evaluate new integrating methods
The performance of a method like SHIP-LDA depends on:
1. the performance of LDA
2. the performance of SHIP
3. the adequacy between the principle of SHIP and the biological
information (does a connection in KEGG indicate higher
correlation?)
4. the reliability of the biological information
1 and 2 can be addressed in simulations, but not 3 and 4.
3 and 4 can be addressed in real data examples, but:
I
we have problems in the n p (high variability of estimated errors),
I
for unsupervised methods there is no natural performance criterion.
Boulesteix
High-dimensional prediction
18/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Thank you for your attention!
Boulesteix
High-dimensional prediction
19/19
Related documents