Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
SHrinkage covariance estimation Incorporating
Prior biological knowledge with applications
to high-dimensional data
V. Guillemot, M. Jelizarow, A. Tenenhaus, A.-L. Boulesteix
Ludwig-Maximilians-Universität München
ISI 2011
Dublin, August 25th 2011
Boulesteix
High-dimensional prediction
1/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Boulesteix
High-dimensional prediction
2/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
High-dimensional omics data
x11
x21
X =
...
...
xn1
...
...
...
...
...
...
...
...
...
x1p
x2p
...
...
xnp
I
Random vector (X1 , . . . , Xp )> with covariance matrix Σ
I
Example: gene expression data
X1 , . . . , Xp = expression levels of genes, p ∝ 10, 000, n ∝ 100
I
The unbiased empirical covariance estimator Σ̂ is ill-conditioned if
n p: p(p + 1)/2 parameters but only n observations.
Boulesteix
High-dimensional prediction
3/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Covariance estimation in multivariate methods
Many multivariate statistical methods require the estimation of the
covariance matrix Σ or its inverse Σ−1 :
I
global test with GlobalANCOVA for gene-set analysis (Hummel et
al, Bioinformatics 2008)
I
multiblock analysis with RGCCA (Tenenhaus & Tenenhaus,
Psychometrika 2011)
I
linear discriminant analysis
Boulesteix
High-dimensional prediction
4/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Shrinkage covariance estimation
Schäfer & Strimmer (2005), Ledoit and Wolf (2003):
Σ∗ = λΣ̂ + (1 − λ)T
λ is an analytically determined parameter.
T is a structured covariance target, e.g.
Target
(D
sii
tij =
0
if i = j
if i 6= j
Boulesteix
Target
(F
sii
tij =
√
r̄ sii sjj
if i = j
if i 6= j
High-dimensional prediction
5/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
A popular idea: incorporate
prior biological knowledge on
the structure of the variables
X1 , . . . , Xp into statistical
learning methods
Our contribution: Implement this idea in the framework of
shrinkage covariance estimation through the choice of an adequate
target.
Boulesteix
High-dimensional prediction
6/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
SHIP (implemented in R package SHIP)
Σ∗SHIP = λT + (1 − λ)Σ̂
with Target G:
sii
√
tij =
r̄ sii sjj
0
if i = j
if i 6= j and i ∼ j
otherwise
i ∼ j means, e.g., that Xi and Xj are in the same pathway, same cluster,
etc
SHIP: SHrinkage covariance estimation Incorporating Prior
biological knowledge
Boulesteix
High-dimensional prediction
7/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Choice of λ
The shrinkage parameter λ is chosen analytically to minimize the
MSE:
P d
P
i6=j Var (sij ) −
i∼j r̄ fij
λ̂ = P
√
2
i6=j (sij − I (i ∼ j)r̄ sii sjj )
in the special case of target G, where:
n
(n−1)3
Pn
2
k=1 (wkij − w̄ij )
P
n
n
d (sij , slm ) =
Cov
k=1 (wkij − w̄ij )(wklm
(n−1)3
qs
q
jj d
1
d (sjj , sij )}.
fij = 2 { s Cov (sii , sij ) + ssii Cov
ii
jj
d (sij ) =
Var
Boulesteix
− w̄lm )
High-dimensional prediction
8/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Simulation design
Multivariate normal distribution with
A1
0
0
0
A2
0
0
0
AK
with Ak = (1 − ak )Ipk + ak Jpk where ak is a scalar in ]0, 1[.
Boulesteix
High-dimensional prediction
9/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
SHIP-based linear discriminant analysis
I
Supervised classification method to predict class membership
(e.g. disease vs. healthy) based on normality assumptions.
I
Idea: Plug the estimator Σ̂∗SHIP into the discriminant function of
linear discriminant analysis
G
●
G (p)
Boulesteix
0.3
0.4
D
●
●
●
●
●
0.2
Test error rate
●
●
●
●
0.1
●
●
●
●
0.0
0.3
0.2
0.1
0.0
Test error rate
0.4
0.5
Low correlations
0.5
High correlations
D
G
G (p)
High-dimensional prediction
10/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Regularized Generalized Canonical Correlation Analysis
(Tenenhaus & Tenenhaus, Psychometrika 2011)
I generalization of Canonical Correlation Analysis (CCA) for multiblock
analysis
I needs an estimate of inverse covariance in each block
I evaluation criterion: MSE of covariance of latent components
I target H: variant of target G with non-zero correlation for all pairs of
variables within a block even if i 6∼ j:
sii
√
tij =
r̄C sii sjj
√
r̄NC sii sjj
if i = j
if i 6= j and i ∼ j
otherwise
Boulesteix
High-dimensional prediction
11/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Regularized Generalized Canonical Correlation Analysis
ρ1,2 ρ1,3
1 ρ2,3
ρ3,2 1
1
Ση = ρ2,1
ρ3,1
H
H (p)
0.25
H
H (p)
Boulesteix
0.20
0.05
●
●
D
D
H
●
●
0.15
●
●
●
MSE
0.20
MSE
●
0.10
0.04
D
n = 50 and α = 2
●
0.15
0.16
0.12
MSE
0.08
0.10
0.15
●
0.05
MSE
●
n = 50 and α = 0.1
n = 200 and α = 2
●
0.10
n = 200 and α = 0.1
H (p)
High-dimensional prediction
D
H
H (p)
12/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
GlobalANCOVA (Hummel et al, Bioinformatics 2008)
I
Global test to test the equality of the mean vector in two groups
(e.g. disease vs. healthy patients)
I
The asymptotic testing procedure uses an estimate of the
covariance matrix.
I
Investigated scenario: each gene is represented by several variables
(probesets), yielding small highly correlated groups of variables.
under H0 (equal mean vector in the
two groups), the p-values are not
uniformly distributed using standard
(diagonal) target D!
Boulesteix
High-dimensional prediction
13/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Conclusions on SHIP (pro)
I
“Proof of concept” successful in simulations
I
Simple generalization of an existing approach (shrinkage
estimation)
I
Common framework for various applications
I
Possible extension: relaxing the cluster structure, making the
target well-conditioned, etc
Boulesteix
High-dimensional prediction
14/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Conclusions on SHIP (cons)
I
Limitation in practice: such prior information on the structure
of variable is often
- incomplete
- partially unreliable
- not directly connected to the notion of correlation
→ approaches not related to correlation may be more
successful in some cases.
I
Do not forget substantive context...
I
Do not be over-optimistic when assessing a new method...
Boulesteix
High-dimensional prediction
15/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Optimization mechanisms in the n p setting
I
In simulations we find that the improvement of LDA through
SHIP is very moderate.
I
With real data we observe no improvement, probably because
the considered prior information are not related to correlation.
I
However, by “fishing for significance” we can make the
classification results look fine.
I
That is because error estimation is very variable in these
settings and thus prone to optimization.
Jelizarow et al, 2010. Over-optimism in bioinformatics: an illustration.
Bioinformatics 26:1990–1998.
Boulesteix
High-dimensional prediction
16/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Optimization mechanisms in the n p setting
I
Optimization of the data sets: Try the new method on different
data sets... and report only the best results...
I
Optimization of the competing methods: Omit the best
state-of-the-art competing methods in the comparison study.
I
Optimization of the settings: Try the new method in combination
with different variable selection or preprocessing steps... and report only
the best results...
I
Optimization of the methods’ characteristics: Consider several
variants of the new method... and report only the best results...
Jelizarow et al, 2010. Over-optimism in bioinformatics: an illustration.
Bioinformatics 26:1990–1998.
Boulesteix
High-dimensional prediction
17/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
On the difficulty to evaluate new integrating methods
The performance of a method like SHIP-LDA depends on:
1. the performance of LDA
2. the performance of SHIP
3. the adequacy between the principle of SHIP and the biological
information (does a connection in KEGG indicate higher
correlation?)
4. the reliability of the biological information
1 and 2 can be addressed in simulations, but not 3 and 4.
3 and 4 can be addressed in real data examples, but:
I
we have problems in the n p (high variability of estimated errors),
I
for unsupervised methods there is no natural performance criterion.
Boulesteix
High-dimensional prediction
18/19
Introduction
The SHIP covariance estimator
Applications
Concluding remarks
Over-optimism
Thank you for your attention!
Boulesteix
High-dimensional prediction
19/19