Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism SHrinkage covariance estimation Incorporating Prior biological knowledge with applications to high-dimensional data V. Guillemot, M. Jelizarow, A. Tenenhaus, A.-L. Boulesteix Ludwig-Maximilians-Universität München ISI 2011 Dublin, August 25th 2011 Boulesteix High-dimensional prediction 1/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Boulesteix High-dimensional prediction 2/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism High-dimensional omics data x11 x21 X = ... ... xn1 ... ... ... ... ... ... ... ... ... x1p x2p ... ... xnp I Random vector (X1 , . . . , Xp )> with covariance matrix Σ I Example: gene expression data X1 , . . . , Xp = expression levels of genes, p ∝ 10, 000, n ∝ 100 I The unbiased empirical covariance estimator Σ̂ is ill-conditioned if n p: p(p + 1)/2 parameters but only n observations. Boulesteix High-dimensional prediction 3/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Covariance estimation in multivariate methods Many multivariate statistical methods require the estimation of the covariance matrix Σ or its inverse Σ−1 : I global test with GlobalANCOVA for gene-set analysis (Hummel et al, Bioinformatics 2008) I multiblock analysis with RGCCA (Tenenhaus & Tenenhaus, Psychometrika 2011) I linear discriminant analysis Boulesteix High-dimensional prediction 4/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Shrinkage covariance estimation Schäfer & Strimmer (2005), Ledoit and Wolf (2003): Σ∗ = λΣ̂ + (1 − λ)T λ is an analytically determined parameter. T is a structured covariance target, e.g. Target (D sii tij = 0 if i = j if i 6= j Boulesteix Target (F sii tij = √ r̄ sii sjj if i = j if i 6= j High-dimensional prediction 5/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism A popular idea: incorporate prior biological knowledge on the structure of the variables X1 , . . . , Xp into statistical learning methods Our contribution: Implement this idea in the framework of shrinkage covariance estimation through the choice of an adequate target. Boulesteix High-dimensional prediction 6/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism SHIP (implemented in R package SHIP) Σ∗SHIP = λT + (1 − λ)Σ̂ with Target G: sii √ tij = r̄ sii sjj 0 if i = j if i 6= j and i ∼ j otherwise i ∼ j means, e.g., that Xi and Xj are in the same pathway, same cluster, etc SHIP: SHrinkage covariance estimation Incorporating Prior biological knowledge Boulesteix High-dimensional prediction 7/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Choice of λ The shrinkage parameter λ is chosen analytically to minimize the MSE: P d P i6=j Var (sij ) − i∼j r̄ fij λ̂ = P √ 2 i6=j (sij − I (i ∼ j)r̄ sii sjj ) in the special case of target G, where: n (n−1)3 Pn 2 k=1 (wkij − w̄ij ) P n n d (sij , slm ) = Cov k=1 (wkij − w̄ij )(wklm (n−1)3 qs q jj d 1 d (sjj , sij )}. fij = 2 { s Cov (sii , sij ) + ssii Cov ii jj d (sij ) = Var Boulesteix − w̄lm ) High-dimensional prediction 8/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Simulation design Multivariate normal distribution with A1 0 0 0 A2 0 0 0 AK with Ak = (1 − ak )Ipk + ak Jpk where ak is a scalar in ]0, 1[. Boulesteix High-dimensional prediction 9/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism SHIP-based linear discriminant analysis I Supervised classification method to predict class membership (e.g. disease vs. healthy) based on normality assumptions. I Idea: Plug the estimator Σ̂∗SHIP into the discriminant function of linear discriminant analysis G ● G (p) Boulesteix 0.3 0.4 D ● ● ● ● ● 0.2 Test error rate ● ● ● ● 0.1 ● ● ● ● 0.0 0.3 0.2 0.1 0.0 Test error rate 0.4 0.5 Low correlations 0.5 High correlations D G G (p) High-dimensional prediction 10/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Regularized Generalized Canonical Correlation Analysis (Tenenhaus & Tenenhaus, Psychometrika 2011) I generalization of Canonical Correlation Analysis (CCA) for multiblock analysis I needs an estimate of inverse covariance in each block I evaluation criterion: MSE of covariance of latent components I target H: variant of target G with non-zero correlation for all pairs of variables within a block even if i 6∼ j: sii √ tij = r̄C sii sjj √ r̄NC sii sjj if i = j if i 6= j and i ∼ j otherwise Boulesteix High-dimensional prediction 11/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Regularized Generalized Canonical Correlation Analysis ρ1,2 ρ1,3 1 ρ2,3 ρ3,2 1 1 Ση = ρ2,1 ρ3,1 H H (p) 0.25 H H (p) Boulesteix 0.20 0.05 ● ● D D H ● ● 0.15 ● ● ● MSE 0.20 MSE ● 0.10 0.04 D n = 50 and α = 2 ● 0.15 0.16 0.12 MSE 0.08 0.10 0.15 ● 0.05 MSE ● n = 50 and α = 0.1 n = 200 and α = 2 ● 0.10 n = 200 and α = 0.1 H (p) High-dimensional prediction D H H (p) 12/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism GlobalANCOVA (Hummel et al, Bioinformatics 2008) I Global test to test the equality of the mean vector in two groups (e.g. disease vs. healthy patients) I The asymptotic testing procedure uses an estimate of the covariance matrix. I Investigated scenario: each gene is represented by several variables (probesets), yielding small highly correlated groups of variables. under H0 (equal mean vector in the two groups), the p-values are not uniformly distributed using standard (diagonal) target D! Boulesteix High-dimensional prediction 13/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Conclusions on SHIP (pro) I “Proof of concept” successful in simulations I Simple generalization of an existing approach (shrinkage estimation) I Common framework for various applications I Possible extension: relaxing the cluster structure, making the target well-conditioned, etc Boulesteix High-dimensional prediction 14/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Conclusions on SHIP (cons) I Limitation in practice: such prior information on the structure of variable is often - incomplete - partially unreliable - not directly connected to the notion of correlation → approaches not related to correlation may be more successful in some cases. I Do not forget substantive context... I Do not be over-optimistic when assessing a new method... Boulesteix High-dimensional prediction 15/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Optimization mechanisms in the n p setting I In simulations we find that the improvement of LDA through SHIP is very moderate. I With real data we observe no improvement, probably because the considered prior information are not related to correlation. I However, by “fishing for significance” we can make the classification results look fine. I That is because error estimation is very variable in these settings and thus prone to optimization. Jelizarow et al, 2010. Over-optimism in bioinformatics: an illustration. Bioinformatics 26:1990–1998. Boulesteix High-dimensional prediction 16/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Optimization mechanisms in the n p setting I Optimization of the data sets: Try the new method on different data sets... and report only the best results... I Optimization of the competing methods: Omit the best state-of-the-art competing methods in the comparison study. I Optimization of the settings: Try the new method in combination with different variable selection or preprocessing steps... and report only the best results... I Optimization of the methods’ characteristics: Consider several variants of the new method... and report only the best results... Jelizarow et al, 2010. Over-optimism in bioinformatics: an illustration. Bioinformatics 26:1990–1998. Boulesteix High-dimensional prediction 17/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism On the difficulty to evaluate new integrating methods The performance of a method like SHIP-LDA depends on: 1. the performance of LDA 2. the performance of SHIP 3. the adequacy between the principle of SHIP and the biological information (does a connection in KEGG indicate higher correlation?) 4. the reliability of the biological information 1 and 2 can be addressed in simulations, but not 3 and 4. 3 and 4 can be addressed in real data examples, but: I we have problems in the n p (high variability of estimated errors), I for unsupervised methods there is no natural performance criterion. Boulesteix High-dimensional prediction 18/19 Introduction The SHIP covariance estimator Applications Concluding remarks Over-optimism Thank you for your attention! Boulesteix High-dimensional prediction 19/19