Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Improving gene signatures by the identification of differentially expressed modules in molecular networks : a local-score approach. Marine Jeanmougin JOBIM 2012, Rennes – July 4th, 2012 1 Outline 1 Introduction Microarray experiments Identification of molecular signatures Motivations 2 DIsease Associated Module Selection (DiAMS) Global approach Local-score statistic for module ranking Evaluation process 3 Results and application Quantitative results Application to Estrogen Receptor status in breast cancer 2 Microarray experiments Objectives of microarray experiments differential analysis Signature of genes Expression level of thousands of transcripts Biological purpose I Signature: genes involved in a phenotype of interest I Medical applications: diagnosis, prognosis, treatment efficacy 3 Identification of molecular signatures Differential analysis Model (c) Xig :expression level of the ith sample for gene g under condition c such as: (c) (c) E(Xig ) = µg Under the assumption of homoscedasticity between conditions: V(Xig ) = (σg )2 (c) Hypothesis testing strategy For two conditions, the null hypothesis to test comes down to ( (1) (2) H0,g : µg = µg (1) (2) H1,g : µg 6= µg B Classical approach: t-statistic Issues for gene-specific variance estimation 4 Identification of molecular signatures Differential analysis Model (c) Xig :expression level of the ith sample for gene g under condition c such as: (c) (c) E(Xig ) = µg Under the assumption of homoscedasticity between conditions: V(Xig ) = (σg )2 (c) Hypothesis testing strategy For two conditions, the null hypothesis to test comes down to ( (1) (2) H0,g : µg = µg (1) (2) H1,g : µg 6= µg B Classical approach: t-statistic Issues for gene-specific variance estimation 4 Identification of molecular signatures Differential analysis Model (c) Xig :expression level of the ith sample for gene g under condition c such as: (c) (c) E(Xig ) = µg Under the assumption of homoscedasticity between conditions: V(Xig ) = (σg )2 (c) Hypothesis testing strategy For two conditions, the null hypothesis to test comes down to ( (1) (2) H0,g : µg = µg (1) (2) H1,g : µg 6= µg B Classical approach: t-statistic Issues for gene-specific variance estimation 4 Identification of molecular signatures Differential analysis Limma: a shrinkage approach (Smyth, 2004) Jeanmougin et al. 2010, PLoS ONE Empirical Bayes variance estimate Sglimma = d0 S02 + dg Sg2 , d0 + dg I S02 : prior variance from the scale-inverse-chi-square distribution fixed with an empirical Bayes approach I Sg2 : usual unbiased estimator of the variance (σg )2 I d0 , dg : residual degrees of freedom for S02 and for the linear model for gene g Test statistic: (1) tglimma = (2) x̄·g − x̄·g q Sglimma n1 + 1 1 n2 . 5 Motivations Limitations of classical approaches I Low reproducibility Ein-Dor et al. 2005, Outcome signature genes in breast cancer: is there a unique set? Bioinformatics I Difficulty to achieve a clear biological interpretation Improving gene signatures I Genes causing the same phenotype are likely to interact together Gandhi, T.K. et al. 2006, Nature Genetics I Identification of genes that are functionally related (i.e. modules) Functional relationship network Expression data 6 Outline 1 Introduction Microarray experiments Identification of molecular signatures Motivations 2 DIsease Associated Module Selection (DiAMS) Global approach Local-score statistic for module ranking Evaluation process 3 Results and application Quantitative results Application to Estrogen Receptor status in breast cancer 7 Global approach Goal Select functional modules presenting unexpected accumulation of high-scoring genes Input parameters I PPI network (strong manifestation of functional relations) I Gene scores from limma statistic DiAMS: a 3-step process 1 Preprocessing 2 Local-score approach for module ranking 3 Selection of significant modules 8 Global approach Step 1 - Preprocessing High-dimensional network I Impossibility of exploring the huge space of possible gene subnetworks Hierarchical clustering I Captures much information about network topology I Enables to go easily through the structure I Screen the entire network without constraints on module sizes ”Walktrap” approach • Random walks strategy • Distance (similarity measure of vertices) • Ward’s criterion Pons and Latapy 2006 JGAA 9 Global approach Step 2 - Local-score approach for module ranking N1 g1 g2 g3 g4 g5 g6 Iterative module ranking 1 Score each module Nk (by summing gene scores) 2 Identify the highest scoring module (local-score statistic) 3 Remove it 4 Repeat setps 1) to 3) until all disjoint modules have been enumerated 10 Global approach Step 2 - Local-score approach for module ranking N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking 1 Score each module Nk (by summing gene scores) 2 Identify the highest scoring module (local-score statistic) 3 Remove it 4 Repeat setps 1) to 3) until all disjoint modules have been enumerated 10 Global approach Step 2 - Local-score approach for module ranking N3 N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking 1 Score each module Nk (by summing gene scores) 2 Identify the highest scoring module (local-score statistic) 3 Remove it 4 Repeat setps 1) to 3) until all disjoint modules have been enumerated 10 Global approach Step 2 - Local-score approach for module ranking N3 N4 N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking 1 Score each module Nk (by summing gene scores) 2 Identify the highest scoring module (local-score statistic) 3 Remove it 4 Repeat setps 1) to 3) until all disjoint modules have been enumerated 10 Global approach Step 2 - Local-score approach for module ranking N5 N3 N4 N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking 1 Score each module Nk (by summing gene scores) 2 Identify the highest scoring module (local-score statistic) 3 Remove it 4 Repeat setps 1) to 3) until all disjoint modules have been enumerated 10 Global approach Step 2 - Local-score approach for module ranking N5 N3 N4 N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking 1 Score each module Nk (by summing gene scores) 2 Identify the highest scoring module (local-score statistic) 3 Remove it 4 Repeat setps 1) to 3) until all disjoint modules have been enumerated 10 Global approach Step 2 - Local-score approach for module ranking N5 N3 N2 N4 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking 1 Score each module Nk (by summing gene scores) 2 Identify the highest scoring module (local-score statistic) 3 Remove it 4 Repeat setps 1) to 3) until all disjoint modules have been enumerated 10 Global approach Step 2 - Local-score approach for module ranking N5 N3 N2 N4 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking 1 Score each module Nk (by summing gene scores) 2 Identify the highest scoring module (local-score statistic) 3 Remove it 4 Repeat setps 1) to 3) until all disjoint modules have been enumerated 10 Global approach Step 3 - Selection of significant modules Goal Assess the global significance of each module Monte-Carlo approach 1 – Permutation of sample labels 2 – Distribution under H0 3 – p-value computation Selection of modules at 5% FDR level. 11 Outline 1 Introduction Microarray experiments Identification of molecular signatures Motivations 2 DIsease Associated Module Selection (DiAMS) Global approach Local-score statistic for module ranking Evaluation process 3 Results and application Quantitative results Application to Estrogen Receptor status in breast cancer 12 Module scoring Distribution of scores in function of p-values 10 Individual gene scoring 6 8 The gene score is given by: 4 2 scores νg = − log(pg ) − δ, 0 I pg : gene p-value from limma, -2 I δ, a constant such as E (νg ) ≤ 0. 0.0 0.2 0.4 0.6 0.8 1.0 pvalues Local-score statistic Definition: value of the highest-scoring module. Given H, a hierarchical community structure, the local-score statistic is defined as: X L = max νg , H⊆H such as H is a subtree of H. g∈H 13 Module scoring Distribution of scores in function of p-values 10 Individual gene scoring 6 8 The gene score is given by: 4 2 scores νg = − log(pg ) − δ, 0 I pg : gene p-value from limma, δ -2 I δ, a constant such as E (νg ) ≤ 0. 0.0 0.2 0.4 0.6 0.8 1.0 pvalues Local-score statistic Definition: value of the highest-scoring module. Given H, a hierarchical community structure, the local-score statistic is defined as: X L = max νg , H⊆H such as H is a subtree of H. g∈H 13 Outline 1 Introduction Microarray experiments Identification of molecular signatures Motivations 2 DIsease Associated Module Selection (DiAMS) Global approach Local-score statistic for module ranking Evaluation process 3 Results and application Quantitative results Application to Estrogen Receptor status in breast cancer 14 Evaluation process Power and false-postive rate study Tree structure 15 Evaluation process Power and false-postive rate study 1. Simulation of significant nodes Tree structure 15 Evaluation process Power and false-postive rate study 1. Simulation of significant nodes Tree structure 2. Simulation of the gene expression matrix H0 Gene expression matrix H1 15 Evaluation process Power and false-postive rate study 1. Simulation of significant nodes Tree structure 2. Simulation of the gene expression matrix H0 Gene expression matrix H1 3. Power and False-Positive (FP) rate evaluation Signature 15 Evaluation process Reproducibility Tree structure H0 H1 Gene expression matrix Signature 16 Evaluation process Reproducibility Tree structure H0 H0 Sub-sampling H1 Gene expression matrix H1 Subsampled expression matrix Signature 16 Evaluation process Reproducibility Tree structure H0 H0 Sub-sampling H1 H1 Subsampled expression matrix Gene expression matrix Reproducibility ? Signature Signature 16 Outline 1 Introduction Microarray experiments Identification of molecular signatures Motivations 2 DIsease Associated Module Selection (DiAMS) Global approach Local-score statistic for module ranking Evaluation process 3 Results and application Quantitative results Application to Estrogen Receptor status in breast cancer 17 Quantitative results False−Positive rate study 0.07 0.06 False−Positive rate 0.05 0.04 Selection method DiAMS Limma 0.03 0.02 0.01 0 5 10 20 30 40 50 Sample size (n1 = n2) False-positive rate study - Estimated false-positive rate over the 1,000 simulations. Plain black line: the 5% level. The dashed black lines: 95% confidence intervals. 18 Quantitative results Power study 1.0 ● ● ● ● ● ● 0.8 0.6 Selection method Power ● ● ● ● DiAMS Limma 0.4 ● 0.2 ● ● 0.0 ● ● 0.5 ● 1.0 1.5 2.0 2.5 3.0 Difference of means (∆) Power study - The mean of power values over the 1,000 simulations are calculated at a 0.05 FDR level. 19 Quantitative results Reproducibility study 100% 80% Reproducibility 60% Selection method DiAMS Limma 40% 20% 0% 5 10 20 30 40 50 Sample size (n1 = n2) 20 Application Breast cancer in a few words I An heterogeneous disease (5 subtypes) I Presence (ER+)/absence (ER-) of Estrogen Receptors: an essential parameter of tumor characterization. Data Affymetrix U133-Plus2.0 arrays: I 537 patients (446 ER+ vs. 91 ER− ) I 54,675 probes Topological data PPI network from HPRD and String: I 13,611 proteins I ∼ 600, 000 interactions 21 Application Results I 27 221 initial modules I 14 significant modules (FDR 1%) I 159 genes Interpretation Module Size Molecular / cellular function 1 38 2 1 (GATA3) 3 35 Breast cancer regulation by Stathmin1* (*oncoprotein which takes part in the preventive progression of ER + tumors) 4 1 (AGR3) Involved in ER-responsive breast tumors (Fletcher et al. 2002) 5 7 Amino-acid metabolism Strong association with ER status (Voduc et al. 2008) PI3K/AKT signaling (cell death and cellular growth) Aryl Hydrocarbon Receptor signaling (*AHR represses ER) 22 Discussion Summary I DiAMS: local-score approach for the selection of disease associated modules of genes I Proved quantitative results on: • power gains, • reproducibility improvements, in comparison to the classical approach. I Limitation: coverage and quality of PPI databases Perspectives I Investigate the predictive performance of DiAMS I Assess the reproducibility on real datasets. 23 Acknowledgements Statistic & Genome Laboratory Christophe Ambroise Michèle, Carène, Claudine, Catherine, Camille, Etienne, Pierre, Gilles, Cecile, Maurice, Marie-Luce, Anne-Sophie, Cyril, Justin, Van-Hanh, Yolande, Sarah, Marius, Bernard et Julien. Pharnext Mickaël Guedj Serguei Nabirotchkin Ilya Chumakov Jan, Caroline, Fabrice, Mickaël, Matthieu, Jonas, Sory. 24 25 26