Download a local-score approa

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Improving gene signatures by the identification
of differentially expressed modules in molecular
networks : a local-score approach.
Marine Jeanmougin
JOBIM 2012, Rennes – July 4th, 2012
1
Outline
1
Introduction
Microarray experiments
Identification of molecular signatures
Motivations
2
DIsease Associated Module Selection (DiAMS)
Global approach
Local-score statistic for module ranking
Evaluation process
3
Results and application
Quantitative results
Application to Estrogen Receptor status in breast cancer
2
Microarray experiments
Objectives of microarray experiments
differential analysis
Signature of genes
Expression level of thousands of transcripts
Biological purpose
I Signature: genes involved in a phenotype of interest
I Medical applications: diagnosis, prognosis, treatment efficacy
3
Identification of molecular signatures
Differential analysis
Model
(c)
Xig :expression level of the ith sample for gene g under condition c such as:
(c)
(c)
E(Xig ) = µg
Under the assumption of homoscedasticity between conditions:
V(Xig ) = (σg )2
(c)
Hypothesis testing strategy
For two conditions, the null hypothesis to test comes down to
(
(1)
(2)
H0,g : µg = µg
(1)
(2)
H1,g : µg 6= µg
B Classical approach: t-statistic
Issues for gene-specific variance estimation
4
Identification of molecular signatures
Differential analysis
Model
(c)
Xig :expression level of the ith sample for gene g under condition c such as:
(c)
(c)
E(Xig ) = µg
Under the assumption of homoscedasticity between conditions:
V(Xig ) = (σg )2
(c)
Hypothesis testing strategy
For two conditions, the null hypothesis to test comes down to
(
(1)
(2)
H0,g : µg = µg
(1)
(2)
H1,g : µg 6= µg
B Classical approach: t-statistic
Issues for gene-specific variance estimation
4
Identification of molecular signatures
Differential analysis
Model
(c)
Xig :expression level of the ith sample for gene g under condition c such as:
(c)
(c)
E(Xig ) = µg
Under the assumption of homoscedasticity between conditions:
V(Xig ) = (σg )2
(c)
Hypothesis testing strategy
For two conditions, the null hypothesis to test comes down to
(
(1)
(2)
H0,g : µg = µg
(1)
(2)
H1,g : µg 6= µg
B Classical approach: t-statistic
Issues for gene-specific variance estimation
4
Identification of molecular signatures
Differential analysis
Limma: a shrinkage approach (Smyth, 2004)
Jeanmougin et al. 2010, PLoS ONE
Empirical Bayes variance estimate
Sglimma =
d0 S02 + dg Sg2
,
d0 + dg
I S02 : prior variance from the scale-inverse-chi-square distribution
fixed with an empirical Bayes approach
I Sg2 : usual unbiased estimator of the variance (σg )2
I d0 , dg : residual degrees of freedom for S02 and for the linear model for
gene g
Test statistic:
(1)
tglimma =
(2)
x̄·g − x̄·g
q
Sglimma n1 +
1
1
n2
.
5
Motivations
Limitations of classical approaches
I Low reproducibility
Ein-Dor et al. 2005, Outcome signature genes in breast cancer: is there a
unique set? Bioinformatics
I Difficulty to achieve a clear biological interpretation
Improving gene signatures
I Genes causing the same phenotype are likely to interact together
Gandhi, T.K. et al. 2006, Nature Genetics
I Identification of genes that are functionally related (i.e. modules)
Functional relationship
network
Expression data
6
Outline
1
Introduction
Microarray experiments
Identification of molecular signatures
Motivations
2
DIsease Associated Module Selection (DiAMS)
Global approach
Local-score statistic for module ranking
Evaluation process
3
Results and application
Quantitative results
Application to Estrogen Receptor status in breast cancer
7
Global approach
Goal
Select functional modules presenting unexpected
accumulation of high-scoring genes
Input parameters
I PPI network (strong manifestation of functional relations)
I Gene scores from limma statistic
DiAMS: a 3-step process
1
Preprocessing
2
Local-score approach for module ranking
3
Selection of significant modules
8
Global approach
Step 1 - Preprocessing
High-dimensional network
I Impossibility of exploring the huge space of possible gene subnetworks
Hierarchical clustering
I Captures much information about network topology
I Enables to go easily through the structure
I Screen the entire network without constraints on module sizes
”Walktrap” approach
• Random walks strategy
• Distance (similarity measure of vertices)
• Ward’s criterion
Pons and Latapy 2006 JGAA
9
Global approach
Step 2 - Local-score approach for module ranking
N1 g1 g2 g3 g4 g5 g6 Iterative module ranking
1
Score each module Nk (by summing gene scores)
2
Identify the highest scoring module (local-score statistic)
3
Remove it
4
Repeat setps 1) to 3) until all disjoint modules have been enumerated
10
Global approach
Step 2 - Local-score approach for module ranking
N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking
1
Score each module Nk (by summing gene scores)
2
Identify the highest scoring module (local-score statistic)
3
Remove it
4
Repeat setps 1) to 3) until all disjoint modules have been enumerated
10
Global approach
Step 2 - Local-score approach for module ranking
N3 N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking
1
Score each module Nk (by summing gene scores)
2
Identify the highest scoring module (local-score statistic)
3
Remove it
4
Repeat setps 1) to 3) until all disjoint modules have been enumerated
10
Global approach
Step 2 - Local-score approach for module ranking
N3 N4 N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking
1
Score each module Nk (by summing gene scores)
2
Identify the highest scoring module (local-score statistic)
3
Remove it
4
Repeat setps 1) to 3) until all disjoint modules have been enumerated
10
Global approach
Step 2 - Local-score approach for module ranking
N5 N3 N4 N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking
1
Score each module Nk (by summing gene scores)
2
Identify the highest scoring module (local-score statistic)
3
Remove it
4
Repeat setps 1) to 3) until all disjoint modules have been enumerated
10
Global approach
Step 2 - Local-score approach for module ranking
N5 N3 N4 N2 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking
1
Score each module Nk (by summing gene scores)
2
Identify the highest scoring module (local-score statistic)
3
Remove it
4
Repeat setps 1) to 3) until all disjoint modules have been enumerated
10
Global approach
Step 2 - Local-score approach for module ranking
N5 N3 N2 N4 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking
1
Score each module Nk (by summing gene scores)
2
Identify the highest scoring module (local-score statistic)
3
Remove it
4
Repeat setps 1) to 3) until all disjoint modules have been enumerated
10
Global approach
Step 2 - Local-score approach for module ranking
N5 N3 N2 N4 N1 g1 g2 g3 g4 g5 g6 Iterative module ranking
1
Score each module Nk (by summing gene scores)
2
Identify the highest scoring module (local-score statistic)
3
Remove it
4
Repeat setps 1) to 3) until all disjoint modules have been enumerated
10
Global approach
Step 3 - Selection of significant modules
Goal
Assess the global significance of each module
Monte-Carlo approach
1 – Permutation of sample labels
2 – Distribution under H0
3 – p-value computation
Selection of modules at 5% FDR level.
11
Outline
1
Introduction
Microarray experiments
Identification of molecular signatures
Motivations
2
DIsease Associated Module Selection (DiAMS)
Global approach
Local-score statistic for module ranking
Evaluation process
3
Results and application
Quantitative results
Application to Estrogen Receptor status in breast cancer
12
Module scoring
Distribution of scores in function of p-values
10
Individual gene scoring
6
8
The gene score is given by:
4
2
scores
νg = − log(pg ) − δ,
0
I pg : gene p-value from limma,
-2
I δ, a constant such as E (νg ) ≤ 0.
0.0
0.2
0.4
0.6
0.8
1.0
pvalues
Local-score statistic
Definition: value of the highest-scoring module.
Given H, a hierarchical community structure, the local-score statistic is defined
as:


X
L = max 
νg  ,
H⊆H
such as H is a subtree of H.
g∈H
13
Module scoring
Distribution of scores in function of p-values
10
Individual gene scoring
6
8
The gene score is given by:
4
2
scores
νg = − log(pg ) − δ,
0
I pg : gene p-value from limma,
δ
-2
I δ, a constant such as E (νg ) ≤ 0.
0.0
0.2
0.4
0.6
0.8
1.0
pvalues
Local-score statistic
Definition: value of the highest-scoring module.
Given H, a hierarchical community structure, the local-score statistic is defined
as:


X
L = max 
νg  ,
H⊆H
such as H is a subtree of H.
g∈H
13
Outline
1
Introduction
Microarray experiments
Identification of molecular signatures
Motivations
2
DIsease Associated Module Selection (DiAMS)
Global approach
Local-score statistic for module ranking
Evaluation process
3
Results and application
Quantitative results
Application to Estrogen Receptor status in breast cancer
14
Evaluation process
Power and false-postive rate study
Tree structure
15
Evaluation process
Power and false-postive rate study
1. Simulation of significant nodes
Tree structure
15
Evaluation process
Power and false-postive rate study
1. Simulation of significant nodes
Tree structure
2. Simulation of the gene expression matrix
H0
Gene expression
matrix
H1
15
Evaluation process
Power and false-postive rate study
1. Simulation of significant nodes
Tree structure
2. Simulation of the gene expression matrix
H0
Gene expression
matrix
H1
3. Power and False-Positive (FP) rate evaluation
Signature
15
Evaluation process
Reproducibility
Tree structure
H0
H1
Gene expression
matrix
Signature
16
Evaluation process
Reproducibility
Tree structure
H0
H0
Sub-sampling
H1
Gene expression
matrix
H1
Subsampled
expression matrix
Signature
16
Evaluation process
Reproducibility
Tree structure
H0
H0
Sub-sampling
H1
H1
Subsampled
expression matrix
Gene expression
matrix
Reproducibility ?
Signature
Signature
16
Outline
1
Introduction
Microarray experiments
Identification of molecular signatures
Motivations
2
DIsease Associated Module Selection (DiAMS)
Global approach
Local-score statistic for module ranking
Evaluation process
3
Results and application
Quantitative results
Application to Estrogen Receptor status in breast cancer
17
Quantitative results
False−Positive rate study
0.07
0.06
False−Positive rate
0.05
0.04
Selection method
DiAMS
Limma
0.03
0.02
0.01
0
5
10
20
30
40
50
Sample size (n1 = n2)
False-positive rate study - Estimated false-positive rate over the 1,000 simulations.
Plain black line: the 5% level. The dashed black lines: 95% confidence intervals.
18
Quantitative results
Power study
1.0
●
●
●
●
●
●
0.8
0.6
Selection method
Power
●
●
●
●
DiAMS
Limma
0.4
●
0.2
●
●
0.0
●
●
0.5
●
1.0
1.5
2.0
2.5
3.0
Difference of means (∆)
Power study - The mean of power values over the 1,000 simulations are
calculated at a 0.05 FDR level.
19
Quantitative results
Reproducibility study
100%
80%
Reproducibility
60%
Selection method
DiAMS
Limma
40%
20%
0%
5
10
20
30
40
50
Sample size (n1 = n2)
20
Application
Breast cancer in a few words
I An heterogeneous disease (5 subtypes)
I Presence (ER+)/absence (ER-) of Estrogen Receptors: an essential
parameter of tumor characterization.
Data
Affymetrix U133-Plus2.0 arrays:
I 537 patients (446 ER+ vs. 91 ER− )
I 54,675 probes
Topological data
PPI network from HPRD and String:
I 13,611 proteins
I ∼ 600, 000 interactions
21
Application
Results
I 27 221 initial modules
I 14 significant modules (FDR 1%)
I 159 genes
Interpretation
Module
Size
Molecular / cellular function
1
38
2
1
(GATA3)
3
35
Breast cancer regulation by Stathmin1*
(*oncoprotein which takes part in the preventive progression of ER +
tumors)
4
1
(AGR3)
Involved in ER-responsive breast tumors (Fletcher et al. 2002)
5
7
Amino-acid metabolism
Strong association with ER status (Voduc et al. 2008)
PI3K/AKT signaling (cell death and cellular growth)
Aryl Hydrocarbon Receptor signaling (*AHR represses ER)
22
Discussion
Summary
I DiAMS: local-score approach for the selection of disease associated
modules of genes
I Proved quantitative results on:
• power gains,
• reproducibility improvements,
in comparison to the classical approach.
I Limitation: coverage and quality of PPI databases
Perspectives
I Investigate the predictive performance of DiAMS
I Assess the reproducibility on real datasets.
23
Acknowledgements
Statistic & Genome Laboratory
Christophe Ambroise
Michèle, Carène, Claudine, Catherine, Camille, Etienne,
Pierre, Gilles, Cecile, Maurice, Marie-Luce, Anne-Sophie,
Cyril, Justin, Van-Hanh, Yolande, Sarah, Marius, Bernard et
Julien.
Pharnext
Mickaël Guedj
Serguei Nabirotchkin
Ilya Chumakov
Jan, Caroline, Fabrice, Mickaël, Matthieu, Jonas, Sory.
24
25
26