Download Talk2.stat.methods

Document related concepts

X-inactivation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Metagenomics wikipedia , lookup

Essential gene wikipedia , lookup

NEDD9 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Oncogenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Long non-coding RNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene desert wikipedia , lookup

Public health genomics wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Minimal genome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Ridge (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Analysis of gene expression data
(Nominal explanatory variables)
Shyamal D. Peddada
Biostatistics Branch
National Inst. Environmental
Health Sciences (NIH)
Research Triangle Park, NC
Outline of the talk

Two types of explanatory variables
(“experimental conditions”)

Some scientific questions of interest

A brief discussion on false discovery rate (FDR)
analysis

Some existing statistical methods for analyzing
microarray data
Types of explanatory variables
Types of explanatory variables
(“experimental conditions”)

Nominal variables:
– No intrinsic order among the levels of the explanatory
variable(s).
– No loss of information if we permuted the labels of the
conditions.

E.g. Comparison of gene expression of samples from
“normal” tissue with those from “tumor” tissue.
Types of explanatory variables
(“experimental conditions”)

Ordinal/interval variables:
– Levels of the explanatory variables are ordered.
– E.g.


Comparison of gene expression of samples from different
stages of severity of lessions such as “normal”,
“hyperplasia”, “adenoma” and “carcinoma”. (categorically
ordered)
Time-course/dose-response experiments. (numerically
ordered)
Focus of this talk:
Nominal explanatory variables
Types of microarray data

Independent samples
– E.g. comparison of gene expression of independent
samples drawn from normal patients versus independent
samples from tumor patients.

Dependent samples
– E.g. comparison of gene expression of samples drawn
from normal tissues and tumor tissues from the same
patient.
Possible questions of interest

Identify significant “up/down” regulated genes
for a given “condition” relative to another
“condition” (adjusted for other covariates).

Identify genes that discriminate between various
“conditions” and predict the “class/condition” of a
future observation.

Cluster genes according to patterns of expression
over “conditions”.

Other questions?
Challenges

Small sample size but a large number of genes.

Multiple testing – Since each microarray has
thousands of genes/probes, several thousand
hypotheses are being tested. This impacts the
overall Type I error rates.

Complex dependence structure between genes and
possibly among samples.
– Difficult to model and/or account for the underlying
dependence structures among genes.
Multiple Testing:
Type I Errors
- False Discovery Rates …
The Decision Table
Number of
Not
rejected
H0
Number of
rejected
H0
Number of
True
H0
The only
observable
values
Number of
True
Total
U
V
m0
T
S
m1
Ha
Total
W
R
m
Strong and weak control of
type I error rates

Strong control: control type I error rate under any
combination of true H and H
0

a
Weak control: control type I error rate only when
all null hypotheses are true
Since we do not know a priori which hypotheses are
true, we will focus on strong control of type I error
rate.
Consequences of multiple testing

Suppose we test each hypothesis at 5% level of
significance.
– Suppose n = 10 independent tests performed. Then the
probability of declaring at least 1 of the 10 tests
significant is 1 – 0.9510 = 0.401.
– If 50,000 independent tests are performed as in
Affymetrix microarray data then you should expect
2500 false positives!
Types of errors in the
context of multiple testing
 Per-Family Error “Rate” (PFER): E(V )
– Expected number of false rejection of H 0

Per-Comparison Error Rate (PCER): E(V )/m
– Expected proportion of false rejections of H 0 among
all m hypotheses.

Family-Wise Error Rate (FWER): P( V > 0 )
– Probability of at least one false rejection of H 0
among all m hypotheses
Types of errors in the
context of multiple testing

False Discovery Rate (FDR):
–
Expected proportion of Type I errors among all rejected
hypotheses.

Benjamini-Hochberg (BH): Set V/R = 0 if R = 0.
V
V
E ( 1{ R 0} )  E ( | R  0) P( R  0)
R
R

Storey: Only interested in the case R > 0. (Positive FDR)
pFDR  E (
V
V
1{ R 0} )  E ( | R  0)
R
R
Some useful inequalities
Since V  R  m, therefor e
V V
 1{ R 0}
m R
(1)
Again, since V  R and R  0  V  0
Therefore
V 1{R 0}  R 1{V 0}.
V
Thus
1{R 0}  1{V 0} .
R
Also 1{V 0}  V
(2)
(3)
Some useful inequalities
Combining (1), (2) and (3), we have :
V V
 1{R 0}  1{V 0}  V
m R
(4)
Taking expectatio ns in (4) we have :
V 
V

E    E  1{ R 0}   E{1{V 0}}  E{V }
m
R

(5)
Some useful inequalities
Thus we have :
PCER  FDR  FWER  PFER
Trivially
FDR  pFDR
(6)
(7)
Conclusion

It is conservative to control FWER rather than
FDR!

It is conservative to control pFDR rather than
FDR!
Some useful inequalities
Question: Is pFDR  FWER?
Some useful inequalities
Example : Suppose m0  m.
Note : m0  m  m1  0
 S  0 V  R
V

 FDR  E  1{ R 0} 
R

 E (1{V  0} )
 P (V  0)  FWER
Some useful inequalities
V

But pFDR  E  | R  0
R

 E (1 | R  0)  1.
Hence if m0  m then
1  pFDR  FDR  FWER
Some useful inequalities
However, in most applications such as
microarrays, one expects m  0
1
In general, there is no proof of the statement
pFDR  FWER
Some popular Type I error
controlling procedures

Let P(1)  P( 2)  ...  P( m) denote the ordered
p-values for the ‘m’ tests that are being performed.

Let  (1)   ( 2)  ...   ( m) denote the ordered
levels of significance used for testing the ‘m’ null
hypotheses, H 0(1) , H 0( 2) ,..., H 0( m) respectively.
Some popular controlling procedures

Step-down procedure:
Step 1 : If P(1)   (1) then reject H 0(1) - Goto Step 2
Else Stop.
Step 2 : If P( 2)H  ,( 2H
reject
H 0( 2) - Goto Step 3
) then
...,
H
(1)
( 2)
(r )
Else Stop.
Step 3 : If P(3)   (3) then reject H 0(3) - Goto Step 3
Else Stop.
and so on.
Some popular controlling procedures

Step –up procedure:
Step 1 : If P( m)   ( m) then reject H 0(i ) , i  1,2,...m and stop.
Else goto Step 2.
Step 2 : If P( m1)   ( m1) then reject H 0(i ) , i  1,2,...m  1 and stop.
Else goto Step 3.
Step 3 : If P( m2)   ( m2) then reject H 0(i ) , i  1,2,...m  2 and stop.
Else goto Step 4.
and so on!
Some popular controlling procedures

Single-step procedure
A stepwise procedure with critical same critical
constant for all ‘m’ hypotheses.
 (1)   ( 2)  ...   ( m)
Some typical stepwise procedures:
FWER controlling procedures

Bonferroni: A single-step procedure with 

Sidak: A single-step procedure with
 i  1  (1   )1/ m

Holm: A step-down procedure with
i   /( m  i  1)

Hochberg: A step-up procedure with
i   /( m  i  1)
i
 /m
minP method: A resampling-based single-step procedure with
  c where c be the α quantile of the distribution of

i

the minimum p-value.
Comments on the methods

Bonferroni: Very general but can be too
conservative for large number of hypotheses.

Sidak: More powerful than Bonferroni, but
applicable when the test statistics are
independent or have certain types of positive
dependence.
Comments on the methods

Holm: More powerful than Bonferroni and is
applicable for any type of dependence structure
between test statistics.

Hochberg: More powerful than Holm’s procedure
but the test statistics should be either
independent or the test statistic have a MTP2
property.
Comments on the methods

Multivariate Total Positivity of Order 2 (MTP2)
f (x) is said to MTP2 if for all x,y  R p ,
f (x  y) f (x  y)  f (x) f (y)

Some typical stepwise procedures:
FDR controlling procedure
 Benjamini-Hochberg:
A step-up procedure with
i  i / m
An Illustration

Lobenhofer et al. (2002) data:

Expose breast cancer cells to estrodial for 1 hour or (12, 24
36 hours).

Number of genes on the cDNA 2 spot array - 1900.

Number of samples per time point 8.,

Compare 1 hour with (12, 24 and 36 hours) using a two-sided
bootstrap t-test.
Some Popular Methods of Analysis
1. Fold-change
1. Fold-change in gene expression

For gene “g” compute the fold change between
two conditions (e.g. treatment and control):
X trt
fg 
X cont
1. Fold-change in gene expression



R1, R2 :
pre-defined constants.

f g  R1 : gene “g” is “up-regulated”.

f g  R2 : gene “g” is “down-regulated”.
1. Fold-change in gene expression

Strengths:
– Simple to implement.
– Biologists find it very easy to interpret.
– It is widely used.

Drawbacks:
– Ignores variability in mean gene expression.
– Genes with subtle gene expression values can be
overlooked. i.e. potentially high false negative rates
– Conversely, high false positive rates are also possible.
2. t-test type procedures
2.1 Permutation t-test
For each gene “g” compute the standard two-sample
t-statistic:
X g ,trt  X g ,cont
tg 
1
1
Sg

ntrt ncont
where X g ,trt , X g ,cont are the sample means and Sg is the
pooled sample standard deviation.

2.1 Permutation t-test
Statistical significance of a gene is determined by
computing the null distribution of t g using either
permutation or bootstrap procedure.
2.1 Permutation t-test
Strengths:

–
–
–
Simple to implement.
Biologists find it very easy to interpret.
It is widely used.
Drawback:

–
Potentially, for some genes the pooled sample standard deviation
could be very small and hence it may result in inflated Type I errors
and inflated false discovery rates.
2.2 SAM procedure
(Significance Analysis of Microarrays)
(Tusher et al., PNAS 2001)
For each gene “g” modify the standard two-sample
t-statistic as:
dg 
X g ,trt  X g ,cont
s0  S g
1
1

ntrt ncont
The “fudge” factor s0 is obtained such that the
coefficient of variation in the above test statistic is minimized.
3. F-test and its variations for
more than 2 nominal conditions

Usual F-test and the P-values can be obtained by a
suitable permutation procedure.

Regularized F-test: Generalization of Baldi and
Long methodology for multiple groups.
– It better controls the false discovery rates and the powers
comparable to the F-test.

Cui and Churchill (2003) is a good review paper.
4. Linear fixed effects models

Effects:
– Array (A) - sample
– Dye (D)
– Variety (V) – test groups
– Genes (G)
– Expression (Y)
4. Linear fixed effects models
(Kerr, Martin, and Churchill, 2000)

Linear fixed effects model:
log( Yijkg )    Ai  D j  Gg  ( AD)ij
 ( AG)ig  ( DG ) jg  (VG) kg   ijkg
iid
 ijkg ~ N (0,  ).
2
H 0 : (VG) kg  0 for all k  1,2,..., v
4. Linear fixed effects models

All effects are assumed to be fixed effects.

Main drawback – all genes have same variance!
5. Linear mixed effects models
(Wolfinger et al. 2001)

Stage 1 (Global normalization model)
log( Ygij )    Ti  A j  (TA)ij   gij

Stage 2 (Gene specific model)
ˆgij  Gg  (GT ) gi  (GA) gj   gij
5. Linear mixed effects models

Assumptions:
iid
Ai ~
2
N (0,   ),
iid
(TA)ij ~
2
N (0,  TA )
iid
 ijkg ~ N (0,  ), (GA) gj ~
2
iid
 gij ~
2
N (0,  g )
2
N (0,  GAg )
5. Linear mixed effects models
(Wolfinger et al. 2001)

Perform inferences on the interaction term
(GT) gi
A popular graphical representation:
The Volcano Plots

A scatter plot of
 log 10 ( p  value)
vs
 log 2 ( fold change)
Genes with large fold change will lie outside a pair of vertical
“threshold” lines. Further, genes which are highly significant with
large fold change will lie either in the upper right hand or upper left
hand corner.
A useful review article

Cui, X. and Churchill, G (2003), Genome Biology.

Software:
R package: statistics for microarray analysis.
http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html
SAM: Significance Analysis of Microarray.
http://www-stat.stanford.edu/%7Etibs/SAM
Supervised classification algorithms
Discriminant analysis based methods
A. Linear and Quadratic Discriminant analysis based methods:

Strength:
– Well studied in the classical statistics literature

Limitations:
– Based on normality
– Imposes constraints on the covariance matrices. Need to
be concerned about the singularity issue.
– No convenient strategy has been proposed in the
literature to select “best” discrminating subset of genes.
Discriminant analysis based methods
B. Nonparametric classification using Genetic Algorithm and Knearest neighbors.
– Li et al. (Bioinformatics, 2001)

Strengths:
– Entirely nonparametric
– Takes into account the underlying dependence structure among
genes
– Does not require the estimation of a covariance matrix

Weakness:
– Computationally very intensive
GA/KNN methodology – very brief
description

Computes the Euclidean distance between all pairs of samples
based on a sub-vector on, say, 50 genes.

Clusters each sample into a treatment group (i.e. condition) based
on the K-Nearest Neighbors.

Computes a fitness score for each subset of genes based on how
many samples are correctly classified. This is the objective
function.

The objective function is optimized using Genetic Algorithm
K-nearest neighbors classification (k=3)
X
Expression levels of gene 1
Subcategories within a class
Expression levels of gene 1
Advantages of KNN approach
Simple, performs as well as or better than more
complex methods
 Free from assumptions such as normality of the
distribution of expression levels
 Multivariate: takes account of dependence in
expression levels
 Accommodates or even identifies distinct
subtypes within a class

Expression data: many genes and few samples

There may be many subsets of genes that can
statistically discriminate between the treated and
untreated.

There are too many possible subsets to look at.
With 3,000 genes, there are about 1072 ways to
make subsets of size 30.
The genetic algorithm

Computer algorithm (John Holland) that works by
mimicking Darwin's natural selection

Has been applied to many optimization problems
ranging from engine design to protein folding and
sequence alignment

Effective in searching high dimensional space
GA works by mimicking evolution

Randomly select sets (“chromosomes”) of 30
genes from all the genes on the chip

Evaluate the “fitness” of each “chromosome” –
how well can it separate the treated from the
untreated?

Pass “chromosomes” randomly to next generation,
with preference for the fittest
Summary

Pay attention to multiple testing problem.
– Use FDR over FWER for large data sets such as gene
expression microarrays

Linear mixed effects models may be used for
comparing expression data between groups.

For classification problem, one may want to
consider GA/KNN approach.