Download Multiple Testing in Microarray Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Multiple Testing in Microarray
Data Analysis
Mi-Ok Kim
Outline
1. Hypothesis Testing
2. Issue in Multiple Testing in Microarray Analysis
1) Type I Error
2) Power
3) P-values
3. Permutation
1. Hypothesis Testing
• H0 : Null hypotheis
vs.
• T : test statistics
C : critical value
H1 : Alternative Hypothesis
• If |T|>C, H0 is rejected. Otherwise H0 is retained
• Ex ) H0 : 1 = 2 vs. H1 : 1  2 T = (x1- x2) / pooled se
If |T| > z(1- /2), H0 is rejected at the significance level 
• C
1. Hypothesis Testing
Truth
H0
H1
Hypothesis Result
Retained
Rejected
Type I error
Type II error
• Type I error rate = false positives ( : significance level )
• Type II error rate = false negatives
• Power : 1–Type II error rate
• P-values : p=inf{ | H0 is rejected at the significance level  }
2. Issues in Multiple Comparison
• Q : Given n treatments, which two treatments are
significantly different ? (simultaneous testing)
cf) Is treatment A different from treatment B ?
• Ex ) m treatment means : 1,…,n
Hj : i = j where ij
Tj = (xi- xj) / pooled SE
• Type I error when testing each at 0.05 significance level
one by one : 1 – (0.95)n
• Inflated Type I error, ex)  =1 – (0.95)10 = 0.401263
• Remedies : Bonferroni Method
Type I error rate =  / # of comparison
3. Issues in Multiple Testing in Microarray
Analysis
• the identification of differentially expressed genes.
ex) a study of differentially expressed genes expression in
tumor biopsy specimens from leukemia patients ( ALL /
AML ) that includes 6,817 genes and 30 samples
• rows : genes ( m )
• columns : samples ( n )
• Hj : jth gene is not differentially expressed
• Simultaneously testing m null hypotheses Hj , j=1, …, m,
to determine which hypotheses to reject while controlling a
suitably defined Type I error and maximizing power
3-1) Type I Error Rates
Truth
•
•
•
•
Hypothesis Result
#retained
#rejected
H0
U
V
H1
T
S
Total
m-R
R
Total
m0
m1
m
Per-comparison error rate ( PCER ) = E(V) / m
Per-family error rate ( PFER ) = E(V)
Family-wise error rate = pr ( V ≥ 1 )
False discovery rate ( FDR ) = E(Q), Q V/R , if R > 0
0,
if R = 0
3-1) Type I Error Rates
Under the complete null hypothesis, each Hj has Type I error
rate j.
•
•
•
•
PCER = E(V) / m = (1+...+m)/m
PFER = E(V) = 1+...+m
FWER= pr ( V ≥ 1 ) = 1 - Pr (Hj , j=1, …, m, not rejected )
FDR = E(V / R) = FWER
PCER = (1+...+m)/m ≤ max (1+...+m)
≤ PWER = FDR ≤ PFER=
1+...+m
3-1) Type I Error Rate
Assume Hj , j=1, …, m, with their test statistics Tj , j=1,…, m,
which has a MN with mean =(1,…,m)and identity
covariance vector
Let Rj = I ( Hj is rejected) and rj is observed value of Rj
Let j = Pr ( Hj rejected under Hj ).
PFER = j=1m j
PCER = j=1m j / m
FWER = 1- j=1m (1- j)
FDR = r1=01…r1=01(j=1m0rj / j=1mrj) jrj (1- j) 1-rj
3-2) Strong vs. Weak Control
• Expectations and probabilities are conditional on which
hypotheses are true.
• Strong control : control of which Type I error rate under
any combination of true and false hypotheses, ie. any value
of m0
• Weak control : control of the Type I error rate only when all
the null hypotheses are true, ie. Under the complete null
hypothesis ∩j=1m Hj
• In the microarray setting, where it is very unlikely that no
genes are differentially expressed, it seems particularly
important to have a strong control of the Type I error rate.
3-3) Power
• Within the class of multiple testing procedures that control
a given Type I error rate at an acceptable level , maximize
power, that is, minimize a suitably defined Type II error
rate.
• Any-pair power : Pr ( S ≥ 1 ) = the probability of rejecting
at least one false null hypothesis
• Per-pair power : average power = E(S) / m1
• All-pair power : Pr ( S = m1 ) = the probability of rejecting
all false null hypothesis
3-4) Multiple Testing Procedures based on Pvalues that control the family-wise error rate
• For a single hypothesis H1,
p1=inf{  | H1 is rejected at the significance level  }
If p1 < , H1 is rejected. Otherwise H1 is retained
• Adjusted p-values for multiple testing (p*)
pj*=inf{  | H1 is rejected at FWER= }
If pj* < , Hj is rejected. Otherwise Hj is retained
• Single-Step, Step-Down and Step-Up procedure
3-4-1) Single-Step Procedure
• For a strong control of FWER,
single-step Bonferroni adjusted p-values : pj*= min( mpj,1)
single-Step Sidak adjsted pvalues : pj*= 1- (1-pj)m
• For a weak control of FWER,
single-step minP adjusted p-values
pj*= min 1≤k≤m (Pk ≤ pj | complete null)m
single-step maxP adjusted p-values
pj*= max 1≤k≤m (|Tk| ≤ Cj | complete null)m
• Under subset pivotal property, weak control = strong
control
3-4-2) Step-Down Procedure
• Order the observed unadjusted p-values such that pr1 ≤ pr2
≤ … ≤ prm
• Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm
• Holm’s procedure
j* = min { j | prj >  / (m-j+1) }, reject Hrj for j=1, .., j*-1
• Adjusted step-down Holm’s p-values
prj *= max{ min( (m-k+1) prk , 1) }
prj *= max{ 1-(1-prk)(m-k+1) }
prj *= max{ Pr( min rk<l<rm Pl ≤ prk | complete null) }
prj *= max{ Pr( max rk<l<rm |Tl| ≤ Crk | complete null) }
3-4-3) Step-Up Procedure
• Order the observed unadjusted p-values such that pr1 ≤ pr2
≤ … ≤ prm
• Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm
• j* = max { j | prj ≤  / (m-j+1) }, reject Hrj for j=1, .., j*
• Adjusted step-down Holm’s p-values
prj *= min{ min( (m-k+1) prk , 1) }
3-5) Resampling Method
• Rows – genes , Columns – samples
• Bootstrap or permutation based method
• Estimate the joint distribution of the test statistics under the
complete null hypothesis by permuting the columns of the
gene expression data matrix (permuting columns)
• For the bth permutation, b=1, …, B,
compute test statistics t1,b, …, tm,b
prj *= j=1B I (| tj,b | ≥ Cj ) / B
ex ) Colub (1999)
3-5) Resampling Method
• Efron et al. (2000) and Tusher et al. (2001)
• Compute a test statistics tj for each gene j and define order
statistics t(j) such that t(1) ≥ t(2) ≥ .. ≥ t(m)
• For each b permutation, b=1, ..,B, compute the test statistics
and define the order statistics t(1),b ≥ t(2),b ≥ .. ≥ t(m),b
• From the permutations, estimate the expected value (under
the complete null) of the order statistics by t*(j)=  t(j),b /B
• Form a Q-Q plot of the observed t(j) vs. the expected t*(j)
• Efron et al. – for a fixed threshold , genes with |t(j)-t*(j)| ≥

• Tusher et al. - for a fixed threshold , let j*=max{j: t(j)-t*(j)
≥ , t*(j) > 0}
Related documents