Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multiple Testing in Microarray Data Analysis Mi-Ok Kim Outline 1. Hypothesis Testing 2. Issue in Multiple Testing in Microarray Analysis 1) Type I Error 2) Power 3) P-values 3. Permutation 1. Hypothesis Testing • H0 : Null hypotheis vs. • T : test statistics C : critical value H1 : Alternative Hypothesis • If |T|>C, H0 is rejected. Otherwise H0 is retained • Ex ) H0 : 1 = 2 vs. H1 : 1 2 T = (x1- x2) / pooled se If |T| > z(1- /2), H0 is rejected at the significance level • C 1. Hypothesis Testing Truth H0 H1 Hypothesis Result Retained Rejected Type I error Type II error • Type I error rate = false positives ( : significance level ) • Type II error rate = false negatives • Power : 1–Type II error rate • P-values : p=inf{ | H0 is rejected at the significance level } 2. Issues in Multiple Comparison • Q : Given n treatments, which two treatments are significantly different ? (simultaneous testing) cf) Is treatment A different from treatment B ? • Ex ) m treatment means : 1,…,n Hj : i = j where ij Tj = (xi- xj) / pooled SE • Type I error when testing each at 0.05 significance level one by one : 1 – (0.95)n • Inflated Type I error, ex) =1 – (0.95)10 = 0.401263 • Remedies : Bonferroni Method Type I error rate = / # of comparison 3. Issues in Multiple Testing in Microarray Analysis • the identification of differentially expressed genes. ex) a study of differentially expressed genes expression in tumor biopsy specimens from leukemia patients ( ALL / AML ) that includes 6,817 genes and 30 samples • rows : genes ( m ) • columns : samples ( n ) • Hj : jth gene is not differentially expressed • Simultaneously testing m null hypotheses Hj , j=1, …, m, to determine which hypotheses to reject while controlling a suitably defined Type I error and maximizing power 3-1) Type I Error Rates Truth • • • • Hypothesis Result #retained #rejected H0 U V H1 T S Total m-R R Total m0 m1 m Per-comparison error rate ( PCER ) = E(V) / m Per-family error rate ( PFER ) = E(V) Family-wise error rate = pr ( V ≥ 1 ) False discovery rate ( FDR ) = E(Q), Q V/R , if R > 0 0, if R = 0 3-1) Type I Error Rates Under the complete null hypothesis, each Hj has Type I error rate j. • • • • PCER = E(V) / m = (1+...+m)/m PFER = E(V) = 1+...+m FWER= pr ( V ≥ 1 ) = 1 - Pr (Hj , j=1, …, m, not rejected ) FDR = E(V / R) = FWER PCER = (1+...+m)/m ≤ max (1+...+m) ≤ PWER = FDR ≤ PFER= 1+...+m 3-1) Type I Error Rate Assume Hj , j=1, …, m, with their test statistics Tj , j=1,…, m, which has a MN with mean =(1,…,m)and identity covariance vector Let Rj = I ( Hj is rejected) and rj is observed value of Rj Let j = Pr ( Hj rejected under Hj ). PFER = j=1m j PCER = j=1m j / m FWER = 1- j=1m (1- j) FDR = r1=01…r1=01(j=1m0rj / j=1mrj) jrj (1- j) 1-rj 3-2) Strong vs. Weak Control • Expectations and probabilities are conditional on which hypotheses are true. • Strong control : control of which Type I error rate under any combination of true and false hypotheses, ie. any value of m0 • Weak control : control of the Type I error rate only when all the null hypotheses are true, ie. Under the complete null hypothesis ∩j=1m Hj • In the microarray setting, where it is very unlikely that no genes are differentially expressed, it seems particularly important to have a strong control of the Type I error rate. 3-3) Power • Within the class of multiple testing procedures that control a given Type I error rate at an acceptable level , maximize power, that is, minimize a suitably defined Type II error rate. • Any-pair power : Pr ( S ≥ 1 ) = the probability of rejecting at least one false null hypothesis • Per-pair power : average power = E(S) / m1 • All-pair power : Pr ( S = m1 ) = the probability of rejecting all false null hypothesis 3-4) Multiple Testing Procedures based on Pvalues that control the family-wise error rate • For a single hypothesis H1, p1=inf{ | H1 is rejected at the significance level } If p1 < , H1 is rejected. Otherwise H1 is retained • Adjusted p-values for multiple testing (p*) pj*=inf{ | H1 is rejected at FWER= } If pj* < , Hj is rejected. Otherwise Hj is retained • Single-Step, Step-Down and Step-Up procedure 3-4-1) Single-Step Procedure • For a strong control of FWER, single-step Bonferroni adjusted p-values : pj*= min( mpj,1) single-Step Sidak adjsted pvalues : pj*= 1- (1-pj)m • For a weak control of FWER, single-step minP adjusted p-values pj*= min 1≤k≤m (Pk ≤ pj | complete null)m single-step maxP adjusted p-values pj*= max 1≤k≤m (|Tk| ≤ Cj | complete null)m • Under subset pivotal property, weak control = strong control 3-4-2) Step-Down Procedure • Order the observed unadjusted p-values such that pr1 ≤ pr2 ≤ … ≤ prm • Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm • Holm’s procedure j* = min { j | prj > / (m-j+1) }, reject Hrj for j=1, .., j*-1 • Adjusted step-down Holm’s p-values prj *= max{ min( (m-k+1) prk , 1) } prj *= max{ 1-(1-prk)(m-k+1) } prj *= max{ Pr( min rk<l<rm Pl ≤ prk | complete null) } prj *= max{ Pr( max rk<l<rm |Tl| ≤ Crk | complete null) } 3-4-3) Step-Up Procedure • Order the observed unadjusted p-values such that pr1 ≤ pr2 ≤ … ≤ prm • Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm • j* = max { j | prj ≤ / (m-j+1) }, reject Hrj for j=1, .., j* • Adjusted step-down Holm’s p-values prj *= min{ min( (m-k+1) prk , 1) } 3-5) Resampling Method • Rows – genes , Columns – samples • Bootstrap or permutation based method • Estimate the joint distribution of the test statistics under the complete null hypothesis by permuting the columns of the gene expression data matrix (permuting columns) • For the bth permutation, b=1, …, B, compute test statistics t1,b, …, tm,b prj *= j=1B I (| tj,b | ≥ Cj ) / B ex ) Colub (1999) 3-5) Resampling Method • Efron et al. (2000) and Tusher et al. (2001) • Compute a test statistics tj for each gene j and define order statistics t(j) such that t(1) ≥ t(2) ≥ .. ≥ t(m) • For each b permutation, b=1, ..,B, compute the test statistics and define the order statistics t(1),b ≥ t(2),b ≥ .. ≥ t(m),b • From the permutations, estimate the expected value (under the complete null) of the order statistics by t*(j)= t(j),b /B • Form a Q-Q plot of the observed t(j) vs. the expected t*(j) • Efron et al. – for a fixed threshold , genes with |t(j)-t*(j)| ≥ • Tusher et al. - for a fixed threshold , let j*=max{j: t(j)-t*(j) ≥ , t*(j) > 0}