Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIOINF 2118 #17 - exact tests Page 1 of 4 Fisher exact test for 2x2 tables A very commonly used test for independence. The setting: 2x2 table, testing H0: the two factors (row & column) are independent. For tables with small counts, the chi-square test and likelihood ratio test may be poor. (The stated Type I error may be way off). The exact test of R.A. Fisher uses as the sample space: {all tables: the row and column totals = the observed totals} Y=1 Y=2 row totals X=1 M11 M12 M1+ X=2 M21 M22 M2+ column totals M+1 M+2 n The test will be “conditional on the margins”. Then once M11 is fixed, all the rest of the table is fixed. So the conditional probability of each table = the conditional probability of M11. The sample space corresponds to M11 ∈ {max(0, M1+ +2),...,min(M1+ +1)} (Notice that M1+ ) +2 = M+1 2+ = M+1 1+ If the rows and columns X and Y are independent, then the distribution of M11 is hypergeometric. See also the end of this document, which shows how to get the hypergeometric from independent Poissons by conditioning on the margins. The notation is variable. Above: This doc (and R): M11 M21 x M12 M22 k N-k n m n N Some books: y A B n N-n N BIOINF 2118 #17 - exact tests Page 2 of 4 Fisher’s exact test: Use X as the test statistic (ordering by “surprise”) one-sided P-value = Pr( X xobs ) (lower tail) or Pr( X xobs ) two-sided P-value = Pr( X x) , where the sum is over all x : Pr( X x) Pr( X xobs ) Example 1: Given N balls in an urn, n black and m white, k balls are chosen randomly, and independent of color (so says H0!). Let X=#of chosen balls which are white. Then æ m öæ n ö ç x ÷ç k-x ÷ è øè ø Pr( X = x) = , (*) æ m+ n ö ç ÷ è k ø over the range max(0, k m) x min(n, k ) . In R: dhyper(x, m, n, k, log = FALSE) phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE) qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE) rhyper(nn, m, n, k) x = the number of white balls drawn = the observation m = the number of whiteballs in the urn. n = the number of black balls in the urn. k = the number of balls drawn from the urn. So m + n = the sample size. Example 2: Given N mice, n male and m female, k mice are mutants, and “sex does not affect the risk of mutation” (H0!). Let X=#of mutants which are male. Then Pr(X | n,m,k) = (*) . “Case-control studies” work like this too. We know n+m=sample size and k = # cases in advance, but n = # people with risk factor is not known in advance. X = # cases with risk factor. Example 3: Given N people, n of them have the polymorphism for gene G1, k of the N have polymorphism for gene G2, and “the polymorphisms are independent ” (H0!). Let X=# of people with both polymorphisms. Then Pr(X | n,m,k) = (*) . BIOINF 2118 #17 - exact tests Page 3 of 4 Data example for Example 3: G2 var G2 wt G1 var 10 12 22 G1 wt 14 64 78 24 76 100 x k N-k n m N “var” = minor allele, “wt” = wild type, or major allele Under independence, E(X | n, m, k) = 100x(1/5)x(1/5) = 4. H0: independence HA: The G1 and G2 variants are more likely together. (One-sided alternative.) P-value = Pr(X ≥ 10 | n, m, k) = 1 - phyper(9, 22, 78, 24) [1] 0.01065388 ====================== The ODDS RATIO is an important measure of association in 2 by 2 tables. The odds ratio is : Population Sample estimate M / M 21 odds ( row 1| col 1) OR = OR = 11 odds ( row 1| col 2) M 12 / M 22 = odds (col 1| row 1) odds (col 1| row 2) = M 11 / M 12 M 21 / M 22 A good rough estimate of the variance of log OR is ( ) 1 1 1 1 + + + M11 M12 M 21 M 22 (courtesy of the Delta Method, see EXTRA TOPIC-delta method and variance stabilization.docx): Example: In the table above, var logOR » OR = (10 / 14) / (12 / 64) = 3.809524 ( ) var logOR @ 10-1 +14-1 +12 -1 + 64 -1 = 0.2703869 So a rough confidence interval for log OR is log(3.809524) ± 1.96 * 0.2703869 , which is the interval (0.3183289, 2.35667969), and a rough confidence interval for OR itself is (exp(0.3183289), exp( 2.35667969)) = (1.374828, 10.555843). An « exact » method uses the noncentral hypergeometric distribution, parametrized by y = log(OR) . From fisher.test(), we get (1.197927, 11.801156). Lots of great stuff about 2x2 tables: Agresti and Min, Simple improved confidence intervals for comparing matched proportions, Sta in Med 2005; Agresti and Caffo, Teacher ' s Corner Successes and Two Failures, American Statistician 2000, Richard Darlington, Some New 2 x 2 Tests. BIOINF 2118 #17 - exact tests Page 4 of 4 APPENDIX --- JUST FOR FUN/CURIOSITY: From four independent Poissons to the hypergeometric, by conditioning on the margins. x k N-k m n N Suppose that x ~ Pois(l ab), k - x ~ Pois(l (1- a)b), m - x ~ Pois(l a(1- b)), N - m - k + x ~ Pois(l (1- a)(1- b)). and they’re independent. This corresponds to (*) with n = N - m . The joint distribution of the four Poissons (which are independent) is Pr(x, k - x, m - x, N - m - k + x) = e- lab (l ab)x e- l (1-a)b (l (1- a)b)k-x e- la(1-b) (l a(1- b))m-x e- l (1-a)(1-b) (l (1- a)(1- b))N -m-k+x ´ ´ ´ x! k - x! m - x! N - m - k + x! This is the numerator. Conditioning on the margins means dividing by the two Poissons for two rows, then by the binomial for the first column conditioning on the sum of the two rows (N). The denominator is Pr(m) Pr(N-m) Pr(k | N) ö 1 æ 1 - la ö ææ N ö nö æ k N -k e- l (1-a) (l (1- a))N -m ÷ ç ç ( l b) ( l (1b)) çè e (l a) ÷ø çè ÷ ø è è k ÷ø m! N - m! ø In the ratio, all exponentials and powers cancel, leaving only 1 1 1 1 1 1 1 1 m! (N - m)! x! k - x! m - x! N - m - k + x! = x! m - x! k - x! N - k - m + x! æ N ö 1 1 æ N ö ç ÷ çè k ÷ø m! N - m! è k ø æ m öæ N -m ö çè x ÷ø çè k - x ÷ø = æ N ö çè k ÷ø This is the probability function for the hypergeometric: dhyper(x, m, N-m, k) = dhyper(x, m, n, k).