Download N 17

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
BIOINF 2118
#17 - exact tests
Page 1 of 4
Fisher exact test for 2x2 tables
A very commonly used test for independence.
The setting: 2x2 table, testing H0: the two factors (row & column) are independent.
For tables with small counts, the chi-square test and likelihood ratio test may be poor.
(The stated Type I error may be way off).
The exact test of R.A. Fisher uses as the sample space:
{all tables: the row and column totals = the observed totals}
Y=1
Y=2
row totals
X=1
M11
M12
M1+
X=2
M21
M22
M2+
column totals
M+1
M+2
n
The test will be “conditional on the margins”.
Then once M11 is fixed, all the rest of the table is fixed. So the conditional probability of each
table = the conditional probability of M11.
The sample space corresponds to
M11 ∈ {max(0, M1+
+2),...,min(M1+
+1)}
(Notice that M1+
)
+2 = M+1
2+ = M+1
1+
If the rows and columns X and Y are independent, then the distribution of M11 is
hypergeometric.
See also the end of this document, which shows how to get the hypergeometric from
independent Poissons by conditioning on the margins.
The notation is variable.
Above:
This doc (and R):
M11 M21
x
M12 M22
k
N-k
n
m n
N
Some books:
y
A
B
n N-n N
BIOINF 2118
#17 - exact tests
Page 2 of 4
Fisher’s exact test: Use X as the test statistic (ordering by “surprise”)
one-sided P-value = Pr( X  xobs ) (lower tail) or Pr( X  xobs )
two-sided P-value =

Pr( X  x) ,
where the sum is over all x : Pr( X  x)  Pr( X  xobs )
Example 1: Given N balls in an urn, n black and m white,
k balls are chosen randomly, and independent of color (so says H0!).
Let X=#of chosen balls which are white. Then
æ m öæ n ö
ç x ÷ç k-x ÷
è
øè
ø
Pr( X = x) =
, (*)
æ m+ n ö
ç
÷
è k ø
over the range max(0, k  m)  x  min(n, k ) .
In R:
dhyper(x, m, n, k, log = FALSE)
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)
rhyper(nn, m, n, k)
x = the number of white balls drawn
= the observation
m = the number of whiteballs in the urn.
n = the number of black balls in the urn.
k = the number of balls drawn from the urn.
So m + n = the sample size.
Example 2: Given N mice, n male and m female,
k mice are mutants, and “sex does not affect the risk of mutation” (H0!).
Let X=#of mutants which are male. Then Pr(X | n,m,k) = (*) .
“Case-control studies” work like this too.
We know n+m=sample size and k = # cases in advance, but
n = # people with risk factor
is not known in advance. X = # cases with risk factor.
Example 3: Given N people, n of them have the polymorphism for gene G1,
k of the N have polymorphism for gene G2,
and “the polymorphisms are independent ” (H0!).
Let X=# of people with both polymorphisms. Then Pr(X | n,m,k) = (*) .
BIOINF 2118
#17 - exact tests
Page 3 of 4
Data example for Example 3:
G2 var G2 wt
G1 var
10
12
22
G1 wt
14
64
78
24
76
100
x
k
N-k
n m
N
“var” = minor allele, “wt” = wild type, or major allele
Under independence, E(X | n, m, k) = 100x(1/5)x(1/5) = 4.
H0: independence
HA: The G1 and G2 variants are more likely together. (One-sided alternative.)
P-value = Pr(X ≥ 10 | n, m, k) = 1 - phyper(9, 22, 78, 24)
[1] 0.01065388
======================
The ODDS RATIO is an important measure of association in 2 by 2 tables.
The odds ratio is :
Population
Sample estimate
M / M 21
odds ( row 1| col 1)
OR =
OR = 11
odds ( row 1| col 2)
M 12 / M 22
=
odds (col 1| row 1)
odds (col 1| row 2)
=
M 11 / M 12
M 21 / M 22
A good rough estimate of the variance of log OR is
(
)
1
1
1
1
+
+
+
M11 M12 M 21 M 22
(courtesy of the Delta Method, see EXTRA TOPIC-delta method and variance stabilization.docx):
Example: In the table above,
var logOR »
OR = (10 / 14) / (12 / 64) = 3.809524
(
)
var logOR @ 10-1 +14-1 +12 -1 + 64 -1 = 0.2703869
So a rough confidence interval for log OR is log(3.809524) ± 1.96 * 0.2703869 , which is the interval
(0.3183289, 2.35667969), and a rough confidence interval for OR itself is (exp(0.3183289), exp( 2.35667969)) =
(1.374828, 10.555843).
An « exact » method uses the noncentral hypergeometric distribution, parametrized by y = log(OR) .
From fisher.test(), we get (1.197927, 11.801156).
Lots of great stuff about 2x2 tables:
Agresti and Min, Simple improved confidence intervals for comparing matched proportions, Sta in Med 2005;
Agresti and Caffo, Teacher ' s Corner Successes and Two Failures, American Statistician 2000,
Richard Darlington, Some New 2 x 2 Tests.
BIOINF 2118
#17 - exact tests
Page 4 of 4
APPENDIX --- JUST FOR FUN/CURIOSITY:
From four independent Poissons to the hypergeometric,
by conditioning on the margins.
x
k
N-k
m n
N
Suppose that
x ~ Pois(l ab), k - x ~ Pois(l (1- a)b),
m - x ~ Pois(l a(1- b)), N - m - k + x ~ Pois(l (1- a)(1- b)).
and they’re independent. This corresponds to (*) with n = N - m .
The joint distribution of the four Poissons (which are independent) is
Pr(x, k - x, m - x, N - m - k + x) =
e- lab (l ab)x e- l (1-a)b (l (1- a)b)k-x e- la(1-b) (l a(1- b))m-x e- l (1-a)(1-b) (l (1- a)(1- b))N -m-k+x
´
´
´
x!
k - x!
m - x!
N - m - k + x!
This is the numerator.
Conditioning on the margins means dividing by the two Poissons for two rows, then by the binomial for the first
column conditioning on the sum of the two rows (N). The denominator is
Pr(m)
Pr(N-m)
Pr(k | N)
ö
1
æ 1 - la
ö ææ N ö
nö æ
k
N -k
e- l (1-a) (l (1- a))N -m ÷ ç ç
(
l
b)
(
l
(1b))
çè e (l a) ÷ø çè
÷
ø è è k ÷ø
m!
N - m!
ø
In the ratio, all exponentials and powers cancel, leaving only
1 1
1
1
1 1
1
1
m!
(N - m)!
x! k - x! m - x! N - m - k + x! =
x! m - x!
k - x! N - k - m + x!
æ N ö
1
1 æ N ö
ç
÷
çè k ÷ø
m! N - m! è k ø
æ m öæ N -m ö
çè x ÷ø çè k - x ÷ø
=
æ N ö
çè k ÷ø
This is the probability function for the hypergeometric:
dhyper(x, m, N-m, k) = dhyper(x, m, n, k).