Download CS 395T: Computational Statistics with Application to

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Computational Statistics with
Application to Bioinformatics
Prof. William H. Press
Spring Term, 2008
The University of Texas at Austin
Unit 8: Contingency Tables, Experimental
Protocols, and All That
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
1
Unit 8: Contingency Tables, Experimental
Protocols, and All That (Summary)
•
We discuss contingency tables at some length
– counts in a table
– null hypothesis is that column label doesn’t affect row probability
• or vice versa (the formula was symmetrical)
•
Pearson Chi-Square test applies when counts are all large
– non-obvious (but true) that the Pearson statistic is Chi-Square distributed
– all row and column marginals count as constraints (less one duplication)
•
When counts are small we will see that there are subtle issues
– under the null hypothesis, not obvious what is the distribution of tables?
– dependence on the exact protocol
•
Three different protocols can give seemingly identical contingency tables
– retrospective or case-control study (column marginals fixed)
– prospective or longitudinal study (row marginals fixed)
– cross-sectional or snapshot study (no marginals fixed)
•
The three protocols have different table distributions in the null
hypothesis
– depending on unknown (nuisance) probabilities p, q, or both
– digressions on
• hypergeometric distribution
• multinomial distribution
continues
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
2
•
Probability of a table with all marginals fixed
– purely combinatorial calculation giving hypergeometric probability
– is the basis of the Fisher Exact Test
• even though it doesn’t exactly apply to any of the actual protocols!
•
Fisher Exact Test using the Wald statistic
– analytically unfeasible for tables larger than few by few, or with too many
counts
•
Why ordinal or cardinal tables should be “compressed”
– that is, test the mean (or another moment), not all columns separately
•
The permutation test is a fast and exact way of sampling the null
hypothesis for contingency tables
– breaks the association
– requires expanding the contingency table to a dataset
•
How to use of the permutation test for the significance of contingency
tables
– requires expanding the contingency table to a dataset
– combines nicely with compression of ordinal data [example: maternal drinking]
• e.g., test difference of means, or of higher moments
• if you test multiple moments, think about multiple hypothesis correction
– but don’t necessarily do it
– or use permutation test directly on (nominal) contingency tables
• in which case you have to pick a statistic
– Wald or Pearson are fine
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
continues
3
•
Contrast the permutation test to bootstrap resampling
– one breaks the causal connection in the data, the other doesn’t
– one lives in the null hypothesis world, the other in the (estimated) true
population
– therefore, one addresses false positives (p-values), the other, false negatives
•
On (nominal) contingency tables, the permutation test samples exactly
the null population of Fisher’s Exact Test
– preserves all marginals
– gives easy Monte Carlo way to compute Fisher’s Exact Test on contingency
tables of any size
•
Even if we can easily compute it, Fisher’s Exact Test remains problematic
– discrete values, and data always lands on a tie
– 2-tailed test is “fragile” to irrelevant number-theoretic coincidences
– and it still doesn’t properly model an actual protocol!
•
As good Bayesians, we marginalize over the unknown column and row
probabilities
–
–
–
–
Marginalizing can be achieved by sampling
We use Dirichlet, the conjugate distribution to multinomial
Draw probabilities from Dirichlet, then draw a table from multinomial
Discreteness effects now much less troublesome
• Still some, since integer valued tables
• And in any case we are matching the actual experimental protocol
•
All tests are not equally powerful
– Compressing the columns in ordinal data can give a more powerful test
• Maternal drinking table example
• Not marginalizing over as big a space
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
4
Contingency Tables (a.k.a. cross-tabulation)
Ask: Is a gene is more likely to be single-exon if it is AT-rich?
rowcon = [(g.ne == 1) (g.ne > 1)];
colcon = [(g.piso < 0.4) (g.piso > 0.6)];
crosstab(rowcon * (1:2)', colcon * (1:2)')
ans =
annoying counts of “other”
48
220
2386
13369
<.4
>.6
1
>1
689
3982
table = contingencytable(rowcon,colcon)
my improved function (below)
table =
2386
13369
689
3982
(fewer genes AT rich than CG rich)
sum(table, 1)
ans =
15755
4671
column marginals
ptable = table ./ repmat(sum(table,1),[2 1])
ptable =
0.1514
0.8486
0.1475
0.8525
So can we claim that these are statistically identical?
Or is the effect here also “significant but small”?
my contingency table function:
function table = contingencytable(rowcons, colcons)
nrow = size(rowcons,2);
ncol = size(colcons,2);
table = squeeze(sum( repmat(rowcons,[1 1 ncol]) .* ...
permute(repmat(colcons,[1 1 nrow]),[1 3 2]),1 ));
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
5
Chi-square (or Pearson) statistic for contingency tables
notation:
null hypothesis:
expected value of Nij

•Are the conditions for valid chi-square distribution
satisfied? Yes, because number of counts in all
bins is large.
the statistic is:
table =
2386
13369
689
3982
•If they were small, we couldn’t use fix-themoments trick, because small number of bins (no
CLT). This occurs often in biomedical data.
•So what then?
nhtable = sum(table,2)*sum(table,1)/sum(sum(table))
nhtable =
1.0e+004 *
0.2372
1.3383
0.0703
0.3968
chis = sum(sum((table-nhtable).^2./nhtable))
chis =
0.4369
p = chi2cdf(chis,1)
d.f. = 4 – 2 – 2 + 1
p =
0.4914
wow, can’t get less significant than this! No evidence of an
association between single-exon and AT- vs. CG-rich.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
6
When counts are small, some subtle issues show up. Let’s look closely.
“conditions”, e.g. healthy vs. sick
The setup is:
counts
“factors”, e.g. vaccinated vs.
unvaccinated
marginals (totals, dot means
summed over)
The null hypothesis is: “Conditions and factors are unrelated.”
To do a p-value test we must:
1. Invent a statistic that measures deviation from the null hypothesis.
2. Compute that statistic for our data.
3. Find the distribution of that statistic over the (unseen) population.
That’s the hard part! What is the “population” of contingency tables?
We’ll now see that it depends (albeit slightly) on the experimental
protocol, not just on the counts!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
7
Protocol 1: Retrospective analysis or “case/control study”
C1 already has the disease. We
retrospectively look at their factors.
In the null hypothesis,
both columns share row
probabilities q and (1-q).
But we don’t know q. It’s
a “nuisance parameter”.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
8
Digression on the hypergeometric distribution
The Texas Legislature has m Republicans and n Democrats.
A committee of size r is chosen randomly [not realistic!]
What is the probability distribution of i, the number of Democrats
on the committee?
p(i ) = hyper(i ; m + n; n; r )
number of ways to
choose i Democrats
¡ ¢¡
=
n
¡i
¢
m
r¡ ¢
i
m+ n
r
number of ways to
choose r-i Republicans
total number ways to
choose the committee
i
m
r
n
m+n
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
9
Protocol 2: Prospective experiment or “longitudinal study”
Identify samples with the factors, then
watch to see who gets the disease
In the null hypothesis, both rows share row probabilities p and
(1-p). But we don’t know p. It’s now the nuisance parameter.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
10
Protocol 3: Cross-sectional or snapshot study (no fixed marginals)
E.g., test all Austin residents for both
disease and factors.
multinomial distribution
Asymptotic methods (e.g. chi-square) are often equivalent to estimating
p,q and thus the nuisance factors, from the data itself.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
11
Digression on the multinomial distribution
On each i.i.d. try, exactly one of K outcomes occurs, with probabilities
XK
p1 ; p2 ; : : : ; pK
pi = 1
i= 1
For N tries, the probability of seeing exactly the outcome
XK
n1 ; n2 ; : : : ; nK
ni = N
i= 1
is
P (n 1 ; : : : ; n K jN ; p1 ; : : : ; pK ) =
probability of one
specific outcome
N!
pn 1 pn 2 ¢¢¢pn K
K
n 1 ! ¢¢¢n K ! 1 2
number of equivalent arrangements
N=26:
abcde fgh ijklmnop q rs tuvwxyz
(12345)(123)(12345678)(1)(12)(1234567)
n1 = 5
n2 = 3
N! arrangements
n6 = 7
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
partition into the
observed ni’s
12
So, in all three cases we got a product of “nuisance” probabilities
(depending on unknown p or q or both) and a “sufficient statistic”
conditioned on all the marginals.
Fisher’s Exact Test just throws away the nuisance factors and
uses the sufficient statistic:
This can also be seen to be the (purely combinatorial) probability of the table
with all marginals fixed:
Numerator: number of partitions with n00=k
Denominator: sum numerator over k
With all marginals fixed, n00 determines the
whole table.
table is fully determined by k alone
Vandermonde’s identity:
Proof: How many ways can you choose a subcommittee of size
r from a committee with n Democrats and m Republicans?
How many Democrats on the subcommittee?
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
13
That was all about the distribution of tables in the null hypothesis.
Now we complete the rest of the tail test paradigm:
The most popular choice for a statistic for 2x2 tables is the “Wald statistic”:
(sorry for the slight
change in notation!)
This is constructed so that it will
asymptotically became a true t-value.
You could instead use the Pearson (chi-square) statistic,
but not the assumption that it is chi-square distributed.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
14
So, the Fisher Exact Test looks like this:
Is this table a significant result?
• Compute the statistic for the data
• Loop over all possible
contingency tables with the same
marginals
– for 2x2 there is just one free
parameter
• Compute the statistic for each
table in the loop
• Accumulate weight (by
hypergeometric probability) of
statistic <, =, > the data statistic
• Output the p-value (or, because of
discreteness effects, the range)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
15
Although the Fisher Exact Test generalizes to larger contingency tables, the
computational effort increases exponentially
can download from Matlab Central (user-contributed):
Fisherextest(8,3,16,26)
Fisher's exact test P-values for cells:
a = 8, b = 3, c = 16, d = 26
---------------------------------------------P-value
---------------------------------------------Left tail
Right tail
2-tail
(negative)
(positive)
(both)
---------------------------------------------0.0430013347
0.9922567550
0.0497619245
----------------------------------------------
Best stick to the two-tailed answer, since you’d probably change your mind
about which tail if you saw something significant on the other!
Editorial: Apart from the words “Fisher” (true) and “exact” (questionable) in its name,
this test doesn’t have much to recommend it. We will soon learn an efficient way to
compute it, but we will also learn more about why it remains problematic.
I just don’t understand why it is widely used, since virtually never are all marginals
held fixed (none of Protocols 1,2,3 above)! This is the frequentist’s worship of
sufficient statistics at its worst!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
16
To learn more, let’s play with a different data set:
Different “standard methods” applied to this data get
p-values ranging from 0.005 to 0.190. (Agresti 1992)
Fisher Exact Test is not a viable option (both because
of computational workload and because we only
derived the 2x2 case!)
So what should we really do?
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
17
A sensible computational approach to the maternal drinking
contingency table is based on the permutation test:
• Define a test statistic that actually reflects your hypothesis
– here, use ordinal nature of the columns
– “more drinks lead to more abnormalities”
– the obvious statistic is “difference of mean number of drinks in the two
rows”
– if a threshold effect is plausible, might also try “difference of mean of
square”
• but must think about multiple hypothesis correction
• Expand the contingency table back to the data set that it came from
– every count its own data point
• Permutation test the expanded data set to determine the
significance of the test statistic
– different from Bootstrap resampling, as we’ll see
– histogram
– p-value
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
18
Example: Permutation test on a 2x2 contingency table
A
B
f
1
3
g
2
1
f
f
f
f
g
g
g
expand
the table
back to
data
construct
the new
table
A
B
B
B
A
A
B
randomly
permute
the
second
column
A
B
f
2
2
g
1
2
f
f
f
f
g
g
g
B
A
A
B
B
A
B
note, will always have just two
columns for any size
contingency table
compute and
save statistic, then
(notice that all marginals are preserved – will
come back to this point)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
19
Let’s apply these lessons to the maternal drinking table.
Input the table and display the means and their differences:
table = [17066 14464 788 126 37; 48 38 5 1 1]
sum(table(:))
table =
17066
48
ans =
32574
14464
38
788
5
drinks = [0 0.5 1.5 4. 6.];
drinksq = drinks.^2;
norm = sum(table,2);
mudrinks = (table * drinks')./norm
mudrinksq = (table * drinksq')./norm
mudrinks =
0.2814
0.39247
mudrinksq =
0.26899
0.78226
diff = [-1 1] * mudrinks
diffsq = [-1 1] * mudrinksq
126
1
37
1
reasonable quantification of the ordinal
categories: exactness isn’t important,
since we get to define the statistic
These are our chosen “statistics”.
The question is: Are either of them
statistically significant? We’ll use the
permutation test to find out.
diff =
0.11108
diffsq =
0.51327
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
20
Expand table back to dataset of length 32574:
[row col] = ndgrid(1:2,1:5)
row =
1
2
col =
1
1
This tells each cell its row and column number
1
2
1
2
1
2
1
2
2
2
3
3
4
4
5
5
(Darn it, I couldn’t think of a way to do this
in Matlab without an explicit loop, thus
spoiling my no-loop record)*
d = [];
for k=1:numel(table); d = [d; repmat([row(k),col(k)],table(k),1)]; end;
size(d)
ans =
32574
2
Yes, has the dimensions we expect.
accumarray(d,1,[2,5])
ans =
17066
48
14464
38
788
5
126
1
37
1
And we can reconstruct the original table.
mean(drinks(d(d(:,1)==2,2)))
ans =
0.39247
And we get the right mean, so it looks like
we are good to go…
*Peter Perkins (MathWorks) suggests the wonderfully obscure
d = [rldecode(table,row) rldecode(table,col)];
where rldecode is one of Peter Acklam’s Matlab Tips and Tricks.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
21
Compute statistic for the data and for 1000 permuations:
The idea is to sample from the null hypothesis (no association) while keeping the
distributions of each single variable unchanged. Do this by permuting a label that
is irrelevant in the null hypothesis.
diffmean = @(d) mean(drinks(d(d(:,1)==2,2))) - mean(drinks(d(d(:,1)==1,2)));
diffmean(d)
ans =
0.11108
diffmean([d(randperm(size(d,1)),1) d(:,2)])
Try one permutation just to see it work.
ans =
0.014027
perms = arrayfun(@(x) diffmean([d(randperm(size(d,1)),1) d(:,2)]), [1:1000]);
pval = numel(perms(perms>diffmean(d)))/numel(perms)
pval =
0.015
hist(perms,(-.15:.01:.3))
So, the association is (fairly) significant.
This is the chance of seeing as large a
difference of means that we see, under
the null hypothesis: a false positive rate.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
22
Same analysis for the squared-drinks statistic:
diffmeansq = @(d) mean(drinksq(d(d(:,1)==2,2))) - mean(drinksq(d(d(:,1)==1,2)));
diffmeansq(d)
ans =
0.51327
permsq = arrayfun(@(x) diffmeansq([d(randperm(size(d,1)),1) d(:,2)]), [1:1000]);
pval = numel(permsq(permsq>diffmeansq(d)))/numel(permsq)
pval =
0.011
hist(permsq,(-.3:.05:1))
•
Should we apply a multiple hypothesis
correction to both pval’s (mult x 2) ?
Probably not.
– mean and mean-of-squares highly
correlated, and
– the previous result was significant
– we’re not just shopping uniform p-values
•
•
But, if your data can stand it, Bonferroni
is the gold standard
Can you think of a principled way to do
multiple hypothesis correction with
highly correlated tests?
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
23
This was not bootstrap resampling! The latter tells us how much variation in the
signal one might see in repeated identical experiments. Permutation test breaks
the causal connection. Bootstrap doesn’t. Bootstrap can be useful in
understanding why another experiment didn’t see the effect (false negative).
diffmean(d(randsample(size(d,1),size(d,1),true),:))
ans =
0.20703
Try one resample just to see it work.
resamp = arrayfun(@(x) diffmean(d(randsample(size(d,1),size(d,1),true),:)), [1:1000]);
bias = mean(resamp) - diffmean(d)
If this is large, we should worry.
resamp = resamp - bias;
pval = numel(resamp(resamp<0))/numel(resamp)
bias =
0.0014876
pval =
0.078
diffmean(d)
= 0.11108
hist(resamp,[-.1:.01:.5])
Here the pval is a false negative rate.
How often would a repetition of the
experiment show an effect with
negative difference of the means?
Moral: Bootstrap resampling and
sampling from the null hypothesis (e.g.
by permutation) are different things!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
24
Let’s go back to the small-count 2x2 contingency table, for which we have
(so far) only the Fisher Exact Test.
What if we do permutation resampling here? Aha! Since the permutation
preserves all marginals, it is a Monte Carlo approximation of the Fisher
Exact Test. And it works for any size table! And is easy to compute!
But is it really any good?
function t = wald(tab)
Code up the Wald statistic.
m = tab(1,1);
n = tab(1,2);
mm = m + tab(2,1);
nn = n + tab(2,2);
p1 = m/mm;
p2 = n/nn;
p = (m+n)/(mm+nn);
t = (p1-p2)/sqrt(p*(1-p)*(1/mm+1/nn));
table = [8 3; 16 26;]
table =
8
16
3
26
tdata = wald(table)
tdata =
2.0542
The data show about a 2 standard deviation effect, except that they’re
not really standard deviations because of the small counts!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
25
Expand the table and generate permutations as before:
[row col] = ndgrid(1:2,1:2);
d = [];
for k=1:numel(table); d = cat(1,d,repmat([row(k),col(k)],table(k),1)); end;
size(d)
ans =
53
2
accumarray(d,1,[2,2])
ans =
8
16
Check that we recover the original table.
3
26
gen = @(x) wald(accumarray(cat(2,d(randperm(size(d,1)),1),d(:,2)),1,[2,2]));
gen(1)
Try one permutation just to see it work.
ans =
-0.6676
perms = arrayfun(gen,1:100000);
It’s fast, so can easily do lots of permuations.
hist(perms,(-4:.1:5))
cdfplot(perms)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
26
We get discrete values because only a few discrete tables are possible.
data
value is
here
(ties)
neg of
data
value is
here
pvalgt = numel(perms(perms>wald(table)))/numel(perms)
pvalge = numel(perms(perms>=wald(table)))/numel(perms)
pvaltt = (numel(perms(perms > wald(table)))+numel(perms(perms < -wald(table))))/numel(perms)
pvaltte = (numel(perms(perms >= wald(table)))+numel(perms(perms <= -wald(table))))/numel(perms)
pvalwrongtail = numel(perms(perms> -wald(table)))/numel(perms)
pvalgt = 0.00812
pvalge = 0.04382
pvaltt = 0.01498
pvaltte = 0.05068
pvalwrongtail = 0.99314
We reproduce values from Fisherextest.m, and can now
see what arbitrary choices its coder (or Fisher) made
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
27
Fundamental flaws in the “exact” approach:
• Because your data value always lands on
a tie, it’s either over-conservative or
under-conservative
– some people split the difference
• Because the negative of your data value
(almost) never lands on a tie, the two-tailed
test is fragile
– might be virtually the same as one-tailed,
as in our example
– or might be hugely (>> x2) different
• In fact, the whole darn thing is fragile to irrelevant “number theoretical
coincidences” about the values of the marginals
• And we’ve already seen what the fundamental problem is
– real protocols don’t fix both sets of marginals
– Fisher’s elegant elimination of the nuisance parameters p and/or q is a trap
• We must face up to estimating a distribution for the nuisance
parameters (p’s and/or q’s) and marginalizing over them
– this makes us Bayesians in a non-Bayesian (p-value) world
– but it’s OK! (“I.J. Good’s Bayes-Frequentist compromise”)
– It’s even “modern”.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
28
credit: Mike West (Duke)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
29
We again need “conjugate distributions”: We want to estimate parameters from
µ ¶Recall that Beta is the “conjugate distribution” to Binomial:
observed counts.
N
P (njN ; q) =
qn (1 ¡ q) N ¡ n
n
P (qjN ; n) / qn (1 ¡ q) N ¡ n P (q)
Bayes, with prior.
A “conjugate prior” is one that preserves the functional form of the distribution.
P(q) / q® (1 ¡ q) ¯
(a = b = 0 is a perfectly good choice: flat prior on q)
So the conjugate distribution is
P (qjN ; n) = R
1
0
=
qn + ® (1 ¡ q) N ¡
n+ ¯
qn + ® (1 ¡ q) N ¡
n + ¯ dq
¡ (N + ® + ¯ + 2)
qn + ® (1 ¡ q) N ¡
¡ (n + ® + 1)¡ (N ¡ n + ¯ + 1)
n+ ¯
» Bet a(n + ® + 1; N ¡ n + ¯ + 1)
This Beta distribution has
mean =
n + ®+ 1
N + ®+ ¯ + 1
var =
(n + ® + 1)(N ¡ n + ¯ + 1)
(N + ® + ¯ + 2) 2 (N + ® + ¯ + 3)
Matlab, Mathematica, and NR3 all have methods for generating random Beta deviates
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
30
Relevant to contingency tables other than 2x2,
Dirichlet is the conjugate distribution to Multinomial
Multinomial distribution (you can derive it by “repeated binomial” or combinatorics as we did earlier):
N!
P (n 1 ; n 2 ; : : : jN ; q1 ; q2 ; : : :) =
qn 1 qn 2 ¢¢¢;
n 1 !n 2 ! ¢¢¢ 1 2
X
(
X
ni = N ;
qi = 1)
Conjugate distribution, using conjugate priors:
P (q1 ; q2 ; : : : jN ; n 1 ; n 2 ; : : :) / qn 1 + ®1 qn 2 + ®2 ¢¢¢
1
Normalization turns out to be:
P (q1 ; q2 ; : : : jN ; n 1 ; n 2 ; : : :) =
2
the Dirichlet distribution
¡ (N + ®1 + 1 + ®2 + 1 + ¢¢¢) n + ® n + ®
q 1 1 q 2 2 ¢¢¢
2
¡ (n 1 + ®1 + 1)¡ (n 2 + ®2 + 1) ¢¢¢ 1
Rather amazingly, there is a simple way to generate a (non-independent) set
of q deviates from an independent set of Gamma deviates:
. X
n
+
®
¡
y
y i ie
qi = yi
yi
yi » Gamma(n i + ®i + 1); p(y) =
¡ (n i + ®i + 1)
i
(In fact, in the case with I=2, this is how Beta deviates [previous slide] are usually generated.)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
31
In case it’s not obvious that sampling over q is the same as
marginalizing over it, here’s a formal proof:
We generate a deviate pair (f,q) by choosing a q, and then, independently,
an f given q, so
p(f ; q) = p(q) p(f jq)
We then ignore q, so ourRf is drawn from
Rthe distribution
p(f ) =
p(f ; q) dq =
p(q) p(f jq) dq
which is the same as the desired
marginalization
R
p(f ) =
p(f jq) p(q) dq
(This is so close to self-evident, that I’m not sure that the proof adds anything!)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
32
So let’s reanalyze assuming that the condition (column)
marginals were fixed by the protocol, and we Bayessample the row probabilities:
(You’ll never believe my sampling function unless I go through an example!)
>> table = [8 3; 16 26]
table =
8
3
16
26
>> marfix = sum(table,1)'
column marginals (transposed)
marfix =
24
29
>> marvar = sum(table,2)'
row marginals (transposed)
marvar =
11
42
>> gammas = gamrnd(marvar+1,1)
~Gamma(ni+1)
gammas =
12.1000
44.5735
>> q = gammas ./ sum(gammas) generated (random) row probabilities qi
q =
0.2135
0.7865
>> qmat = repmat(q,[size(table,2),1])
finally, we generate multinomial deviates for each
qmat =
column, using the generated row probabilities
0.2135
0.7865
0.2135
0.7865
The reason everything is done in the transpose is
>> tabout = mnrnd(marfix,qmat)'
because of the way that Matlab’s mnrnd function
tabout =
expects its arguments to be shaped. Sorry about
4
8
that!
20
21
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
33
Encapsulate the sampling process into a function.
Then, generate a bunch of samples and look at their Wald
statistics.
function tabout = tabnullsamp(tabin)
marfix = sum(tabin,1)';
marvar = sum(tabin,2)';
q = gamrnd(marvar+1,1);
qmat = repmat(q./sum(q),[size(tabin,2),1]);
tabout = mnrnd(marfix,qmat)';
wald(table)
ans =
2.0542
tabnullsamp(table), tabnullsamp(table)
ans =
6
18
ans =
7
17
7
22
9
20
wald(tabnullsamp(table))
ans =
1.0392
samps = arrayfun(@(x) wald(tabnullsamp(table)), 1:30000);
hist(samps,-4:.05:4)
cdfplot(samps)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
34
There are still discreteness effects (after all, these are integer tables), but they are
less troubling:
pval = numel(samps(samps>=wald(table)))/numel(samps)
pvaltt = (numel(samps(samps>=wald(table)))+numel(samps(samps<= -wald(table))) )/numel(samps)
pval =
0.022967
pvaltt =
0.040067
one-tail vs. two-tail now much more reasonble
This is probably the most honest answer that we can get
for the significance of this particular contingency table.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
35
Let’s reanalyze the maternal drinking data using the same methodology,
but (just for variety) with the Pearson statistic:
function chis = pearson(table)
nhtable = sum(table,2)*sum(table,1)/sum(sum(table));
chis = sum(sum((table-nhtable).^2./nhtable));
table = [17066 14464 788 126 37; 48 38 5 1 1]'
table =
17066
14464
788
126
37
48
38
5
1
1
transpose to make the unfixed marginals be the rows, as before
(this is a guess as to what the protocol actually was!)
columns are the “conditions”
pearson(table)
ans =
12.082
samps = arrayfun(@(x) pearson(tabnullsamp(table)), 1:10000);
hist(samps,0:0.25:90)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
36
Giving the results
pval = numel(samps(samps>pearson(table)))/numel(samps)
pval =
0.0408
Why is this less significant than the previous analysis, which gave p=.015 or
.011 (mean or square-mean)?
Because here we didn’t use the fact that the factors were ordinal and
quantitatively related. By virtue of using that information, and compressing the
rows, the previous method was more powerful.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
37
Actually, it seems likely that this data was a cross-sectional study with no fixed
marginals. If so, a better sampling of the null hypothesis would be:
function tabout = tabnullsamp2(tabin)
marcol = sum(tabin,1);
marrow = sum(tabin,2);
ntot = sum(marcol);
q = gamrnd(marcol+1,1);
q = q./sum(q);
p = gamrnd(marrow+1,1);
p = p./sum(p);
pq = p * q;
tabout = reshape(mnrnd(ntot,pq(:)'),size(tabin));
samps = arrayfun(@(x) pearson(tabnullsamp2(table)), 1:10000);
hist(samps,0:0.25:90)
pval = numel(samps(samps>pearson(table)))/numel(samps)
pval =
0.0445
different from previous: protocol matters!
This would probably the “honest” answer if the
table were nominal, not ordinal. But, as said,
since it is ordinal, the previous analysis using
difference of means is more powerful.
You now know all you need to know about
contingency tables – and much more than
almost everyone who uses them!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
38