Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2008 The University of Texas at Austin Unit 8: Contingency Tables, Experimental Protocols, and All That The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Unit 8: Contingency Tables, Experimental Protocols, and All That (Summary) • We discuss contingency tables at some length – counts in a table – null hypothesis is that column label doesn’t affect row probability • or vice versa (the formula was symmetrical) • Pearson Chi-Square test applies when counts are all large – non-obvious (but true) that the Pearson statistic is Chi-Square distributed – all row and column marginals count as constraints (less one duplication) • When counts are small we will see that there are subtle issues – under the null hypothesis, not obvious what is the distribution of tables? – dependence on the exact protocol • Three different protocols can give seemingly identical contingency tables – retrospective or case-control study (column marginals fixed) – prospective or longitudinal study (row marginals fixed) – cross-sectional or snapshot study (no marginals fixed) • The three protocols have different table distributions in the null hypothesis – depending on unknown (nuisance) probabilities p, q, or both – digressions on • hypergeometric distribution • multinomial distribution continues The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 2 • Probability of a table with all marginals fixed – purely combinatorial calculation giving hypergeometric probability – is the basis of the Fisher Exact Test • even though it doesn’t exactly apply to any of the actual protocols! • Fisher Exact Test using the Wald statistic – analytically unfeasible for tables larger than few by few, or with too many counts • Why ordinal or cardinal tables should be “compressed” – that is, test the mean (or another moment), not all columns separately • The permutation test is a fast and exact way of sampling the null hypothesis for contingency tables – breaks the association – requires expanding the contingency table to a dataset • How to use of the permutation test for the significance of contingency tables – requires expanding the contingency table to a dataset – combines nicely with compression of ordinal data [example: maternal drinking] • e.g., test difference of means, or of higher moments • if you test multiple moments, think about multiple hypothesis correction – but don’t necessarily do it – or use permutation test directly on (nominal) contingency tables • in which case you have to pick a statistic – Wald or Pearson are fine The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press continues 3 • Contrast the permutation test to bootstrap resampling – one breaks the causal connection in the data, the other doesn’t – one lives in the null hypothesis world, the other in the (estimated) true population – therefore, one addresses false positives (p-values), the other, false negatives • On (nominal) contingency tables, the permutation test samples exactly the null population of Fisher’s Exact Test – preserves all marginals – gives easy Monte Carlo way to compute Fisher’s Exact Test on contingency tables of any size • Even if we can easily compute it, Fisher’s Exact Test remains problematic – discrete values, and data always lands on a tie – 2-tailed test is “fragile” to irrelevant number-theoretic coincidences – and it still doesn’t properly model an actual protocol! • As good Bayesians, we marginalize over the unknown column and row probabilities – – – – Marginalizing can be achieved by sampling We use Dirichlet, the conjugate distribution to multinomial Draw probabilities from Dirichlet, then draw a table from multinomial Discreteness effects now much less troublesome • Still some, since integer valued tables • And in any case we are matching the actual experimental protocol • All tests are not equally powerful – Compressing the columns in ordinal data can give a more powerful test • Maternal drinking table example • Not marginalizing over as big a space The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 4 Contingency Tables (a.k.a. cross-tabulation) Ask: Is a gene is more likely to be single-exon if it is AT-rich? rowcon = [(g.ne == 1) (g.ne > 1)]; colcon = [(g.piso < 0.4) (g.piso > 0.6)]; crosstab(rowcon * (1:2)', colcon * (1:2)') ans = annoying counts of “other” 48 220 2386 13369 <.4 >.6 1 >1 689 3982 table = contingencytable(rowcon,colcon) my improved function (below) table = 2386 13369 689 3982 (fewer genes AT rich than CG rich) sum(table, 1) ans = 15755 4671 column marginals ptable = table ./ repmat(sum(table,1),[2 1]) ptable = 0.1514 0.8486 0.1475 0.8525 So can we claim that these are statistically identical? Or is the effect here also “significant but small”? my contingency table function: function table = contingencytable(rowcons, colcons) nrow = size(rowcons,2); ncol = size(colcons,2); table = squeeze(sum( repmat(rowcons,[1 1 ncol]) .* ... permute(repmat(colcons,[1 1 nrow]),[1 3 2]),1 )); The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 5 Chi-square (or Pearson) statistic for contingency tables notation: null hypothesis: expected value of Nij •Are the conditions for valid chi-square distribution satisfied? Yes, because number of counts in all bins is large. the statistic is: table = 2386 13369 689 3982 •If they were small, we couldn’t use fix-themoments trick, because small number of bins (no CLT). This occurs often in biomedical data. •So what then? nhtable = sum(table,2)*sum(table,1)/sum(sum(table)) nhtable = 1.0e+004 * 0.2372 1.3383 0.0703 0.3968 chis = sum(sum((table-nhtable).^2./nhtable)) chis = 0.4369 p = chi2cdf(chis,1) d.f. = 4 – 2 – 2 + 1 p = 0.4914 wow, can’t get less significant than this! No evidence of an association between single-exon and AT- vs. CG-rich. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 6 When counts are small, some subtle issues show up. Let’s look closely. “conditions”, e.g. healthy vs. sick The setup is: counts “factors”, e.g. vaccinated vs. unvaccinated marginals (totals, dot means summed over) The null hypothesis is: “Conditions and factors are unrelated.” To do a p-value test we must: 1. Invent a statistic that measures deviation from the null hypothesis. 2. Compute that statistic for our data. 3. Find the distribution of that statistic over the (unseen) population. That’s the hard part! What is the “population” of contingency tables? We’ll now see that it depends (albeit slightly) on the experimental protocol, not just on the counts! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 7 Protocol 1: Retrospective analysis or “case/control study” C1 already has the disease. We retrospectively look at their factors. In the null hypothesis, both columns share row probabilities q and (1-q). But we don’t know q. It’s a “nuisance parameter”. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 8 Digression on the hypergeometric distribution The Texas Legislature has m Republicans and n Democrats. A committee of size r is chosen randomly [not realistic!] What is the probability distribution of i, the number of Democrats on the committee? p(i ) = hyper(i ; m + n; n; r ) number of ways to choose i Democrats ¡ ¢¡ = n ¡i ¢ m r¡ ¢ i m+ n r number of ways to choose r-i Republicans total number ways to choose the committee i m r n m+n The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 9 Protocol 2: Prospective experiment or “longitudinal study” Identify samples with the factors, then watch to see who gets the disease In the null hypothesis, both rows share row probabilities p and (1-p). But we don’t know p. It’s now the nuisance parameter. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 10 Protocol 3: Cross-sectional or snapshot study (no fixed marginals) E.g., test all Austin residents for both disease and factors. multinomial distribution Asymptotic methods (e.g. chi-square) are often equivalent to estimating p,q and thus the nuisance factors, from the data itself. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 11 Digression on the multinomial distribution On each i.i.d. try, exactly one of K outcomes occurs, with probabilities XK p1 ; p2 ; : : : ; pK pi = 1 i= 1 For N tries, the probability of seeing exactly the outcome XK n1 ; n2 ; : : : ; nK ni = N i= 1 is P (n 1 ; : : : ; n K jN ; p1 ; : : : ; pK ) = probability of one specific outcome N! pn 1 pn 2 ¢¢¢pn K K n 1 ! ¢¢¢n K ! 1 2 number of equivalent arrangements N=26: abcde fgh ijklmnop q rs tuvwxyz (12345)(123)(12345678)(1)(12)(1234567) n1 = 5 n2 = 3 N! arrangements n6 = 7 The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press partition into the observed ni’s 12 So, in all three cases we got a product of “nuisance” probabilities (depending on unknown p or q or both) and a “sufficient statistic” conditioned on all the marginals. Fisher’s Exact Test just throws away the nuisance factors and uses the sufficient statistic: This can also be seen to be the (purely combinatorial) probability of the table with all marginals fixed: Numerator: number of partitions with n00=k Denominator: sum numerator over k With all marginals fixed, n00 determines the whole table. table is fully determined by k alone Vandermonde’s identity: Proof: How many ways can you choose a subcommittee of size r from a committee with n Democrats and m Republicans? How many Democrats on the subcommittee? The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 13 That was all about the distribution of tables in the null hypothesis. Now we complete the rest of the tail test paradigm: The most popular choice for a statistic for 2x2 tables is the “Wald statistic”: (sorry for the slight change in notation!) This is constructed so that it will asymptotically became a true t-value. You could instead use the Pearson (chi-square) statistic, but not the assumption that it is chi-square distributed. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 14 So, the Fisher Exact Test looks like this: Is this table a significant result? • Compute the statistic for the data • Loop over all possible contingency tables with the same marginals – for 2x2 there is just one free parameter • Compute the statistic for each table in the loop • Accumulate weight (by hypergeometric probability) of statistic <, =, > the data statistic • Output the p-value (or, because of discreteness effects, the range) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 15 Although the Fisher Exact Test generalizes to larger contingency tables, the computational effort increases exponentially can download from Matlab Central (user-contributed): Fisherextest(8,3,16,26) Fisher's exact test P-values for cells: a = 8, b = 3, c = 16, d = 26 ---------------------------------------------P-value ---------------------------------------------Left tail Right tail 2-tail (negative) (positive) (both) ---------------------------------------------0.0430013347 0.9922567550 0.0497619245 ---------------------------------------------- Best stick to the two-tailed answer, since you’d probably change your mind about which tail if you saw something significant on the other! Editorial: Apart from the words “Fisher” (true) and “exact” (questionable) in its name, this test doesn’t have much to recommend it. We will soon learn an efficient way to compute it, but we will also learn more about why it remains problematic. I just don’t understand why it is widely used, since virtually never are all marginals held fixed (none of Protocols 1,2,3 above)! This is the frequentist’s worship of sufficient statistics at its worst! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 16 To learn more, let’s play with a different data set: Different “standard methods” applied to this data get p-values ranging from 0.005 to 0.190. (Agresti 1992) Fisher Exact Test is not a viable option (both because of computational workload and because we only derived the 2x2 case!) So what should we really do? The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 17 A sensible computational approach to the maternal drinking contingency table is based on the permutation test: • Define a test statistic that actually reflects your hypothesis – here, use ordinal nature of the columns – “more drinks lead to more abnormalities” – the obvious statistic is “difference of mean number of drinks in the two rows” – if a threshold effect is plausible, might also try “difference of mean of square” • but must think about multiple hypothesis correction • Expand the contingency table back to the data set that it came from – every count its own data point • Permutation test the expanded data set to determine the significance of the test statistic – different from Bootstrap resampling, as we’ll see – histogram – p-value The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 18 Example: Permutation test on a 2x2 contingency table A B f 1 3 g 2 1 f f f f g g g expand the table back to data construct the new table A B B B A A B randomly permute the second column A B f 2 2 g 1 2 f f f f g g g B A A B B A B note, will always have just two columns for any size contingency table compute and save statistic, then (notice that all marginals are preserved – will come back to this point) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 19 Let’s apply these lessons to the maternal drinking table. Input the table and display the means and their differences: table = [17066 14464 788 126 37; 48 38 5 1 1] sum(table(:)) table = 17066 48 ans = 32574 14464 38 788 5 drinks = [0 0.5 1.5 4. 6.]; drinksq = drinks.^2; norm = sum(table,2); mudrinks = (table * drinks')./norm mudrinksq = (table * drinksq')./norm mudrinks = 0.2814 0.39247 mudrinksq = 0.26899 0.78226 diff = [-1 1] * mudrinks diffsq = [-1 1] * mudrinksq 126 1 37 1 reasonable quantification of the ordinal categories: exactness isn’t important, since we get to define the statistic These are our chosen “statistics”. The question is: Are either of them statistically significant? We’ll use the permutation test to find out. diff = 0.11108 diffsq = 0.51327 The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 20 Expand table back to dataset of length 32574: [row col] = ndgrid(1:2,1:5) row = 1 2 col = 1 1 This tells each cell its row and column number 1 2 1 2 1 2 1 2 2 2 3 3 4 4 5 5 (Darn it, I couldn’t think of a way to do this in Matlab without an explicit loop, thus spoiling my no-loop record)* d = []; for k=1:numel(table); d = [d; repmat([row(k),col(k)],table(k),1)]; end; size(d) ans = 32574 2 Yes, has the dimensions we expect. accumarray(d,1,[2,5]) ans = 17066 48 14464 38 788 5 126 1 37 1 And we can reconstruct the original table. mean(drinks(d(d(:,1)==2,2))) ans = 0.39247 And we get the right mean, so it looks like we are good to go… *Peter Perkins (MathWorks) suggests the wonderfully obscure d = [rldecode(table,row) rldecode(table,col)]; where rldecode is one of Peter Acklam’s Matlab Tips and Tricks. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 21 Compute statistic for the data and for 1000 permuations: The idea is to sample from the null hypothesis (no association) while keeping the distributions of each single variable unchanged. Do this by permuting a label that is irrelevant in the null hypothesis. diffmean = @(d) mean(drinks(d(d(:,1)==2,2))) - mean(drinks(d(d(:,1)==1,2))); diffmean(d) ans = 0.11108 diffmean([d(randperm(size(d,1)),1) d(:,2)]) Try one permutation just to see it work. ans = 0.014027 perms = arrayfun(@(x) diffmean([d(randperm(size(d,1)),1) d(:,2)]), [1:1000]); pval = numel(perms(perms>diffmean(d)))/numel(perms) pval = 0.015 hist(perms,(-.15:.01:.3)) So, the association is (fairly) significant. This is the chance of seeing as large a difference of means that we see, under the null hypothesis: a false positive rate. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 22 Same analysis for the squared-drinks statistic: diffmeansq = @(d) mean(drinksq(d(d(:,1)==2,2))) - mean(drinksq(d(d(:,1)==1,2))); diffmeansq(d) ans = 0.51327 permsq = arrayfun(@(x) diffmeansq([d(randperm(size(d,1)),1) d(:,2)]), [1:1000]); pval = numel(permsq(permsq>diffmeansq(d)))/numel(permsq) pval = 0.011 hist(permsq,(-.3:.05:1)) • Should we apply a multiple hypothesis correction to both pval’s (mult x 2) ? Probably not. – mean and mean-of-squares highly correlated, and – the previous result was significant – we’re not just shopping uniform p-values • • But, if your data can stand it, Bonferroni is the gold standard Can you think of a principled way to do multiple hypothesis correction with highly correlated tests? The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 23 This was not bootstrap resampling! The latter tells us how much variation in the signal one might see in repeated identical experiments. Permutation test breaks the causal connection. Bootstrap doesn’t. Bootstrap can be useful in understanding why another experiment didn’t see the effect (false negative). diffmean(d(randsample(size(d,1),size(d,1),true),:)) ans = 0.20703 Try one resample just to see it work. resamp = arrayfun(@(x) diffmean(d(randsample(size(d,1),size(d,1),true),:)), [1:1000]); bias = mean(resamp) - diffmean(d) If this is large, we should worry. resamp = resamp - bias; pval = numel(resamp(resamp<0))/numel(resamp) bias = 0.0014876 pval = 0.078 diffmean(d) = 0.11108 hist(resamp,[-.1:.01:.5]) Here the pval is a false negative rate. How often would a repetition of the experiment show an effect with negative difference of the means? Moral: Bootstrap resampling and sampling from the null hypothesis (e.g. by permutation) are different things! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 24 Let’s go back to the small-count 2x2 contingency table, for which we have (so far) only the Fisher Exact Test. What if we do permutation resampling here? Aha! Since the permutation preserves all marginals, it is a Monte Carlo approximation of the Fisher Exact Test. And it works for any size table! And is easy to compute! But is it really any good? function t = wald(tab) Code up the Wald statistic. m = tab(1,1); n = tab(1,2); mm = m + tab(2,1); nn = n + tab(2,2); p1 = m/mm; p2 = n/nn; p = (m+n)/(mm+nn); t = (p1-p2)/sqrt(p*(1-p)*(1/mm+1/nn)); table = [8 3; 16 26;] table = 8 16 3 26 tdata = wald(table) tdata = 2.0542 The data show about a 2 standard deviation effect, except that they’re not really standard deviations because of the small counts! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 25 Expand the table and generate permutations as before: [row col] = ndgrid(1:2,1:2); d = []; for k=1:numel(table); d = cat(1,d,repmat([row(k),col(k)],table(k),1)); end; size(d) ans = 53 2 accumarray(d,1,[2,2]) ans = 8 16 Check that we recover the original table. 3 26 gen = @(x) wald(accumarray(cat(2,d(randperm(size(d,1)),1),d(:,2)),1,[2,2])); gen(1) Try one permutation just to see it work. ans = -0.6676 perms = arrayfun(gen,1:100000); It’s fast, so can easily do lots of permuations. hist(perms,(-4:.1:5)) cdfplot(perms) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 26 We get discrete values because only a few discrete tables are possible. data value is here (ties) neg of data value is here pvalgt = numel(perms(perms>wald(table)))/numel(perms) pvalge = numel(perms(perms>=wald(table)))/numel(perms) pvaltt = (numel(perms(perms > wald(table)))+numel(perms(perms < -wald(table))))/numel(perms) pvaltte = (numel(perms(perms >= wald(table)))+numel(perms(perms <= -wald(table))))/numel(perms) pvalwrongtail = numel(perms(perms> -wald(table)))/numel(perms) pvalgt = 0.00812 pvalge = 0.04382 pvaltt = 0.01498 pvaltte = 0.05068 pvalwrongtail = 0.99314 We reproduce values from Fisherextest.m, and can now see what arbitrary choices its coder (or Fisher) made The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 27 Fundamental flaws in the “exact” approach: • Because your data value always lands on a tie, it’s either over-conservative or under-conservative – some people split the difference • Because the negative of your data value (almost) never lands on a tie, the two-tailed test is fragile – might be virtually the same as one-tailed, as in our example – or might be hugely (>> x2) different • In fact, the whole darn thing is fragile to irrelevant “number theoretical coincidences” about the values of the marginals • And we’ve already seen what the fundamental problem is – real protocols don’t fix both sets of marginals – Fisher’s elegant elimination of the nuisance parameters p and/or q is a trap • We must face up to estimating a distribution for the nuisance parameters (p’s and/or q’s) and marginalizing over them – this makes us Bayesians in a non-Bayesian (p-value) world – but it’s OK! (“I.J. Good’s Bayes-Frequentist compromise”) – It’s even “modern”. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 28 credit: Mike West (Duke) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 29 We again need “conjugate distributions”: We want to estimate parameters from µ ¶Recall that Beta is the “conjugate distribution” to Binomial: observed counts. N P (njN ; q) = qn (1 ¡ q) N ¡ n n P (qjN ; n) / qn (1 ¡ q) N ¡ n P (q) Bayes, with prior. A “conjugate prior” is one that preserves the functional form of the distribution. P(q) / q® (1 ¡ q) ¯ (a = b = 0 is a perfectly good choice: flat prior on q) So the conjugate distribution is P (qjN ; n) = R 1 0 = qn + ® (1 ¡ q) N ¡ n+ ¯ qn + ® (1 ¡ q) N ¡ n + ¯ dq ¡ (N + ® + ¯ + 2) qn + ® (1 ¡ q) N ¡ ¡ (n + ® + 1)¡ (N ¡ n + ¯ + 1) n+ ¯ » Bet a(n + ® + 1; N ¡ n + ¯ + 1) This Beta distribution has mean = n + ®+ 1 N + ®+ ¯ + 1 var = (n + ® + 1)(N ¡ n + ¯ + 1) (N + ® + ¯ + 2) 2 (N + ® + ¯ + 3) Matlab, Mathematica, and NR3 all have methods for generating random Beta deviates The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 30 Relevant to contingency tables other than 2x2, Dirichlet is the conjugate distribution to Multinomial Multinomial distribution (you can derive it by “repeated binomial” or combinatorics as we did earlier): N! P (n 1 ; n 2 ; : : : jN ; q1 ; q2 ; : : :) = qn 1 qn 2 ¢¢¢; n 1 !n 2 ! ¢¢¢ 1 2 X ( X ni = N ; qi = 1) Conjugate distribution, using conjugate priors: P (q1 ; q2 ; : : : jN ; n 1 ; n 2 ; : : :) / qn 1 + ®1 qn 2 + ®2 ¢¢¢ 1 Normalization turns out to be: P (q1 ; q2 ; : : : jN ; n 1 ; n 2 ; : : :) = 2 the Dirichlet distribution ¡ (N + ®1 + 1 + ®2 + 1 + ¢¢¢) n + ® n + ® q 1 1 q 2 2 ¢¢¢ 2 ¡ (n 1 + ®1 + 1)¡ (n 2 + ®2 + 1) ¢¢¢ 1 Rather amazingly, there is a simple way to generate a (non-independent) set of q deviates from an independent set of Gamma deviates: . X n + ® ¡ y y i ie qi = yi yi yi » Gamma(n i + ®i + 1); p(y) = ¡ (n i + ®i + 1) i (In fact, in the case with I=2, this is how Beta deviates [previous slide] are usually generated.) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 31 In case it’s not obvious that sampling over q is the same as marginalizing over it, here’s a formal proof: We generate a deviate pair (f,q) by choosing a q, and then, independently, an f given q, so p(f ; q) = p(q) p(f jq) We then ignore q, so ourRf is drawn from Rthe distribution p(f ) = p(f ; q) dq = p(q) p(f jq) dq which is the same as the desired marginalization R p(f ) = p(f jq) p(q) dq (This is so close to self-evident, that I’m not sure that the proof adds anything!) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 32 So let’s reanalyze assuming that the condition (column) marginals were fixed by the protocol, and we Bayessample the row probabilities: (You’ll never believe my sampling function unless I go through an example!) >> table = [8 3; 16 26] table = 8 3 16 26 >> marfix = sum(table,1)' column marginals (transposed) marfix = 24 29 >> marvar = sum(table,2)' row marginals (transposed) marvar = 11 42 >> gammas = gamrnd(marvar+1,1) ~Gamma(ni+1) gammas = 12.1000 44.5735 >> q = gammas ./ sum(gammas) generated (random) row probabilities qi q = 0.2135 0.7865 >> qmat = repmat(q,[size(table,2),1]) finally, we generate multinomial deviates for each qmat = column, using the generated row probabilities 0.2135 0.7865 0.2135 0.7865 The reason everything is done in the transpose is >> tabout = mnrnd(marfix,qmat)' because of the way that Matlab’s mnrnd function tabout = expects its arguments to be shaped. Sorry about 4 8 that! 20 21 The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 33 Encapsulate the sampling process into a function. Then, generate a bunch of samples and look at their Wald statistics. function tabout = tabnullsamp(tabin) marfix = sum(tabin,1)'; marvar = sum(tabin,2)'; q = gamrnd(marvar+1,1); qmat = repmat(q./sum(q),[size(tabin,2),1]); tabout = mnrnd(marfix,qmat)'; wald(table) ans = 2.0542 tabnullsamp(table), tabnullsamp(table) ans = 6 18 ans = 7 17 7 22 9 20 wald(tabnullsamp(table)) ans = 1.0392 samps = arrayfun(@(x) wald(tabnullsamp(table)), 1:30000); hist(samps,-4:.05:4) cdfplot(samps) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 34 There are still discreteness effects (after all, these are integer tables), but they are less troubling: pval = numel(samps(samps>=wald(table)))/numel(samps) pvaltt = (numel(samps(samps>=wald(table)))+numel(samps(samps<= -wald(table))) )/numel(samps) pval = 0.022967 pvaltt = 0.040067 one-tail vs. two-tail now much more reasonble This is probably the most honest answer that we can get for the significance of this particular contingency table. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 35 Let’s reanalyze the maternal drinking data using the same methodology, but (just for variety) with the Pearson statistic: function chis = pearson(table) nhtable = sum(table,2)*sum(table,1)/sum(sum(table)); chis = sum(sum((table-nhtable).^2./nhtable)); table = [17066 14464 788 126 37; 48 38 5 1 1]' table = 17066 14464 788 126 37 48 38 5 1 1 transpose to make the unfixed marginals be the rows, as before (this is a guess as to what the protocol actually was!) columns are the “conditions” pearson(table) ans = 12.082 samps = arrayfun(@(x) pearson(tabnullsamp(table)), 1:10000); hist(samps,0:0.25:90) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 36 Giving the results pval = numel(samps(samps>pearson(table)))/numel(samps) pval = 0.0408 Why is this less significant than the previous analysis, which gave p=.015 or .011 (mean or square-mean)? Because here we didn’t use the fact that the factors were ordinal and quantitatively related. By virtue of using that information, and compressing the rows, the previous method was more powerful. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 37 Actually, it seems likely that this data was a cross-sectional study with no fixed marginals. If so, a better sampling of the null hypothesis would be: function tabout = tabnullsamp2(tabin) marcol = sum(tabin,1); marrow = sum(tabin,2); ntot = sum(marcol); q = gamrnd(marcol+1,1); q = q./sum(q); p = gamrnd(marrow+1,1); p = p./sum(p); pq = p * q; tabout = reshape(mnrnd(ntot,pq(:)'),size(tabin)); samps = arrayfun(@(x) pearson(tabnullsamp2(table)), 1:10000); hist(samps,0:0.25:90) pval = numel(samps(samps>pearson(table)))/numel(samps) pval = 0.0445 different from previous: protocol matters! This would probably the “honest” answer if the table were nominal, not ordinal. But, as said, since it is ordinal, the previous analysis using difference of means is more powerful. You now know all you need to know about contingency tables – and much more than almost everyone who uses them! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 38