Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Contingency Tables and LogLinear Models Hal Whitehead BIOL4062/5062 • Categorical data • Contingency tables • Goodness of fit – G-tests • Multiway tables – log-linear models Goodness of Fit With Categorical Data • Categorical variables: have discrete values (colours, haplotypes, sexes, morphs, ...) • No ordering (usually) Contingency Tables • Data: number of individuals in cell (with particular combination of values) One-Way Table Blue Colour Yellow of Green Eye Red White 35 47 12 37 56 Two-Way Table Male Female Blue 12 23 Colour Yellow 36 11 of Green 3 9 Eye Red 31 6 White 50 6 Goodness of fit with categorical data f(i) g(i) a number observed in cell i number expected in cell i according to model number of cells Goodness of fit of data to model G, likelihood-ratio, test: G = 2·Log(L) = Σ f(i) · Log( f(i) / g(i) ) i=1:a If model is true: Distributed as χ² with a-1 degrees of freedom Goodness of fit with categorical data f(i) g(i) a number observed in cell i number expected in cell i according to model number of cells G = 2 · Log(L) = Σ f(i) · i=1:a Log( f(i) / g(i) ) G ~ X² = Σ (f(i) - g(i)) ² / g(i) “Chi-squared test” i=1:a If model is true: Distributed as χ² with a-1 degrees of freedom Example: Goodness of fit Bottlenose whale populations from mark-recapture Yrs Seen 1 2 3 4 5 >6 χ2(5) No. Whales 81 35 17 10 6 11 Expected: Model A 64.8 45.0 25.0 14.2 7.0 3.9 G =23.3(P=0.00) Model B 75.7 42.5 19.0 9.1 4.7 9.0 G = 2.8(P=0.73) Example: Goodness of fit, Two-way contingency table Mortality of mice given bacteria Antiserum No antiserum Dead 13 25 Alive 44 29 Example: Goodness of fit, Two-way contingency table Mortality of mice given bacteria Antiserum No antiserum Dead 13 25 Alive 44 29 Null hypothesis: Mortality independent of antiserum Alternative hypothesis: Mortality rate different with antiserum Example: Goodness of fit, Two-way contingency table Mortality of mice given bacteria Antiserum No antiserum Total Dead 13 25 38 Alive 44 29 73 Total 57 54 111 Null hypothesis: Mortality independent of antiserum Alternative hypothesis: Mortality rate different with antiserum Example: Goodness of fit, Two-way contingency table Mortality of mice given bacteria Dead 13 (19.5) 25 (18.5) 38 Alive Total Antiserum 44 (37.5) 57 No antiserum 29 (35.5) 54 Total 73 111 54x73/111=35.5 Null hypothesis: Mortality independent of antiserum Alternative hypothesis: Mortality rate different with antiserum Example: Goodness of fit, Two-way contingency table Mortality of mice given bacteria Dead 13 (19.5) 25 (18.5) 38 Alive Total Antiserum 44 (37.5) 57 No antiserum 29 (35.5) 54 Total 73 111 54x73/111=35.5 Null hypothesis: Mortality independent of antiserum Alternative hypothesis: Mortality rate different with antiserum 1degree of freedom as if any cell total given, all others fixed G = Σ f(i) · Log( f(i) / g(i) ) = 6.88 χ2(1): p=0.009 Two-way contingency table • Test independence of rows and columns in r x c contingency table using G-test – if independent, G is χ2((r-1)x(c-1)) d.f. L1 L2 Area L3 L4 A . . . . B . . . . Haplotypes C D . . . . . . . . E . . . . F . . . . Problems with G-tests of contingency tables with categorical data • Non-independence of data • Small cell-numbers (G-test is asymptotic): Rule of thumb: expected cell numbers >5 – – – – Williams correction Yates correction Lump data Use exact test • Model wrong: – In mxn 2-way contingency table, if both sets of marginal totals are fixed, then G test is inappropriate-use exact test e.g. Students’ beer preferences X: 20M,20F choose one each from 40 Blue, 40 Keiths G-test OK Y: 20M,20F choose one each from 20 Blue, 20 Keiths G-test not OK (use exact test) Blue Keith's Total Male xBM xKM 20 Female xBF xKF 20 Total X ? ? 40 Total Y 20 20 40 Multiway Tables Categorical variables divided into: a) Factors: data on group to which subject belongs, or set of experimental conditions c.f. independent continuous variables in regression b) Responses: what was observed c.f. dependent continuous variables General types of multiway tables • • • • Multiresponse, no-factor Multiresponse, one-factor One-response, multifactor Multiresponse, multifactor Multiresponse, no-factor (c.f. Principal Components) Locus 1 Locus 2 Locus 3 Locus 4 A B C D a b c d R R R R Multiresponse, one-factor (c.f. Canonical Variate Analysis) Locus 1 Locus 2 Locus 3 Locus 4 Area A B C D P1 a b c d P2 P3 P4 R R R R F One-response, multifactor (c.f. Multiple Regression) Mortality Ate peas Smoked Exercised 1 1 1 2 0 0 0 1 0 R F F F Multiresponse, multifactor (c.f. Canonical Correlation) Whistles Grunts Clicks Habitat Social Y Y Y Forest Y N N N Savannah N R R R F F Log-linear Models Expected no. of F’s eating plants but not bats: ƒ(F,p+,b-) = O·S(F)·P(+)·B(-)·SP(F,+)·..·SPB(F,+,-) O is the overall geometric mean number per cell S(F) is an additional sex effect SP is an interaction between sex and plants Log(ƒ(s,p,b)) = μ+α(s)+β(p)+γ (b)+δ(s,p)+ ... +ε(s,p,b) This is a log-linear model Log-linear Models • Log(ƒ(s,p,b)) = μ+α(s)+β(p)+γ (b)+δ(s,p)+ ... +ε(s,p,b) • Calculate likelihood by finding μ, β, γ, δ, ε, ... given totals, to maximize: Log(L) = Σ Σ Σ f(s,p,b)·Log( f(s,p,b) / g(s,p,b) ) s p b • Test importance of various terms using likelihoodratio G tests • Compare models using AIC Log-linear Models • In log-linear models: • Almost always include first order effects • Almost always include k-1th order effects for variables included in kth order effects: – include A and B if AB is included – include AB, AC and BC if ABC is included Drosophila mortality (R) by sex (F) and pupation site (F) Pupation Site Female Healthy Poisoned Male Healthy Poisoned AM IM OM OW 23 55 8 7 15 34 5 3 1 6 3 4 5 17 3 5 Drosophila mortality (R) by sex (F) and pupation site (F) • Test for 3-way effect: – Does mortality depend on the interaction between sex and pupation site? • G = 1.37, 3 [=(4-1)(2-1)(2-1)] d.f., P=0.7137 • Test for 2-way effects: – Does pupation site depend on sex? • G = 1.50, 3 [=(4-1)(2-1)] d.f., P=0.6814 – Does mortality depend on sex? • G = 12.61, 1 [=(2-1)(2-1)] d.f., P=0.0004 – Does mortality depend on pupation site? • G = 8.96, 3 [=(4-1)(2-1)] d.f., P=0.0298 Drosophila mortality (R) by sex (F) and pupation site (F) • Test for 3-way effect: – Does mortality depend on the interaction between sex and pupation site? • G = 1.37, 3 [=(4-1)(2-1)(2-1)] d.f., P=0.7137 • Test for 2-way effects: – Does pupation site depend on sex? • G = 1.50, 3 [=(4-1)(2-1)] d.f., P=0.6814 – Does mortality depend on sex? • G = 12.61, 1 [=(2-1)(2-1)] d.f., P=0.0004 – Does mortality depend on pupation site? • G = 8.96, 3 [=(4-1)(2-1)] d.f., P=0.0298 Drosophila mortality by sex and pupation site • • • • • • • • Complete independence AIC=30.44 Site*Sex AIC=36.30 Site*Mortality AIC=27.48 Sex* Mortality AIC=19.83 Site*Sex+ Site*Mortality AIC=23.34 Site*Sex+ Sex*Mortality AIC=25.68 Site*Mortality+ Sex*Mortality AIC=16.87 All 2-way interactions AIC=21.37 Drosophila mortality (R) by sex (F) and pupation site (F) • Conclusion; Mortality depends on: • Sex % poisoned –F –M 13% 34% • Pupation site – – – – AM IM OM OW 14% 21% 32% 47% Number of parameters (K) in calculation of AIC for log-linear models • 1-way table (n cells) – null model (all cells same): K=0 – full model (all cells different): K=n-1 • 2-way table (mxn cells) – null model (all cells same): K=0 – both one-way effects: K=(m-1)+(n-1)=m+n-2 – full model (all cells different): K=(m-1)(n-1)+(m-1)+(n-1)=mn-1 Number of parameters (K) in calculation of AIC for log-linear models • 3-way table (lxmxn cells) – null model (all cells same): K=0 – all one-way effects: K=(l-1)+(m-1)+(n-1)=l+m+n-3 – all one-way effects and one two-way effect: K=l+m+n-3+(m-1)(n-1)= l+mn-2 – all one-way and two-way effects: K=l+m+n-3+(m-1)(n-1)+(m-1)(l-1) +(n-1)(l-1) =lm+ln+mn-l-m-n – full model (all cells different): K=(l-1)(m-1)(n-1)+ lm+ln+mn-l-m-n=lmn-1