Download Categorical data and log-linear models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Contingency Tables and LogLinear Models
Hal Whitehead
BIOL4062/5062
• Categorical data
• Contingency tables
• Goodness of fit
– G-tests
• Multiway tables
– log-linear models
Goodness of Fit
With Categorical Data
• Categorical variables: have discrete values
(colours, haplotypes, sexes, morphs, ...)
• No ordering (usually)
Contingency Tables
• Data: number of individuals in cell (with
particular combination of values)
One-Way Table
Blue
Colour Yellow
of
Green
Eye Red
White
35
47
12
37
56
Two-Way Table
Male Female
Blue 12
23
Colour Yellow 36
11
of
Green 3
9
Eye Red 31
6
White 50
6
Goodness of fit with categorical data
f(i)
g(i)
a
number observed in cell i
number expected in cell i according to model
number of cells
Goodness of fit of data to model
G, likelihood-ratio, test:
G = 2·Log(L) = Σ f(i) · Log( f(i) / g(i) )
i=1:a
If model is true:
Distributed as χ² with a-1 degrees of freedom
Goodness of fit with categorical data
f(i)
g(i)
a
number observed in cell i
number expected in cell i according to model
number of cells
G = 2 · Log(L) = Σ f(i) ·
i=1:a
Log( f(i) / g(i) )
G ~ X² = Σ (f(i) - g(i)) ² / g(i) “Chi-squared test”
i=1:a
If model is true:
Distributed as χ² with a-1 degrees of freedom
Example: Goodness of fit
Bottlenose whale populations from mark-recapture
Yrs
Seen
1
2
3
4
5
>6
χ2(5)
No.
Whales
81
35
17
10
6
11
Expected:
Model A
64.8
45.0
25.0
14.2
7.0
3.9
G =23.3(P=0.00)
Model B
75.7
42.5
19.0
9.1
4.7
9.0
G = 2.8(P=0.73)
Example: Goodness of fit,
Two-way contingency table
Mortality of mice given bacteria
Antiserum
No antiserum
Dead
13
25
Alive
44
29
Example: Goodness of fit,
Two-way contingency table
Mortality of mice given bacteria
Antiserum
No antiserum
Dead
13
25
Alive
44
29
Null hypothesis: Mortality independent of antiserum
Alternative hypothesis: Mortality rate different with antiserum
Example: Goodness of fit,
Two-way contingency table
Mortality of mice given bacteria
Antiserum
No antiserum
Total
Dead
13
25
38
Alive
44
29
73
Total
57
54
111
Null hypothesis: Mortality independent of antiserum
Alternative hypothesis: Mortality rate different with antiserum
Example: Goodness of fit,
Two-way contingency table
Mortality of mice given bacteria
Dead
13 (19.5)
25 (18.5)
38
Alive
Total
Antiserum
44 (37.5)
57
No antiserum
29 (35.5)
54
Total
73
111
54x73/111=35.5
Null hypothesis: Mortality independent of antiserum
Alternative hypothesis: Mortality rate different with antiserum
Example: Goodness of fit,
Two-way contingency table
Mortality of mice given bacteria
Dead
13 (19.5)
25 (18.5)
38
Alive
Total
Antiserum
44 (37.5)
57
No antiserum
29 (35.5)
54
Total
73
111
54x73/111=35.5
Null hypothesis: Mortality independent of antiserum
Alternative hypothesis: Mortality rate different with antiserum
1degree of freedom as if any cell total given, all others fixed
G = Σ f(i) · Log( f(i) / g(i) ) = 6.88
χ2(1):
p=0.009
Two-way contingency table
• Test independence of rows and columns in
r x c contingency table using G-test
– if independent, G is χ2((r-1)x(c-1)) d.f.
L1
L2
Area L3
L4
A
.
.
.
.
B
.
.
.
.
Haplotypes
C
D
.
.
.
.
.
.
.
.
E
.
.
.
.
F
.
.
.
.
Problems with G-tests of contingency
tables with categorical data
• Non-independence of data
• Small cell-numbers (G-test is asymptotic):
Rule of thumb: expected cell numbers >5
–
–
–
–
Williams correction
Yates correction
Lump data
Use exact test
• Model wrong:
– In mxn 2-way contingency table, if both sets of
marginal totals are fixed, then G test is inappropriate-use exact test
e.g. Students’ beer preferences
X: 20M,20F choose one each from 40 Blue, 40 Keiths
G-test OK
Y: 20M,20F choose one each from 20 Blue, 20 Keiths
G-test not OK (use exact test)
Blue
Keith's
Total
Male
xBM
xKM
20
Female
xBF
xKF
20
Total X
?
?
40
Total Y
20
20
40
Multiway Tables
Categorical variables divided into:
a) Factors: data on group to which subject
belongs, or set of experimental conditions
c.f. independent continuous variables in
regression
b) Responses: what was observed
c.f. dependent continuous variables
General types of multiway tables
•
•
•
•
Multiresponse, no-factor
Multiresponse, one-factor
One-response, multifactor
Multiresponse, multifactor
Multiresponse, no-factor
(c.f. Principal Components)
Locus 1
Locus 2
Locus 3
Locus 4
A
B
C
D
a
b
c
d
R
R
R
R
Multiresponse, one-factor
(c.f. Canonical Variate Analysis)
Locus 1
Locus 2
Locus 3
Locus 4
Area
A
B
C
D
P1
a
b
c
d
P2
P3
P4
R
R
R
R
F
One-response, multifactor
(c.f. Multiple Regression)
Mortality
Ate peas
Smoked
Exercised
1
1
1
2
0
0
0
1
0
R
F
F
F
Multiresponse, multifactor
(c.f. Canonical Correlation)
Whistles
Grunts
Clicks
Habitat
Social
Y
Y
Y
Forest
Y
N
N
N
Savannah
N
R
R
R
F
F
Log-linear Models
Expected no. of F’s eating plants but not bats:
ƒ(F,p+,b-) = O·S(F)·P(+)·B(-)·SP(F,+)·..·SPB(F,+,-)
O is the overall geometric mean number per cell
S(F) is an additional sex effect
SP is an interaction between sex and plants
Log(ƒ(s,p,b)) = μ+α(s)+β(p)+γ (b)+δ(s,p)+ ... +ε(s,p,b)
This is a log-linear model
Log-linear Models
• Log(ƒ(s,p,b)) = μ+α(s)+β(p)+γ (b)+δ(s,p)+ ... +ε(s,p,b)
• Calculate likelihood by finding μ, β, γ, δ, ε, ...
given totals, to maximize:
Log(L) = Σ Σ Σ f(s,p,b)·Log( f(s,p,b) / g(s,p,b) )
s p b
• Test importance of various terms using likelihoodratio G tests
• Compare models using AIC
Log-linear Models
• In log-linear models:
• Almost always include first order effects
• Almost always include k-1th order effects
for variables included in kth order effects:
– include A and B if AB is included
– include AB, AC and BC if ABC is included
Drosophila mortality (R)
by sex (F) and pupation site (F)
Pupation
Site
Female
Healthy
Poisoned
Male
Healthy
Poisoned
AM
IM
OM
OW
23
55
8
7
15
34
5
3
1
6
3
4
5
17
3
5
Drosophila mortality (R)
by sex (F) and pupation site (F)
• Test for 3-way effect:
– Does mortality depend on the interaction
between sex and pupation site?
• G = 1.37, 3 [=(4-1)(2-1)(2-1)] d.f., P=0.7137
• Test for 2-way effects:
– Does pupation site depend on sex?
• G = 1.50, 3 [=(4-1)(2-1)] d.f., P=0.6814
– Does mortality depend on sex?
• G = 12.61, 1 [=(2-1)(2-1)] d.f., P=0.0004
– Does mortality depend on pupation site?
• G = 8.96, 3 [=(4-1)(2-1)] d.f., P=0.0298
Drosophila mortality (R)
by sex (F) and pupation site (F)
• Test for 3-way effect:
– Does mortality depend on the interaction
between sex and pupation site?
• G = 1.37, 3 [=(4-1)(2-1)(2-1)] d.f., P=0.7137
• Test for 2-way effects:
– Does pupation site depend on sex?
• G = 1.50, 3 [=(4-1)(2-1)] d.f., P=0.6814
– Does mortality depend on sex?
• G = 12.61, 1 [=(2-1)(2-1)] d.f., P=0.0004
– Does mortality depend on pupation site?
• G = 8.96, 3 [=(4-1)(2-1)] d.f., P=0.0298
Drosophila mortality
by sex and pupation site
•
•
•
•
•
•
•
•
Complete independence
AIC=30.44
Site*Sex
AIC=36.30
Site*Mortality
AIC=27.48
Sex* Mortality
AIC=19.83
Site*Sex+ Site*Mortality
AIC=23.34
Site*Sex+ Sex*Mortality
AIC=25.68
Site*Mortality+ Sex*Mortality AIC=16.87
All 2-way interactions
AIC=21.37
Drosophila mortality (R)
by sex (F) and pupation site (F)
• Conclusion; Mortality depends on:
• Sex
% poisoned
–F
–M
13%
34%
• Pupation site
–
–
–
–
AM
IM
OM
OW
14%
21%
32%
47%
Number of parameters (K) in calculation
of AIC for log-linear models
• 1-way table (n cells)
– null model (all cells same): K=0
– full model (all cells different): K=n-1
• 2-way table (mxn cells)
– null model (all cells same): K=0
– both one-way effects: K=(m-1)+(n-1)=m+n-2
– full model (all cells different):
K=(m-1)(n-1)+(m-1)+(n-1)=mn-1
Number of parameters (K) in calculation
of AIC for log-linear models
• 3-way table (lxmxn cells)
– null model (all cells same): K=0
– all one-way effects: K=(l-1)+(m-1)+(n-1)=l+m+n-3
– all one-way effects and one two-way effect:
K=l+m+n-3+(m-1)(n-1)= l+mn-2
– all one-way and two-way effects:
K=l+m+n-3+(m-1)(n-1)+(m-1)(l-1) +(n-1)(l-1)
=lm+ln+mn-l-m-n
– full model (all cells different): K=(l-1)(m-1)(n-1)+
lm+ln+mn-l-m-n=lmn-1
Related documents