Download Slide 1

Generalized Linear Models Logistic Regression Log-Linear Models © G. Quinn & M. Keough, 2004 Generalized linear model • Fit model using maximum likelihood • Three components – Random component • response variable & its probability distribution • Exponential distribution (normal, gamma, binomial, etc) – Systematic component • predictor variable(s) – Continuous or categorical – Combinations, polynomial functions – Link function Link function • Links expected value of Y to predictors Three common link functions • Identity link – g() =  – Models mean or expected value of Y – Standard linear models • Log link – g() = log() – Used for count data, which cannot be negative • Logit link – g() = log(/(1-)) – Used for binary data and logistic regression Logistic regression • Modelling response variables that are discrete – Often binary • Present/Absent • Alive/Dead • Response/No Response • Predictors categorical or continuous – Equivalent to simple linear regression, ANOVA, multiple regression, ANCOVA Simple logistic regression • Single, binary, response variable • Single, continuous predictor • Model (x) – P(Y) =1 for a given X • Fit logistic regression model – Sigmoidal – Greatest change in values in mid-range of X – Response variable has binomial distribution • OLS not appropriate; ML required Example: Lizards on islands • Polis et al. (1998), examination of factors controlling spider populations on islands in the Gulf of California. • Hypothesis: Presence of predator lizards (Uta) a key influence. • What aspects of an island influence the presence of Uta? – P/A ratio Uta presence/absence 1.0 0.8 0.6 0.4 0.2 0.0 0 10 20 30 40 50 60 70 P/A ratio The logistic model •0 and 1 are parameters to be estimated •Slope and intercept A simpler model • Calculate odds – P(event)/P(1-event) – P(yi = 1)/P(yi = 0) Log odds Link function (logit), g(x) g(x) = 0 + 1x1 Linear model g(x) = 0 + 1(P/A ratio) Interpretation • 1 is the rate of change in the log(odds) for a unit change in X • More often expressed as the Odds Ratio – Change in odds for unit change in X –e • 0 is the intercept – not often of biological interest Null hypotheses Most often 1 = 0 • Wald test – ML equivalent of t test – Parameter estimate/standard error – b/sb – Normal for large sample sizes – Test using z distribution Tests (continued) • Compare fit of full and reduced model g(x) = 0 + 1x1 Full model g(x) = 0 Reduced model Difference in fit reflects effect of 1 Assess fit using Likelihood Ratio statistic () Possible 0 values 0 Log(L) for best parameter estimate Log(L) ML estimator Log(L) for best parameter estimate Log(L) 1 values 0 values Tests (continued)  is the ratio of the likelihood of the reduced model to that of the full model If  is near 1, 1 contributes little If  is <1, 1 has an effect G2 = -2 ln() Log-likelihood 2 or G statistic G2 = - 2 (log-likelihood reduced – log-likelihood full) G2 follows 2 with 1 df Test statistics • Test with either Wald or G2 • Unlike regression, Wald ≠ G2 • Use G2 for small sample sizes • G2 also called deviance when a specific model is compared to a saturated model (which fits data perfectly – Change in deviance used to compare different models – Equivalent to SSResidual Worked example Maximum Likelihood estimation, so procedure is iterative: estimate parameters, calculate log-likelihood. Refine parameter estimates, recalculate log-likelihood. Continue until convergence: SYSTAT output: Category choices 0 (REFERENCE) 1 (RESPONSE) Total : 19 9 10 L-L at iteration 1 is -13.170 L-L at iteration 2 is -8.837 L-L at iteration 3 is -7.529 L-L at iteration 4 is -7.138 L-L at iteration 5 is -7.111 L-L at iteration 6 is -7.110 L-L at iteration 7 is -7.110 Log Likelihood: -7.110 Output (cont) Parameter 1 CONSTANT 2 PARATIO Parameter 2 PARATIO Estimate 3.606 -0.220 S.E. 1.695 0.101 t-ratio 2.127 -2.184 95.0 % bounds Upper Lower 0.978 0.659 Odds Ratio 0.803 =e-.22 Log Likelihood of constants only model = LL(0) = 2*[LL(N)-LL(0)] = -13.143 12.066 with 1 df Chi-sq p-value = 0.001 2(-13.143 – (-7.110)) p-value 0.033 0.029 0.8 0.6 0.4 0.2 0.0 0 10 20 30 40 50 60 70 P/A ratio Predicted probability of occurrence Uta presence/absence 1.0 1.0 0.8 0.6 0.4 0.2 0.0 0 10 20 30 40 50 60 70 P/A ratio A special case: toxicity testing • Logistic regression used to estimate relationship between concentration of substance and response variable • Equation used to solve for concentration that produces a given level of response – LC50 – EC50 Worked example: toxicity testing • Effect of copper on larvae of a marine invertebrate, Bugula dentata • Methods – Larvae exposed to copper at range of concentrations • Range of [Cu] 0 – 400 g/L – Recorded as swimming or not after 6 hours – Recorded as live or dead after 24 h • Parameter estimates: – Intercept • 3.07  0.64 100 – Slope • LC50 – 50% swimming – odds = 1, log(odds) = 0 – Solve for y = 0 • Log[Cu] = 2.16 • [Cu] = 145 g/L 80 % Swimming • -1.42  0.31 • (t = -4.56, P<0.001) 60 40 20 0 0 1 2 Log [Cu] 3 100 80 % Swimming 0 60 40 20 0 0 1 2 Log [Cu] 3 Extension to multiple regression • Analogous to least squares multiple regression • Generates partial regression coefficients • Test overall regression by comparing fit of – Full model – Reduced model (constant or 0 only) • Wald tests as equivalent to t tests • Use likelihood ratio statistics (deviance) • Assumptions Logistic ANCOVA • Analogous to ANCOVA • Test for heterogeneity of slopes – Fit models with and without interaction present – Compare fit of two models • Run reduced model with covariate and categorical variable • Test effects of each Worked example Marshall et al. (2003) Ecology • Effects of larval size on juvenile survivorship in a bryozoan, Bugula neritina – Larval size measured, then juveniles transplanted to field and survival (and growth) monitored – Experiment repeated several times • Response variable: colony survival • Predictor variables: – Larval Size – Experimental Run Logistic ANCOVA (cont) • Fit model g(x) = 0 + 1x1 + I + i1x1 • 1 = overall effect of size • i = effect of Run i • i1 = effect of size in Run i – LL = -33.397, df = 7 • Fit model g(x) = 0 + 1x1 + I – LL = -36.22, df = 4 • Effect of interaction term – G2 = -2 (-36.22 – (-33.397)) = 5.65, df = 3, P = 0.130 – Conclude slopes not significantly heterogeneous 1.0 Effect of Larval Size 0.8 P = 0.001 Effect of Run P = 0.229 0.6 P(survival) 0.4 0.2 0.0 5000 16000 Larval size (m2) Important assumptions • Correct probability distribution for response variable • Collinearity – Inflates standard errors of parameter estimates – Interpretations unreliable – Few diagnostics available • Correlation matrices for predictor variables • Examine tolerance by running as OLS linear regression • Residuals – Not useful for individual observations – Aggregate approaches • Deciles of risk • Influence Contingency tables and loglinear models Introduction • Each observation classified into 2 groups – Phenotypes for trait controlled by single-locus with dominance – Behavioural choice between two alternatives • Is the distribution between these groups consistent with a particular hypothesis? – Crosses between known genotypes – No behavioural preference • Data expected to follow binomial distribution Binomial test • Behavioural experiment – n1, n2 are numbers making choices 1 & 2 • Null hypothesis: no preference – p = q = 0.5 • Test – Calculate probability of observing ≥ n1 by chance – Binomial expansion • (p + q)n P( y  n)  CNn p n q N n Example 5 animals choose A, 1 chooses B. Binomial Expansion 0 1 2 3 4 5 6 0.016 0.094 0.234 0.313 0.234 0.094 0.016 One-tailed test: P(5 or more) = P(5) + P(6) = 0.094 + 0.016 = 0.110 Two-tailed test: P(≥ 5 or ≤ 1) = P(5) + P(6) + P(1) +P(0) = 0.094 + 0.016 + 0.094 + 0.016 = 0.220 Two groups • Binomial test appropriate for small sample sizes – Provides exact probabilities • Alternative procedures for larger samples – Goodness-of-fit tests • 2 • Log-likelihood ratio tests 2 Goodness of fit test (oi  ei )  ei i 1 K • • • • Data in K groups oi is observed number in group i ei is expected number in group I Assess using 2 with k-1 df 2 More groups • Outcome no longer binomial, but multinomial: (p1 + p2 + … pi + … pk)n • Computationally difficult Observations Factor B Factor A 1 2 1 n11 n12 n1+ 2 n21 n22 n2+ n+1 n+2 n Factor B Factor A 1 2 1 11 12 1+ 2 21 22 2+ +1 +2 If A & B independent:  ij   i     j i+ = ni+/n Remember: P(A  B) = P(A)P(B) Calculate expected frequencies & test goodness-of-fit Expected cell frequencies f ij  n ij Test: I J X   2 i 1 j 1 (nij  fij ) 2 fij Assess against 2 with df = (I-1)(J-1) Worked example: two-way tables Regeneration and seed dispersal mechanisms of plants French & Westoby (1996) cross-classified plant species following fire by two variables: – – whether they regenerated by seed only or vegetatively whether they were ant or vertebrate dispersed. H0: dispersal mechanism is independent of mode of regeneration. Seed Vegetative Ant 25 36 Vertebrate 6 21 Observations Factor B Factor A Seed Veg Ant 25 36 61 Vert 6 21 27 31 57 88 Expected values 25 36 61 21.5 39.5 0.69 6 21 27 9.5 17.5 0.31 31 57 88 0.35 0.65 (25-21.5)2 / 21.5 = 0.57 2 = 2.89, df = 1, P = 0.089 Odds Ratio approach • Used for 2 x 2 tables • Calculate odds for each level of one factor – e.g., for seed only plants, odds of being ant dispersed, repeat for vegetative plants – i/(1- i) • Calculate Odds Ratio ( ) – Log ( ) • Calculate se: ASE (log  )  • Assess using Log ( ) /se 1 1 1 1    n11 n12 n21 n22 Example: odds ratio test 25 36 61 0.81 0.63 6 21 27 0.19 0.37 31 57 88 4.17 1.71  = 0.89 se = 0.53 95% CI = 0.86 to 6.89 4.17 / 1.71 = 2.43 Odds Small Sample Sizes • Aim for expected values <5 in no more than 20% of cells – Pool categories to raise expected numbers • Yate’s correction – Adjustment for continuity – Not widely recommended now • Fisher’s exact test – For 2 x 2 tables • Other exact methods – Randomisation tests Interpreting patterns: Residuals nij – fij Raw Residual Calculate for each cell Sample size dependent Standardized residual (nij  f ij ) f ij Freeman-Tukey deviate Compare to 2 0.05,1 ab nij  nij  1  4 fij  1 , where   df overall Example: Dead trees on floodplains • Surveys of dead coolibah trees • Transects with 3 positions: Top (dunes), middle, and bottom (lakeshore) • 2 = 13.66, df = 2, P <0.0005 • Reject H0 – Incidence of dead trees depends on floodplain position Dead Coolibah trees With Without Bottom 15 13 Middle 4 8 Top 0 17 Example: Dead trees on floodplains • Standardized residuals • More dead than expected near bottom, fewer than expected on dunes Dead Coolibah trees With Without Bottom 1.855 -1.312 Middle 0.000 0.000 Top -2.380 1.683 An alternative: log-linear models • GLM • Expected cell frequencies modelled using – Log link function – Poisson error term • Maximum likelihood estimation • Fit assessed using log-likelihood log f ij  constant + Xi  Yj  XY ij position coolibahposition log f ij  constant  coolibah     i j ij fij is the expected frequency in cell ij, constant is the mean of the logs of all the expected frequencies i X is the effect of category i of variable X j Y is the effect of category j of variable Y ijXY is the effect of any interaction between X and Y. The interaction measures deviations from independence of the two variables. log f ij  constant + Xi  Yj  XY ij Saturated model Fits data perfectly log f ij  constant + Xi  Yj Reduced model Independent action of factors X and Y Difference in fit of the two models indicates the importance of the interaction between X and Y (H0: XY = 0) For coolibah example Log-likelihood position coolibahposition log f ij  constant  coolibah     i j ij log f ij  constant  coolibah  position i j G2 -10.429, df = 3 -19.735, df = 2 = -2(LLmodel – LLsaturated) = -2(-19.735 – (-10.429)) = 18.61, df = 1, P < 0.001 Reject H0 More complex designs 3-way tables • 3 main effects • 3 two-factor interactions • 1 three-factor interaction • Estimation of parameters not simple – Iterative procedures Full model: XZ YZ XYZ log f ijk  constant  Xi  Yj  Zk  XY       ij ik jk ijk Models are hierarchical: Higher order term “forces” all simpler terms in Omission of two-way term forces omission of 3-way Representative models Loglinear model df X+Y+Z IJK-I-J-K+2 X + Y + Z + XY (K-1)(IJ-1) X + Y + Z + XZ (J-1)(IK-1) X + Y + Z + YZ (I-1)(JK-1) X + Y + Z + XZ + YZ K(I-1)(J-1) X + Y + Z + XY + YZ J(I-1)(K-1) X + Y + Z + XY + XZ I(J-1)(K-1) X + Y + Z + XY + XZ + YZ Saturated model: X + Y + Z + XY + XZ + YZ + XYZ (I-1)(J-1)(K-1) 0 Comparison of models • Choosing best model – Lowest value of G2 – Akaike Information Criterion (AIC) • Adjusts for number of parameters in model • G2 – 2 dftest of model • Tests of hypothesis – Contrast fit of two models differing in the presence of the term in question Worked example Wildebeeste carcasses (Sinclair & Arcese 1995) • Carcasses classified according to – Sex – Cause of death • (predation or not) – Health (state of bone marrow) Worked example Wildebeeste carcasses (Sinclair & Arcese 1995) Marrow type Cause of death Sex SWF OG TG Total Predation Female 26 32 8 66 Predation Male 14 43 10 67 Non-pred. Female 6 26 16 48 Non-pred. Male 7 12 26 45 53 113 60 226 Totals Fit of models Model G2 df P AIC 1 death + sex + marrow 42.76 7 <0.001 28.76 2 death x sex 42.68 6 <0.001 30.68 3 death x marrow 13.24 5 0.021 3.34 4 sex x marrow 37.98 5 <0.001 27.98 5 death x sex + death x marrow 13.16 4 0.011 5.16 6 death x sex + sex x marrow 37.89 4 <0.001 29.89 7 death x marrow + sex x marrow 8.46 3 0.037 2.46 8 death x sex + death x marrow + sex x marrow 7.19 2 0.027 3.19 9 Saturated (full) model 0 0 Tests of hypotheses 1 death + sex + marrow 42.76 7 2 death x sex 42.68 6 3 death x marrow 13.24 5 4 sex x marrow 37.98 5 5 death x sex + death x marrow 13.16 4 6 death x sex + sex x marrow 37.89 4 7 death x marrow + sex x marrow 8.46 3 8 death x sex + death x marrow + sex x marrow 7.19 2 9 Saturated (full) model 0 0

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1