Download Statistics 2: generalized linear models

Statistics 2: generalized linear models General linear model: Y ~ a + b1* x1 + … + bn* xn + ε There are many cases when general linear models are no longer valid: • Loss of a) normality of errors, b) homogeneity of variance, c) linearity Sometimes these can be fixed by transformations (e.g. log or sqrt of Y) or by including to the model a) Interactions or b) polynomials (e.g. I(x^2) ) Generalized linear models • These broaden the concept of linear models – Non-normal error distributions • Poisson responses (count number) • Binomial responses (number of successes in a trial) • Also categorical data (multinomial response) • Exponential or Weibull responses (waiting time) – Transformation of the response to linear scale with link function New model elements Generalized linear model: f(Y) ~ a + b1* x1 + … + bn* xn + ε Link function Error term: Variance no longer homogeneous but depends on the mean Link function • Response is transformed to linear scale f(Y) = a + b1x1 + … + bnxn • To obtain a prediction Y = f-1(a + b1x1 + … + bnxn) Common choices for link and variance function Distribution Link Variance Normal Identity: μ ~1 Binomial Logit: log(μ /(1- μ)) μ(1- μ) Poisson Logarithm: log(μ) μ Generalized linear models in R glm(formula, family, data) For example: family=binomial(link=“logit”) family=poisson(link=“log”) Binomial data: logistic regression • Binomial data gives number of successes in n independent trials – Tossing of a coin ten times – Exposing n patients to a treatment and seeing how many survive • Expected value to be predicted (p) is the probability of success – Logit link function transforms this from [0,1] to [-Inf,Inf] • Two options of coding the response: – A vector of values 0 or 1: failure or success – A matrix with two columns: first gives the number of successes the second the number of failures DEMO 1 Count data: Poisson regression • Poisson distributed data is count data – Number of fish caught during a fishing trip – Number of species observed in a given area • Link function is log and model predicts expected number • Variance increases along the expected value • Two special features: – Offset variables: ‘observation time’ – Over/under dispersion: Variance in the data can be less or more than that assumed by the variance function -> can be taken care with dispersal parameter • This can also occur with binomial data Measures of fit in glm • Proportion of deviance explained – Problem: even a perfect model does not explain all deviance • Pseudo R2 – Nagelkerke’s R2 1-exp(-2/n*(logLik(model)-logLik(NULLModel))) NULL model is a model with intercept only. n = number of observations DOES NOT work for quasi-distributions! DEMO 2 & exercises

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistics 2: generalized linear models