Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Generalised linear models Gil McVean, Department of Statistics Thursday 5th November 2009 Questions to ask… • How do you add covariates to a model? • What is a linear model? • What is a generalised linear model? • How do you estimate parameters and test hypotheses with GLMs? • How do you assess model fit with GLMs? 2 What is a covariate? • A covariate is a quantity that may influence the outcome of interest – Genotype at a SNP – Age of mice when measurement was taken – Batch of chips from which gene expression was measured • Previously, we have looked at using likelihood to estimate parameters of underlying distributions • We want to generalise this idea to ask how covariates might influence the underlying parameters • Much statistical modelling is concerned with considering linear effects of covariates on underlying parameters 3 What is a linear model? • In a linear model, the expectation of the response variable is defined as a linear combination of explanatory variables Yi 0 1 X i 2 Zi 3 X i Zi ... Response variable • Intercept Linear relationships with explanatory variables Gaussian error Explanatory variables can include any function of the original data Yi 0 1 X 2e 2 i • Interaction term Zi 3 X i ... Zi But the link between E(Y) and X (or some function of X) is ALWAYS linear and the error is ALWAYS Gaussian 4 What is a GLM? • There are many settings where the error is non-Gaussian and/or the link between E(Y) and X is not necessarily linear – Discrete data (e.g. counts in multinomial or Poisson experiments) – Categorical data (e.g. Disease status) – Highly-skewed data (e.g. Income, ratios) • Generalised linear models keep the notion of linearity, but enable the use of nonGaussian error models E (Yi ) i g 1 0 1 X i 2 Z i 3 X i Z i ... • g is called the link function – In linear models, the link function is the identity • The response variable can be drawn from any distribution of interest (the distribution function) – In linear models this is Gaussian 5 Poisson regression • In Poisson regression the expected value of the response variable is given by the exponent of the linear term E (Yi ) i exp 0 1 X i 2 Zi 3 X i Zi ... • The link function is the log • Note that several distribution functions are possible (normal, Poisson, binomial counts), though in practice Poisson regression is typically used to model count data (particularly when counts are low) 6 Example: Caesarean sections in public and private hospitals 7 Boxplots of rates of C sections 8 Fitting a model without covariates > analysis<-glm(d$Caes ~ d$Births, family = "poisson") > summary(analysis) Call: glm(formula = d$Caes ~ d$Births, family = "poisson") Deviance Residuals: Min 1Q Median -2.81481 -0.73305 -0.08718 3Q 0.74444 Max 2.19103 Coefficients: Estimate Std. Error z value Pr(>|z|) Implies an average of (Intercept) 2.132e+00 1.018e-01 20.949 < 2e-16 *** 12.7 per 1000 births d$Births 4.406e-04 5.395e-05 8.165 3.21e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 99.990 Residual deviance: 36.415 AIC: 127.18 on 19 on 18 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 9 Fitting a model with covariates glm(formula = d$Caes ~ d$Births + d$Hospital, family = "poisson") Deviance Residuals: Min 1Q Median -2.3270 -0.6121 -0.0899 3Q 0.5398 Coefficients: Estimate Std. Error z (Intercept) 1.351e+00 2.501e-01 d$Births 3.261e-04 6.032e-05 d$Hospital 1.045e+00 2.729e-01 --Signif. codes: 0 ‘***’ 0.001 ‘**’ • Max 1.6626 value 5.402 5.406 3.830 Pr(>|z|) 6.58e-08 *** 6.45e-08 *** 0.000128 *** 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Unexpectedly, this indicates that public hospitals actually have a higher rate of Caesarean sections than private ones Implies an average of 15.2 per 1000 births in public hospitals and 5.4 per 1000 births in private ones 10 Checking model fit • Look a distribution of residuals and how well observed values are predicted 11 What’s going on? • Initial summary suggested opposite result to GLM analysis. Why? Normal Caesarean Private 412 16 Public 21121 285 Relative risk Private (compared to public) = 2.8 • Relationship between no. Births and no. C sections does not appear to be linear • Accounting for this removes most of the apparent differences between hospital types • There is also one quite influential outlier 12 Simpson’s paradox • Be careful about adding together observations – this can be misleading • E.g. Berkeley sex-bias case Applicants % admitted Men 8442 44% Women 4321 35% Major Men Appears that women have lower success Women Applicants % admitted Applicants % admitted A 825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 272 6% 341 7% But actually women are typically more successful at the Departmental level, just apply to more competitive subjects 13 14 Finding MLEs in GLM • In linear modelling we can use the beautiful compactness of linear algebra to find MLEs and estimates of the variance for parameters • Consider an n by k+1 data matrix, X, where n is the number of observations and k is the number of explanatory variables, and a response vector Y – the first column is ‘1’ for the intercept term • The MLEs for the coefficients () can be estimated using ˆβ XT X1 XT Y • βˆ ~ N (β, ( XT X) 1 2 ) In GLMs, there is usually no such compact analytical expression for the MLEs – Use numerical methods to maximise the likelihood 15 Testing hypotheses in GLMs • For the parameters we are interested in we typically want to ask how much evidence there is that these are different from zero • For this we need to construct confidence intervals for the parameter estimates • We could estimate the confidence interval by finding all parameter values with log-likelihood no greater than 1.94 units worse than the MLE • Alternatively, we might appeal to the CLT and use bootstrap techniques to estimate the variance of parameter estimates • However, we can also appeal to theoretical considerations of likelihood (again based on the CLT) that show that parameter estimates are asymptotically normal with variance described by the Fisher information matrix • Informally, the information matrix describes the sharpness of the likelihood curve around the MLE and the extent to which parameter estimates are correlated 16 Logistic regression • When only two types of outcome are possible (e.g. disease/not-disease) we can model counts by the binomial • If we want to perform inference about the factors that influence the probability of ‘success’ it is usual to use the logistic model exp 0 1 X i 2 Z i ... E (Yi ) 1 exp 0 1 X i 2 Z i ... • The link function here is the logit g ( ) log 1 17 Example: testing for genotype association • In a cohort study, we observe the number of individuals in a population that get a particular disease • We want to ask whether a particular genotype is associated with increased risk • The simplest test is one in which we consider a single coefficient for the genotypic value Genotype AA Aa AA Genotypic value 0 1 2 Frequency in population p2 2p1p 1p2 Probability of disease p0 2 0 = -4 1 = 2 1 pi p1 p2 1 1 e [ 0 1Gi ] 0 18 A note on the model • Note that each copy of the risk allele contribute in an additive way to the exponent • This does not mean that each allele ‘adds’ a fixed amount to the probability of disease • Rather, each allele contributes a fixed amount to the log-odds • This has the effect of maintaining Hardy-Weinberg equilibrium within both the cases and controls P(disease ) 0 1Gi log odds log P( Not disease ) 19 Concepts in disease genetics • Relative risk describes the risk to a person in an exposed group compared to the unexposed group RR • P(disease | exposed ) P(disease | not exposed ) The odds ratio compares the odds of disease occurring in one group relative to another P(disease | exposed ) P ( not disease | exposed ) OR P(disease | not exposed ) P ( not disease | not exposed ) • If the absolute risk of disease is low the two will be very similar 20 Cont. • Suppose in a given study we observe the following counts Genotype 0 1 2 Counts with disease 26 39 21 1298 567 49 Counts without disease • We can fit a GLM using the logit link function and binomial probabilities • We have genotype data stored in the vector gt and disease status in the vector status • Using R, this is specified by the command – glm(formula = status ~ gt, family = binomial) 21 Interpreting results Measure of contribution of individual observations to overall goodness of fit (for MLE model) MLE for coefficients Call: glm(formula = status ~ gt, family = binomial) Standard error for estimates Deviance Residuals: Min 1Q Median -0.8554 -0.4806 -0.2583 Estimate/std. error 3Q -0.2583 Max 2.6141 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.6667 0.2652 -17.598 <2e-16 *** gt 1.2833 0.1407 9.123 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 P-value from normal distribution (Dispersion parameter for binomial family taken to be 1) Null deviance: 967.36 Residual deviance: 886.28 AIC: 890.28 on 1999 on 1998 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 Measure of goodness of fit of null (compared to saturated model) Measure of goodness of fit of fitted model Penalised likelihood used in model choice Number of iterations used to find MLE 22 Adding in extra covariates • Adding in additional explanatory variables in GLM is essentially the same as in linear model analysis • Likewise, we can look at interactions • In the disease study we might want to consider age as a potentially important covariate glm(formula = status ~ gt + age, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.00510 0.79209 -8.844 <2e-16 *** gt 1.46729 0.17257 8.503 <2e-16 *** age 0.04157 0.01903 2.185 0.0289 * 23 Adding model complexity • In the disease status analysis we might want to generalise the fitted model to one in which each genotype is allowed its own risk pi 1 1 e [ 0 1I Gi 1 1I Gi 2 ] glm(formula = status ~ g1 + g2 + age, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.46870 0.73155 -7.476 7.69e-14 *** g1TRUE 1.25694 0.26278 4.783 1.72e-06 *** g2TRUE 3.00816 0.33871 8.881 < 2e-16 *** age 0.04224 0.01915 2.206 0.0274 * --Null deviance: 684.44 on 1999 degrees of freedom Residual deviance: 609.09 on 1996 degrees of freedom AIC: 617.09 24 Model choice • It is worth remembering that we cannot simply identify the ‘significant’ parameters and put them in our chosen model • Significance for a parameter tests the marginal null hypothesis that the coefficient associated with that parameter is zero • If two explanatory variables are highly correlated then marginally neither may be significant, yet a linear model contains only one would be highly significant • There are several ways to choose appropriate models from data. These typically involve adding in parameters one at a time and adding some penalty to avoid over-fitting. 25