Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics Synopsis • Logistic Regression • Mixed Models Logistic Regression and Statistical Genetics • Applicable to Association Studies • Data: – Binary outcomes (eg disease status) – Dependent on genotypes [+ sex, environment] • • • • Aim is to identify which factors influence the outcome Rigorous tests of statistical significance Flexible modelling language Generalisation of Chi-Squared Test Data • Data for individual labelled i=1…n: – Binary Response yi – Unphased Genotypes gij for markers j=1..m Coding Unphased Genotypes • Several possibilities: – AA, AG, GG original genotypes – 12, 21, 22 – 1, 2, 3 – 0, 1, 2 # of G alleles • Missing Data – NA default in R Using R • Load genetic logistic regression tools • > source(‘logistic.R’) • Read data table from file – > t <- read.table(‘geno.dat’, header=TRUE) • Column names – names(t) – t$y response (0,1) – t$m1, t$m2, …. Genotypes for each marker Contigency Tables in R • ftable(t$y,t$m31) prints the contingency table > ftable(t$y,t$m31) 11 12 22 0 1 > 515 387 28 11 75 2 Chi-Squared Test in R > chisq.test(t$y,t$m31) Pearson's Chi-squared test data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464 Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) > The Logistic Model • Effect of SNP k on phenotype • Null model The Logistic Model • Prob(Yi=1) = exp(hi)/(1+exp(hi)) • hi = Sj xij bj - Linear Predictor • xij – Design Matrix (genotypes etc) • bj – Model Parameters (to be estimated) • Model is investigated by – estimating the bj s by maximum likelihood – testing if the estimates are different from 0 The Logistic Function Prob(Yi=1) = exp(hi)/(1+exp(hi)) Prob(Y=1) h Types of genetic effect at a single locus AA AG GG Recessive 0 0 1 Dominant 1 1 0 Additive 0 1 2 Genotype 0 a b Additive Genotype Model • Code genotypes as – AA – AG – GG x=0, x=1, x=2 • Linear Predictor – h = b0 + xb1 • • • • P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) PAA = P(Y=1|x=0) = exp(b0)/(1+exp(b0)) PAG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) PGG = P(Y=1|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1)) Additive Model: b0 = -2 b1 = 2 PAA = 0.12 PAG = 0.50 PGG = 0.88 Prob(Y=1) h Additive Model: b0 = 0 b1 = 2 PAA = 0.50 PAG = 0.88 PGG = 0.98 Prob(Y=1) h Recessive Model • Code genotypes as – AA – AG – GG x=0, x=0, x=1 • Linear Predictor – h = b0 + xb1 • P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = PAG = P(Y=1|x=0) = exp(b0)/(1+exp(b0)) • PGG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) Recessive Model: b0 = 0 b1 = 2 PAA = PAG = 0.50 PGG = 0.88 Prob(Y=1) h Genotype Model • Each genotype has an independent probability • Code genotypes as (for example) – AA – AG – GG x=0, z=0 x=1, z=0 x=0, z=1 • Linear Predictor – h = b0 + xb1+zb2 two parameters • • • • P(Y=1|x) = exp(b0 + xb1+zb2)/(1+exp(b0 + xb1+zb2)) PAA = P(Y=1|x=0,z=0) = exp(b0)/(1+exp(b0)) PAG = P(Y=1|x=1,z=0) = exp(b0 + b1)/(1+exp(b0 + b1)) PGG = P(Y=1|x=0,z=1) = exp(b0 + b2)/(1+exp(b0 + b2)) Genotype Model: b0 = 0 b1 = 2 b2 = -1 PAA = 0.5 PAG = 0.88 PGG = 0.27 Prob(Y=1) h Models in R response y genotype g AA AG GG model DF Recessive 0 0 1 y ~ recessive(g) 1 Dominant 0 1 1 y ~ dominant(g) 1 Additive 0 1 2 y ~ additive(g) 1 Genotype 0 a b y ~ genotype(g) 2 Data Transformation • g <- t$m1 • use these functions to treat a genotype vector in a certain way: –a –r –d –g <<<<- additive(g) recessive(g) dominant(g) genotype(g) Fitting the Model • • • • afit rfit dfit gfit <<<<- glm( glm( glm( glm( t$y t$y t$y t$y ~ ~ ~ ~ additive(g),family=‘binomial’) recessive(g),family=‘binomial’) dominant(g),family=‘binomial’) genotype(g),family=‘binomial’) • Equivalent models: – genotype = dominant + recessive – genotype = additive + recessive – genotype = additive + dominant – genotype ~ standard chi-squared test of genotype association Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 *** b1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733 . b2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 > Analysis of Deviance Chi-Squared Test > anova(glm( t$y ~ genotype(t$m31), family='binomial'), test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) NULL genotype(t$m31) Df Deviance Resid. Df Resid. Dev 1017 343.71 2 3.96 1015 339.76 Model Comparison • Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance 1 1016 38.301 2 1015 38.124 1 0.177 > Scanning all Markers > logscan(t,model=‘additive’) Deviance DF Pval LogPval m1 8.604197e+00 1 3.353893e-03 2.474450800 m2 7.037336e+00 1 7.982767e-03 2.097846522 m3 6.603882e-01 1 4.164229e-01 0.380465360 m4 3.812860e+00 1 5.086054e-02 1.293619014 m5 7.194936e+00 1 7.310960e-03 2.136025588 m6 2.449127e+00 1 1.175903e-01 0.929628598 m7 2.185613e+00 1 1.393056e-01 0.856031566 m8 1.227191e+00 1 2.679539e-01 0.571939852 m9 2.532562e+01 1 4.842353e-07 6.314943565 m10 5.729634e+01 1 3.748518e-14 13.426140380 m11 3.107441e+01 1 2.483233e-08 7.604982503 … … … Multilocus Models • Can test the effects of fitting two or more markers simultaneously • Several multilocus models are possible • Interaction Model assumes that each combination of genotypes has a different effect • eg t$y ~ t$m10 * t$m15 Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial’, test="Chisq") > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) NULL genotype(t$m13) genotype(t$m26) genotype(t$m13):genotype(t$m26) Df Deviance Resid. Df Resid. Dev 1017 343.71 2 108.68 1015 235.03 2 1.14 1013 233.89 3 6.03 1010 227.86 > pchisq(6.03,3,lower.tail=F) calculate p-value [1] 0.1101597 Adding the effects of Sex and other Covariates • Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age • Fit models of the form • • fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’) Adding the effects of Sex and other Covariates • Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age • anova(fit1,fit2) Random Effects Mixed Models • So far our models have had fixed effects – each parameter can take any value independently of the others – each parameter estimate uses up a degree of freedom – models with large numbers of parameters have problems • • • • saturation - overfitting poor predictive properties numerically unstable and difficult to fit in some cases we can treat parameters as being sampled from a distribution – random effects • only estimate the parameters needed to specify that distribution • can result in a more robust model Example of a Mixed Model • Testing genetic association across a large number of of families – – – – yi = trait value of i’th individual bi = genotype of individual at SNP of interest fi = family identifier (factor) ei = error, variance s2 – y=m+b+f+e – If we treat family effects f as fixed then we must estimate a large number of parameters – Better to think of these effects as having a distribution N(0,t2) for some variance t which must be estimated • individuals from the same family have the same f and are correlated – • cov = t2 individuals from different families are uncorrelated – genotype parameters b still treated as fixed effects • mixed model Mixed Models • y=m+b+f+e • Fixed effects model – E(y) = m + b + f – Var(y) = Is2 • I is identity matrix • Mixed model – E(y) = m + b – Var(y) = Is2 + Ft2 • F is a matrix, Fij = 1 if i, j in same family • Need to estimate both s2 and t2 Generalised Least Squares • y = Xb + e • Var(y) = V (symmetric covariance matrix) • V = Is2 (uncorrelated errors) • V = Is2 + Ft2 (single grouping random effect) • V = Is2 + Gt2 (G = genotype identity by state) • GLS solution, if V is known T -1 -1 • • • • T -1 ˆ b = (X V X) X V y unbiased, efficient (minimum variance) consistent (tends to true value in large samples) asymptotically normally distributed Multivariate Normal Distribution • Joint distribution of a vector of correlated observations • Another way of thinking about the data • Estimation of parameters in a mixed model is special case of likelihood analysis of a MVN • y ~ MVN(m, V) – m is vector of expected values – V is variance-covariance matrix – If V = Is2 then elements of y are uncorrelated and equivalent to n independent Normal variables – probability density/Likelihood is 1 L(y | m,V) = exp(- (y - m)' V-1 (y - m)) 2 2 n2 (2p V ) Henderson’s Mixed Model Equations • • • • General linear mixed model. b are fixed effect u are random effects X, Z design matrices y = Xb + Zu + e u » MVN(0,G) (G = It 2 ) e » MVN(0,R) (R = Is 2 ) y » MVN(Xb,V) V = ZGZ T + R æ XT R-1X XT R-1Z öæbˆ ö æ XT R-1yö ç ÷ = ç T -1 ÷ ç T -1 T -1 -1 ÷ è Z R X Z R Z + G øè uˆ ø è Z R y ø Mixed Models • Fixed effects models are special cases of mixed models • Mixed models sometimes are more powerful (as fewer parameters) – But check that the assumption effects are sampled from a Normal distribution is reasonable – Differences between estimated parameters and statistical significance in fixed vs mixed models • Shrinkage: often random effect estimates from a mixed model have smaller magnitudes than from a fixed effects model. Mixed Models in R • Several R packages available, e.g. – lme4 general purpose mixed models package – emma for genetic association in structured populations • Formulas in lme4, eg test for association between genotype and trait • • • H0: E(y) = m vs H1: E(y) = m + b • Var(y) = Is2 + Ft2 – – – h0 = lmer(y ~ 1 +(1|family), data=data) h1 = lmer(y ~ genotype +(1|family), data=data) anova(h0,h1) Formulas in lmer() • y ~ fixed.effects + (1|Group1) + (1|Group2) etc – random intercept models • y ~ fixed.effects +(1|Group1) + (x|Group1) – random slope model Example: HDL Data > library(lme4) > b=read.delim("Biochemistry.txt”) cc=complete.cases(b$Biochem.HDL) > f0=lm(Biochem.HDL ~ Family , data=b,subset=cc) > f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc) > anova(f0) Analysis of Variance Table Response: Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) Family 175 147.11 0.84064 5.2098 < 2.2e-16 *** Residuals 1681 271.24 0.16136 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > f1 Linear mixed model fit by REML Formula: Biochem.HDL ~ (1 | Family) Data: b Subset: cc AIC BIC logLik deviance REMLdev 2161 2178 -1078 2149 2155 Random effects: Groups Name Variance Std.Dev. Family (Intercept) 0.066633 0.25813 Residual 0.161487 0.40185 Number of obs: 1857, groups: Family, 176 Fixed effects: Estimate Std. Error t value (Intercept) 1.60615 0.02263 70.97 > plot(resid(f0),resid(f1))