Download Linear Modelling III - Wellcome Trust Centre for Human Genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Linear Modelling III
Richard Mott
Wellcome Trust Centre for Human
Genetics
Synopsis
• Logistic Regression
• Mixed Models
Logistic Regression and Statistical Genetics
• Applicable to Association Studies
• Data:
– Binary outcomes (eg disease status)
– Dependent on genotypes [+ sex, environment]
•
•
•
•
Aim is to identify which factors influence the outcome
Rigorous tests of statistical significance
Flexible modelling language
Generalisation of Chi-Squared Test
Data
• Data for individual labelled i=1…n:
– Binary Response yi
– Unphased Genotypes gij for markers j=1..m
Coding Unphased Genotypes
• Several possibilities:
– AA, AG, GG original genotypes
– 12, 21, 22
– 1, 2, 3
– 0, 1, 2 # of G alleles
• Missing Data
– NA default in R
Using R
• Load genetic logistic regression tools
• >
source(‘logistic.R’)
• Read data table from file
– > t <- read.table(‘geno.dat’, header=TRUE)
• Column names
– names(t)
– t$y response (0,1)
– t$m1, t$m2, …. Genotypes for each marker
Contigency Tables in R
• ftable(t$y,t$m31) prints the contingency table
> ftable(t$y,t$m31)
11 12 22
0
1
>
515 387
28 11
75
2
Chi-Squared Test in R
> chisq.test(t$y,t$m31)
Pearson's Chi-squared test
data: t$y and t$m31
X-squared = 3.8424, df = 2, p-value = 0.1464
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(t$y, t$m31)
>
The Logistic Model
• Effect of SNP k on phenotype
• Null model
The Logistic Model
• Prob(Yi=1) = exp(hi)/(1+exp(hi))
• hi = Sj xij bj - Linear Predictor
• xij – Design Matrix (genotypes etc)
• bj – Model Parameters (to be estimated)
• Model is investigated by
– estimating the bj s by maximum likelihood
– testing if the estimates are different from 0
The Logistic Function
Prob(Yi=1) = exp(hi)/(1+exp(hi))
Prob(Y=1)
h
Types of genetic effect at a single locus
AA
AG
GG
Recessive
0
0
1
Dominant
1
1
0
Additive
0
1
2
Genotype
0
a
b
Additive Genotype Model
• Code genotypes as
– AA
– AG
– GG
x=0,
x=1,
x=2
• Linear Predictor
– h = b0 + xb1
•
•
•
•
P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))
PAA = P(Y=1|x=0) = exp(b0)/(1+exp(b0))
PAG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
PGG = P(Y=1|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))
Additive Model: b0 = -2 b1 = 2
PAA = 0.12 PAG = 0.50 PGG = 0.88
Prob(Y=1)
h
Additive Model: b0 = 0 b1 = 2
PAA = 0.50 PAG = 0.88 PGG = 0.98
Prob(Y=1)
h
Recessive Model
• Code genotypes as
– AA
– AG
– GG
x=0,
x=0,
x=1
• Linear Predictor
– h = b0 + xb1
• P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))
• PAA = PAG = P(Y=1|x=0) = exp(b0)/(1+exp(b0))
• PGG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
Recessive Model: b0 = 0 b1 = 2
PAA = PAG = 0.50 PGG = 0.88
Prob(Y=1)
h
Genotype Model
• Each genotype has an independent probability
• Code genotypes as (for example)
– AA
– AG
– GG
x=0, z=0
x=1, z=0
x=0, z=1
• Linear Predictor
– h = b0 + xb1+zb2 two parameters
•
•
•
•
P(Y=1|x) = exp(b0 + xb1+zb2)/(1+exp(b0 + xb1+zb2))
PAA = P(Y=1|x=0,z=0) = exp(b0)/(1+exp(b0))
PAG = P(Y=1|x=1,z=0) = exp(b0 + b1)/(1+exp(b0 + b1))
PGG = P(Y=1|x=0,z=1) = exp(b0 + b2)/(1+exp(b0 + b2))
Genotype Model: b0 = 0 b1 = 2 b2 = -1
PAA = 0.5 PAG = 0.88 PGG = 0.27
Prob(Y=1)
h
Models in R
response y
genotype g
AA
AG
GG
model
DF
Recessive
0
0
1
y ~ recessive(g)
1
Dominant
0
1
1
y ~ dominant(g)
1
Additive
0
1
2
y ~ additive(g)
1
Genotype
0
a
b
y ~ genotype(g)
2
Data Transformation
• g <- t$m1
• use these functions to treat a genotype vector
in a certain way:
–a
–r
–d
–g
<<<<-
additive(g)
recessive(g)
dominant(g)
genotype(g)
Fitting the Model
•
•
•
•
afit
rfit
dfit
gfit
<<<<-
glm(
glm(
glm(
glm(
t$y
t$y
t$y
t$y
~
~
~
~
additive(g),family=‘binomial’)
recessive(g),family=‘binomial’)
dominant(g),family=‘binomial’)
genotype(g),family=‘binomial’)
• Equivalent models:
– genotype = dominant + recessive
– genotype = additive + recessive
– genotype = additive + dominant
– genotype ~ standard chi-squared test of genotype
association
Parameter Estimates
> summary(glm( t$y ~ genotype(t$m31), family='binomial'))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
b0 (Intercept)
-2.9120
0.1941 -15.006
<2e-16 ***
b1 genotype(t$m31)12 -0.6486
0.3621 -1.791
0.0733 .
b2 genotype(t$m31)22 -0.7124
0.7423 -0.960
0.3372
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
>
Analysis of Deviance
Chi-Squared Test
> anova(glm( t$y ~ genotype(t$m31), family='binomial'),
test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
NULL
genotype(t$m31)
Df Deviance Resid. Df Resid. Dev
1017
343.71
2
3.96
1015
339.76
Model Comparison
• Compare general model with additive,
dominant or recessive models:
> afit <- glm(t$y ~ additive(t$m20))
> gfit <- glm(t$y ~ genotype(t$m20))
> anova(afit,gfit)
Analysis of Deviance Table
Model 1: t$y ~ additive(t$m20)
Model 2: t$y ~ genotype(t$m20)
Resid. Df Resid. Dev
Df Deviance
1
1016
38.301
2
1015
38.124
1
0.177
>
Scanning all Markers
> logscan(t,model=‘additive’)
Deviance DF
Pval
LogPval
m1 8.604197e+00 1 3.353893e-03 2.474450800
m2 7.037336e+00 1 7.982767e-03 2.097846522
m3 6.603882e-01 1 4.164229e-01 0.380465360
m4 3.812860e+00 1 5.086054e-02 1.293619014
m5 7.194936e+00 1 7.310960e-03 2.136025588
m6 2.449127e+00 1 1.175903e-01 0.929628598
m7 2.185613e+00 1 1.393056e-01 0.856031566
m8 1.227191e+00 1 2.679539e-01 0.571939852
m9 2.532562e+01 1 4.842353e-07 6.314943565
m10 5.729634e+01 1 3.748518e-14 13.426140380
m11 3.107441e+01 1 2.483233e-08 7.604982503
…
…
…
Multilocus Models
• Can test the effects of fitting two or more
markers simultaneously
• Several multilocus models are possible
• Interaction Model assumes that each
combination of genotypes has a different
effect
• eg t$y ~ t$m10 * t$m15
Multi-Locus Models
> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial’,
test="Chisq")
> anova(f)
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
NULL
genotype(t$m13)
genotype(t$m26)
genotype(t$m13):genotype(t$m26)
Df Deviance Resid. Df Resid. Dev
1017
343.71
2
108.68
1015
235.03
2
1.14
1013
233.89
3
6.03
1010
227.86
> pchisq(6.03,3,lower.tail=F) calculate p-value
[1] 0.1101597
Adding the effects of Sex and other
Covariates
• Read in sex and other covariate data, eg. age
from a file into variables, say a$sex, a$age
• Fit models of the form
•
•
fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’)
fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)
Adding the effects of Sex and other
Covariates
• Compare models using anova – test if the effect of
the marker m10 is significant after taking into
account sex and age
• anova(fit1,fit2)
Random Effects
Mixed Models
• So far our models have had fixed effects
– each parameter can take any value independently of the
others
– each parameter estimate uses up a degree of freedom
– models with large numbers of parameters have problems
•
•
•
•
saturation - overfitting
poor predictive properties
numerically unstable and difficult to fit
in some cases we can treat parameters as being sampled from a
distribution – random effects
• only estimate the parameters needed to specify that distribution
• can result in a more robust model
Example of a Mixed Model
•
Testing genetic association across a large number of of families
–
–
–
–
yi = trait value of i’th individual
bi = genotype of individual at SNP of interest
fi = family identifier (factor)
ei = error, variance s2
– y=m+b+f+e
– If we treat family effects f as fixed then we must estimate a large number of parameters
– Better to think of these effects as having a distribution N(0,t2) for some variance t which must
be estimated
•
individuals from the same family have the same f and are correlated
–
•
cov = t2
individuals from different families are uncorrelated
– genotype parameters b still treated as fixed effects
•
mixed model
Mixed Models
• y=m+b+f+e
• Fixed effects model
– E(y) = m + b + f
– Var(y) = Is2
• I is identity matrix
• Mixed model
– E(y) = m + b
– Var(y) = Is2 + Ft2
• F is a matrix, Fij = 1 if i, j in same family
• Need to estimate both s2 and t2
Generalised Least Squares
• y = Xb + e
• Var(y) = V (symmetric covariance matrix)
• V = Is2 (uncorrelated errors)
• V = Is2 + Ft2 (single grouping random effect)
• V = Is2 + Gt2 (G = genotype identity by state)
• GLS solution, if V is known
T -1
-1
•
•
•
•
T -1
ˆ
b = (X V X) X V y
unbiased,
efficient (minimum variance)
consistent (tends to true value in large samples)
asymptotically normally distributed
Multivariate Normal Distribution
• Joint distribution of a vector of correlated observations
• Another way of thinking about the data
• Estimation of parameters in a mixed model is special case of likelihood
analysis of a MVN
• y ~ MVN(m, V)
– m is vector of expected values
– V is variance-covariance matrix
– If V = Is2 then elements of y are uncorrelated and equivalent to n
independent Normal variables
– probability density/Likelihood is
1
L(y | m,V) = exp(- (y - m)' V-1 (y - m))
2
2 n2
(2p V )
Henderson’s Mixed Model Equations
•
•
•
•
General linear mixed model.
b are fixed effect
u are random effects
X, Z design matrices
y = Xb + Zu + e
u » MVN(0,G)
(G = It 2 )
e » MVN(0,R)
(R = Is 2 )
y » MVN(Xb,V)
V = ZGZ T + R
æ XT R-1X
XT R-1Z öæbˆ ö æ XT R-1yö
ç ÷ = ç T -1 ÷
ç T -1
T -1
-1 ÷
è Z R X Z R Z + G øè uˆ ø è Z R y ø
Mixed Models
• Fixed effects models are special cases of mixed models
• Mixed models sometimes are more powerful (as fewer
parameters)
– But check that the assumption effects are sampled from a
Normal distribution is reasonable
– Differences between estimated parameters and statistical
significance in fixed vs mixed models
• Shrinkage: often random effect estimates from a mixed
model have smaller magnitudes than from a fixed effects
model.
Mixed Models in R
• Several R packages available, e.g.
– lme4 general purpose mixed models package
– emma
for genetic association in structured populations
• Formulas in lme4, eg test for association between genotype and trait
•
•
•
H0: E(y) = m
vs
H1: E(y) = m + b
•
Var(y) = Is2 + Ft2
–
–
–
h0 = lmer(y ~ 1 +(1|family), data=data)
h1 = lmer(y ~ genotype +(1|family), data=data)
anova(h0,h1)
Formulas in lmer()
• y ~ fixed.effects + (1|Group1) + (1|Group2)
etc
– random intercept models
• y ~ fixed.effects +(1|Group1) + (x|Group1)
– random slope model
Example: HDL Data
> library(lme4)
> b=read.delim("Biochemistry.txt”)
cc=complete.cases(b$Biochem.HDL)
> f0=lm(Biochem.HDL ~ Family , data=b,subset=cc)
> f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc)
> anova(f0)
Analysis of Variance Table
Response: Biochem.HDL
Df Sum Sq Mean Sq F value
Pr(>F)
Family
175 147.11 0.84064 5.2098 < 2.2e-16 ***
Residuals 1681 271.24 0.16136
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> f1
Linear mixed model fit by REML
Formula: Biochem.HDL ~ (1 | Family)
Data: b
Subset: cc
AIC BIC logLik deviance REMLdev
2161 2178 -1078
2149
2155
Random effects:
Groups
Name
Variance Std.Dev.
Family
(Intercept) 0.066633 0.25813
Residual
0.161487 0.40185
Number of obs: 1857, groups: Family, 176
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.60615
0.02263
70.97
> plot(resid(f0),resid(f1))
Related documents