Download Logistic Regression using R - Wellcome Trust Centre for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Association Analysis,
Logistic Regression,
R and S-PLUS
Richard Mott
http://bioinformatics.well.ox.ac.uk/lectures/
Logistic Regression in Statistical
Genetics
• Applicable to Association Studies
• Data:
– Binary outcomes (eg disease status)
– Dependent on genotypes [+ sex, environment]
• Aim is to identify which factors influence the
outcome
• Rigorous tests of statistical significance
• Flexible modelling language
• Generalisation of Chi-Squared Test
What is R ?
•
•
•
•
•
•
Statistical analysis package
Free
Similar to commercial package S-PLUS
Runs on Unix, Windows, Mac
www.r-project.org
Many packages for statistical genetics,
microarray analysis available in R
• Easily Programmable
Modelling in R
• Data for individual labelled i=1…n:
– Response yi
– Genotypes gij for markers j=1..m
Coding Unphased Genotypes
• Several possibilities:
– AA, AG, GG original genotypes
– 12, 21, 22
– 1, 2, 3
– 0, 1, 2 # of G alleles
• Missing Data
– NA default in R
Using R
• Load genetic logistic regression tools
• > source(‘logistic.R’)
• Read data table from file
– > t <- read.table(‘geno.dat’,
header=TRUE)
• Column names
– names(t)
– t$y response (0,1)
– t$m1, t$m2, …. Genotypes for each marker
Contigency Tables in R
• ftable(t$y,t$m31) prints the contingency table
> ftable(t$y,t$m31)
11 12 22
0
1
>
515 387
28 11
75
2
Chi-Squared Test in R
> chisq.test(t$y,t$m31)
Pearson's Chi-squared test
data: t$y and t$m31
X-squared = 3.8424, df = 2, p-value = 0.1464
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(t$y, t$m31)
>
The Logistic Model
• Prob(Yi=0) = exp(hi)/(1+exp(hi))
 hi = Sj xij bj - Linear Predictor
• xij – Design Matrix (genotypes etc)
• bj – Model Parameters (to be estimated)
• Model is investigated by
– estimating the bj’s by maximum likelihood
– testing if the estimates are different from 0
The Logistic Function
Prob(Yi=0) = exp(hi)/(1+exp(hi))
Prob(Y=0)
h
Types of genetic effect at a single
locus
AA
AG
GG
Recessive
0
0
1
Dominant
1
1
0
Additive
0
1
2
Genotype
0
a
b
Additive Genotype Model
• Code genotypes as
– AA
– AG
– GG
x=0,
x=1,
x=2
• Linear Predictor
 h = b0 + xb1
•
•
•
•
P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))
PAA = P(Y=0|x=0) = exp(b0)/(1+exp(b0))
PAG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
PGG = P(Y=0|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))
Additive Model: b0 = -2 b1 = 2
PAA = 0.12 PAG = 0.50 PGG = 0.88
Prob(Y=0)
h
Additive Model: b0 = 0 b1 = 2
PAA = 0.50 PAG = 0.88 PGG = 0.98
Prob(Y=0)
h
Recessive Model
• Code genotypes as
– AA
– AG
– GG
x=0,
x=0,
x=1
• Linear Predictor
 h = b0 + xb1
• P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))
• PAA = PAG = P(Y=0|x=0) = exp(b0)/(1+exp(b0))
• PGG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
Recessive Model: b0 = 0 b1 = 2
PAA = PAG = 0.50 PGG = 0.88
Prob(Y=0)
h
Genotype Model
• Each genotype has an independent probability
• Code genotypes as (for example)
– AA
– AG
– GG
x=0, y=0
x=1, y=0
x=0, y=1
• Linear Predictor
 h = b0 + xb1+yb2 two parameters
•
•
•
•
P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2))
PAA = P(Y=0|x=0,y=0) = exp(b0)/(1+exp(b0))
PAG = P(Y=0|x=1,y=0) = exp(b0 + b1)/(1+exp(b0 + b1))
PGG = P(Y=0|x=0,y=1) = exp(b0 + b2)/(1+exp(b0 + b2))
Genotype Model: b0 = 0 b1 = 2 b2 = -1
PAA = 0.5 PAG = 0.88 PGG = 0.27
Prob(Y=0)
h
Models in R
response y
genotype g
AA
AG
GG
model
DF
Recessive
0
0
1
y ~ dominant(g)
1
Dominant
0
1
1
y ~ recessive(g)
1
Additive
0
1
2
y ~ additive(g)
1
Genotype
0
a
b
y ~ genotype(g)
2
Data Transformation
• g <- t$m1
• use these functions to treat a genotype
vector in a certain way:
–a
–r
–d
–g
<<<<-
additive(g)
recessive(g)
dominant(g)
genotype(g)
Fitting the Model
•
•
•
•
afit
rfit
dfit
gfit
<<<<-
glm(
glm(
glm(
glm(
t$y
t$y
t$y
t$y
~
~
~
~
additive(g),family=‘binomial’)
recessive(g),family=‘binomial’)
dominant(g),family=‘binomial’)
genotype(g),family=‘binomial’)
• Equivalent models:
– genotype = dominant + recessive
– genotype = additive + recessive
– genotype = additive + dominant
– genotype ~ standard chi-squared test of genotype
association
Parameter Estimates
> summary(glm( t$y ~ genotype(t$m31), family='binomial'))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
b0 (Intercept)
-2.9120
0.1941 -15.006
<2e-16 ***
b1 genotype(t$m31)12 -0.6486
0.3621 -1.791
0.0733 .
b2 genotype(t$m31)22 -0.7124
0.7423 -0.960
0.3372
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
>
Analysis of Deviance
Chi-Squared Test
> anova(glm( t$y ~ genotype(t$m31), family='binomial'))
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
NULL
genotype(t$m31)
Df Deviance Resid. Df Resid. Dev
1017
343.71
2
3.96
1015
339.76
Model Comparison
• Compare general model with additive,
dominant or recessive models:
> afit <- glm(t$y ~ additive(t$m20))
> gfit <- glm(t$y ~ genotype(t$m20))
> anova(afit,gfit)
Analysis of Deviance Table
Model 1: t$y ~ additive(t$m20)
Model 2: t$y ~ genotype(t$m20)
Resid. Df Resid. Dev
Df Deviance
1
1016
38.301
2
1015
38.124
1
0.177
>
Scanning all Markers
> logscan(t,model=‘additive’)
Deviance DF
Pval
LogPval
m1 8.604197e+00 1 3.353893e-03 2.474450800
m2 7.037336e+00 1 7.982767e-03 2.097846522
m3 6.603882e-01 1 4.164229e-01 0.380465360
m4 3.812860e+00 1 5.086054e-02 1.293619014
m5 7.194936e+00 1 7.310960e-03 2.136025588
m6 2.449127e+00 1 1.175903e-01 0.929628598
m7 2.185613e+00 1 1.393056e-01 0.856031566
m8 1.227191e+00 1 2.679539e-01 0.571939852
m9 2.532562e+01 1 4.842353e-07 6.314943565
m10 5.729634e+01 1 3.748518e-14 13.426140380
m11 3.107441e+01 1 2.483233e-08 7.604982503
…
…
…
Multilocus Models
• Can test the effects of fitting two or more
markers simultaneously
• Several multilocus models are possible
• Interaction Model assumes that each
combination of genotypes has a different
effect
• eg t$y ~ t$m10 * t$m15
Multi-Locus Models
> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial')
> anova(f)
Analysis of Deviance Table
Model: binomial, link: logit
Response: t$y
Terms added sequentially (first to last)
NULL
genotype(t$m13)
genotype(t$m26)
genotype(t$m13):genotype(t$m26)
Df Deviance Resid. Df Resid. Dev
1017
343.71
2
108.68
1015
235.03
2
1.14
1013
233.89
3
6.03
1010
227.86
> pchisq(6.03,2,lower.tail=F) calculate p-value
[1] 0.04904584
Adding the effects of Sex and other
Covariates
• Read in sex and other covariate data, eg.
age from a file into variables, say a$sex,
a$age
• Fit models of the form
•
•
fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’)
fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)
Adding the effects of Sex and other
Covariates
• Compare models using anova – test if the effect
of the marker m10 is significant after taking into
account sex and age
• anova(fit1,fit2)
Multiple Testing
• Take care interpreting significance levels when
performing multiple tests
• Linkage disequilibrium can reduce the effective number
of independent tests
• Permutation is a safe procedure to determine
significance
• Repeat j=1..N times:
– Permute disease status y between individuals
– Fit all markers
– Record maximum deviance maxdev[j] over all markers
• Permutation p-value for a marker is the proportion of
times the permuted maximum deviance across all
markers exceeds the observed deviance for the marker
– logscan(t,permute=1000) slow!
Haplotype Association
• Haplotype Association
– Different from multiple genotype models
– Phase taken into account
– Haplotype association can be modelled in a similar logistic
framework
• Treat haplotypes as extended alleles
• Fit additive, recessive, dominant & genotype models as
before
– Eg haplotypes are h = AAGCAT, ATGCTT, etc
– y ~ additive(h)
– y ~ dominant(h) etc
Related documents