Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Discrete choice wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Modifying the Schwarz Bayesian Information
Criterion to locate multiple interacting
Quantitative Trait Loci
1. M.Bogdan, J.K.Ghosh and R.W.Doerge,
Genetics 2004 167: 989-999.
2. M.Bogdan and R.W.Doerge “Mapping multiple interacting
QTL by multidimensional genome searches’’
Xia- genotype of i-th individual at locus a
Xia = 1/2 - individual is heterozygous at locus a
Xia = -1/2 - individual is homozygous at locus a
dab=10 cM
-
ρ (Xia, Xib) = 0.81
Data for QTL mapping
Y1,...,Yn - vector of trait values for n backcross individuals
X=[Xij], 1 ≤ i ≤ n, 1 ≤ j ≤ m - genotypes of m markers
Standard methods of QTL mapping
One QTL model
(1)
Yi     Qi   i ,
i
N (0,  )
2
Qi  (-1/2,1/2) - QTL genotype
1. Search over markers - fit model (1) at each
marker and choose markers for which the likelihood
exceeds a preestablished threshold value as candidate
QTL locations.
Interval mapping
Lander and Botstein (1989)
• Consider a fixed position between markers
I i - state of flanking markers
 1 1   1 1   1 1   1 1  
I i   ,  ,  ,   ,   ,  ,   ,   
 2 2   2 2   2 2   2 2  
1
pi  P (Qi  | I i )  easy to compute
2
Yi     Qi   i ,
i
N (0,  2 )
1
1
2
2
f (Yi | I i )  pi N (    ,  )  (1  pi ) N (    ,  )
2
2
n
L(Y | I )   f (Yi | I i )
i 1
1. Estimate μ, β, and σ by EM algorithm and compute
the corresponding likelihood.
2. Repeat this procedure for a new possible QTL
location.
3. Plot the resulting likelihoods as the function of
assumed QTL position.
• Problems with interval mapping
a) Not able to distingush closely linked QTL
b) Not able to detect epistatic QTL (involved only in
interactions)
• Solution
Estimate the location of several QTL at once using
multiple regression model (Kao et al. 1999)
p
Yi  μ   β jQij 
j1
r

1 j<l  m
γ jlQijQil  ε i
Problem : estimation of the number
of additive and interaction terms
p
r
j1
j1
Yi  μ   β jX ih j   γ jX ik j X iu j  ε i
Xij - genotype of j-th marker
average number of markers - (200,400)
Bayesian Information Criterion
• Choose the model which maximizes
log L -1/2 k log n
L – likelihood of the data for a given model
k – number of parameters in the model
n – sample size
Broman (1997) and Broman and Speed (2002) –
BIC overestimates QTL number
How to modify BIC ?
Mi – i-th linear model (specifies which markers
are included in regression)
θ = (μ, β1,..., βp, γ1,..., γr, σ) – vector of parameters
for Mi
fi(θ) – density of the prior distribution for θ
π(i) – prior probability of Mi
L(Y|θ) – likelihood of the data given the vector
of paramers θ
mi(Y) – likelihood of the data given the model Mi
m i (Y)   L(Y | θ)f i (θ)dθ
P(Mi|Y)  π(i)mi(Y)
BIC neglects π(i) and uses asymptotic
approximation
log m i (Y)  log L(Y, θ̂)  1/2(p  r  2)log n
neglecting π(i) = assigning the same prior probability
to all models = assigning high prior probability to the
event that there are many regressors
Example : 200 markers
200 models with one additive term
 200 

 =19 900 models with one interaction
2 
or with two additive terms
 200 

 = 9.05*1058 models with 100 additive terms
100 
Idea: supplement BIC with a more realistic prior
distribution π
1
~
ˆ
S (i )  log  (i )  log L(Y , )  ( p(i )  r (i )) log n
2
n
ˆ
log L(Y , )   log RSS  C (n)
2
RSS  residual sum of squares from regression
S (i )  n log RSS  ( p(i )  r (i )) log n  2 log  (i )
Choice of π (George and McCulloch, 1993)
M – number of markers
M(M  1) - number of potential interactions
N
2
α - the probability that i-th additive term
appears in the model
ν - the probability that j-th interaction term
appears in the model
M- model with p additive terms and r interactions
π(M)= αp νr(1-α)M-p (1-ν)N-r
Prior distribution on the number of additive terms, p –
Binomial (M,α)
Prior distribution on the number of interactions, r –
Binomial (N,ν)
We choose
1
1
  , l  N and   , u  N
l
u
log π(M)=C(M,N,l,u)-p log(l-1)-r log(u-1)
S (i )  n log RSS  ( p  r ) log n 
2 p log( l  1)  2r log( u  1)
M
E(p) 
,
l
N
E(r) 
u
Choice of l and u should depend on the prior
knowledge on the number of QTL.
Our choice – for the sample size 200
probability of wrongly detecting QTL (when there are
none) ≈ 0.05
We keep E(p) and E(r) equal to 2.2
The choice is supported by theoretical bound on type I
error based on Bonferoni inequality.
S (i )  n log RSS  ( p  r ) log n 
2p log( M / 2.2)  2r log( N / 2.2)
Additional penalty similar to Risk Inflation Criterion
of Foster and George (2k log t , where t is the total
number of available regressors) and to the
modification of BIC proposed by Siegmund (2004).
Search over 12 chromosomes
markers spaced every 10 cM
n
h2
p
corr. extr
200
0
0
500
0
200
0.2
200
r
corr
extr
0.95 0.03 0
-
0.02
0
0.99 0.01 0
-
0
1
1
0.03 0
0
0.02
0.195 0
-
0.01 1
0.95 0.04
n
h2
p
corr
extr
r
corr
extr
200
0.55
0
-
0.02
3
2.88
0.08
200
0.5
7
5.06
0.26
0
-
0.09
500
0.5
7
6.99
0.14
0
-
0.03
200
0.43
12
2.39
0.31
0
-
0.03
500
0.43
12
9.68
0.47
0
-
0.02
200
0.71
12
9.53
0.75
0
-
0.02
200
0.53
2
1.95
0.04
5
2.11
0.11
500
0.53
2
2
0.03
5
3.47
0.08
• The criterion adjusts well to the number of
available markers
• For n = 200 the criterion detects almost all
additive QTL with individual h2 =0.13 and
interactions with h2 =0.2.
• For n = 500 the criterion detects almost all
additive QTL with individual h2 =0.06 and
interactions with h2 =0.12.
Bound for the type I error
S1  the maximum of the criterion over
all one dimensional models
S0 = log L0 (Y / ˆ , ˆ )  the value of the criterion
for the null model
D - the number of terms chosen by our criterion
P( D  0)  P( S1  S0 )
S M i - the value of the criterion for a
given one dimensional model
S M i  S0 if
L(Y / ˆM i )
2 log
 log n  2(log(l  1) or log(u  1))
L (Y / ˆ )
0
0
P( S M i  S0 ) 
2 P( Z  log n  2(log(l  1) or log(u  1)))
where Z
N (0,1)
By Bonferoni inequality and the bound
2
1
x
P(Z>x) 
exp( )
2
2 x
2M
2N
P( S1  S0 ) 

(l  1)C1 (l , n) (u  1)C2 (u, n)
M
N
l
, u
2.2
2.2
P( S1  S0 ) 

4.4 
1
1



2 n  log n  2 log(l  1)
log n  2 log(u  1) 
For n=200 and typical values of M this yields values
in the range between 0.057 and 0.08.