Download methods S1.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Methods S1. Simulation of genotype-phenotype data
Simulation of genotypic data
We simulate the latent vector of size m , Z  ( Z1 ,
, Z m ) , from a multivariate
normal distribution assuming a zero mean and an exchangeable correlation structure,
i.e. all non-diagonal elements of the correlation matrix have the same positive value
0    1 . We, then, simulate the RAFs of m SNPs, denoted as p  ( p1 ,
, pm ) , from a
uniform distribution U (0.01, 0.99) . The i th entry of a genotype vector G  (G1 ,, Gm ) is
obtained by discretizing the i th element of the simulated latent vector. The discretization
is achieved by using thresholds which are consistent with the assumption of HardyWeinberg Equilibrium in the population:
0

Gi  1
2

if  ( Z i )  (1  pi ) 2
if (1  pi ) 2   ( Z i )  1  pi2
if  ( Z i )  1  pi2 ,
where  is the cumulative distribution function of the standard normal distribution.
While in real data the relationship between genotypes might not be the one described
by the multivariate normal distribution, this method i) captures the essential feature of
co-occurring alleles and ii) is able to simulate all possible haplotypes in a block, unlike
the small number which is available when using the small publicly available data sets.
Simulation of phenotypic data under single causal variant scenario
The phenotype of a subject is determined based on the penetrance(s) for the
genotype(s) at the causal variant(s), as derived from the mode of inheritance and
frequency.
For a model containing a single causal variant associated with disease,
the phenotype, Y , of a subject is determined as
1
Y 
0
with probability Pr( D | Gd )
with probability 1  Pr( D | Gd ),
where Gd is the genotype of the causal variant and Pr( D | Gd ) is the penetrance function.
Pr( D | Gd ) is defined as
 f0

Pr( D | Gd )   R1 f 0
R f
 2 0
if Gd  0
if Gd  1
if Gd  2,
where f 0 is the baseline penetrance and R1 and R2 are the relative risks of
heterozygote and risk allele homozygote compared with non-risk allele homozygote,
respectively. f 0 is not an independent quantity and can be calculated as a function of
other parameters as:
f0 
K
,
p R2  2 pd (1  pd ) R1  (1  pd ) 2
2
d
with K  Pr( D) being the prevalence of the disease and pd being the causal allele
frequency. Pairs of R1 and R2 used in single causal variant scenario under Experiment I
and II are available in Tables S2 and 1, respectively.
Simulation of phenotypic data under k causal variant scenario ( k  1 )
For the k causal variant scenario ( k  1 ), Y is determined as
1
Y 
0
with probability Pr( D | Gd 1 ,..., Gdk )
with probability 1  Pr( D | Gd 1 ,..., Gdk ),
where Pr( D | Gd 1 ,..., Gdk ) is the penetrance and Gd 1 ,..., Gdk are the genotypes of the k
causal variants. For each mode of inheritance (additive, dominant and recessive),
Pr( D | Gd 1 ,..., Gdk ) is defined as
 f 0   k  Gdi
i 1
for additive mode of inheritance

k

Pr( D | Gd 1 ,..., Gdk )   f 0   i 1  I (Gdi  1) for dominant mode of inheritance

k
for recessive mode of inheritance
 f 0   i 1  I (Gdi  2)
where   0 is the effect size of any causal allele, i.e. effect sizes of the alleles at the k
causal variants are assumed to be identical. Effect sizes (  ) used in Experiment I and II
are given in Tables S3 and 2, respectively. f 0 is the baseline penetrance and, for
simplicity, in our simulations we assumed that f 0  K / k .
Related documents