Download Quantitative Trait Loci, QTL An introduction to

Document related concepts

Gene expression programming wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Inbreeding wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Designer baby wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic drift wikipedia , lookup

Microevolution wikipedia , lookup

Behavioural genetics wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Epistasis wikipedia , lookup

Twin study wikipedia , lookup

Population genetics wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Heritability of IQ wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transcript
Quantitative Trait Loci, QTL
An introduction to quantitative genetics
and common methods for mapping of
loci underlying continuous traits:
Why study quantitative traits?
• Many (most) human traits/disorders are complex
in the sense that they are governed by several
genetic loci as well as being influenced by
environmental agents;
• Many of these traits are intrinsically continuously
varying and need specialized statistical
models/methods for the localization and
estimation of genetic contributions;
• In addition, in several cases there are potential
benefits from studying continuously varying
quantities as opposed to a binary
affected/unaffected response:
For example:
• in a study of risk factors the underlying
quantitative phenotypes that predispose disease
may be more etiologically homogenous than the
disease phenotype itself;
• some qualitative phenotypes occur once a
threshold for susceptibility has been exceeded,
e.g. type 2 diabetes, obesity, etc.;
• in such a case the binary phenotype
(affected/unaffected) is not as informative as the
actual phenotypic measurements;
A pedigree representation
Variance and variability
• methods for linkage analysis of QTL in humans rely
on a partitioning of the total variability of trait values;
• in statistical theory, the variance is the expected
squared deviation round the mean value,
Y  E (Y ) :
V (Y )  E[(Y  Y ) 2 ];
• it can be estimated from data as:
1 n
s  i 1 ( yi  y ) 2 ;
n
2
• the square root of the variance is called the standard
deviation;
A simple model for the phenotype
Y=X+e
where
• Y is the phenotypic value, i.e. the trait value;
• X is the genotypic value, i.e. the mean or
expected phenotypic value given the genotype;
• e is the environmental deviation with mean 0.
• We assume that the total phenotypic variance is
the sum of the genotypic variance and the
environmental variance, V (Y ) = V (X ) + V (e),
i.e. the environmental contribution is assumed
independent of the genotype of the individual;
Distribution of Y : a single biallelic locus
A single biallelic locus: genetic effects
Genotype
Genotypic value
• a is the homozygous effect,
• k is the dominance coeffcient
• k = 0 means complete additivity,
• k = 1 means complete dominance (of A2),
• k > 1 if A2 is overdominant.
Example: The pygmy gene, pg
• From data we have the following mean
values of weight:
X++ = 14g, X+pg = 12g, Xpgpg = 6g,
• 2a = 14 -6 = 8 implies a =4,
• (1 + k)a = 12 - 6 = 6 implies k = 0.5.
Data suggest recessivity (although not
complete) of the pygmy gene.
Decomposition of the genotypic value, X
• Xij is the mean of Y for AiAj-individuals;
• when k = 0 the two alleles of a biallelic locus
behaves in a completely additive fashion: X is a
linear function of the number of A2-alleles;
• we can then think of each allele contributing a
purely additive effect to X ;
• this can be generalized to k ≠ 0 by
decomposition of X into additive contributions of
alleles together with deviations resulting from
dominance;
• the generalization is accomplished using leastsquares regression of X on the gene content;
Least-squares linear regression
 X = X̂ +  , i.e. fitted value  residual deviation;
 minimize the sum of squared residuals;
 V ( X )  V (X̂ )  V ( ), variance decomposit ion
Model 1
 X i j  Xˆ ij   ij
    i   j   ij
  is the population mean phenotype,
 i is the additive effect of allele Ai ,
 ij is the residual deviation due to dominance;
  Xˆ ij    1 N1   2 N 2 , with N k the number
of Ak - alleles in the genotype;

  21

ˆ
X ij    1   2
  2
2

for A1 A1 ,
for A1 A2 ,
for A2 A2 .
1 p1   2 p2  0
   2  1
 1   p2
 2  p1
Interpretations
• in the linear regression X  Xˆ  
Xˆ is the heritable component of the genotype,
δis the non-heritable part;
• the sum of an individuals additive allelic effects, αi+αj is
called the breeding value and is denoted Λij
• under random mating αican be interpreted as the average
excess of allele Ai
• this is defined as the difference between the expected
phenotypic value when one allele (e.g. the paternally
transmitted) is fixed at Ai and the population average, μ;
Linear Regression
 pk  proportion of Ak - alleles in population;
 the expected additive effect of a randomly drawn
allele is 0, i.e.
 1 p1   2 p2  0 ;
 which implies the corresponding population
variance
 12 p1   22 p2
 since for a bialleliclocus N1  2-N 2 ,
X  ~  N  
ij
where
~    2 1 ,
   2  1.
2
ij
Graphically
Linear Regression Model solving 
• X ij  ~    N 2   ij
X
N2
prob.
0
0
p12
a(1+k)
1
2 p1 p2
2a
2
cov( X , N 2 )
• 
var( N 2 )
p
2
2
E ( X )  a(1  k )  2 p1 p2  2ap22  2ap2 (1  p1k )
V ( X )  a (1  k )  2 p1 p2  4a p  4a p (1  p1k )
2
2
2
2
2
2
E ( N 2 )  2 p2
Var ( N 22 )  2 p1 p2
E ( XN 2 )  a(1  k ) 1 2 p1 p2  2a  2 p
 2ap2 (2 p2  p1 (1  k ))
2
2
2
2
2
COV ( X , N 2 )  2ap2 [2 p2 (1  p1k )  2 p2  p1 (1  k )]
 2ap1 p2 [1  k  2 p2 k ]
 2ap1 p2 [1  k ( p1  p2 )]
   a  [1  k ( p1  p2 )]
average excesses
 i*  E ( X | one allele is i )   X
1*  X 12 p(another one is 2 | 1)
 X 11 p(another one is 1 | 1)   X
randommating

X 12 p2  X 11 p1   X
 (1   2 ) p2  (21 ) p1  1
Interpretations under random mating
• α= a [1+ k (p1-p2)] ;
α= - p2 α;
α= p1 α,
Population parameters for k≠0
• α is called the average effect of allelic substitution:
substitute A1 A2for a randomly chosen
A1 –allele
• then the expected change in X is,
(X12 -X11) p1 + (X22 -X12) p2 ;
• which equals α. (simple calculations).
 : Average effect of allelic substitution
A1
A2
A2
A1
A2
A1
p1 ( X 12  X 11 )  p2 ( X 22  X 12 )
 p1  a(1  k )  p2  a(1  k )
 a  (1  k ( p1  p2 ))
α is a function of p2 and k :
Partitioning the genetic variance
• the variance, V (X ), of the genotypic values in
a population is called the genetic variance:
V ( X )  V ( Xˆ   )
 V ( Xˆ )  V ( )
 VA  VD
•
VA  2 p1 p2 2  2( p112  p2 22 )
is the additive
genetic variance, i.e. variance associated with
additive allelic effects;
• VD  (2 p1 p2 ak ) 2
dominance genetic
variance, i.e. due to dominance deviations;
VA
VA  2( p112  p2 22 )
 p11  p2 2  0
VA  2 p1 p2  p 4 ( Linear
2
2
2
2
 (2 p1 p2  2p22 ) 2
 2 p1 p2 2
 2 p1 p2 a 2 [1  k ( p1  p2 )]2
regression )
V (X); VA; VD are functions of p2 and k:
VA  [dashed ]  2 p1 p2 [a(1  k ( p1  p2 ))]2 ;
VD  [dotted ]  (2 p1 p2 ak ) 2 ;
Example: The Booroola gene, (Lynch and Walsh, 1998)
In summary
• The homozygous effect a, and the dominance
coefficient k are intrinsic properties of allelic
products.
• The additive effect αi, and the average excess
αi* are properties of alleles in a particular
population.
• The breeding value is a property of a particular
individual in reference to a particular population.
It is the sum of the additive effects of an
individual's alleles.
• The additive genetic variance, VA, , is a property
of a particular population. It is the variance of the
breeding values of individuals in the population.
Multilocus traits
• Do the separate locus effects combine in an
additive way, or do there exist non-linear
interaction between different loci: epistasis?
• Do the genes at different loci segregate
independently?
• Do the gene expression vary with the
environmental context: gene by environment
interaction?
• Are specic genotypes associated with particular
environments: covariation of genotypic values
and environmental effects?
Example: epistasis
Average length of vegetative internodes in the lateral branch
(in mm) of teosinte. Table from Lynch and Walsh (1998).
Two independently segregating loci
• Extending the least-squares decomposition of X :
X    1  1  2   2  
• Λk is the breeding value of the k'th locus,
δk is the dominance deviation of the k'th locus,
ε is a residual term due to epistasis;
• if the loci are independently segregating
V ( X )  V (1 )  V ( 2 )  V (1 )  V ( 2 )  V ( )
 VA,1  VA,2  VD,1  VD,2  V ( )
 VA  VD  V ( )
Neglecting V (ε)
• the epistatic variance components contributing
to V (ε) are often small compared to VA and VD;
• in linkage analysis it is this often assumed that
V (ε) = 0;
• note however: the relative magnitude of the
variance components provide only limited insight
into the physiological mode of gene action;
• epistatic interactions, can greatly inflate the
additive and/or dominance components of
variance;
Resemblance between relatives
A model for the trait values of two relatives:
Yk = Xk + ek, k = 1 , 2,
where for the k’th relative
• Yk is the phenotypic value,
• Yk is the genotypic value,
• ek is the mean zero environmental deviation.
• the ek’s are assumed to be mutually independent
and also independent of k. Hence, the covariance
of the trait values of two relatives is given by the
genetic covariance, C(X1; X2), i.e.
C(Y1; Y2) = C(X1; X2)
A (preliminary) formula for C(X1 ,X 2)
For a single locus trait
C(X1; X2) = c1VA + c2VD
• c1 and c2 are constants determined by the type
of relationship between the two relatives.
• same formula applies for multilocus traits if no
epistatic variance components are included in
the model, i.e. V (ε) = 0.
• in this latter case and are given by summation of
the corresponding locus-specific contributions.
Joint distribution of sibling trait values
Single biallelic, dominant (k =1 ) model. Correlation 0.46.
Measures of relatedness
• N = the number of alleles shared IBD by
two relatives at a given locus;
• the kinship coefficient, θ , is given by
2 θ = E(N) / 2;
i.e. twice the kinship coefficient equals the expected
proportion of alleles shared IBD at the locus.
• The coefficient of fraternity, Δ, is defined
as
Δ = P(N = 2).
Some examples
• Siblings
(z0; z1; z2) = (1/4; 1/2; 1/4) implying E(N) = 1.
Thus θ= 1/4 and Δ = 1/4:
• Parent-offspring
(z0; z1; z2) = (0; 1; 0) implying E(N) = 1.
Thus θ = 1/4 and Δ = 0:
• Grandparent - grandchild
(z0; z1; z2) = (1/2; 1/2; 0) implying E(N) = 1=2.
Thus θ = 1/8 and Δ = 0:
Covariance formula for a single locus
Under the assumed model
X 1     i1   1j   ij1
X 2     i2   2j   ij2
Cov( X 1 , X 2 )  Cov( i1   1j ,  i2   2j )
 Cov( ij1 ,  ij2 )
C (Y1 , Y2 )  C ( X 1 , X 2 )
 2θVA  VD
E(N )

VA  P( N  2)VD
2
A single locus; perfect marker data
N
C(Y1,Y2|N)  VA  I  N  2 VD
2
with
1 if N  2
I {N  2}  
0 if N  0 or N  1
i.e.
if N  0
0

C (Y1,Y2|N)  VA / 2
if N  1
V  V if N  2
D
 A
Covariance formula for multiple loci
n independently segregating loci assuming no
epistatic interaction, i.e. putting V (ε) = 0
C (Y1 , Y2 )  C ( X 1 , X 2 )
 2 VA  VD
 2

l
VA,l   l VD ,l
 E( Nl )

 l 
VA,l  P ( N l  2) VD ,l  ;
 2

N l is the mumber of alleles shared IBD at locus l ;
V A,l , VD ,l are locus - specific additive - and dominace variance
contributi ons, respective ly.
Covariance formula for multiple loci
n independently segregating loci assuming no
epistatic interaction, i.e. putting V (ε) = 0
C (Y1 , Y2 )  C ( X 1 , X 2 )
 2 VA  VD
 2

l
VA,l   l VD ,l
 E( Nl )

 l 
VA,l  P ( N l  2) VD ,l  ;
 2

N l is the mumber of alleles shared IBD at locus l ;
V A,l , VD ,l are locus - specific additive - and dominace variance
contributi ons, respective ly.
Covariance... continued
Define for every pair of relatives
 (x)  E[ Nx | MDx] / 2;
and
 2(x)  P( Nx  2 | MDx);
For two related individuals we then have,
C (Y1 , Y2 | MD x ) 
 l
E[ N l | MD x ]
(
VA,l  P( N l  2 | MD x )VD ,l ) ;
2
  VA, x   2 VD , x  2VA, x  VD , R
( x)
( x)
Haseman-Elston method
• Uses pairs of relatives of the same type: most
often sib pairs;
• for each relative pair calculate the squared
phenotypic difference: Z = (Y1 –Y2)2;
• given MDx regress the Z's on the expected
proportion of alleles IBD, π(x) = E [Nx |MDx]/2, at
the test locus;
• a slope coefficient β< 0, if statistically significant,
is considered as evidence for linkage;
HE: an example
0.5
Proportion of marker alleles identical by decent
Solid line is the tted regression line;
Dotted line indicates true underlying relationship
HE: motivation
E[(Y1  Y2 ) ]  V [Y1  Y2 ]
2
 V (Y1 )  V (Y2 )  2C (Y1  Y2 )
 2V (Y )  2C (Y1  Y2 )
Assume strictly additive gene action at each locus,
i.e.VD = 0. Then, for a putative QTL at x,
E[(Y1  Y2 ) 2 | MD x ]  2V (Y )  2C (Y1  Y2 | MD x )
 2V (Y )  2[ ( x )VA, x  2VA, R ]
NOTE : This is a linear function in  ( x ) !
HE: linkage test
E[Y1 , Y2|MD x ]     ( x )
where
  2[V (Y )  2VA, R ]
  2VA, x
The linkage test is
H0 : 
 0, ( VA, x  0)
vs
H1 : 
 0
HE: examples with simulated data
simulated data from n = 200 sib-pairs;
top to bottom: h2 = 0:50; 0:33; 0:25.
Heritability and power
• for a given locus we may define the locus-specific
heritability as the proportion of the total variance
'explained' by that particular site, e.g. (in the narrowsense),
V
h2 
A
V (Y )
• the locus-specific heritability is the single most
important parameter for the power of QTL linkage
methods;
• heritabilities below  10% leads, in general, to
unrealistically large sample sizes.
HE: two-point analysis
~ ( m)
~
E[(Y1  Y2 ) | marker genotypes]     
2
where  is the expected proportion of marker
alleles shared IBD.
~
•  depends on the type of relatives considered;
~
• for sib pairs   2(1  2 ) 2VA,l ;
• recombination fraction (θ) and effect size (VA;l )
are confounded and cannot be separately
estimated;
(m )
HE: in summary
Simple, transparent and comparatively robust but:
•
•
•
•
poor statistical power in many settings;
different types of relatives cannot be mixed;
parents and their offspring cannot be used in HE;
assumptions of the statistical model not generally
satisfied;
• Remedy:
• use one of several suggested extensions of HE;
• alternatively, use VCA instead
VCA
QTL
Polygenes
Independent
environment
Mathematically:
Yi=+Tai+gi+qi+ei
Trait value
where  is the population mean, a are the “environmental”
predictor variables, q is the major trait locus, g is the
polygenic effect, and e is the residual error.
VCA: an additive model
p
n
i 1
l 1
Y      i zi   X l  e
E (Y )    i 1  i zi ;
p
V (Y )  VA  VD  V (e)
 VA, x  VD , x  VA, R  VD , R  V (e)
C (Y1 , Y2 | MD x )   VA, x   V
( x)
( x)
2
D, x
 2VA, R  VD , R
VCA: major assumption
The joint distribution of the phenotypic values in a
pedigree is assumed to be multivariate normal with
the given mean values, variances and covariances;
• the multivariate normal distribution is completely
specified by the mean values, variances and
covariances;
• the likelihood, L, of data can be calculated and
we can estimate the variance components
VA;x; VD;x ; VA;R; VD;R;
VCA: linkage test
The linkage test of
H0 : VA;x = VD;x = 0
uses the LOD score statistic
L(full model)
LOD x  log 10
L(VA, x  VD, x  0)
When the position of the test locus, x, is varied over
a chromosomal region the result can be
summarized in a LOD score curve.
VCA vs HE: LOD score proles
From Pratt et al.; Am. J. Hum. Genet. 66:1153-1157, (2000)
Linkage methods for QTL
• Fully parametric linkage approach is difficult;
• Model-free tests comprise the alternative choice;
• We will discuss
Haseman-Elston Regression (HE);
Variance Components Analysis (VCA);
Both can be viewed as two-step procedures:
1. use polymorphic molecular markers to extract
information on inheritance patterns;
2. evaluate evidence for a trait-influencing locus
at specified locations;
Similarities and differences
• HE and VCA are based on estimated IBDsharing given marker data;
• both methods require specification of a statistical
model!
('model-free' means 'does not require
specification of genetic model')
• similarity in IBD-sharing is used to evaluate trait
similarity using either
linear regression (HE) or
variance components analysis (VCA);