Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nutriepigenomics wikipedia , lookup

Designer baby wikipedia , lookup

SNP genotyping wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Inbreeding wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human leukocyte antigen wikipedia , lookup

Fetal origins hypothesis wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

Population genetics wikipedia , lookup

Tag SNP wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genetic drift wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Transcript
Gene mapping by
association
3/4/04
Biomath/HG 207B/Biostat 237
Linkage has its limits
To determine that a trait is closer to
marker 1 than marker 2, we need to
see recombination between marker 2
and the trait locus.
As distance between the markers
decreases the number of informative
meioses needed to see recombination
increases.
At some point Linkage analysis
because impractical because too
many families are needed.
Association Studies
• Association is a statistical term that
describes the co-occurrence of
alleles or phenotypes.
• An Allele A is associated with
disease D, if people with D have a
different frequency of A than people
without D.
Possible causes for allelic
association
• best: allele increases
disease susceptibility
D
– candidate gene
studies
• good: some subjects
share common ancestor
– linkage disequilibrium
studies
Loci Alleles

Under linkage equilibrium
P(A,D)=P(A)*P(D)
Violation of the equality is termed
linkage disequilibrium

M
A1
K
D
Linkage Disequilibrium
Suppose one of the population founders carries an
allelic variant that increases risk of a disease. The
disease gene is very close to a marker so  is very
small.
d D
A a
d d
A a
.
.
.
d d
A a
Ancestral haplotypes
d d
are dA, da, and Da
A a
Note that D is
associated
d d
D d
d d with a.
A A
a A
A A P(a|D) is close
to one.
Over many generations (n), there is
occasionally recombination between
the two genes. So that the population
looks like:
d d d d d d D d D d D
A a A A A A a A a A A
d d d
A A A
The degree of association between D and a has decreased
P(a|D) but still P(a|D) > P(a). P(a D)>p(a)P(D)
The Degree of Association Between Two Genes
Depends on the Distance Between them and
the Age of the Population
1. Let aD = P(aD)-P(a)P(D) and similarly for
other alleles.
aD(n) = aD(0)(1-)n
2. At linkage equilibrium
P(a/a|D/d)=P(a/a|d/d)=P(a/a|D/D)=P(a/a)
P(A/a|D/d)=P(A/a|d/d)=P(A/a|D/D)=P(A/a)
P(A/A|D/d)=P(A/A|d/d)=P(A/A|D/D)=P(A/A)
Violation of these equalities is evidence of linkage
disequilibrium.
Allelic association studies
test whether alleles are
associated with the trait
• 2 types of association tests
– population-based
association test
• cases and controls are unrelated
• cross-classify by genotype
• use 2 test or logistic regression
– family-based association
tests
• cases and controls are related:
parents, sibs etc
• often based on allele
transmission rates
• prime example TDT
Mapping Genes using a Case Control Design
1.
Example: Non insulin dependent diabetes in Pima
indians is associated with human immunoglobulin
gene, Gm. (Knowler et al., 1988)
Cases
genotype
1/1 or 1/2 23 (.0169)
1343(.983)
2/2
1366
Controls
270(.0760)
3284(.924)
3554
Total
293
4627
4920
2 =61.6 p <0.00005.
2.
What can go wrong? Association could be due to
ethnic differences among cases and controls population stratification.
A Dramatic Example of when Association is due to
Population Stratification
1. The Gm genotype differs by degree of caucasian
heritage
Genotype
1/1 or 1/2
2/2
>50%
184 (.441)
233
417
2 =1185.5 p <0.00005
2.
<50%
109 (.0242)
4394
4503
Total
293
4627
4920
Diabetes prevalence differs by caucasian heritage
Diabetes
>50%
<50%
Total
Yes
20 (.0146) 1346 (.112)
1366
No
397
3157
3554
417
4503
4920
3. Controlled for age and degree caucasian background,
diabetes and Gm are not significantly associated.
(Knowler et al., 1988).
How concerned should we be about
population stratification invalidating
case/control results?
1. The allele frequencies and disease prevalence rarely differ
as dramatically by race as in the example.
2. Good epidemiological methods can reduce the problem.
Collect information on racial/ethnic background
3. Sometimes there is no alternative to a case/control design.
Family controls may not be available.
On the other hand,
1. Better safe than sorry - Family based control designs
2. Family based designs require more genotyping but
not more phenotyping than case/control
The Transmission Disequilibrium Test eliminates
concern over false positives due to population
stratification
Spielman et al., 1993
Terwilliger and Ott, 1992
A simple illustration of the TDT:
Collect parent-child trios
If the child is chosen without regard to disease status
A/a
A/A
then the child’s genotype is equally likely to be A/a or A/A
However, if the child is chosen because they are affected
and the marker allele a is associated with the disease allele D
d D
A a
d d
A A
D d
a A
then the child is more likely to have the A/a genotype
at the marker than the A/A genotype.
Testing for Transmission Distortion (Disequilibrium)
A biallic locus
1. Select individuals with the disease, genotype these
individuals as well as their parents.
2. Determine how many heterozygous parents transmit A
and how many transmit a.
3. Under the null hypothesis, the probability that a parent
with the A/a genotype transmitted an A is ½.
4. Also under the null hypothesis, the maternal and
paternal transmissions are independent.
5. In the case where there are only two alleles at the
marker the test is equivalent to a McNemar test.
Transmitted/
Not transmitted
A
a
A
a
C11
C21
C12
C22
Test statistic T= (C21-C12)2/(C21+C12)
For large samples and under the null hypothesis, T has
a chisquare distribution (df=1)
What are we testing with the TDT?
A strongly positive result suggests that the marker tested is
a trait susceptibility locus or that the marker is closely linked
to a trait susceptibility locus.
1
H O : P  A / a  a   P  A / a  A 
2
1
H Alt : P A / a  a  
2
For a single affected child per family, the null and alternative
hypotheses are equivalent to:
1
H O :  or   0
2
1
H Alt :   and   0
2
When more than one affected child per family is used,
the TDT confounds linkage and association. Thus
little is gained by running the TDT on a data set consisting of
several very large pedigrees if linkage of the trait and marker
has already been established. With many small unrelated
pedigrees information on association can still be gained.
The TDT has been extended to
multiple alleles per locus
Ho = transmission to affected child is not dependent on allele type
Ha = transmission to affected child depends on allele type
Allele
transmitted
2
k-1
1
not-transmitted
1
2
.
.
.
k-1
k
----C2,1
.
.
.
Ck-1,1
Ck,1
t1
C1,2 . . .
----- . . .
.
.
.
Ck-1,2 . . .
Ck,2 . . .
t2 . . .
k
C1,k-1
C2,k-1
C1,k
C2,k
n1
n2
----Ck,k-1
tk-1
Ck-1,k
------tk
nk-1
nk
ti represents the column sum omitting the diagonal term,
ni the row sum also omitting the diagonal.
Test statistics include
k  1 ti  ni 
Tm h 

k i 1 ti  ni
2
k
TDT 2  max
1i  k
ti  ni 2
ti  ni
Mendel’s TDT1 is
proportional to
this statistic.
Under some conditions, Tmh is asymptotically
distributed as chi-square with degree of freedom k-1
Numerical example: data from
a locus with 5 alleles. 120 transmissions from
heterozygous parents to affected children.
Allele
1
not-trans.
1
2
6
3
8
4
8
5
7
t
29
2
transmitted
3
4
5
n
6
--7
5
8
26
4
5
--5
7
21
5
4
5
6
--20
19
22
27
24
28
120
4
7
7
--6
24
Tmh = ?
TDT2 = ?
Is there evidence of transmission distortion?
MENDEL determines significance using
permutation procedures
Why?
If the sample size is small or alleles are rare, the
TDT statistic distribution is poorly approximately
by a chi-square distribution.
How?
(1) For each iteration (usually 10,000 or more)
(a) Calculate a new TDT table.
Hold the parental genotypes fixed. For each
child, designate with equal probability that
the child gets one of the parental
alleles.
(b) Calculate the TDT statistic and determine if
larger than the observed TDT statistic.
(2) The p-value is equal to the number of iterations
in which the TDT statistic is larger than the observed
divided by the total number of iterations.
What is the reason for the standard error?
Permutation p-values are estimated using Monte Carlo
simulation with a finite number of iterations.
TDT Summary
• ignores transmissions from homozygous
parents
• with two alleles it has an approximate chisquare(1) distribution (McNemar test)
– but exact p-values can be computed from the
Binomial(p=.5) distribution in the bi-allelic case
• If there is one affected per nuclear family this
tests the null: no linkage or no association
– If test is significant, there is linkage and association
• If there are multiple affecteds, the TDT will
confound linkage and association owing to the
dependencies of the trios.
– users should not expect new insight when the data
consists of one or two large disease pedigrees already
showing linkage
– with many small unrelated pedigrees, the chance of
confusing linkage with association becomes less of
an issue, and the TDT can help in identifying
associated marker alleles.
Limitations of the original TDT
(1) Nuclear Families
(2) Qualitative traits
(3) Codominant markers
Many methods for extending the TDT have
been developed.
We will discuss one in detail, the gamete competition
model.
One way to extend the TDT:
Lange (1988), Jin et al. (1994), and Sham and Curtis (1995)
considered a model (Bradley Terry, 1952) that was originally
used to predict to rank teams the outcome of team sports.
How does the model work?
Look at specific example:
Suppose we are interested in predicting the outcome
of a playoff game where the Diamond Back play the Dodgers.
Or suppose we want to know the probability that Dodgers
will be the National League West winners this year if we
consider regular season results for last year?
Suppose results are:
winner
D’backs Dodgers Giants Rockies
Loser
D’Backs
---
Dodgers
Padres
6
4
4
5
6
---
7
5
4
Giants
8
5
---
5
6
Rockies
8
7
7
---
5
Padres
7
8
6
7
---
Let D’backs/Dodgers
Dodgers denote the event
that the D’backs and Dodgers play and the Dodgers win.
In general for each team i, we assign a win parameter
ti so that the probability that i beats j is:
P(i / j  i ) 
ti
ti t j
Bradley - Terry Model of Competing Sports Teams
P(i / j  i ) 
ti
ti t j
Note that multiplying each ti by any a>0 does not change
its value, so one ti can be fixed at 1. We fix td’backs = 1.
Note that if ti > tj for all j then i is the best team
Let yij denote the number of times that i plays j and
i wins. For example, the D’backs beat the Giants 8 times and the
Giants beat the D’backs 4 times (yij = 8 and yji = 4). The win
parameters can be determined using the following recurrence
relationship
t
m 1
i

y
j i
ij
y ij  y ji
t
j i
 t mj
m
i
where the loglikelihood is
ln(L)   y ij ln t i  ln( t i  t j )
i j
j
RESULTS
winner
D’backs Dodgers Giants Rockies
Loser
D’Backs
Dodgers
Giants
Rockies
Padres
--6
8
8
7
6
--5
7
8
4
7
--7
6
4
5
5
--7
Padres
5
4
6
5
---
Ho = all teams are equally likely to win (ti = 1 for all i)
P(i / j  i ) 
1
2
LRT = 3.63, the p-value of 0.46 supports acceptance of the null
hypothesis.
We get more information from this analysis
We get the relative rankings.
tdodgers = 1.23, td’backs =1.00,
tgiants = 0.87, trockies = 0.71, tpadres = 0.67
With these rankings we can make predictions about the
outcomes of games:
1.23
 .55
2.23
1.23
P( giants / dodgers  dodgers) 
 .59
2.10
P(dodgers / d ' backs  dodgers) 
Note that these probabilities are different from the predictions
if we just used the individual match up records. The estimate is
not 8/12 =.67 for dodgers beating giants
How is this sports analysis analogous to TDT?
Think of :
(1) Each possible allele at locus = a team
(2) A heterozygous parent = a match up
(3) Allele received by child from a
heterozygous parent = the winner of the game
(4) The transmission parameters = the win parameters
(5) The win/lost record is determined by
the transmissions from heterozygous parents.
transmitted
1
not trans.
1
--2
6
3
8
4
8
5
7
2
6
--5
7
8
3
4
5
4
7
--7
6
4
5
5
--7
5
4
6
5
---
When we ignore disease status, the Bradley- Terry model
provides a form of segregation analysis.
When we consider the transmission to affected members
only (like this example) we have a form of TDT analysis.
The gamete competition likelihood for a pedigree
The general form of the gamete competition likelihood for a
pedigree with n individuals is
L  ... Pen( X i | Gi ) Prior (G j )  Tran(Gm | Gk , Gl )
G1
Gn i
j
{k ,l ,m}
Here person i has marker phenotype Xi and underlying
marker genotype Gi.
For founders , Prior(Gj)
For offspring, the transmission probability factors
Tran(Gm | Gk, Gl)= Tran(Gmk | Gk )*Tran(Gml | Gl )
Tran(Gmk | Gk) = tmk/(tmk +tnk) and
Tran(Gml | Gl) = tml/(tml +tnl)
The penetrance, Pen(Xi| Gi) is always 1 or 0, depending on
whether Xi and Gi are consistent or inconsistent
Assessing significance
We use a likelihood ratio test statistic
LRT = 2*( ln(LHa)-ln(LHo) )
Where LHa and LHo are the maximum likelihoods
under the alternative and null hypotheses.
Significance?
Approximate p-values can be calculated by assuming a
the distribution is chi-square or by gene dropping.
Gamete Competition contrasted with the TDT
(1) Gamete Competition works on extended pedigrees
No need to break up large families into nuclear families.
(2) If have only trios, the gamete competition and
the TDT are equivalent. Their null hypothesis
is no linkage or no association. The alternative
hypothesis is linkage and association.
(3) When considering more than one affected per
family, the TDT and gamete competition confound
association with linkage.
(4) Exact p-values can be determined with the TDT.
Gamete competition p-values are asymptotic.
(5) The gamete competition model can be used
when there is missing marker information.
Allele frequencies can be fixed at population estimates
or estimated along with the t’s.
(6) When there is missing data, the gamete competition
is not immune to the effects of population stratification
or rare alleles.
Example:Families affected with Noninsulin Dependent
Diabetes and linkage to a marker within the sulfonyl urea
receptor-1 gene
27 Mexican-American extended pedigrees with 74
affected offspring (all genotyped) at SUR
The likelihood ratio test statistic is 9.133 with 9 degrees
of freedom. P-value =0.043
allele 1
freq .054
.288
ti
se of .215
ti
2
3
4
5
6
7
8
9
10
.210 .190 .048 .047 .108 .140 .091
.071 .042
1.00 .810 1.40 .697 .383 .556 .567
.499 .082
fixed .447 .985 .681 .204 .288 .322
.509 .104
Can this model be extended to quantitative
traits?
Yes by recognizing that the Bradley – Terry Model is
equivalent to a matched case control design. The
transmitted allele is the case, the untransmitted allele is
the control.
ti  e
i x p
where xp denotes child p’s standardized trait value,
i denotes allele i and the probability of an i/j heterozygous
parent transmitting i is
P(i / j  i) 
e
e
(i  j ) x p
(i  j ) x p
1
Note that one  is set to zero.
This is equivalent to conditional logistic regression.
Quantitative Trait Example: ACE
High ACE concentration is associated with a deletion
within an intron of the ace gene.
404 people in 69 families (Sinsheimer et al., 2000).
e deletion xk
P(insertion/ deletion  deletion)  
e deletion xk  1
1
P(insertion/ deletion  insertion)  
e deletion xk  1
P(insertion/ deletion  deletion)
 P(insertion/ deletion  insertion)  1.0
mle
s.e. of mle
insertion
0.00
fixed
deletion
1.31
0.17
Ho: deletion = 0 Ha: deletion  0
LRT = 82.76 Asymptotic p-value < 1 x 10-19
Another Example: Analyzing tightly linked SNPs:
SNPs (single nucleotide polymorphisms) tend to be more
stable and more abundant than microsatellite markers.
They are predominately biallelic, so we would like to use several
tightly linked markers simultaneously to increase the overall
information content.
Recall that we use the allele transmissions from heterozygous
parents.
Assuming HWE, the maximum possible % of heterozygous
parents for biallelic system is 0.50. For an n allele system, it
is H=(n-1)/n. More alleles more information.
The phase of these multilocus SNPs may
not be known:
Example: suppose there are three SNPs.
An individual with multilocus genotype 1/2, 1/2, 1/2
could have one of the following haplotypes:
(1) 111 and 222, (2) 122 and 211 (3) 121 and 212 or (4)
112 and 221.
The gamete competition allows the use of noncodominant markers so we don’t need to determine
which of these haplotypes combinations is present in a
particular individual.
For two linked loci associated with a quantitative trait,
the transmission probability is expressed as:
P(ij / kl  ij ) 
 x
ij p
(1   )e
 x    x
 x
  ij x p
kl p   il p
kj p

(1   ) e
e
 e
e

 

 




If we are using tightly linked SNPs, then  is effectively zero
and the transmission probability reduces to:
 x
ij p
e
P(ij / kl  ij ) 
 x
 x
ij p
kl p
e
e
An Example
Again we use sex adjusted ACE levels as a quantitative
trait.
The three SNPs are labeled by their position and the
nucleotides present at the position. A-240T, T1237C, and
G2350A. Because the ACE gene spans only 26kb, the
recombination fractions between these SNPs are
effectively zero.
The pedigree data consist of 83 white British families
ranging in size from 4 to 18 members. ACE levels were
determined on 405 family members. Genotypes were
collected on 555 family members.
In MENDEL, the most important difference
from the previous example will be observed in the locus file.
We need to allow for phase ambiguities (lack of certainty in
haplotypes).
L469
AUTOSOME 627 <-# haplotypes,# phenotypes
ATA
0.40190
ATG
0.00780
ACA
0.06740
ACG
0.18310
T*A
0.01340 !T*A corresponds to haplotypes TTA and TCA
T*G
0.32640 !T*G corresponds to haplotypes TTG and TCG
We are no longer assuming co-dominant markers so we must
specify the phenotype (of the marker) / genotype relationship.
These phenotypes correspond to the marker phenotypes used
in the pedigree file.
RESULTS
Haplotype
Ho
pijk
s.e. of pijk
Ha
pijk
s.e. of pijk
ijk
s.e. of ijk
ATA ATG ACA
ACG
TTA+TCA
.4052 .0079 .0676
.0257 .0045 .0136
.1839
.0199
.0133
.0059
.3321
.0240
.4019
.0256
.0000
fixed
.1831
.0198
1.169
.2352
.0134
.0059
.2765
.5848
.3264
.0242
1.528
.2189
.0078
.0024
.2440
.9893
.0674
.0136
.2137
.4076
Log-likelihood under Ho = -704.34
Log-likelihood under Ha = -663.73
LRT=81.22
TTG+TCG
df = 5
p-value = 4.67 x 10-16
Many other extensions / alternatives
to the TDT have been developed.
These include:
TDT using sibling controls
Sib-TDT (Spielman and Ewens, 1998)
DAT (Boehnke and Langefeld, 1998)
SDT (Horvath and Laird)
TDT for quantitative traits
Allison (1997), Rabinowitz (1997),
Abecasis (2000)
Joint modeling of linkage and association
that allow estimation of recombination
Hastabacka (1992)
Kaplan, Hill and Weir (1995)
Terwilliger (1995)