Download Linkage analysis the basic concepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic drift wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Genetic testing wikipedia , lookup

Medical genetics wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Population genetics wikipedia , lookup

Tag SNP wikipedia , lookup

Public health genomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transcript
Fundamental Concepts in Gene Mapping (BIO227)
Linkage analysis: The basic concepts
Review of Mapping Strategies
What is the biological basis of Linkage
Analysis?
What parameter do we test in linkage
analysis? What is the null hypothesis?
If we reject the null, what do we conclude?
If the recombination parameter θ between 2
loci is 0.10, are these 2 loci linked?
What is the distance between them in
centimorgans? What is it in base pairs?
Would you expect these two loci to be in LD?
Learning Objectives for Today
The basic concepts and principles of linkage analysis:
Making Genetic Maps
Disease Mapping
Direct Counting Method (for Mendelian Disorders)
Affected Sib Pairs for NonParametric Linkage
Testing for the presence of linkage and LOD scores
Very Big Overview
Get phenotype data from families
Cover genome with markers, say 440 or
4,000,000
Test every marker AND every location in
between markers
Convert test statistic to a LOD score
If LOD exceeds 3 (more or less), declare
linkage in the region
Role of Linkage Analysis in Gene
Mapping
General Usage: determine ‘genetic distance’ between 2 or
more loci; genetic distance determined by θ
MAKE GENETIC MAPS
--First Human Genetic Map was completed in 1987 with
440 markers organized into 23 linkage groups. Locations
between markers determined via linkage
LOCATE DISEASE GENES
--Locate distance between hypothetical DSL and a
known marker(s) on the map.
--Genes for around 2-3000 Mendelian Disorders were
found using linkage; was not very successful for complex
disorders.
--Finding new Mendelian Disorders using sequence
analysis of families
Constructing Genetic Maps
θ = P(recombination between 2 loci)
The recombination fraction increases with the physical
distance between loci.
=>
Recombination fraction can be used to measure relative
distances on the chromosome.
Morgan is the unit of measure
θ = distance in Morgans
θ100 = distance in cMs
How do we estimate P(recombination) from data?
Look at transmissions from parents to offspring, count
recombinations.
Probability of haplotype transmission for two biallelic loci
(aA locus and bB locus)
Possible Haplotypes Transmitted to Offspring:
Assume phase known
Possible parental
diplotypes
a
b
a
B
A
b
A
B
aa
bb
aa
bB
aA
bb
aA
bB
aA
Bb
P(transmission) depends on θ ONLYwith double het parents
Probability of haplotype transmission for two biallelic
loci (aA locus and bB locus)
Possible Haplotypes Transmitted to Offspring
All possible
parental
diplotypes
ab
aB
Ab
AB
ab|ab
1
0
0
0
ab|aB
½
½
0
0
(1- θ)/2
θ /2
θ /2
(1- θ)/2
aB|aB
ab|Ab
ab|AB
aB|Ab
P(offspring is recombinant| double het parent) = θ.
A Simple example, Linkage between ABO locus and
AK1 Locus
How do we estimate θ
(and the genetic distance)?
METHOD 1: Direct Counting Method:
Assume data at two markers on parents
and offspring.
Identify haplotype transmissions from each
double heterozygote parent to each of their
offspring for the two loci.
Count recombinant haplotypes in the
offspring for the two loci
Use resulting data for estimation, testing.
AA
11
BB
22
CC
33
CC
33
AB
12
AC
13
AC
23
CC
33
BC
23
CEPH families
BC
13
AC
13
Direct counting method using a sample
of families
Unit of analysis is pairs (meioses) of a double
heterozygous parent and offspring; gives sample size
Random variable is the transmission from each parent to
offspring: Z_i=1 if recombinant, 0 otherwise (i indexes
double het parent-child pair)
Let r denote the sum of Z_i; it counts the number of
offspring-parent pairs where the transmitted haplotype
is a recombinant
s denotes the number of offspring-parent pairs where
the transmitted haplotype is a non-recombinant.
Total number of informative transmissions is n = r+s;
equals the number of double het parent-child pairs
Direct counting method
Principle:
Transmissions from different parents are independent
and transmissions to different offspring are
independent. Pr(recombinant) = θ is same for every
pair
The distribution of r is what?
How do we specify the null hypothesis of no linkage?
How do we estimate θ given r and n?
How do we test for linkage? (any number of ways)
Direct counting method
Inference about θ: r is Binomial (n,θ)
•
•
•
p(r) = nCr θr(1-θ)(n-r) r=0,…n
θ^ = max(r/n,1/2)
To test H0: θ = ½, use Likelihood Ratio Test
(LRT)
Likelihood ratio test compares p(r) as a function of
θ to p(r) when θ = 1/2
In general, inference is complicated by fact that θ
is constrained to be < ½
AA
11
BB
22
CC
33
CC
33
AB
12
AC
13
AC
23
CC
33
BC
23
CEPH families
BC
13
AC
13
Autosomal dominant inheritance: disease status is observed,
but DSL alleles are not. Marker locus with alleles M,m is observed
We assume
•
Complete penetrance, no
phenocopies
•
Dd or DD=affected and
dd=unaffected
Step 1: Infer disease genotype
and missing markers
D?
??
dd
mm
D d
Mm
dd
mm
Step 2: Infer phase and the
informative meioses
Step 3: Count the number of
recombinants and nonrecombinants
What happens if grandmother’s
marker data are missing, or
if she is mM?
dD dD dD dd dd
mm mM mM mm mm
r=
s=
θ^hat =0.2
Problems with Parametric Analysis
We assume that
•
Complete penetrance
•
DD,Dd=affected and
dd=unaffected
Step 1: Consider both phases
Step 2: Identify informative
meioses under each phase
Step 3: Count the number of
recombinants and nonrecombinants under each
phase
Step 4: Combine over phases
Both grandparents genotype
missing?
Cannot determine phase
D?
D,d
M,m
dd
- dd
mm
d D d D d D dd dd
m m Mm mM mm mm
1
0
0
0
0
r=1
r=
s=4
s=
What shall we do? What is P(phase 1) and P(phase 2)
θ=0.2
θ =0.8?
Handling missing phase in parent
• P(r) is B(n, θ) if phase is known; for other
phase, s is B(n, θ)
• If know P(phase) can compute p(r) as
P(r) = P(r|phase 1)P(phase 1) +
P(r|phase 2)P(phase 2)
• P(phase) = ½ Why?
P(r) = ½nCrθr(1-θ)s + ½nCsθs(1-θ)r
= ½nCr{θr(1-θ)s + θs(1-θ)r}
Can be used to estimate θ or a LR test or LOD
score, but simple chi-square tests no longer
apply.
Complications with parametric analysis
Recessive model calculations are
difficult—genotype often not possibe to
infer
Suppose incomplete penetrance?
Suppose phenocopies?
dd
mm
Unaffected could be dd, Dd or DD
Affecteds could be dd
Penetrance functions often depend upon
age for complex disorder
Results can be very misleading if choose
wrong penetrance function (rely on
segregation analysis)
Likelihood gets very difficult to
enumerate, especially with complex
pedigrees; have to consider all possible
genotypes and all possible phases
Led to increased emphasis on Nonparametric methods
D d
Mm
dd
mm
dD dD dD dd dd
mm mM mM mm mm
Likelihood inference for θ: LOD
Definition: Likelihood of the data is proportional to the
probability (or density) function.
A likelihood ratio test to test H0 vs HA uses
LR = L(under alternative)/L(under null)
L(under alternative) depends on unknown θ. So,
choose a value of θ which maximizes the likelihood
under the alternative; maximizes LR
LRT = 2 ln {max LR}
= 2 ln {max L(under alternative)/L(under
null)}
When H0 is true, in general likelihood ratio test is
approximately chi-square on 1 df. Because of
constraint, LRT is not chi-square in this case. LOD
score used for testing
21/53
Inference about Linkage: LOD Score
Definition: Log (base 10) of LR(θ)
LOD(θ) = log10 LR(θ)
LR is a measure of support for a value θ relative to the null
value(1/2); note that LOD is a function of the unknown θ
LOD of 1 says P(data for θ) is 10 times what it is for θ = ½.
LOD of 2 says P(data for θ) is 100 times what it is for θ = ½.
Use maximized LOD score >3 to reject H0. LOD score can
be negative
Several advantages over LRT: easier to combine over
families, easy to compare different markers
Combining LODs from multiple families
Have K independent families
LR is product over families:
–
•
LR() = LRfam1()  LRfam2()  LRfam3() …
…so lods is sum over families
– lods() = lodsfam1() + lodsfam2() + lodsfam3() …
Can calculate lods for each family separately at each value of , then add
NOT true for LRT
•
Example:
–
–
–
–
–
Family 1 has r=2 and n=5
Family 2 has r=1 and n=6
Family 3 has r=0 and n=3
Family 4 has r=2 and n=8 r_tot = 5 n_tot = 22
LRTtot(5/22) ≠ LRTfam1(2/5) + LRTfam2(1/6) + LRTfam3(0/3) + LRTfam4(2/8)
Finding genes for Mendelian Disorders was a sequential process; LOD
scores convenient way to report results
1
0
-1
lods
-2
Fam 1
Fam 2
Fam 3
Fam 4
All fams
0.0
0.1
0.2
0.3
theta
0.4
0.5
Relation between max LOD and LRT:
How big is max LOD of 3?
Max LOD = max(over θ)log_10 (LR)
= log_10(max LR)
LRT = 2 ln (max LR)
LRT = 4.6 max LOD
So at the ML of θ
Max LOD > 3 => LRT > 13.8
very small p-value LRT
Where does max LOD >3 originate?
Many justifications:
Sequential Analysis
Multiple testing argument—Take a grid of
linked markers over entire genome; test
everywhere
Can use properties of recombination to
derive P(max LOD exceeds threshold | no
linkage anywhere). Depends on threshold
and and length of chromosome tested.
Cannot do this with association testing!
Summary: Direct Counting
• General features of a parametric linkage
method:
– Mode of inheritance has to be specified
(segregation analysis); was not so successful for
complex disease
– Could be seriously wrong if disease model is
wrong. Really only successful for Mendelian
diseases
– Estimation of the recombination fraction, max LOD
used for inference
Linkage: Method 2 Nonparametric Analysis
Nonparametric => Do not need to make assumption
about disease model. Linkage analysis based on
counting recombinations can be very inaccurate if
genetic model is incorrect. Nonparametric is valid
under H0, but power depends on model
• Most approaches rely on using pairs of affected
relatives and concept of sharing of markers
between relatives: IBD or IBS
• Intuition: If have a pair of affected relatives, then
likely share a disease allele at the DSL, so at a
linked marker, sharing the marker is also likely
Alleles shared identical by descent and
identical by state
Allele sharing is defined between 2 individuals
Each individual has two alleles, one from Mom and one
from Dad. Thus the pair can share 0,1,2
identical by state (IBS) are those that are physically
identical, i.e., both people have a T for an A or T snp, for
example.
identical by descent (IBD) must be IBS and also
inherited from a common ancestor. Alleles that are IBD
are also IBS, but not vice-versa. With IBD, shared alleles
are exact copies.
Examples of identity by state and
identity by descent among 2 sibs
ab
cd
ac
IBS=
IBD=
bd
ab
cd
ac
IBS=
IBD=
ad
ab
cd
ac
IBS=
IBD=
ac
Why we love polymorphic markers
ab
cb
bc
IBS=
IBD=
ab
?
?
ab
cc
ac
IBS=
IBD=
ac
?
?
ab
ab
ab
IBS=
IBD=
ab
?
?
Can always tell IBS, but not always IBD; IBD ≤ IBS
For now, we assume that IBD status is known
(= perfect marker information).; will return to this
problem later
Nonparametric Analysis: Pairs of Affected
Relatives (Use siblings)
Basic Idea: Two affected siblings should share the
same genetic material IBD at a DSL.
Then, if the marker is close (linked) to the DSL,
affected siblings will be sharing an ‘excess’ of
alleles at the marker.
Relatives who do NOT share affection status should
share less
Need to consider what we expect about sharing in the
absence of disease but also what do we expect at
the DSL.
Distribution of I.B.D.-relationships under H0
Under the null-hypothesis: No linkage between the marker
locus and the disease gene (θ=1/2):
pk =Probability that two affected relatives share k alleles IBD at marker
Type of relative pair
p0
p1
p2
First cousin
¾
¼
0
Double first cousins
13/16
1/8
1/16
½
0
Monozygotic twins
Full sibs
Parent-offspring
Grandparent–grandchild ½
Sharing IBD at the DSL:
Recessive model (2 copies of D-allele=> affected)
Parent
1
Parent
2
Parent
2
Parent
1
Parent
1
Parent
2
Disease
locus
sib 1
sib 2
sib 2
sib 1
sib 1
sib 2
1 DSL
2 Affected sibs
1 Unaffected
2 Unaffected sibs
IBD = 2 at DSL
IBD = 1 at DSL
IBD = 0 at DSL
Affected relative pair analysis
Collect affected affected relative pairs (and other members of
pedigree)
Genotype all relative pairs of each pedigree and determine IBD
for each pair
Compute the IBD probabilities:
(under null)
p0 (=sharing 0 alleles)
p1 (=sharing 1 alleles)
p2 (=sharing 2 alleles)
Estimate the IBD probabilities at the marker from the sample
Construct test statistic that compares the IBD
probabilities under the null hypothesis with observed
IBD probabilities
Affected sib pair analysis
Have data on n affected sib pairs (n0, n1, n2)
Compare the observed proportions with the IBD
probabilities under the null hypothesis:
p0=1/4 p1 =1/2 p2=1/4
 Many Test statistics (simple ones are not easy to
generalize when cannot tell IBD):
MLS-methods (maximum likelihood)
NPL-methods (score tests)
LOD scores
MLS methods
Assumptions:
• n affected sib pairs
• Perfect marker
information
• pk =probability of
sharing k alleles ibd.
Likelihood function:
Pro: handle missing IBD
Con: need to test pattern
of Sharing
Number of Alleles Shared IBD
0
1
2
Tot
al
Observed
n0
n1
n2
n
Expected
n/4 n/2
n/4 n
Alternative Test for IBD Sharing:
Nonparametric Score Test (NPL)
Wi : number of alleles shared IBD in the ith pair
μ = E(Wi|H0) = ?
σ2 = Var(Wi|H0) = ?
Z = (W_bar – μ)/ σ/√n
is N(0,1) for large n when H0 is true.
Reject if Z is too big because under the alternative there is
excess marker sharing so that
E(Wi) > E(Wi|H0)
Example
Example
Last Topic:
IBD Transition Probabilities
Assume θ between 2 loci known
•P(share j alleles at locus 2|share k at locus 1)
•Can also get joint distribution of IBD1 and IBD2 and
also can get P(IBD2) if I know P(IBD1) and θ
•These transition probabilities hold, no matter what
the allele sharing probabilities are at marker 1. The
could be the null (1/4,1/2,1/4) or the marker A could
be a DSL, with probabilities computed according to
the mode of inheritence.
Applications of Basic Principle
Principle: Know IBD sharing at a locus, you can predict IBD
sharing at some distance θ from the locus;
1) Power Analysis: Assume some disease model, calculate
p(sharing at DSL), compute P(sharing at marker|DSL
sharing) for different values of θ. Enables one to compute
power for a given disease model, θ and n.
2) Incorporating pairs with incomplete information about IBD at
a marker: Use data at adjoining markers Improve power
with missing parents or with markers that are not
polymorphic
3) Whole Genome Linkage Scans (Multi-Marker):
H0: DSL not linked to any marker on genome
HA: Evidence for linkage at least one locus
Summary
• What are the main weakness of parametric
linkage analysis?
• Is missing phase a weakness for
nonparametric analysis
• What is a major limitation of nonparametric
analysis
• What about non-Mendelian disorders?
• With pedigrees, families can be analyzed
separately
• Concepts of IBD can be extended to handle
rare variants in families