Download A 1 - Technion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genetic Linkage Analysis
using HMMs
Lecture 7
.
Prepared by Dan Geiger
Outline
Part I: Quick look on relevant genetics
Part II: The use of HMMs
Part III: Case study: Werner’s syndrome
Gene Hunting: find genes responsible for a
given disease
Main idea: If a disease is statistically linked
with a marker on a chromosome, then
tentatively infer that a gene causing the
disease is located near that marker.
2
Chromosome Logical Structure
Locus – the location of markers on the
chromosome.
Allele – one variant form (or state) of a
gene/marker at a particular locus.
By markers we mean genes, Single
Nucleotide Polymorphisms, Tandem
repeats, etc.
Locus1
Possible Alleles: A1,A2
Locus2
Possible Alleles: B1,B2,B3
3
Phenotype versus Genotype
The ABO locus
determines detectable
Phenotype
Genotype
antigens on the surface of
A
A/A, A/O
red blood cells.
 The 3 major alleles
B
B/B, B/O
(A,B,O) determine the
AB
A/B
various ABO blood types.
O
O/O
 O is recessive to A and
B. A and B are dominant Note: genotypes are unordered.
over O. Alleles A and B
are codominant.

4
Recombination Phenomenon
Male or female
A recombination between
2 genes occurred if the
haplotype of the individual
contains 2 alleles that
resided in different
haplotypes in the
individual's parent.
(Haplotype – the alleles at
different loci that are
received by an individual
from one parent).
:‫תאי מין‬
‫ או זרע‬,‫ביצית‬
5
Homolog chromosomes showing Chaismata
‫כרומוזומים הומולוגיים המראים כיאסמתה‬
Sister chromatids
.‫הכיאסמה היא הביטוי הציטולוגי לשחלוף‬
Chaisma(ta) is the cellular expression of recombination.
6
Example: ABO and the AK1 marker
on Chromosome 9
O
A
O O
A2 A2
2
1
A2/A2
A1/A1
Phase inferred
A O
A1 A2
Recombinant
A
A
4
3
A2/A2
A1/A2
O O
A1 A2
O
A |O
A2 | A2
5
A1/A2
Recombination fraction  = 16/100.
One centi-morgan means one recombination every 100
meiosis. In our case it is 16cM.
One centi-morgan corresponds to approx 1M nucleotides (with
large variance) depending on location and sex.
7
Example for Finding Disease
Genes
D
H
D D
A2 A2
2
1
A2/A2
A1/A1
Phase inferred
H D
A1 A2
Recombinant
H
H
4
3
A2/A2
A1/A2
D D
A1 A2
D
H |D
A2 | A2
5
A1/A2
We use a marker with codominant alleles A1/A2.
We speculate a locus with alleles H (Healthy) / D (affected)
If the expected number of recombinants is low (close to
zero), then the speculated locus and the marker are
tentatively physically closed.
8
Recombination cannot be simply
counted
H
H
2
1
A2/A2
A1/A1
Phase ???
H D
A1 A2
Possible
Recombinant
H
H
4
3
A2/A2
A1/A2
D D
A1 A2
D
H |D
A2 | A2
5
A1/A2
One can compute the probability that a recombination
occurred and use this number as if this is the real count.
9
Comments about the example
Often:
 Pedigrees
are larger and more complex.
 Not every individual is typed.
 Recombinants cannot always be determined.
 There are more markers and they are
polymorphic (have more than two alleles).
10
Genetic Linkage Analysis


The method just described is called genetic linkage
analysis. It uses the phenomena of recombination in
families of affected individuals to locate the vicinity of
a disease gene.
Recombination fraction is measured in centi morgans
and can change between males and females.
(Linkage ) 0    P(Recombinat ion )  0.5 ( No Linkage)

Next step: Once a suspected area is found, further
studies check the 20-50 candidate genes in that area.
11
Part II: Mathematics and
Algorithms
12
Using the Maximum Likelihood
Approach
The probability of pedigree data Pr(data |  ) is a
function of the known and unknown recombination
fractions (the unknown is denoted by ).
How can we construct this likelihood function ?
The maximum likelihood approach is to seek the value of
 which maximizes the likelihood function Pr(data |  ) .
This is the ML estimate.
13
Constructing the Likelihood function
First, we determine the variables describing the problem.
Lijm = Maternal allele at locus i of person j. The values of
this variables are the possible alleles li at locus i.
Lijf = Paternal allele at locus i of person j. The values of this
variables are the possible alleles li at locus i (Same as for Lijm) .
Xij = Unordered allele pair at locus i of person j. The values
are pairs of ith-locus alleles (li,l’i). “The genotype”
Yj = person j is affected/not affected. “The phenotype”.
Sijm = a binary variable {0,1} that determines which maternal
allele is received from the mother. Similarly,
Sijf = a binary variable {0,1} that determines which paternal
allele is received from the father.
It remains to specify the joint distribution that governs
these variables. HMMs turn to be a reasonable choice.
14
The model
Locus 1
Locus 2 (Disease)
Locus 3
Locus 4
This model depicts the qualitative relations between the variables.
We will now specify the joint distribution over these variables.
15
Probabilistic Model for
Recombination
L11m
L12m
L11f
X11
S13m
L12f
X12
S13f
L13f
L13m
X13
L21m
X21
S23m
Y1
P( s23t
L22m
L21f
 
1  
| s13t , )  
where t  {m,f}

1  
 
L22f
X22
Y2
L23f
L23m
S23f
X23
Y3
 is the recombination fraction between loci 2 & 1.
16
Details regarding the Loci
P(L11m=a) is the frequency of
allele a.
Li1m
X11 is an unordered allele pair
at locus 1 of person 1 = “the
data”.
P(x11 | l11m, l11f) = 0 or 1
depending on consistency
Li1f
Xi1
Si3m
Y1
Li3m
The phenotype variables Yj are 0 or 1 (e.g, affected or not affected) are
connected to the Xij variables (only in the disease locus). For example,
model of perfect recessive disease yields the penetrance probabilities:
P(y11 = sick | X11= (a,a)) = 1
P(y11 = sick | X11= (A,a)) = 0
P(y11 = sick | X11= (A,A)) = 0
17
Hidden Markov Model In our case
X
S1
X2
S2
X3
S3
Xi-1
Si-1
Xi
Si
Xi+1
Si+1
X1
X2
X2
X3
X3
Xi-1
Yi-1
Xi
Xi
Xi+1
Xi+1
The compounded variable Si = (Si,1,m,…,Si,n,f) is called the
inheritance vector. It has 22n states where n is the number of
persons that have parents in the pedigree (non-founders).
The compounded variable Xi = (Xi,1,…,Xi,n) is the data
regarding locus i. Similarly for the disease locus we use Yi.
To specify the HMM we now explicate the transition matrices
from Si-1 to Si and the matrices P(xi|Si).
18
The transition matrix
Recall that we wrote:
i 
1   i
P( si , j ,t | si 1, j ,t , i )  
where t  {m,f}

1 i 
 i
All i are usually known except the one before the
disease locus .
Extending this matrix to the smallest inheritance
vector (n=1), we get:
P( s23m , s23 f

i 
1   i
1   i 


1


i
i


| s13m , s13 f , i )  

i 
1   i

i 


1



i
 i
00
01
 i   00
1   i
i 
  01

1


i 
 i
 i   10
1   i

1   i 

1   i   11
 i
10
11
Let d=hamming distance between state si-1 and state si.
Then the transition probability is given by id(1-i)2n-d
19
Probability of data in one locus given the
inheritance vector (emission probabilities)
L21m
L22m
L21f
X21
S23m
X22
S23f
L23f
L23m
Model for locus 2
L22f
X23
P(x21, x22 , x23 |s23m,s23f) =
=
l21m,l21f,l22m,l22f l22m,l22f
P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m, l21f)
P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f)
The five last terms are always zero-or-one, namely, indicator functions.
20
Probability of data in the disease locus
given the inheritance vector (emission
probabilities)
P(y1, y2 , y3 |s23m,s23f) =
=
l21m,l21f,l22m,l22f l22m,l22f ,x21,x22,x23
P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m, l21f)
P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f)
P(y1|x21) P(y2|x22) P(y3|x23)
21
Finding the best location
X
S1
X1

Xi-1
S
X3
S2
Xi
Si
Slast
Xi-1
Y
X3
X2
Xi
Xi
Xlast
22
Finding the best location
X
S1
X3
S2
X1
X3
X2

Xi-1
S
Xi
Si
Slast
Xi-1
Y
Xi
Xi
Xlast
Simplest algorithm: For each possible locations on the genetic
map, place the disease locus, say in the middle, and compute
using the forward algorithm, the probability of data given that
location. Data here means one assignment for the Xi variables
and for Y.
Choose the maximum of all options.
23
Finding the best location
X
S1
X3
S2
X1
X3
X2

Xi-1
S
Xi
Si
Slast
Xi-1
Y
Xi
Xi
Xlast
Second algorithm: Run the forward-backward algorithm and store
intermediate results. Use these to compute probability of data at
each location, all at once. Choose the maximum of all options.
At each segment one can try several values for  and choose the
best.
Or use EM to learn the best value.
24
Part III: Case study
Werner’s Syndrome
A successful application of
genetic linkage analysis
using HMM software
(GeneHunter)
25
The Disease
 First
references in 1960s
 Causes premature ageing
 Autosomal recessive
 Linkage studies from 1992
 WRN gene cloned in 1996
 Subsequent discovery of mechanisms involved in
wild-type and mutant proteins
26
One Pedigree’s Data (out of 14)
Pedigree
number
1
1
1
1
1
1
1
1
1
115
126
111
122
125
121
135
131
141
0
0
0
111
0
111
126
121
131
Individual
ID
Father’s
ID
0
0
0
115
0
115
122
125
135
2
1
1
2
2
1
2
1
2
1
1
1
1
1
1
1
1
2
Sex: 1=male
2=female
0
0
0
0
0
2
0
0
1
Mother’s
ID
1
0
1
0
0
0
0
0
2
0
1
0
0
0
1
0
0
1
1
0
1
0
0
2
0
0
1
2
1
2
0
0
1
0
0
1
1
2
0
0
0
2
0
0
1
2
1
2
0
0
3
0
0
1
Unknown marker alleles
3
2
0
0
0
3
0
0
1
3
3
3
0
0
1
0
0
1
1
3
1
0
0
2
0
0
1
2
1
2
0
0
1
0
0
1
Status: 1=healthy
2=diseased
1
2
1
0
0
1
0
0
1
1
1
1
0
0
1
0
0
1
1
1
1
0
0
3
0
0
1
3
1
3
0
0
2
0
0
1
2
3
1
0
0
2
0
0
1
2
2
2
0
0
1
0
0
1
1
2
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
1
0
1
0
0
0
0
0
1
0
1
0
0
0
1
0
0
1
1
0
1
0
0
0
0
0
1
0
1
0
0
0
1
0
0
1
1
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
Known marker alleles
27
Marker File Input
1 disease
locus + 13
markers
Recessive
disease
requires 2
mutant
genes
14 0 0 5
0 0.0 0.0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2
First
0.995 0.005 First marker
has 6 alleles
1
marker’s
0 0 1
name
3 6
# D8S133
0.0200 0.3700 0.4050 0.0050 0.0500 0.0750
...[other 12 markers skipped]...
First
marker
founder
allele
frequencies
Recombination
distances between
markers
0 0
10 7.6 7.4 0.9 6.7 1.6 2.5 2.8 2.1 2.8 11.4 1 43.8
1 0.1 0.45
28
Genehunter Output
Putative distance
of disease gene
from first marker
in recombination
units
Most ‘likely’
position
position LOD_score
0.00 -1.254417
1.52
2.836135
information
0.224384
0.226379
...[other data skipped]...
18.58
19.92
21.26
22.60
22.92
23.24
23.56
13.688599
14.238474
14.718037
15.159389
15.056713
14.928614
14.754848
0.384088
0.401992
0.426818
0.462284
0.462510
0.463208
0.464387
...[other data skipped]...
81.84
1.939215
90.60 -11.930449
Log likelihood of
placing disease
gene at distance,
relative to it being
unlinked.
Maximum log
likelihood score
0.059748
0.087869
29
Locating the Marker
Marker
DHS133
D8S136
D8S137
D8S131
D8S339
D8S259
FGFR
D8S255
ANK
PLAT
D8S165
D8S166
D8S164
Interdistance
Distance
from first
7.6
7.4
0.9
6.7
1.6
2.5
2.8
2.1
2.8
11.4
1.0
43.8
0.0
7.6
15.0
15.9
22.6
24.2
26.7
29.5
31.6
34.4
45.8
46.8
90.6
30
Final Location
Marker
D8S259
location of marker
D8S339
Marker
D8S131
WRN Gene
final location
Error in location by genetic linkage of about 1.25M base pairs.
31
Related documents