Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to molecular evolution
Lecture 13, Statistics 246
March 4, 2004
1
Evolution using molecules:
implicit assumptions
Our DNA is inherited from our parents more or less unchanged.
Molecular evolution is dominated by mutations that are neutral
from the standpoint of natural selection.
Mutations accumulate at fairly steady rates in surviving lineages.
We can study the evolution of (macro) molecules and reconstruct
the evolutionary history of organisms using their molecules.
2
Some important dates in history
(billions of years ago)
Origin of the universe
15 4
Formation of the solar system
4.6
First self-replicating system
3.5 0.5
Prokaryotic-eukaryotic divergence
1.8 0.3
Plant-animal divergence
1.0
Invertebrate-vertebrate divergence
0.5
Mammalian radiation beginning
0.1
86 CSH Doolittle et al.
3
The three kingdoms
4
Two important early observations
Different proteins evolve at different rates, and this seems
more or less independent of the host organism, including its
generation time.
It is necessary to adjust the observed percent difference
between two homologous proteins to get a distance more or
less linearly related to the time since their common ancestor.
( Later we offer a rational basis for doing this.)
An striking early version of these observations is next.
5
g
V ertebrates/
Insects
f
Carp/Lamprey
abcde
Reptiles/Fish
200
Birds/Reptiles
Mammals/
Reptiles
220
Mammals
Corrected amino acid changes per 100 residues
Rates of macromolecular evolution
h
i
j
10
180
160
6
5
140
7
89
Evolution of
the globins
120
100
80
60
1
40
4
Separation of ancestors
of plants and animals
Pliocene
Miocene
Oligocene
Eocene
Paleocene
100 200 300
400 500
600
Huronian
Algonkian
Cambrian
Devonian
Silurian
Ordovician
Jurassic
Triassic
Permian
Cretaceous
0
Carboniferous
20 3
2
700 800 900 1000 1100 1200 1300 1400
6
Millions of years since divergence
After Dickerson (1971)
Different rates of change for different proteins
Protein
aPAMs/100
Pseudogenes
Fibrinopeptides
Lactalbumins
Lysozymes
Ribonucleases
Hemoglobins
Acid proteases
Triosephosphate isomerase
Phosphoglyceraldehyde
dehydrogenase
Glutamate dehydrogenase
residues/108 years Theoretical lookback timeb
45c
200c
670c
750c
850c
1.5d
2.3d
6d
400
90
27
24
21
12
8
3
9d
2
1
18d
______________________________________________________________________________________
aPAMs, Accepted point mutations (explained shortly). bUseful lookback time = 360 PAMs.
cMillion years. dBillion years.
From Doolittle 1986
7
Rates of change in protein families
Protein
Ratea
Protein
Fibrinopeptides
Growth hormone
Ig kappa chain C region
Kappa casein
Ig gamma chain C region
Lutropin beta chain
Ig lambda chain C region
Complement C3a
Lactalbumin
Epidermal growth factor
Somatotropin
Pancreatic ribonuclease
Lipotropin beta
Haptoglobin alpha chain
Serum albumin
Phospholipase A2
Protease inhibitor PST1 type
Prolactin
Pancreatic hormone
Carbonic anydrase C
Lutropin alpha chain
Hemoglobin alpha chain
Hemoglobin beta chain
Lipid-binding protein A-II
Gastrin
Animal lysozyme
Myoglobin
Amyloid A
Nerve growth factor
Acid proteases
Myelin basic protein
90
37
37
33
31
30
27
27
27
26
25
21
21
20
19
19
18
17
17
16
16
12
12
10
9.8
9.8
8.9
8.7
8.5
8.4
7.4
Thyrotropin beta chain
Parathyrin
Parvalbumin
BPTI Protease inhibitors
Trypsin
Melanotropin beta
Alpha crystallin A chain
Endorphin
Cytochrome b5
Insulin
Calcitonin
Neurophysin 2
Plastocyanin
Lactate dehydrogenase
Adenylate cyclase
Triosephosphate isomerase
Vasoactive intestinal peptide
Corticotropin
Glyceraldehyde 3-P DH
Cytochrome C
Plant ferredoxin
Collagen
Troponin C, skeletal muscle
Alpha crystallin B-chain
Glucagon
Glutamate DH
Histone H2B
Histone H2A
Histone H3
Ubiquitin
Histone H4
apercent/100My
From (Nei, 1987; Dayhoff et al., 1978)
Rate
7.4
7.3
7.0
6.2
5.9
5.6
5.0
4.8
4.5
4.4
4.3
3.6
3.5
3.4
3.2
2.8
2.6
2.5
2.2
2.2
1.9
1.7
1.5
1.5
1.2
0.9
0.9
0.5
0.14
0.1
8 0.1
Some terminology
In evolution, homology (here of proteins), means similarity due to
common ancestry.
A common mode of protein evolution is by duplication. Depending
on the relations between duplication and speciation dates, we
have two different types of homologous proteins. Loosely,
Orthologues: the “same” gene in different organisms;common
ancestry goes back to a speciation event.
Paralogues: different genes in the same organism; common
ancestry goes back to a gene duplication.
Lateral gene transfer gives another form of homology.
9
Beta-globins (orthologues)
10
BG-human
BG-macaque
BG-bovine
BG-platypus
BG-chicken
BG-shark
60
70
80
100
110
120
NLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
.......Q................K...............
D......A................K.......V...RN..
D......K................NR.....IV...R..S
.I.N..SQ....................DI.II...A..S
DV.SQ.TD..KK.AEE....V.S.K..AKCF.VE.GILLK
130
BG-human
BG-macaque
BG-bovine
BG-platypus
BG-chicken
BG-shark
40
RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLD
..........S.........................N...
...........A....N............DS..N.MK...
....A.....SAG............A...TS.G.A.KN..
...A...N..S.T.IL...M.R.......TS.G.AVKN..
.Y.GNLKEFTACSYG-----..E.A...T..LGVAVT..G
90
BG-human
BG-macaque
BG-bovine
BG-platypus
BG-chicken
BG-shark
30
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ
-........N...T..........................
--M..A...A....F....K....................
-...SGG......N......IN.L................
...W.A...QLI.G.......A.C.A...A...I......
-..WSEV.LHEI.TT.KSIDKHSL.AK..A.MFI.....T
50
BG-human
BG-macaque
BG-bovine
BG-platypus
BG-chicken
BG-shark
20
140
KEFTPPVQAAYQKVVAGVANALAHKYH
.....Q.....................
.....VL..DF.............R..
.D.S.E....W..L.S...H..G....
.D...EC...W..L.RV..H...R...
DK.A.QT..IWE.YFGV.VD.ISKE..
. means same as
reference sequence
-
means deletion
10
Beta-globins: uncorrected pairwise distances
DISTANCES between protein sequences, calculated over: 1 to 147
Below diagonal: observed number of differences
Above diagonal: number of differences per 100 amino acids
hum
mac
bov
pla
chi
sha
hum
----
5
16
23
31
65
mac
7
----
17
23
30
62
bov
23
24
----
27
37
65
pla
34
34
39
----
29
64
chi
45
44
52
42
----
sha
91
88
91
90
87
61
----
11
Beta-globins: corrected pairwise distances
DISTANCES between protein sequences, calculated over 1 to 147.
Below diagonal: observed number of differences
Above diagonal: estimated number of substitutions per 100 amino acids
Correction method: Jukes-Cantor
hum
mac
bov
pla
chi
sha
hum
----
5
17
27
37
108
mac
7
----
18
27
36
102
bov
23
24
----
32
46
110
pla
34
34
39
----
34
106
chi
45
44
52
42
----
98
sha
91
88
91
90
87
---12
UPGMA tree
BG-shark
BG-chicken
BG-platypus
BG-bovine
BG-macaque
BG-human
13
Human globins (paralogues)
10
alpha-human
beta-human
delta-human
epsilon-human
gamma-human
myo-human
50
70
90
100
110
DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL
.NLKGTFAT..E..CD..H...E..R..GNV.VCV..H.F
.NLKGTF.Q..E..CD..H...E..R..GNV.VCV..RNF
.NLKP.FAK..E..CD..H...E.....GNVMVII..T.F
..LKGTFAQ..E..CD..H...E.....GNV.VTV..I.F
GHHEAEIKP.AQS..T.HKIPVKYLEFI.E.IIQV.QSKH
120
alpha-human
beta-human
delta-human
epsilon-human
gamma-human
myo-human
60
KTYFPHF-DLSHGSA-----QVKGHGKKVADALTNAVAHV
QRF.ES.G...TPD.VMGNPK..A.....LG.FSDGL..L
QRF.ES.G...SPD.VMGNPK..A.....LG.FSDGL..L
QRF.DS.GN..SP..ILGNPK..A.....LTSFGD.IKNM
QRF.DS.GN..SA..IMGNPK..A.....LTS.GD.IK.L
LEK.DK.KH.KSEDEMKASEDL.K..AT.LT..GGILKKK
80
alpha-human
beta-human
delta-human
epsilon-human
gamma-human
myo-human
30
-VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
VH.T.EE.SA.T.L....--NVD.V.G...G.LLVVY.W.
VH.T.EE..A.N.L....--NVDAV.G...G.LLVVY.W.
VHFTAEE.AA.TSL.S.M--NVE.A.G...G.LLVVY.W.
GHFTEE..ATITSL....--NVEDA.G.T.G.LLVVY.W.
-G..DGEWQL.LNV....E.DIPGH.Q.V.I.L.KGH.E.
40
alpha-human
beta-human
delta-human
epsilon-human
gamma-human
myo-human
20
130
140
PAEFTPAVHASLDKFLASVSTVLTSKYR-----GK....P.Q.AYQ.VV.G.ANA.AH..H......
GK....QMQ.AYQ.VV.G.ANA.AH..H......
GK....E.Q.AWQ.LVSA.AIA.AH..H......
GK....E.Q..WQ.MVTA.ASA.S.R.H......
.GD.GADAQGAMN.A.ELFRKDMA.N.KELGFQG
14
Human globins: uncorrected pairwise distances
DISTANCES between protein sequence, calculated over 1 to 154.
Below diagonal: observed number of differences
Above diagonal: number of differences per 100 amino acids
alpha
beta
----
55
beta
82
----
delta
82
10
epsil
89
gamma
myo
alpha
delta
epsil
gamma
55
60
57
74
7
25
27
75
----
27
29
74
35
39
----
20
77
85
39
42
29
----
76
116
117
119
118
116
myo
15
----
Human globins: corrected pairwise distances
DISTANCES between protein sequences,calculated over 1-141
Below diagonal: observed number of differences
Above diagonal: estimated number of substitutions per 100 amino acids
Correction method: Jukes-Cantor
alpha
beta
delta
epsil
gamma
myo
alpha
----
281
281
281
313
208
beta
82
----
7
30
31
1000
delta
82
10
----
34
33
470
epsil
89
35
39
----
21
402
gamma
85
39
42
29
----
470
myo
116
117
119
118
----16
116
Alu sequences (a-globin2 Alu 1, Knight et al., 1996)
hum-1
hum-2
hum-3
chimp
bonob
goril
orang
C
.
.
.
.
.
.
C
.
.
.
.
.
G
G
.
.
.
.
.
.
A
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
A
A
.
A
C
.
.
.
.
.
.
10
A
.
.
.
.
.
.
C
.
.
.
.
.
.
G
.
.
.
.
A
.
G
.
.
.
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
C
.
.
.
.
.
.
20
A C
. .
. .
. .
. .
. .
. .
A
.
.
.
.
.
G
C
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
T
.
.
.
.
.
.
A
.
.
C
C
.
.
A
.
.
.
.
.
.
30
T C
. .
. .
. .
. .
. .
. .
C
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
T
.
.
C
C
C
C
A
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
40
T T
. .
. .
. .
. .
. .
. .
G
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
C
.
C
.
.
.
.
.
.
T
.
.
.
.
.
C
50
G A
. .
. .
. .
. .
. .
. .
G
.
.
.
.
.
.
G
.
.
.
.
.
.
C
.
.
.
.
.
T
G
.
.
.
.
.
.
A
.
.
G
G
G
G
G
.
.
.
.
.
.
A
.
.
.
.
.
C
G
.
.
.
.
.
.
60
G
.
.
.
.
.
.
hum-1
hum-2
hum-3
chimp
bonob
goril
orang
A
.
.
.
.
.
.
T
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
C
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
70
G
.
.
.
.
.
.
G
.
.
.
.
.
.
T
.
.
.
.
.
.
C
.
.
.
.
.
T
G
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
T
.
.
.
.
.
.
80
T T
. .
. .
. C
. C
. C
. C
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
A
A
.
.
.
.
.
.
C
T
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
90
C C
. .
. .
. .
. .
. .
. .
T
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
C
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
.
.
.
.
100
A T
. .
. .
. .
. .
. .
. .
G
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
C
.
.
.
.
T
.
110
C C
. .
. .
. .
. .
. .
. .
C
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
T
.
C
.
.
.
.
T
.
.
.
.
.
.
A
.
.
.
.
.
C
T
.
.
.
.
.
.
A
.
.
.
.
.
.
120
C
.
.
.
.
.
.
hum-1
hum-2
hum-3
chimp
bonob
goril
orang
T
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
.
.
.
.
A
.
.
.
.
.
.
130
C A A
. . .
. . .
. . .
. . .
. . .
. . .
A
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
.
.
.
.
T
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
140
G G
. .
. .
. .
. .
. .
. .
G
.
.
.
.
.
.
T
.
.
.
.
.
C
G
.
.
.
.
.
.
T
.
.
.
.
G
.
G
.
.
.
.
.
.
G
.
.
C
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
150
G C
. .
. .
. .
. .
. .
. .
G
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
C
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
160
G T
. .
. .
. .
. .
. .
. .
A
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
.
.
.
.
C
.
.
.
.
.
T
C
.
.
.
.
.
.
T
.
.
C
C
C
C
A
.
.
.
.
.
.
G
.
.
.
.
.
.
170
C T
. .
. .
. .
. .
. .
. .
A
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
G
G
G
G
A
.
.
.
.
.
.
180
G
.
.
.
.
.
.
hum-1
hum-2
hum-3
chimp
bonob
goril
orang
G
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
C
.
.
.
.
.
.
190
A G G
. . .
. . .
. . .
. . .
. . .
. . .
A
.
.
.
.
.
.
G
.
.
A
A
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
.
.
.
.
C
.
.
.
.
.
.
G
.
.
.
.
A
.
C
.
.
.
.
.
.
200
T T
. .
. .
. .
. .
. .
. .
G
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
C
.
.
.
.
.
.
C
.
.
.
.
.
.
C
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
210
G A
. .
. .
. .
. .
. .
. .
G
.
.
.
.
A
.
G
.
.
.
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
220
T T
. .
. .
. .
. .
. .
. .
G
.
.
.
.
T
.
A
.
.
.
.
.
T
G
.
.
.
.
.
.
G
.
.
.
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
230
C T
. .
. .
. .
. .
. .
. .
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
G
C
.
.
.
.
.
.
240
G
.
.
.
.
A
.
hum-1
hum-2
hum-3
chimp
bonob
goril
orang
C
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
C
C
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
250
C T C
. . .
. . .
. . .
. . .
. . .
. . .
C
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
C
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
G
.
.
.
.
.
.
G
.
.
.
.
.
.
260
G C
. .
. .
. .
. .
. .
. .
A
.
.
.
.
.
.
A
.
.
.
.
.
.
C
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
A
.
.
.
.
.
.
G
.
.
.
.
.
.
270
C A
. .
. .
. .
. .
. .
. .
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
C
.
.
.
.
.
.
T
.
.
.
.
.
.
C
.
.
.
.
.
.
C
.
.
.
.
.
.
G
.
.
A
A
A
A
280
T C
. .
. .
. .
. .
. .
. .
T
.
.
.
.
.
.
C
.
.
.
.
.
T
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
A
.
.
.
.
.
.
290
T A
. .
. .
. .
. .
. .
. .
A
.
.
.
.
.
.
A
.
.
.
.
.
.
T
.
.
.
.
.
.
A
.
.
.
.
.
.
A A
. .
. .
C .
. .
17
. .
. .
T
.
.
.
.
.
.
A
.
.
.
.
.
.
300
A
.
.
.
.
.
.
Alu sequences:uncorrected pairwise distances
DISTANCES between nucleic acid sequences calculated over: 1 to 300,
considering all base positions
Below diagonal: observed number of differences
Above diagonal: number of differences per 100 bases
hum-1 hum-2 hum-3 chimp bonob goril orang
hum-1
----
0
0
4
3
5
7
hum-2
1
----
1
4
4
5
7
hum-3
1
2
----
4
4
5
7
chimp
12
13
13
----
1
5
6
bonob
10
11
11
2
----
4
5
15
15
14
12
----
7
21
21
18
16
22
----
goril
orang
14
20
18
Alu sequences: corrected pairwise distances
DISTANCES between nucleic acid sequences, calculated over: 1 to 300,
considering all base positions
Below diagonal: observed number of differences
Above diagonal: estimated number of substitutions per 100 bases
Correction method: Jukes-Cantor
hum-1 hum-2 hum-3 chimp bonob goril orang
hum-1
----
0
0
4
3
5
7
hum-2
1
----
1
4
4
5
7
hum-3
1
2
----
4
4
5
7
chimp
12
13
13
----
1
5
6
bonob
10
11
11
2
----
4
6
goril
14
15
15
14
12
----
8
19
orang
20
21
21
18
16
22
----
Correcting distances between DNA and
protein sequences
We mentioned earlier that it is necessary to adjust observed
percent differences to get a distance measure which scales
linearly with time. This is because we can have multiple and back
substitutions at a given position along a lineage.
All of the correction methods (with names like Jukes-Cantor,
2-parameter Kimura, etc) are justified by simple probabilistic
arguments involving Markov chains whose basis is worth
mastering.
The same molecular evolutionary models are used in scoring
sequence alignments.
20
Markov chain
State space = {A,C,G,T}.
p(i,j) = pr(next state Sj | current state Si)
Markov assumption:
p(i,j) = pr(next state Sj | current state Si & any
configuration of states before this)
Only the present state, not previous states,
affects the probs of moving to next states.
21
A simple 4-state Markov chain: the Kimura
2-parameter model for nucleotide change
A
A c
G
a
C
b
T
b
G a
c
b
b
C b
b
c
a
T b
b
a
c
Transition probabilities:
Horizontal: a
Diagonal & vertical: b
Self: c = a2b
A
G
C
T
22
The multiplication rule
pr(state after next is Sk | current state is Si)
= ∑j pr(state after next is Sk, next state is Sj | current state is Si)
[addition rule]
= ∑j pr(next state is Sj| current state is Si) x pr(state after next is Sk | current
state is Si, next state is Sj)
= ∑j p(i,j) x p(j,k)
[multiplication rule]
[Markov assumption]
= (i,k)-element of P2, where P=(p(i,j)).
More generally,
pr(state t steps from now is Sk | current state is Si) = i,k element of Pt
23
Continuous-time version
For any s,t write pij(t) = pr(Sj at time t+s | Si at time s) for the
stationary (time-homogeneous) transition probabilities.
Write P(t) = (pij(t)) for the matrix of pij(t)’s.
Then for any t,u: P(t+u) = P(t) P(u).
It follows that P(t) = exp(Qt), where Q = P’(0) is the derivative of
P(t) at t = 0.
Q is called the infinitesimal matrix of P(t), and satisfies
P’(t) = QP(t) = P(t)Q.
24
Interpretation of Q

Roughly, q(i,j) is the rate of transitions of i to j, while
q(i,i) =  j q(i,j), so each row sum is 0. If under some initial
conditions, we have a Markov chain evolving in continuous time
with infinitesimal matrix Q, and pj(t) = pr(Sj at time t), then
pj(t+h) =i pr(Si at t, Sj at t+h)
= i pr(Si at t)pr(Sj at t+h | Si at t)
= pj(t)x(1+hqjj) + i jpi(t)x hqij
i.e., h-1[pj(t+h) - pj(t)] = pj(t)q(j,j) + i j pi(t)q(i,j)
which becomes P’ = QP as h  0.
Important approximation: when t is small, P(t)  I + Qt.
25
The Jukes-Cantor model (1969)
-3
Q=
Common
ancestor of
human and orang.
t time units
P(t) =
human (now)
Consider e.g. the
2nd position in
a-globin2 Alu1.

 -3
 
 
r
s
s
s
s
r
s
s


-3




-3
s
s
r
s
s
s
s
r
r = (1+3e4t)/4,
s = (1 e4t)/4.
26
Definition of PAM
Let P(t) = exp(Qt). Then the A,G element of P(t) is
pr(G now | A then) = (1  e4t)/4.
Same for all pairs of different nucleotides.
Overall rate of change k = 3t.
When k = .01, described as 1 PAM
PAM = accepted point mutation
Put t = .01/3 = 1/300. Then the resulting
P = P(1/300) is called the PAM(1) matrix.
Why use PAMs?
27
Evolutionary time, PAM
Since sequences evolve at different rates, it is
convenient to rescale time so that 1 PAM of evolutionary
time corresponds to 1% expected substitutions.
For Jukes-Cantor, k = 3t is the expected number of
substitutions in [0,t], so is a distance. (Show this.)
Set 3t = 1/100, or t = 1/300, so 1 PAM = 1/300 years.
28
Distance adjustment
For a pair of sequences, k = 3t is the desired
metric, but not observable. Instead, pr(different) is
observed. So we use a model to convert
pr(different) to k.
This is completely analogous to the conversion of
 = pr(recombination)
to genetic (map) distance (= expected number of
crossovers) using the Haldane map function
 = 1/2  (1  e-2d),
assuming the no-interference (Poisson) model. 29
Towards Jukes-Cantor adjustment
common ancestor
still 2nd position in a-globin Alu 1
Assume that the common ancestor has
A, G, C or T with probability 1/4.
G
orang
C
human
Then the chance of the nt differing
3/4
p≠ = 3/4  (1  e8t)
= 3/4  (1  e4k/3), since k =2  3t
30
t
Jukes-Cantor adjustment
If we suppose all nucleotide positions behave identically and
independently, and n≠ differ out of n, we can invert this,
obtaining
=  3/4  log(1  4/3 n≠/n).
This is the corrected or adjusted fraction of differences (under
this simple model).  100 to get PAMs
The analogous simple model for amino acid sequences has
=  19/20  log(1  20/19 n≠/n).
 100 for PAM.
31
Illustration
1. Human and bovine beta-globins are aligned with no deletions
at 145 out of 147 sites. They differ at 23 of these sites. Thus
n≠/n = 23/145, and the corrected distance using the Jukes-Cantor
formula is (natural logs)
 19/20  log(1  20/19  23/145) = 17.3  10-2.
2. The hum-1 and gorilla sequences are aligned without gaps
across all 300 bp, and differ at 14 sites. Thus n≠/n = 14/300, and
the corrected distance using the Jukes-Cantor formula is
 3/4  log(1  4/3  14/300) = 4.8  10-2.
32
Correspondence between observed a.a. differences
and the evolutionary distance (Dayhoff et al., 1978)
Observed Percent Difference
1
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
Evolutionary Distance in PAMs
1
5
11
17
23
30
38
47
56
67
80
94
112
133
159
195
246
328
33