Download T - cse.sc.edu - University of South Carolina

Document related concepts
no text concepts found
Transcript
Bioinformatics Algorithms and
Data Structures
Probabilistic Approaches to
Phylogeny
Lecturer: Dr. Rose
Slides by: Dr. Rose
April 24, 2003
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Approaches to
Phylogeny
Principal Goal: rank trees according to:
–
–
Their likelihood P(data|tree) or
Their posterior probability P(tree|data)
Secondary goal:
–
–
Find the probability of some taxonomic feature
Example: the grouping of a set of taxa on a branch
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Approaches to
Phylogeny
Notation and definitions:
Let P(x•|T,t•) denote the probability of a set of data
given a tree, where:
–
–
–
x• denotes n sequences
T denotes a tree with n leaves with sequence j at leaf j
t• denotes the edge lengths of the tree
The definition of P(x•|T,t•) depends on our choice of
model of evolution.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Approaches to
Phylogeny
Let P(x|y,t) denote the probability that sequence y
evolves into x along an edge of length t.
Assume that we can define P(x|y,t).
If we can do this for each edge of T  we can
calculate the probability of T.
Let’s look at an example.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Approaches to
Phylogeny
P(x1,.., x5|T, t•) =
P(x1|x4,t1)P(x2|x4,t2)P(x3|x5,t3)P(x4|x5,t4)P(x5)
root
x5
t4
x4
t2
t3
t1
x2
x1
x3
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Approaches to
Phylogeny
There is a small problem: normally, we do not have
sequences for internal nodes.
Solution: calculate P(x1,.., x3 | T, t•) by summing over
all possible ancestors x4 & x5.
Given this model we can search for T and t• that
maximizes P(x• | T, t•)
Maximizing P(x• | T, t•) requires:
1. Searching over possible tree topologies, T
2. Searching over possible lengths of edges, t•
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Approaches to
Phylogeny
Concerning the search over tree topologies:
Q: How many rooted binary trees are there with n
leaves?
A: (2n – 3)!! rooted binary trees with n leaves.
For n = 10 there are ~ 40 million trees
For n = 20 there are ~ 4.4 * 1021
An efficient search procedure might prove useful.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Ridiculously simplistic model of evolution:
1. Every site is independent
2. Deletions and insertions do not occur
3. Substitution accounts for all evolution
Let P(b|a, t) denote the probability of the substitution
of residue b for residue a over an edge length of
t.
Extending to aligned gapless sequences x and y,
P(x | y, t) = PuP(xu|yu, t), where u indexes over sites
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Consider the substitution matrix which specifies
P(b|a,t):
A residue alphabet of size K, entails a K-by-K
subsitution matrix.
 P( A1 | A1 , t ) P( A2 | A1 , t )  P( AK | A1 , t ) 


 P( A1 | A2 , t ) P( A2 | A2 , t )  P( AK | A2 , t ) 
S (t )  







 P( A | A , t ) P( A | A , t )  P( A | A , t ) 
1
K
2
K
K
K


UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
A nice property for substitution matrices is the
multiplicative property in the sense that:
S(t)S(s) = S(t + s) for all values of the lengths s and t.
This is equivalent to:
SbP(a|b, t)P(b|c, s) = P(a|c, s + t) for all a, c, s, and t.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Viewing t as a time variable, multiplicativity is a
consequence of the nature of the substitution
process. The process is:
1. Markovian and
2. Stationary, i.e., the substitution of a at time t for b at
time s depends only on the time interval (t-s)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Example: Jukes & Cantor [1969] model
In this model all nucleotides undergo transitions at
the same rate a, giving the rate substitution
matrix R:
A
  3a

C  a
G  a

T  a
A
C
G
T
a
a
a 

 3a
a
a 
a
 3a
a 

a
a
 3a 
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
The substitution matrix for a short time S(e) is
approximately I+Re where I is the identity
  3a a a a 
matrix.


a

3
a
a
a
Recall from the previous slide R   a a  3a a 
ae
ae 
1  3ae ae


ae 
 ae 1  3ae ae
I  Re  
ae
ae 1  3ae ae 


 ae

ae
ae
1

3
ae


UNIVERSITY OF SOUTH CAROLINA

 a

a

a  3a 
College of Engineering & Information Technology
Probabilistic Models of
Evolution
We want to derive the substitution matrix for time t.
S(t+e) = S(t)S(e)  S(t)(I+Re)
by multiplicativity
S(t+e)  S(t)I+ S(t)Re
multiplying out
S(t+e) - S(t)  S(t)Re
subtract S(t) from both sides
[S(t+e) - S(t)]/e  S(t)R
divide by e
S´(t) = S(t)R as e  0
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Noting that the rate matrix is symmetrical, we expect
something of the following form for S(t):
 rt

 st
S (t )  
s
 t
s
 t
st st st 

rt st st 
st rt st 

st st rt 

  3a a a a 


 a  3a a a 
R
a a  3a a 


 a a a  3a 


Substituting our matrix S(t) into S´(t) = S(t)R gives:
ŕ = -3ar +3as,
ś = -as + ar
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
From the previous slide: substituting our matrix S(t)
into S´(t) = S(t)R gives:
ŕ = -3ar +3as,
ś = -as + ar
Which are satisfied by:
rt = ¼(1 + 3e-4at)
st = ¼(1 - e-4at)
Substituting rt and st into S(t) give the Jukes-Cantor
model
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Jukes-Cantor Model:
 ¼(1  3e -4at ) ¼(1 - e -4at )
¼(1 - e-4at )
¼(1 - e -4at ) 


- 4at
- 4at
- 4at
- 4at
¼(1 - e ) 
 ¼(1 - e ) ¼(1  3e ) ¼(1 - e )
S (t )  

- 4at
- 4at
- 4at
- 4at
¼(1
e
)
¼(1
e
)
¼(1

3
e
)
¼(1
e
)


- 4at
- 4at
- 4at 
 ¼(1 - e -4at )
¼(1
e
)
¼(1
e
)
¼(1

3
e
)

UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
•
•
When t = , rt = st = ¼.
Hence the nucleotide equilibrium frequencies
implied by the Jukes-Cantor model are:
qA = qC = qG = qT = ¼.
•
This model is too simple even for nucleotide
substitution.
Q: what obvious substitution issue is ignored?
A: The difference between transitions and
transversions.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Transitions:
• purine  purine, i.e., A  G
• pyrimidine  pyrimidine, i.e., C  T
Transversions:
• purine  pyrimidine
–
–
–
–
AC
AT
GC
GT
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Kimura [1980] proposed a rate matrix that accounts
for the difference between transitions and
tranversions:
  2  a



R  
a





a



 2  a

a



 2  a


a

 2  a 
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
We can integrate the Kimura rate matrix to give the
general time-dependent form:
Where:
 rt

 st
S (t )  
u
 t
s
 t
st
rt
st
ut
ut
st
rt
st
st 

ut 
st 

rt 
rt = ¼(1 - e-4t)
ut = ¼(1 + e-4t - 2e-2(a)t)
rt = 1 - 2st - ut
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
This model is still unrealistic:
–
–
The equilibrium frequencies are equal qA = qC = qG =
qT = ¼.
This is unrealistic for some taxa where there is a
preponderance of bias in AT vs GC ratio.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
•
Moving on to protein sequences, we consider
evolution models for amino acids
•
Dayhoff et. al. compiled a matrix Aab
–
–
–
hypothetical phylogenetic tree of 71 families.
compilation of frequecies of transitions between
paired residues
Gives pab, the probability of a aligned with b.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
•
Next, a second matrix Bab is derived from Aab
– Bab = Aab/ScAac
–
–
•
•
Gives the conditional probability p(b|a)
This is the short time interval estimate.
Let qa denote the frequency of occurrence of
residue a in a protein. Similarly for qb.
The expected number of substitutions in a
protein is then Sa,b qaqb Bab
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
•
•
•
A substitution matrix where Sa,b qaqb Bab= 0.01
is a 1 PAM matrix.
Dayhoff et. al. define a 1 PAM (point accepted
mutation) matrix as the expected number of
substitutions is 1%
Next Bab is scaled to produce a 1 PAM matrix of
substitution probabilities
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
A third matrix Cab is derived from Bab
–
–
Cab = sBab for a  b
Caa = sBaa + (1 – s)
–
–
i.e., scale off-diagonals by s and adjust diagonals to
maintain a row sum of 1.
s is chosen so that Cab is a 1 PAM matrix.
–
Let S(1) denote this 1 PAM matrix
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Q: How can substitution matrices for longer times be
created?
A: raise S(1) to a power n.
Example:
–
S(2) = S(1)S(1)
–
–
Entries P(a|b, t = 2) = ScP(a|c, t = 1)P(c|b, t = 1)
These are the substitutions from b to a via any
intermediary c
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
•
S(n) can be viewed as the result of a n steps in a
Markov chain with 20 states.
–
–
The 20 states correspond to the 20 amino acids
Each step in this Markov chain has probability given
by S(1).
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Recap: S(1) was defined by
1. normalizing the rows of the symmetric matrix A
2. then rescaling
These operations can be interchanged
1. rescale
2. then normalize
Note: the rescaled matrix is still symmetrical  it
can be diagonalized
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
We can thus write S(1) = UD(li)U-1 where
–
–
–
–
–
–
U is a coordinate transformation and
D(li) is the diagonal matrix with
eigenvalues l1…l20 on the diagonal.
the eigenvalues range between 0 and 1 and can be
written l1 = exp(-mi)
Powers of S(1) are represented very simply in the
diagonal matrix system
S(2) = S(1)S(1) = UD(li)U-1UD(li)U-1 = UD(li2)U-1
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
For arbitrary t, S(t) = UD(lit)U-1 , i.e.,
 e  m1t 0  0 


 m 2t
 0  1
 0 e
S (t )  U 
U
     
 m 20t 
 0
0

e


If Ai is the ith amino acid then
P(Aj|Ai, t) = Skuikexp(-mkt) vkj where
uik and vkj are entries in U and U-1, respectively
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Probabilistic Models of
Evolution
Letting t = , we get the PAM matrix:
 q A1 q A2  q A20 


 q A1 q A2  q A20 
    


 qA qA  qA 
2
20 
 1
Where the qAi are the equilibrium frequencies for
amino acids.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Q: What is the simplest tree that we might consider?
A: A tree comprised of two leaves, x1 and x2.
Q: Why is this interesting? We already know the
topology.
A: We don’t know the edge lengths!
a
t2
t1
x2
x1
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Q: How does the likelihood of the tree vary with the
edge lengths?
A: That is the question 
Consider a single site with residues x1, and x2.
Assign the variable a to the root as shown below:
a
t2
t1
x2
x1
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
The probability of
1. having a at the root (we assume the equilibrium
distribution) and
2. substituting a by x1 and
3. Substituting a by x2
is given by:
P(x1, x2, a|T, t1, t2) = qa P(x1|a,t1)P(x2|a,t2)
Recall: Since we don’t know what residue is at the
root, we must sum over all possibilities.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Thus the generalized form of the equation is:
P(x1, x2|T, t1, t2) = Sa qa P(x1|a,t1)P(x2|a,t2)
The generalization of this equations to N sites is:
N
P( x1 , x 2 | T , t1 , t 2 )   P( xu1 , xu2 | T , t1 , t 2 )
u 1
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Example: likelihood of two sequences
CCGGCCGCGCG
CGGGCCGGCCG
Q: What is the likelihood of these sequences in our
simple tree of two leaves?
For this example, assume the Jukes-Cantor model.
We will need:
1. The substitution equations from the previous section.
2. The single site equation from the previous slide.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Based on:
•
the substitution matrix from the previous section,
 rt
s
S (t )   t
 st
s
 t
•
st
rt
st
st
st
st
rt
st
st 
st 

st 
rt 

the single site equation from the previous slide,
P(x1, x2|T, t1, t2) = Sa qa P(x1|a,t1)P(x2|a,t2)
The probability of C occurring in both leaves is:
P(C, C | T , t1 , t 2 )  qC rt1 rt2  qG st1 st2  qA st1 st2  qT st1 st2
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
P(C, C | T , t1 , t 2 )  qC rt1 rt2  qG st1 st2  qA st1 st2  qT st1 st2
Q: Why does the first term use the probabilities rt1
and rt2 while the other three terms use st1 and st2?
A: the first term models the case where there is no
change.
• In the first term the root residue is C and both leaves are
also C, so there is no substitution  rt1 and rt2
• The other terms model the substitution of the residue at
the root by C at the leaves  st1 and st2
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Recall that the equilibrium distribution for JukesCantor entails qA = qC = qG = qT = ¼.
Hence:
P(C, C | T , t1 , t 2 )  qC rt1 rt2  qG st1 st2  qA st1 st2  qT st1 st2 
1
(rt1 rt2  3st1 st2 )
4
By symmetry P(G,G|T,t1,t2) = P(C,C|T,t1,t2)
Likewise, P(G,C|T,t1,t2) = P(C,G|T,t1,t2) = ?
1
P(C, G | T , t1 , t2 )  qC rt1 st2  qG st1 rt2  qA st1 st2  qT st1 st2  (rt1 st2  st1 rt2  2st1 st2 )
4
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Recall: rt and st from the previous section:
rt = ¼(1 + 3e-4at)
st = ¼(1 - e-4at)
Substituting gives:
P(C, C | T , t1 , t 2 ) 
1
1
(rt1 rt2  3st1 st2 )  (1  3e 4a (t1 t2 ) )
4
16
1
1
P(C, G | T , t1 , t 2 )  (rt1 st2  st1 rt2  2st1 st2 )  (1  e  4a (t1 t2 ) )
4
16
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Ok, now we have probabilities for single sites.
We derive the probability for sequences by taking the
product of site probabilities, i.e.,
N
P( x , x | T , t1 , t 2 )   P( xu1 , xu2 | T , t1 , t 2 )
1
2
u 1
If there are match sites where the sequences match
and mismatch sites where they mismatch we get:
P( x , x | T , t1 , t 2 ) 
1
1
2
16
 4a ( t1  t 2 ) match
 4a ( t1  t 2 ) mismatch
(
1

3
e
)
(
1

e
)
match mismatch
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Q: How can we get the likelihood for multiple
sequences?
Consider a tree T with edge lengths t.
Denote the ancestor of node i by a(i)
Let P(x1,..,xn |T, t) denote the probability of
generating residues x1,.., xn at the n leaves
P(x1,.., xn |T, t) is found by multiplying the
probabilities of all edges of the tree.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
P(x1,..,xn |T, t) =
a
n1

, a n2 ,.. a 2 n1
2 n2
n
i  n 1
i 1
qa 2 n1  P(a i | aa (i ) , ti ) P( x i | aa (i ) , ti )
sum over all possible assignments
of residues to inner nodes
probabilities of a inner nodes
given inner nodes’ parents
probabilities of a leaves given
leaves’ parents
Note: this can be computed from the leaves up.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
•
Let Lk denote the leaves below
node k.
•
Let P(Lk|a) denote the
probability of these leaves given
that the residue at k is a.
•
Let i and j be the children of k.
•
P(Lk|a) is computed from
P(Li|b) and P(Lj|c) for all
residues b and c.
UNIVERSITY OF SOUTH CAROLINA
k
a
i
b
j
c
College of Engineering & Information Technology
Felsenstein’s Likelihood
Algorithm
Initialization:
Set k = 2n-1
Recursion: Compute P(Lk|a):
If k is a leaf
Set P(Lk|a) = 1 if a = xk, o/w P(Lk|a)= 0
Else
compute P(Li|a) & P(Lj|a) for children i & j and all a
Set P(Lk|a) = Sb,c P(b|a,ti)P(Li|b)P(c|a,tj)P(Lj|c)
Termination:
the likelihood at site u = P(x |T,t) = Sa P(L2n-1|a)qa
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Felsenstein’s Likelihood
Algorithm
Finally, site likelihood values are combined to give
likelihood of sequences at leaves.
This step assumes independence of sites:
N
P( x | T , t )   P( xu | T , t )

u 1
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Let’s look at a simple example.
Tree with 3 sequences (only GC bases)
t4
CCGGCCGCGCG
CGGGCCGGCCG
GCCGCCGGGCC
4
t2
t1
t3
2
Assume Jukes-Cantor model
1
Consider site where C occurs at all leaves:
3
P(C, C, C | T, t1, t2, t3) = ?
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
t4
4
t2
t1
P(C, C, C | T , t1 , t2 , t3 )
1
 qC rt3 (rt4 rt1 rt2  3st4 st1 st2 )
 (q A  qG  qT ) st3 (rt4 st1 st2  2st4 st1 st2  st4 rt1 rt2 )
1.
t3
2
3
Equilibrium probability of residue C at the root
1. C is at all nodes
2. 4 does not have residue C
2.
Equilibrium probabilities of other residues at root
1.
2.
3.
4 has same residue as root, i.e., not C
4 has a different residue from root, but not C
4 has residue C
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
t4
4
t2
t1
2
1
P(C, C, C | T , t1 , t 2 , t3 )
 qC rt3 (rt4 rt1 rt2  3st4 st1 st2 )
 (q A  qG  qT ) st3 (rt4 st1 st2  2st4 st1 st2  st4 rt1 rt2 )
1
 rt1 rt2 (rt3 rt4  3st3 st4 )
t3
4
3
 st1 st2 (2st3 st4  st3 rt4  st4 rt3 )
3
4
1
 (rt1 rt2 rt3 t4  3st1 st2 st3 t4 )
4
N.B. The sorcery at the colored boxes is a consequence of multiplicativity of the
Jukes-Cantor matrices.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Observation: the edges adjoining the root appear
only as their sum:
–
In the previous example of the tree comprised of 2
sequences, the edges t1 & t2 appeared as:
1
P(C, C | T , t1 , t2 )  (1  3e  4a (t1 t2 ) )
16
1
P(C, G | T , t1 , t 2 )  (1  e  4a (t1 t2 ) )
16
–
In this last example, the edges t3 & t4 appeared as:
1
P(C, C, C | T , t1 , t 2 , t3 )  (rt1 rt2 rt3 t4  3st1 st2 st3 t4 )
4
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
•
•
This observation holds for all leaf values, and thus total
likelihood.
Idea: collapse t3 & t4 into a single edge, shifting the root
to node 4.
t4
4
t2
t1
t1
t3
t3
2
2
1
t2
3
UNIVERSITY OF SOUTH CAROLINA
1
3
College of Engineering & Information Technology
Likelihood for ungapped
alignments
We now denote the third edge by t3 rather than t3+t4
This results in a simpler equation:
P( xu1 , xu2 , xu3 | T , t1 , t2 , t3 )   P( xu1 | a, t1 ) P( xu2 | a, t2 ) P( xu3 | a, t3 )
a
Since our example sequences contain only C & G,
the only possible types of terms are:
•
•
•
•
CCC or GGG (All residues are the same)
GGC or CCG
CGC or GCG
GCC or CGG
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for ungapped
alignments
Examples of these equations are:
1
P(C, C, C | T , t1 , t 2 , t3 )  (rt1 rt2 rt3  st1 st2 st3 )
4
1
P(C, C, G | T , t1 , t 2 , t3 )  (rt1 rt2 st3  st1 st2 rt3  2st1 st2 st3 )
4
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Reversibility & Independence of
Root Position
Q: How is it that we were able to collapse t3 and t4 in
the previous example?
A: The likelihood of the tree was independent of the
position of the root.
If a substitution matrix has the properties of:
1. Multiplicativity and
2. Reversibility, i.e., P(b|a,t)qa = P(a|b,t)qb for all a,b,t
Then likelihood is independent of the position of the
root.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Reversibility & Independence of
Root Position
Can we show that multiplicativity & reversibility imply that
root position doesn’t affect likelihood?
Let the children of the root, node 2n-1, be i and j.
Recall: P(Lk|a) is the probability of the leaves below node k
given that k is residue a.
In Felsentein’s algorithm:
–
–
P(L2n-1|) is computed from P(Li| ) and P(Lj|)
The likelihood of the sequences x at site u is
–
P(xu |T, t) = SaqaP(Lroot|a) = Sb,c,aqaP(b|a,ti) P(c|a,tj) P(Li|b) P(Lj|c)
i.e., we sum over all possible residues appearing at the root and its two
children
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Reversibility & Independence of
Root Position
Taking this equation:
P(xu |T, t) = Sb,c,a qaP(b|a,ti) P(c|a,tj) P(Li|b) P(Lj|c)
And applying reversibility, we get:
P(xu |T, t) = Sb,c,[Sa P(c|a,tj) P(a|b,ti)] qbP(Li|b) P(Lj|c)
This effectively shifts the root to node i.
Multiplicativity allows us to simplify the inner sum:
Sa P(c|a,tj) P(a|b,ti) = Sa P(c|b, ti +tj)
This reflects the shift of the root from node 2n-1 to node i.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Reversibility & Independence of
Root Position
The root can be shifted to any position in the tree.
Likelihood is independent of where the root is.
This is Felsenstein’s so-called ‘pulley principle’.
Q: What is the significance of the ‘pulley principle’?
A: the likelihood search need only consider unrooted
trees provided multiplicativity & reversibility
obtain.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
We have investigated:
1. Simple evolutionary models (Jukes-Cantor, Kimura)
2. Felsenstein’s algorithm for computing likelihood
Now we look at finding the ‘Best’ tree.
Maximum Likelihood assumes that the tree with the
highest likelihood score is the best tree.
Recall: we must search for T and t• that maximizes
the likelihood P(x• | T, t•)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
Maximizing P(x• | T, t•) requires:
1. Searching over possible tree topologies, T
2. Searching over possible lengths of edges, t•
For small numbers of sequences, enumerating all
trees is ok.
Recall, however: there are (2n – 3)!! rooted binary
trees with n leaves.
–
–
For n = 10 there are ~ 40 million trees
For n = 20 there are ~ 4.4 * 1021
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
Approach for small numbers of sequences (2-5):
–
–
Enumerate all trees
For each tree
•
•
The likelihood is explicitly expressed as a function of edge
lengths
Use a numerical technique such as Newton’s method
optimization
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
For larger numbers of sequences:
–
–
–
There is the problem of tree space search (next
section)
Felsentein’s algorithm can be used to compute
likelihood
Edge length optimization can be done by:
•
•
Felsenstein’s EM alogorithm or
standard optimiser such as conjugate gradients
–
–
Using a conjugate gradient method requires the derivatives of the
likelihood wrt edge length.
Replace P(yk| ya(k), tk) with P(yk| ya(k), tk)/ tk
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
Instead of maximizing likelihood, consider a
Bayesian approach:
•
•
Given the prior probability P(T, t)
Calculate the posterior probability P(T, t | x)
Q: How can we do this?
A: use Bayes’ rule:

P
(
x
| T , t ) P(T , t )
P(T , t  | x  ) 
P( x  )
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
Q: What does the posterior probability P(T, t | x)
express?
A: The probability of the tree and edge lengths given
the data.
This is approach is easy for small numbers of
sequences.
Q: Why is this true?
A: all trees can be enumerated so P(T, t) is known
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
In order to use Bayes’ rule, we need:
1.
2.
3.
P(x)
P(x | T, t)
P(T, t)
1. We can derive P(x) from the evolution model,
but we still need to integrate over tree space.
2. We’ve spent most of today deriving P(x | T, t).
3. We can derive P(T, t) by enumerating all trees.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
For larger numbers of sequences, it is not possible to
enumerate all trees
Q: If we can’t enumerate all trees how can we derive
the posterior probability?
A: we sample from the posterior distribution on the
space of trees and edge lengths.
Idea: randomly sampling the posterior distribution
should produce a sample that reflects the
distribution.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
In the limit of a large sampling, properties with high
probability will appear with frequency proportional to
their probability.
 The frequency of a property is an estimate of its
probability.
Q: How do we sample randomly?
A: We can use the Metropolis algorithm, i.e., Markov chain
Monte Carlo methodology.
Note: sadly, the semester is virtually over, so there isn’t time
to explore MCMC 
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
Superficially, MCMC samples by creating a sequence of trees
as follows:
–
–
–
–
–
–
A proposal distribution is assumed.
Each tree is randomly created from the previous tree via the
proposal distribution
Let P1 = P(T, t| x) be the posterior probability of the of the
current tree.
Let P2 = P(Ŧ, ŧ| x) be the posterior probability of proposed new
tree.
If P2  P1 then the new tree is accepted
If P2 < P1 then the new tree is accepted with probability P2/P1
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Likelihood for Inference
–
–
–
–
If P2 < P1 then the new tree is accepted with probability P2/P1
If Ŧ is rejected, then T becomes the current tree again.
This last point is key since we are interested in the frequency,
i.e., distribution.
If the proposal distribution is symmetrical in the sense that
• the probability of proposing Ŧ, ŧ from T, t
is the same as
• the probability of proposing T, t from Ŧ, ŧ
Then procedure correctly samples the posterior distribution.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Proposal Distribution
Proposal distribution: Key to Metropolis algorithm
Q: What should we expect if we really randomly selected
trees from the space of all trees?
A: small posterior probability i.e., P(T, t| x) is small.
 Many proposed trees will be a waste of time.
Q: What should we expect if the select trees that are very
close to previous trees?
A: We will explore the space very slowly.
 Many more steps will be required.
 Tuning the proposal distribution is the hardest
part of using the Metropolis algorithm.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Proposal Distribution
Mau et al. suggest generating candidate trees by:
1. Adjusting edges
2. Reordering the assignments of sequences to leaves
Adjustment of edges: based on called “traversal profile”
–
–
–
Equivalent to original tree
2D representation of tree
Node position:
•
•
–
Vertical: equal to sum of edges to node from root in the tree
Horizontal: spaced according to inorder traversal order.
Nodes numbered by inorder traversal
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Proposal Distribution
Below: tree & corresponding traversal profile
Notice: for any internal node k:
–
–
Descendants on left have a numbers smaller than k.
Descendants on right have numbers larger than k.
10
6
14
8
4
12
2
1
UNIVERSITY OF SOUTH CAROLINA
16
5
3
7
9
11
13
15
17
College of Engineering & Information Technology
Proposal Distribution
Given a traversal profile, construct the tree by:
–
–
The highest node is the root.
Recurse:
•
The left child is the highest node to the left but not horizontally past
an ancestor.
The right child is the highest node to the right, but not horizontally
past an ancestor.
•
10
10
6
6
14
14
8
4
4
12
2
1
12
16
2
5
3
7
8
9
11
13
15
UNIVERSITY OF SOUTH CAROLINA
17
1
16
5
3
7
9
11
13
15
17
College of Engineering & Information Technology
Proposal Distribution
Edge adjustment is effected by:
Shifting the vertical position up/down by D.
D is chosen from a uniform distribution.
Construct the tree from the traversal profile
•
•
•
10
10
6
4
14
14
8
4
6
12
16
2
2
5
5
1
12
8
16
3
7
9
11
13
15
17
UNIVERSITY OF SOUTH CAROLINA
1
3
7
9
11
13
15
17
College of Engineering & Information Technology
Proposal Distribution
Observation: Edge adjustment
–
–
produces a new topology
Does not allow non-adjacent leaves to become
adjacent.
Branch direction switching accomplishes this.
–
–
–
–
Branches are randomly switched at a node
This reorders leaves and results in non-adjacent
leaves becoming adjacent
This does not change the posterior probability.
Hence this proposed tree is always accepted.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Proposal Distribution
Branch direction switching example:
10
10
6
6
14
14
8
4
8
12
4
12
16
16
2
2
5
1
3
7
9
11
13
15
5
17
UNIVERSITY OF SOUTH CAROLINA
7
9
1
3
11
13
15
17
College of Engineering & Information Technology
Likelihood for Inference
In Summary:
Branch direction swapping
Does not change posterior probabilities
Does leads to new part of tree space
Edge adjustment
Does change posterior probabilities
Changes vary continuously with the size of the adjustment
Can change topology, but not always.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology