Download Building phylogenetic trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-negative matrix factorization wikipedia , lookup

Matrix multiplication wikipedia , lookup

Gaussian elimination wikipedia , lookup

Transcript
Building
phylogenetic trees
Contents



Phylogeny
Phylogenetic trees
How to make a phylogenetic tree from pairwise
distances
 UPGMA
method (+ an example)
 Neighbor-Joining method (+ an example)


Comparison of methods
Conclusion
Phylogeny





Phylogeny is the evolution of related species/genes
Phylogenetic tree: diagram showing evolutionary
lineages of species/genes
The history of genes or species may be very different
Genes can be homologous or analogous, but still
remind each other
Homologous sequences can be devided into two
parts




Orthologous sequences diverged by specification from
a common ancestor
Paralogous sequences evolved by gene dublication
within species
Analogous sequences may appear and function very
similarly, but they do not have a common ancestor
WHEN WE WANT TO EXPLORE EVOLUTIONARY
RELATIONSHIPS, WE NEED TO HANDLE
ORTHOLOGOUS SEQUENCES
Genes
Homologous
Orthologous
Analogous
Paralogous
Phylogenetic trees

WHY construct a phylogenetic tree?








to understand lineage of various species
to understand how various functions evolved
to inform multiple alignments
Trees can be rooted (a common ancestor in known)
or unrooted
Leaves are the terminal nodes that correspond to
the observed sequences of genes or species (A, B,
C, D)
Internal nodes are hypothetical ancestral nodes
All trees will be assumed to be binary, meaning that
an edge that branches splits into two daughter edges
Each edge has a certain amount of evolutionary
divergence associated to it, defined by some
measure of distance between sequences, or from a
model of substitution of residues over the course of
evolution
HRV10
HRV100
HRV66
HRV77
HRV25
HRV62
HRV29
HRV44
HRV31
HRV47
HRV39
HRV59
HRV63
HRV40
HRV85
HRV56
HRV54
HRV98
HRV1A
HRV1bGenba
HRV12
HRV78
HRV20
HRV68
HRV28
HRV53
HRV71
HRV51
HRV65
HRV46
HRV80
HRV45
HRV8
HRV95
HRV58
HRV36
HRV89Genba
HRV7
HRV88
HRV23
HRV30
HRV2Genban
HRV49
HRV43
HRV75
HRV16Genba
HRV81
HRV57
HRV55
HRVHanks
HRV21
HRV11
HRV33
HRV76
HRV24
HRV90
HRV18
HRV34
HRV50
HRV73
HRV13
HRV41
HRV61
HRV96
HRV15
HRV74
HRV38
HRV60
HRV67
HRV32
HRV9
HRV19
HRV82
HRV22
HRV64
HRV94
Phylogenetic trees

Different ways to represent a phylogenetic tree
(illustrated by Treeview)
HRV10
HRV100
HRV66
HRV77
HRV25
HRV62
HRV29
HRV44
HRV31
HRV47
HRV10
HRV100
HRV66
HRV77
HRV25
HRV62
HRV29
HRV44
HRV31
HRV47
HRV39
HRV59
HRV63
HRV40
HRV85
HRV56
HRV54
HRV98
HRV1A
HRV1bGenba
HRV12
HRV78
HRV20
HRV68
HRV28
HRV53
HRV71
HRV51
HRV65
HRV46
HRV80
HRV45
HRV8
HRV95
HRV58
HRV36
HRV89Genba
HRV7
HRV88
HRV23
HRV30
HRV2Genban
HRV49
HRV43
HRV75
HRV16Genba
HRV81
HRV57
HRV55
HRVHanks
HRV21
HRV11
HRV33
HRV76
HRV24
HRV90
HRV18
HRV34
HRV50
HRV73
HRV13
HRV41
HRV61
HRV96
HRV15
HRV74
HRV38
HRV60
HRV67
HRV32
HRV9
HRV19
HRV82
HRV22
HRV64
HRV94
HRV39
HRV59
HRV63
HRV40
HRV85
HRV56
HRV54
HRV98
HRV1A
HRV1bGenba
HRV12
HRV78
HRV20
HRV68
HRV28
HRV53
HRV71
HRV51
HRV65
HRV46
HRV80
HRV64
HRV22
HRV82
HRV19
HRV32HRV9
HRV67
HRV23
HRV30
HRV2Genban
HRV49
HRV43
HRV75
HRV16Genba
HRV81
HRV57
HRV55
HRVHanks
HRV21
HRV11
HRV33
HRV76
HRV24
HRV90
HRV18
HRV34
HRV50
HRV73
HRV13
HRV41
HRV61
HRV96
HRV15
HRV74
HRV38
HRV60
HRV67
HRV32
HRV9
HRV19
HRV82
HRV22
HRV64
HRV94
0.1
HRV62
HRV77
HRV38
HRV45
HRV96
HRV8
HRV95 HRV61
HRV58
HRV36
HRV89Genba
HRV7
HRV88
HRV94
HRV63 HRV85HRV54
HRV1A
HRV59
HRV39
HRV1bGenba
HRV98
HRV40
HRV56
HRV66 HRV25
HRV60
HRV29
HRV44
HRV74
HRV15
HRV31
HRV47
HRV41
HRV100
HRV10
HRV13
HRV12
HRV73
HRV78
HRV50
HRV34
HRV18
HRV90
HRV20
HRV24
HRV68
HRV76
HRV33
HRV11
HRV21
HRV28
HRVHanks
HRV55
HRV57
HRV53
HRV71
HRV81
HRV16Genba
HRV51
HRV75
HRV43
HRV65
HRV49
HRV46
HRV80
HRV2Genban
HRV30
HRV23
HRV88
HRV58
HRV7
HRV89Genba
HRV45
HRV36
HRV95
0.1
HRV8
Different algorithms used to infer
phylogeny from sequence data
1.
2.
3.
4.
5.
Distance methods
Parsimony
Likelihood
Probabilistic methods
Phylogenetic invariants
Route from the molecular
sequences to the phylogenetic tree
Distance methods:
 Select a set of related (orthologous) nucleotide or amino
acid sequences
 Perform multiple sequence alignment (Clustal series
widely used)
 Calculate pairwise distances of the sequence using
chosen evolution model of substitution (Distances
between sequences describe the evolution: the smaller
distances are the closer they are related)
 Select the most suitable algorithm to infer phylogeny
 View the tree with a certain program (Treeview,
NJPlot,..)
Hamming Distance
Making a tree from pairwise
distances


Distances dij between each pair
of sequences i and j are
calculated in the given dataset
Different ways defining distances

For nucleotide sequences:
Jukes-Cantor, Kimura-2-parameter K2P,
HKY (Hasegawa-Kishino-Yano), F84,
Tamura-Nei, General time-reversible
model, General 12-parameter model

For amino acid sequences:
PAM-matrices, BLOSUM-matrices
A B C D
A
0
32
44
46
B
32
0
29
43
C
D
44 29
0
30
46 43
30
0
Distance matrix methods

UPGMA
 Algorithm
introduced by Sokal and Michener
1958

Neighbor-Joining
 Algorithm
introduced by Saitou and Nei 1987
 Modified by Studier and Keppler 1988
Clustering method: UPGMA




UPGMA = Unweighted pair group method using
arithmetic averages
Simple method
It works by clustering the sequences, at each
stage connecting two clusters and finally
creating a new node on a tree
Method assumes equal rate of evolutionary
change along branches  Molecular clock
assumption
UPGMA
A
C
B
D
UPGMA produces a rooted tree
 Branch lengths satisfy a molecular clock
 The divergence of sequences is assumed to occur at the same constant rate
at all points in the tree
 Trees that are clocklike are rooted and the total branch length from the root
up to any leaf is equal
 Trees are often referred to be ultrametric
 A distance measures are ultrametric if either all three distances are equal
dij = dik = djk or two of them are equal and one is smaller: djk < dij = dik
 UPGMA is guaranteed to build the correct tree if distances are ultrametric
 Method can be used for reconstructing phylogenies if evolutionary rates are
assumed to be same in all lineages  criticism in the phylogeny literature



Suitable for the species closely related
Running time O(n2)
Algorithm: UPGMA
Initialisation:
Assign each sequence i in dataset to its own cluster
Define one leaf of T for each sequence, and place at height zero
Iteration:
Find the two clusters i and j for which dij is the smallest (pick
randomly if several equal distances)
Define a new cluster ij by Cij = Ci U Cj. Cluster ij has nij = ni + nj
members ( initially ni = 1 )
Connect i and j on the tree to a new node v
The branch lengths from new node to i and j are
placed at height d ij
2
Algorithm: UPGMA (cont.)
Iteration (cont.)
Compute the distances between the new cluster and the
remaining clusters by using
d ( ij ),k
 ni 
 nj 


d jk

d ik  
n n 
n n 
j 
j 
 i
 i
Add ij to the current clusters and remove i and j
Termination:
When only two clusters i and j remain, place the root at height
d ij
2
An example UPGMA (1)

Distance matrix (arbitrary)
for four items (sequences)
A, B, C and D
Actually distances are not ultrametric,
because three distances are not equal
dij ≠ dik ≠ djk or two of them are not equal
and one is smaller: djk < dij ≠ dik
A
B
C
D
A B C D
0
8
7 12
8
0
9 14
7
9
0 11
12 14 11 0
Step 1. Find the smallest distance, dij, between two clusters
 A and C, where dij is 7
An example UPGMA (2)
Step 2. Define new cluster ij, which has nij = ni + nj
members (initially ni = 1)
New cluster  A and C
nAC = nA+ nC=2
A
B
C
D
Step 3. Connect A and C on the tree to a new node v1
Step 4. The branch lengths from new node v1 to A and C
3,5
d AC
7
  3,5
2
2
3,5
A
C
A
B
C
D
0
8
7
12
0
9
14
0
11
0
An example UPGMA (3)
Step 5. Compute the distances between the new cluster AC and the
remaining clusters (B and D):
 nA
d  AC , B  
 n A  nC


nC
d AB  
 n A  nC



d CB  1 * 8  1 * 9  8.5

2
2

 nA
d  AC , D  
 n A  nC
 nC

d AD  
 n A  nC



d CD  1 *12  1 *11  11.5

2
2

Step 6. Delete the columns and rows of the distance matrix that
correspond to clusters A and C, and add a column and a row for
cluster AC
AC
B
D
AC
New distance matrix
B
D
0
8,5
11,5
0
14
0
An example UPGMA (4)
AC
2nd iteration process
Step 1. Find the two sequences i and j for which dij
is the smallest (randomly if several equal distances)
AC-B
AC
B
0
B
D
8,5 11,5
0
D
14
0
Step 2. Define new cluster (ij), which has nij = ni + nj
members ( initially ni = 1 ) New cluster  AC and B
 nACB = nAC+ nB = 2 + 1 = 3
Step 3. Connect AC and B on the tree to a new node v2
Step 4. The branch lengths from new node v2 to AC and B

d ACB 8.5

 4,25
2
2
3,5
3,5
4,25
A
C
B
An example UPGMA (5)
Step 5. Compute the distances between the new cluster and the
remaining cluster (D)
 nAC
d ( ACB ), D  
 nAC  nB


nB
d ACD  

 nAC  nB

2
1
d BD  *11,5  *14  12,33
3
3

Step 6. Delete the columns and rows of the distance matrix that
correspond to clusters AC and B, and add a column and a row for
cluster ACB
New distance matrix
ACB
D
ACB
D
0
12,33
0
An example UPGMA (6)
Termination:
Only two clusters (ACB and D)
remaining
ACB
Place the root height
d ij 12,33

 6,17
2
2
D
Original distance matrix and final
phylogenetic tree(including the
branch lengths)
A
B
C
D
ACB
D
0
12,33
0
3,5
A
B
C
D
0
8
7
12
0
9
14
0
11
0
0,75
1,92
3,5
4,25
A
C
B
D
6,17
Neighbor-Joining
(N-J)
D






B
Another algorithm that works by clustering the sequences
Does not assume molecular clock
N-J trees are unrooted
A
C
N-J assumes additivity
Def. Edge lengths are said to be additive if the distance between
any pair of leaves is the sum of lengths of the edges on the path
connecting them
Method uses an approximate algorithm, where the tree is built by
finding a pair of neighboring leaves i and j that minimize the length
of the tree. Finally neighboring leaves are joined.
Running time O(n2)
Algorithm: Neighbor-Joining
Initialisation:
Define T to be the set of leaf nodes, one for each given sequence
n
d ij
Iteration:
ui  
Compute
j  i n  2  for each sequence, where n is the number of
sequences in the distance matrix
Pick a pair i and j (for which dij – ui – uj is the smallest (pick randomly if
several equal)
Join items i and j with a new node v
Compute the branch lengths from a new node v to items i and j
Compute the distances between new node v and remaining items
Remove i and j from the distance matrix and replace them by new node v
Termination:
When only two items i and j remain, add the remaining edge between i and
j, with length dij
An example N-J (1)
n
Step 1. Compute ui  
d ij
j  i n  2 
for each row in
distance matrix
Step 2. Compute d ij  (ui  u j )
(the lower-diagonal
matrix) and choose the
smallest (most negative)
A B
C D
Step 1 - ui
A
0
8
7
12
=(8+7+12)/(4-2) = 13,5
B
8
0
9
14
=(8+9+14)/(4-2)=15,5
C
7
9
0
11
=(7+9+11)/(4-2)=13,5
D
1
2
14
11
0
=(12+14+11)/(4-2)=18,5
A
B
C
D
A
0
8
7
12
B
8-(13,5+15,5)=-21
0
9
14
C
7-(13,5+13,5)=-20
9-(15,5+13,5)= -20
0
11
D
12-(13,5+18,5)=-20
14-(15,5+18,5)=-20
11-(13,5+18,5)=-21
0
An example N-J (2)
d AB (u A  u B ) 8 13,5  15,5
v


 
3
A
Step 3. Join A and B together with
2
2
2
2
a new node v1. Compute the edge
lengths, from A to node v and from
B to node v1
B
5
vB 
d AB (u B  u A ) 8 15,5  13,5

 
5
2
2
2
2
v1
3
A
Step 4. Compute distances
between the new node v1 and
remaining items (C and D)
(d AC  d BC  d AB ) 7  9  8

4
2
2
(d  d BD  d AB ) 12  14  8
 AD

9
2
2
d ( AB),C 
d ( AB), D
An example N-J (3)
New reduced distance matrix
Step 5. Delete A and B from the
distance matrix and replace them by
new item AB
AB C D
Step 6. Continue from step 1,
because more than two items remain
Step 1. Compute
for each row in ui 
distance matrix
n
d ij
 n  2
Step 1 = ui
AB
0
4
9
(4+9)/1=13
C
4
0
11
(4+11)/1=15
D
9
11
0
(9+11)/1=20
j i
Step 2 Compute
and choose
d ij  (ui  u j )
the smallest (the lower-diagonal
matrix)
AB
C
D
AB
0
4
9
C
4-(13+15)=-24
0
11
D
9-(13+20)=-24 11-(15+20)=-24
0
An example N-J (4)
AB C D
Step 3 Join v1 and C together
with a new node v2. Compute the
edge lengths, from v1 to node v2
and from C to node v2
d ABC (u AB  uC ) 4 13  15

 
1
2
2
2
2
u  u AB   4  15  13  3
d
vC  ABC  C
2
2
2
2
Step 1 = ui
AB
0
4
9
(4+9)/1=13
C
4
0
11
(4+11)/1=15
D
9
11
0
(9+11)/1=20
v1 
v2 v1
1
C
Step 4 Compute distances
between the new node v2 and
remaining items (D)
d ( ABC), D 
B
5
3
3
A
(d ABD  d CD  d ABC ) 9  11  4

8
2
2
An example N-J (5)
Step 5 Delete AB and C from the
distance matrix and replace them
by ABC
ABC
ABC
D
0
8
0
D
Step 6 Only two nodes remaining 
connect them
Original distance matrix and final phylogenetic tree (including the edge lengths)
D
A B C
A 0
B
C
D
8
0
D
8
7 12
9 14
0
11
0
B
5
1
C
3
3
A
Comparison

UPGMA


The total branch length from
the root up to any leaf is equal
 Produces a rooted tree, where
the root is hypothesized
ancestor of the sequences in
the tree
 Suitable for closely related
sequences
 Can be used to infer
phylogenies if one can
D
assume that evolutionary
rates are the same in all
lineages
3,5
3,5
4,25
6,17
Neighbor-joining

Unrooted tree, where the
direction of evolution is
unknown
 Suitable for datasets with
largely varying rates of
evolution
 Suitable for large datasets
8
A
B
5
C
1
B
D
C
3
3
A
Conclusion







UPGMA method constructs a rooted phylogenetic tree correctly if there is a
molecular clock with a constant rate of mutation
UPGMA method is rarely used, because molecular clock assumption is not
generally true: selection pressures vary across time periods, genes within
organisms, organisms, regions within gene
N-J method produces an unrooted tree without molecular clock hypothesis
N-J method is one of the most popular and widely used by molecular
evolutionist
Distance methods are strongly dependent on the model of evolution used
Sequence information is reduced when transforming sequence data into
distances
Distance methods are computationaly fast
Reference



Durbin, R., Eddy, S., Krogh, A., Mithchison G.
2003 Biological sequence analysis –
Probabilistic models of proteins and nucleic acid.
Campridge University Press.
Li, W. 1997. Molecular Evolution. Sinauer
Associates, Sunderland, MA. p. 108
Felsenstein, J. 2003. Inferring Phylogenies.
Sinauer Associates, Sunderland, MA. p.147-170
Examples of phylogeny programs
Multiple sequence alignment
 Clustal series (W, V) (free, http://www-igbmc.ustrasbg.fr/BioInfo/ClustalX/Top.html )
Phylogeny packages
 PAUP (http://paup.csit.fsu.edu/ )
 Phylip (free, http://evolution.gs.washington.edu)
 MEGA (free, http://www.megasoftware.net)
Viewing/plotting phylogenetic trees
 Treeview (free, http://taxonomy.zoology.gla.ac.uk/rod/treeview.html)
 NJPlot (free, http://pbil.univ-lyon1.fr/software/njplot.html)
Further reading




N-J: Saitou, N. and M. Nei.1987. The neighbor-joining method: a
new method for reconstructing phylogenetic trees. Mol Biol Evol
4(4): 406-25.
N-J: Studier, J. A., K. J. Keppler, et al. 1988. A note on the neighborjoining algorithm of Saitou and Nei The neighbor-joining method: a
new method for reconstructing phylogenetic trees. Mol Biol Evol
5(6): 729-31.
UPGMA: Michener, C. D., and R. R. Sokal. 1957. A quantative
approach to a problem in classification. Evolution 11: 130-162.
ClustalW: Thompson, J. D., T. J. Gibson, et al. 1997. The
CLUSTAL_X windows interface: flexible strategies for multiple
sequence alignment aided by quality analysis tools. Nucleic Acids
Res 25(24): 4876-82.