Download LecCh8Phylogenetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

Corecursion wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Phylogenetics
“Inferring Phylogenies”
Joseph Felsenstein
Excellent reference
What is a phylogeny?
Different Representations





Cladogram - branching pattern only
Phylogram - branch lengths are
estimated and drawn proportional to the
amount of change along the branch
Rooted - implies directionality of change
Unrooted - does not
How do you root a tree?
What is a phylogeny used for?
  4 Nei 
n1

n
2   ij
i 1 j i1
nn  1
Estimate a Phylogeny
Sp1 ACCGTCTTGTTA
Sp2 AGCGTCATCAAA
Sp3 AGCGTCATCAAA
Sp4 ACCGTCTTGATA
Sp5 AGCCTCTTCATA
Estimate a Phylogeny
Sp1 ACCGTCTTGTTA
Sp2 AGCGTCATCAAA
Sp3 AGCGTCATCAAA
Sp4 ACCGTCTTGATA
Sp5 AGCCTCTTCATA
Working Tree
sp2
sp1
c2
sp3
sp4
sp5
Estimate a Phylogeny
Sp1 ACCGTCTTGTTA
Sp2 AGCGTCATCAAA
Sp3 AGCGTCATCAAA
Sp4 ACCGTCTTGATA
Sp5 AGCCTCTTCATA
Working Tree
sp2
sp1
c2
c4
sp4
sp3
sp5
Estimate a Phylogeny
Sp1 ACCGTCTTGTTA
Sp2 AGCGTCATCAAA
Sp3 AGCGTCATCAAA
Sp4 ACCGTCTTGATA
Sp5 AGCCTCTTCATA
Working Tree
sp2
sp1
c2
c7
c4
sp4
sp3
sp5
Estimate a Phylogeny
Sp1 ACCGTCTTGTTA
Sp2 AGCGTCATCAAA
Sp3 AGCGTCATCAAA
Sp4 ACCGTCTTGATA
Sp5 AGCCTCTTCATA
Working Tree
sp2
sp1
c2
c9
sp4
c7
c4
sp3
sp5
Estimate a Phylogeny
Sp1 ACCGTCTTGTTA
Sp2 AGCGTCATCAAA
Sp3 AGCGTCATCAAA
Sp4 ACCGTCTTGATA
Sp5 AGCCTCTTCATA
Working Tree
sp1
sp2
c10
c2
c9
sp4
c7
c4
sp3
sp5
Estimate a Phylogeny
Sp1 ACCGTCTTGTTA
Sp2 AGCGTCATCAAA
Sp3 AGCGTCATCAAA
Sp4 ACCGTCTTGATA
Sp5 AGCCTCTTCATA
Final Tree
sp1
sp2
c10
c2
c9
sp4
c7
c11
c4
sp3
sp5
What optimality criteria do we
use then?

Parsimony
Likelihood
Bayesian

Distance methods?


Parsimony



1
Why should we choose a specific grouping?
Maximum parsimony: we should accept the
hypothesis that explain the data most simply
and efficiently
“Parsimony is simply the most robust criterion
for choosing between competing scientific
hypotheses. It is not a statement about how
evolution may or may not have taken place”1
Kitching, I. J.; Forey, P. L.; Humphries, J. & Williams, D. M. 1998. Cladistics: the theory and practice
of parsimony analysis. The systematics Association Publication. No. 11.
Parsimony



Optimality criteria that chooses the
topology with the less number of
transformations of character states
Optimizing one component: tree
topology (pattern based)
Most parsimonious tree: the one (or
multiple) with the minimum number of
evolutionary changes (smaller size/tree
length)
Reconstructing trees via sequence data
O
A
6. T=>G
C
2. G=>A
4. A=>G
1. T=>A
D
1
2
3
4
5
6
O
T
G
T
A
A
T
A
A
A
T
G
A
G
B
A
G
C
C
-
G
C
A
A
T
G
A
T
D
A
G
C
C
-
T
B
6. T=>G
5. A=> GAP
4. A=>C
3. T=>C
Tree length = 8
Neighbor-joining Method
NJ distance matrices
NJ distance matrices
NJ distance matrices
NJ distance matrices
Finished NJ tree
Models of Evolution
T
C
Pyrimidines
A
G
Purines
Transversions
Transitions
Maximum Likelihood

Base frequencies: fA + fG + fC + fT = 1
Base exchange: fs + fv = 1
R-matrix:  +  +  +  +  +  = 1
Gamma shape parameter
Number of discrete gamma-distribution categories
Pinvar: fvar + finv = 1

Likelihood: L =  li where i is each character state





Maximum Likelihood
C
G
A
t1
L=Pr(D|H)


L( i ) 
x
G
t2
t7
t4
t3
z
y
G
t5
t6
t8
w
Pr AGGCG via x, y,z,w 
all x ,y ,z, w

   Pr(w)Pr(z;w,t )Pr( x; w,t )Pr( y;z,t )Pr( A; x,t ) Pr(G;z,t )Pr(G;z,t ) Pr(C; y,t
8
x
y
z
w
7
6
1
2
3
4
)Pr(G; y,t5 )
ML cont.
n
L  L
(i )
i1
the probability that the nucleotide at time t is i is given by
1 3 4t / 3
Pii (t)   e
4 4
the probability that the nucleotide at time t is j, j i, is given by
1 1 4t /3
Pij (t)   e
4 4
Bayes Theorem
The conditional
probability of H given D:
posterior Prob
probability
(H │D) =
Prior probability or
Marginal probability of H
Prob (H) Prob (D│H)
H=Hypothesis D=Data
Prob (D)
Likelihood
function
Prior probability or
Marginal
probability
of D
∑ P(H) P(D|H)
H
Normalizing Constant: ensures ∑ P (H │D) = 1
Take Home Message

Likelihood: represents the P of the data
given the hypothesis => difficult to
interpret

Bayes approach: estimates the P of the
hypothesis given the data => estimates
P for the hypothesis of interest
Bayesian Inference of Phylogeny
f(i |X) =

Calculating pP of a tree involves a summation over all possible trees
and, for each tree, integration over all combinations of bl and
substitution-model parameter values
f(i,i,|X) =


f(i,i,) f(X|i,i,)
B(s)
∑j=1 ∫ , f(i,i,) f(X| i,i,)dd
Inferences of any single parameter are based on the marginal
distribution of the parameter
f(i|X) =

f(i) f(X|i)
B(s)
∑j=1 f(i) f(X|i)
∫ , f(i,i,) f(X|i,i,) dd
B(s)
∑j=1 ∫ , f(i,i,) f(X| i,i,)dd
This marginal P distribution of the topology, for example, integrates
out all the other parameters
Advantage: the power of the analysis is focused on the parameter of
interest (i.e., the topology of the tree)
Estimating phylogenies



Exhaustive Searches
Branch and bound methods
Rise in computational time versus rise
in solution space
How many topologies are
there?
2n  3!

T  n1
2 n 1!
The Phylogenetic Problem
T
B(T)   2i  5
Number of Seqs
10
100
1,000
10,000
100,000
1,000,000
i3
Number of Trees
6
2x10
2x10182
2x102,860
8x1038,658
1x10486,663
1x105,866,723
HIV-1 Whole Genomes
1993 - 15
HIV-1 Whole Genomes
2003 (JAN) - 397
Tree Space - the final frontier
Heuristic Searches



Nearest-neighbor interchanges (NNI) - swap two adjacent
branches on the tree
Subtree pruning and regrafting (SPR) - removing a branch
from the tree (either an interior or an exterior branch) with a
subtree attached to it. The subtree is then reinserted into
the remaining tree in all possible places
Tree bisection and reconnection (TBR) - An interior branch
is broken, and the two resulting fragments o the tree ar
considered as separate trees. All possible connections are
made between a branch of one and a branch of the other.
Other approaches



Tree-fusing - find two near optimal trees
and exchange subgroups between the
two trees
Genetic Algorithms - a simulation of
evolution with a genotype that describes
the tree and a fitness function that
reflects the optimality of the tree
Disc Covering - upcoming paper
Phylogenetic Accuracy?



Consistency - A phylogenetic method is
consistent for a given evolutionary model if the
method converges on the correct tree as the data
available to the method become infinite.
Efficiency - Statistical efficiency is a measure of
how quickly a method converges on the correct
solution as more data are applied to the problem.
Robustness - Robustness refers to the degree to
which violations of assumptions will affect
performance of phylogenetic methods
How reliable is MY phylogeny?




Bootstrap Analysis
Jackknife Analysis
Posterior Probabilities (Bayesian
Approaches)
Decay Indices
Bootstrap
Pseudoreplicates