Download (evolutionary) trees: Character based methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Phylogenetic Analysis
based on two talks, by
Caro-Beth Stewart, Ph.D.
Department of Biological Sciences
University at Albany, SUNY
[email protected]
and Tal Pupko, Ph.D.
Faculty of Life Science
Tel-Aviv University
[email protected]
Based on lectures by C-B Stewart,
and by Tal Pupko
What is phylogenetic analysis and why
should we perform it?
Phylogenetic analysis has two major components:
1.
Phylogeny inference or “tree building” —
the inference of the branching orders, and
ultimately the evolutionary relationships,
between “taxa” (entities such as genes,
populations, species, etc.)
2.
Character and rate analysis —
using phylogenies as analytical frameworks
for rigorous understanding of the evolution of
various traits or conditions of interest
Based on lectures by C-B Stewart,
and by Tal Pupko
Common Phylogenetic Tree Terminology
Terminal Nodes
Branches or
Lineages
A
B
C
D
Ancestral Node
or ROOT of
the Tree
Internal Nodes or
Divergence Points
(represent hypothetical
ancestors of the taxa)
Based on lectures by C-B Stewart,
and by Tal Pupko
E
Represent the
TAXA (genes,
populations,
species, etc.)
used to infer
the phylogeny
Phylogenetic trees diagram the evolutionary
relationships between the taxa
Taxon B
Taxon C
Taxon A
Taxon D
No meaning to the
spacing between the
taxa, or to the order in
which they appear from
top to bottom.
Taxon E
This dimension either can have no scale (for ‘cladograms’),
can be proportional to genetic distance or amount of change
(for ‘phylograms’ or ‘additive trees’), or can be proportional
to time (for ‘ultrametric trees’ or true evolutionary trees).
((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses
These say that B and C are more closely related to each other than either is to A,
and that A, B, and C form a clade that is a sister group to the clade composed of
Based on lectures by C-B Stewart,
D and E. If the tree has a time scale,
and E are the most closely related.
and then
by Tal D
Pupko
A few examples of what can be inferred
from phylogenetic trees built from DNA
or protein sequence data:
• Which species are the closest living relatives of
modern humans?
• Did the infamous Florida Dentist infect his
patients with HIV?
• What were the origins of specific transposable
elements?
• Plus countless others…..
Based on lectures by C-B Stewart,
and by Tal Pupko
Which species are the closest living
relatives of modern humans?
14
Humans
Gorillas
Chimpanzees
Chimpanzees
Bonobos
Bonobos
Gorillas
Orangutans
Orangutans
Humans
0
15-30
MYA
Mitochondrial DNA, most nuclear DNAencoded genes, and DNA/DNA
hybridization all show that bonobos and
chimpanzees are related more closely to
humans than either are to gorillas.
MYA
0
The pre-molecular view was that the
great apes (chimpanzees, gorillas and
orangutans) formed a clade separate
from humans, and that humans diverged
from the apes at least 15-30 MYA.
Based on lectures by C-B Stewart,
and by Tal Pupko
Did the Florida Dentist infect his patients with HIV?
Phylogenetic tree
of HIV sequences
from the DENTIST,
his Patients, & Local
HIV-infected People:
DENTIST
Patient C
Patient A
Patient G
Patient B
Patient E
Patient A
DENTIST
Yes:
The HIV sequences from
these patients fall within
the clade of HIV sequences
found in the dentist.
Local control 2
Local control 3
Patient F
No
Local control 9
Local control 35
Local control 3
Patient D
Based on lectures by C-B Stewart,
From Ou et al. (1992) and Page & Holmes
(1998)
and
by Tal Pupko
No
A few examples of what can be learned
from character analysis using
phylogenies as analytical frameworks:
• When did specific episodes of positive Darwinian
selection occur during evolutionary history?
• Which genetic changes are unique to the human
lineage?
• What was the most likely geographical location of
the common ancestor of the African apes and
humans?
• Plus countless others…..
Based on lectures by C-B Stewart,
and by Tal Pupko
The number of unrooted trees increases in a greater
than exponential manner with number of taxa
A
B
# Taxa ( N)
C
A
B
C
A
C
B
D
D
E
A
B
C
F
D
E
3
4
5
6
7
8
9
10
.
.
.
.
30
# Unrooted trees
1
3
15
105
945
10,935
135,135
2,027,025
.
.
.
.
01 x 85.3ֵ
36
(2N - 5)!! = # unrooted trees for N taxa
Based on lectures by C-B Stewart,
and by Tal Pupko
Inferring evolutionary relationships between
the taxa requires rooting the tree:
B
To root a tree mentally,
imagine that the tree is
made of string. Grab the
string at the root and
tug on it until the ends of
the string (the taxa) fall
opposite the root:
Root
D
Unrooted tree
A
A
Note that in this rooted tree, taxon A is
no more closely related to taxon B than
it is to C or D.
C
B
C
D
Rooted tree
Root
Based on lectures by C-B Stewart,
and by Tal Pupko
Now, try it again with the root at another position:
B
C
Root
Unrooted tree
D
A
A
B
C
D
Rooted tree
Root
Note that in this rooted tree, taxon A is most
closely related to taxon B, and together they
are equally distantly related to taxa C and D.
Based on lectures by C-B Stewart,
and by Tal Pupko
An unrooted, four-taxon tree theoretically can be rooted in five
different places to produce five different rooted trees
A
The unrooted tree 1:
4
1
B
Rooted tree 1a
2
Rooted tree 1b
C
5
D
3
Rooted tree 1c
Rooted tree 1d
Rooted tree 1e
B
A
A
C
D
A
B
B
D
C
C
C
C
A
A
D
D
D
B
B
These trees show five different evolutionary relationships among the taxa!
Based on lectures by C-B Stewart,
and by Tal Pupko
There are two major ways to root trees:
By outgroup:
Uses taxa (the “outgroup”) that are
known to fall outside of the group of
interest (the “ingroup”). Requires
some prior knowledge about the
relationships among the taxa. The
outgroup can either be species (e.g.,
birds to root a mammalian tree) or
previous gene duplicates (e.g.,
a-globins to root b-globins).
outgroup
By midpoint or distance:
Roots the tree at the midway point
A
between the two most distant taxa in
the tree, as determined by branch
10
lengths. Assumes that the taxa are
evolving in a clock-like manner. This
assumption is built into some of the
distance-based tree buildingBased
methods.
on lectures by C-B Stewart,
and by Tal Pupko
d (A,D) = 10 + 3 + 5 = 18
Midpoint = 18 / 2 = 9
C
3
B
2
2
5
D
Each unrooted tree theoretically can be rooted
anywhere along any of its branches
C
A
D
B
A
C
B
A
B
D
E
C
F
D
E
# Taxa
3
4
5
6
7
8
9
.
.
.
.
30
# Unrooted
# Rooted
x # Roots =
Trees
Trees
1
3
3
3
5
15
15
7
105
105
9
945
945
11
10,3 95
10,935
13
135,1 35
135,135
15
2,027,0 25
.
.
.
.
.
.
.
.
.
.
.
.
36
~3.58 x 10
57
~2.04 x 10 38
(2N - 3)!! = # unrooted trees for N taxa
Based on lectures by C-B Stewart,
and by Tal Pupko
Molecular phylogenetic tree building methods:
Are mathematical and/or statistical methods for inferring the divergence
order of taxa, as well as the lengths of the branches that connect them.
There are many phylogenetic methods available today, each having
strengths and weaknesses. Most can be classified as follows:
COMPUTATIONAL METHOD
Characters
Distances
DATA TYPE
Optimality criterion
Clustering algorithm
PARSIMONY
MAXIMUM LIKELIHOOD
MINIMUM EVOLUTION
UPGMA
LEAST SQUARES
NEIGHBOR-JOINING
Based on lectures by C-B Stewart,
and by Tal Pupko
Types of data used in phylogenetic inference:
Character-based methods: Use the aligned characters, such as DNA
or protein sequences, directly during tree inference.
Taxa
Species
Species
Species
Species
Species
A
B
C
D
E
Characters
ATGGCTATTCTTATAGTACG
ATCGCTAGTCTTATATTACA
TTCACTAGACCTGTGGTCCA
TTGACCAGACCTGTGGTCCG
TTGACCAGTTCTCTAGTTCG
Distance-based methods: Transform the sequence data into pairwise
distances (dissimilarities), and then use the matrix during tree building.
Species
Species
Species
Species
Species
A
B
C
D
E
A
---0.23
0.87
0.73
0.59
B
0.20
---0.59
1.12
0.89
C
0.50
0.40
---0.17
0.61
D
0.45
0.55
0.15
---0.31
E
0.40
0.50
0.40
0.25
----
Based
on 2-parameter
lectures by distance
C-B Stewart,
Example 2:
Kimura
and
by TalofPupko
(estimate of the true
number
substitutions between taxa)
Example 1:
Uncorrected
“p” distance
(=observed percent
sequence difference)
Computational methods for finding optimal trees:
Exact algorithms: "Guarantee" to find the optimal or
"best" tree for the method of choice. Two types used in tree
building:
Exhaustive search: Evaluates all possible unrooted
trees, choosing the one with the best score for the method.
Branch-and-bound search: Eliminates the parts of the
search tree that only contain suboptimal solutions.
Heuristic algorithms: Approximate or “quick-and-dirty”
methods that attempt to find the optimal tree for the method of
choice, but cannot guarantee to do so. Heuristic searches
often operate by “hill-climbing” methods.
Based on lectures by C-B Stewart,
and by Tal Pupko
Exact searches become increasingly difficult, and
eventually impossible, as the number of taxa increases:
A
B
# Taxa ( N)
C
A
B
C
A
C
B
D
D
E
A
B
C
D
F
E
3
4
5
6
7
8
9
10
.
.
.
.
30
# Unrooted trees
1
3
15
105
945
10,935
135,135
2,027,025
.
.
.
.
01 x 85.3ֵ
36
(2N - 5)!! = # unrooted trees for N taxa
Based on lectures by C-B Stewart,
and by Tal Pupko
Heuristic search algorithms are
input order dependent and can get
stuck in local minima or maxima
Search
for global
minimum
local
minimum
Rerunning heuristic searches using
different input orders of taxa can help
find global minima or maxima
Search
for global
maximum
GLOBAL
MAXIMUM
local
maximum
GLOBAL
MAXIMUM
GLOBAL
MINIMUM
GLOBAL
MINIMUM
Based on lectures by C-B Stewart,
and by Tal Pupko
Classification of phylogenetic inference methods
COMPUTATIONAL METHOD
Characters
Distances
DATA TYPE
Optimality criterion
Clustering algorithm
PARSIMONY
MAXIMUM LIKELIHOOD
MINIMUM EVOLUTION
UPGMA
LEAST SQUARES
NEIGHBOR-JOINING
Based on lectures by C-B Stewart,
and by Tal Pupko
Parsimony methods:
Optimality criterion: The ‘most-parsimonious’ tree is the one that
requires the fewest number of evolutionary events (e.g., nucleotide
substitutions, amino acid replacements) to explain the sequences.
Advantages:
• Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).
• Can be used on molecular and non-molecular (e.g., morphological) data.
• Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy)
• Can be used for character (can infer the exact substitutions) and rate analysis.
• Can be used to infer the sequences of the extinct (hypothetical) ancestors.
Disadvantages:
• Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!)
• Can be fooled by high levels of homoplasy (‘same’ events).
• Can become positively misleading in the “Felsenstein Zone”:
[See Stewart (1993) for a simple explanation of parsimony analysis, and Swofford
et al. (1996) for a detailed explanation of various parsimony methods.]
Based on lectures by C-B Stewart,
and by Tal Pupko
Branch and Bound
Tal Pupko, Tel-Aviv University
Based on lectures by C-B Stewart,
and by Tal Pupko
There are many trees..,
We cannot go over all the trees. We will try to find
a way to find the best tree.
There are approximate solutions… But what if we
want to make sure we find the global maximum.
There is a way more efficient than just go over all
possible tree. It is called BRANCH AND BOUND
and is a general technique in computer science,
that can be applied to phylogeny.
Based on lectures by C-B Stewart,
and by Tal Pupko
BRANCH AND BOUND
To exemplify the BRANCH AND BOUND (BNB)
method, we will use an example not connected to
evolution. Later, when the general BNB method is
understood, we will see how to apply this method
to finding the MP tree. We will present the
traveling salesperson path problem (TSP).
Based on lectures by C-B Stewart,
and by Tal Pupko
THE TSP PROBLEM
(especially adapted to israel).
A guard has to visit n check-points whose location
on a map is known. The problem is to find the
shortest path that goes through all points exactly
once (no need to come back to starting point).
Naïve approach: (say for 5 points). You have 5
starting points. For each such starting point you
have 4 “next steps”. For each such combination of
starting point and first step, you have 3 possible
second steps, etc.
All together we have 5*4*3*2*1
Based on lectures by C-B Stewart,
Possible solutions = and
5! by.Tal Pupko
THE TSP TREE
1
2
3
2
4
5
1
3
3
4
5
245
1
2
145
45
25
24
54
52
42
4
4
5
125
1
2
124
Based on lectures by C-B Stewart,
and by Tal Pupko
5
3
5
1
2
3
4
THE SHP NAÏVE APPROACH
Each solution can be represented as a
permutation:
(1,2,3,4,5)
(1,2,3,5,4)
(1,2,4,3,5)
(1,2,4,5,3)
(1,2,5,3,4)
…
We can go over the list and find the one giving the
highest score.
Based on lectures by C-B Stewart,
and by Tal Pupko
THE SHP NAÏVE APPROACH
However, for 15 points, for example, there are
1,307,674,368,000
The rate of increase of the number of solutions is
too fast for this to be practical.
Based on lectures by C-B Stewart,
and by Tal Pupko
A TSP GREEDY HEURISTIC
Start from a random point. Go to the closest point.
Go to its closest point, etc.etc.
This approach doesn’t work so well…
(but a reasonably close heuristic, based on simulated
annealing, will be presented in a couple of lectures.)
Based on lectures by C-B Stewart,
and by Tal Pupko
BNB SOLUTION TO SHP
1
2
3
2
4
5
1
Shortest path
found so far =
15
3
3
4
5
245
1
2
145
45
25
24
54
52
42
4
4
5
125
1
2
124
Based on lectures by C-B Stewart,
and by Tal Pupko
5
3
5
1
2
3
Score here
already 16:
no point in
expanding
the rest of
the subtree
4
Back to finding the MP
tree
Finding the MP tree is NP-Hard (will see shortly)…
BNB helps, though it is still exponential…
Based on lectures by C-B Stewart,
and by Tal Pupko
The MP search tree
1
3
4 is added to branch 1.
2
1
4
1
1
3
4
3
3
4
2
5 is added to branch 2.
There are 5 branches
2
Based on lectures by C-B Stewart,
and by Tal Pupko
2
The MP search tree
30
4 is added to branch 1.
55
43
52
54
52
53
58
61
56
59
39
61
69
Based on lectures by C-B Stewart,
and by Tal Pupko
53
51
42
47
47
MP-BNB
30
4 is added to branch 1.
55
43
52
54
52
53
58
61
56
59
39
61
69
Best (minimum) value = 52
Based on lectures by C-B Stewart,
and by Tal Pupko
53
51
42
47
47
MP-BNB
30
4 is added to branch 1.
55
43
52
54
52
53
58
61
56
59
39
61
69
Best record = 52 Based on lectures by C-B Stewart,
and by Tal Pupko
53
51
42
47
47
MP-BNB
30
4 is added to branch 1.
55
43
52
54
52
53
58
61
56
59
39
61
69
Best record = 52 Based on lectures by C-B Stewart,
and by Tal Pupko
53
51
42
47
47
MP-BNB
30
43
52
54
52
53
55
58
39
53
Best record = 52 Based on lectures by C-B Stewart,
and by Tal Pupko
51
42
47
47
MP-BNB
30
43
52
54
52
53
55
58
39
53
Best record = 52 Based on lectures by C-B Stewart,
and by Tal Pupko
51
42
47
47
MP-BNB
30
43
52
54
52
53
55
58
39
53
Best record = 52 Based
51 on lectures by C-B Stewart,
and by Tal Pupko
51
42
47
47
MP-BNB
30
43
52
54
52
53
55
58
39
53
Best record = 52 Based
51 on
42lectures by C-B Stewart,
and by Tal Pupko
51
42
47
47
MP-BNB
30
43
52
54
52
53
55
58
39
53
Best record = 52 Based
51 on
42lectures by C-B Stewart,
and by Tal Pupko
51
42
47
47
MP-BNB
30
43
52
54
52
53
55
58
39
53
Best record = 52 Based
51 on
42lectures by C-B Stewart,
and by Tal Pupko
51
42
47
47
MP-BNB
30
43
52
54
52
53
55
58
39
53
Total # trees visited: 14
Based on lectures by C-B Stewart,
and by Tal Pupko
51
42
47
Best TREE.
MP score = 42
47
Order of Evaluation Matters
The bound
after searching
this subtree
will be 42.
30
Evaluate all 3 first
43
55
39
53
Total tree visited: 9
Based on lectures by C-B Stewart,
and by Tal Pupko
51
42
47
47
And Now
Maximum Parsimony is
Computationally Intractable
Felsenstein’s Dynamic Programming
Algorithm for tiny maximum likelihood
and more, time permitting
Based on lectures by C-B Stewart,
and by Tal Pupko