Download D - Clayton State University

Document related concepts

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Community fingerprinting wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Point mutation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

History of molecular evolution wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
Biology 4900
Biocomputing
Chapter 7
Molecular Phylogeny and Evolution
Outline
• Introduction to evolution and phylogeny
• Nomenclature of relationship trees
• Five stages of molecular phylogeny:
1.
2.
3.
4.
5.
Selecting sequences
Multiple sequence alignment
Models of substitution
Tree-building
Tree evaluation
Historical background: Evolution
• Evolution: Theory that groups
of organisms change over time;
descendants are structurally
and functionally different than
ancestors.
• Darwin (1859) proposed the
role of natural selection in
hereditary change.
– Heredity is generally conservative:
changes in offspring tend to be
small.
• Prior to 1950’s, phylogeny was
studied by comparing living
species with fossils.
– Modern techniques of molecular
biology not available!
Pevsner, Bioinformatics and Functional Genomics, 2009; http://www.ucsd.tv/evolutionmatters/lesson2/study.shtml
Tree of Life
http://www.allvoices.com/contributed-news/4553607-is-chimps-as-smart-as-human; After Pace NR (1997) Science 276:734;
http://en.wikipedia.org/wiki/File:E_coli_at_10000x,_original.jpg
3 main mechanisms of evolutionary change
• Conditions of growth affect development of
organism.
– Physiological response to external stimuli
– One or more differences in an organism that increase it’s
likelihood of surviving in an environment long enough to
reproduce
• Change resulting from sexual reproduction.
– Offspring contains genetic material from 2 parents.
• Mutation with selection/genetic drift.
– Similar to first mechanism
Pevsner, Bioinformatics and Functional Genomics, 2009
Historical background
• Studies of molecular evolution began with the first
sequencing of proteins, beginning in the 1950s.
• In 1953 Frederick Sanger and colleagues determined
the primary amino acid sequence of insulin.
• (The accession number of human insulin is
NP_000198)
Pevsner, Bioinformatics and Functional Genomics, 2009; http://en.wikipedia.org/wiki/Frederick_Sanger
Historical background: insulin
• By the 1950s, it became clear that amino acid
substitutions occur non-randomly.
– For example, Sanger and colleagues noted
that most amino acid changes in the insulin A
chain are restricted to a disulfide loop region.
– Such differences are called “neutral” changes
(Kimura, 1968; Jukes and Cantor, 1969).
• Subsequent studies at the DNA level showed
that rate of nucleotide (and of amino acid)
substitution is about six- to ten-fold higher in the
C peptide, relative to the A and B chains.
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 219
Note the sequence divergence in the
disulfide loop region of the A chain
Fig. 7.3
Page 220
0.1 x 10-9
1 x 10-9
Number of nucleotide substitutions/site/year
0.1 x 10-9
Fig. 7.3
Page 220
Historical background: insulin
Surprisingly, insulin from the
guinea pig (and from the related
coypu) evolve seven times faster
than insulin from other species.
Why?
The answer: Guinea pig and coypu
insulin do not bind two zinc ions,
while insulin molecules from
most other species do. There was
a relaxation on the structural
constraints of these molecules,
and so the genes diverged rapidly.
Sus scrofa insulin
(1ZNI.pdb)
Page 219
Guinea pig and coypu insulin have undergone an
extremely rapid rate of evolutionary change
Arrows indicate positions at which
guinea pig insulin (A and B chains)
differs from human and mouse
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 220
Historical Background: Vasopressin and Oxytocin
• Vasopressin and oxytocin also sequenced in 1950’s.
• Sequences differ by only 2 residues, but functions differ significantly
• Vasopressin
– Reduces the volume of urine formed in the body to retain water
• Oxytocin
– Causes contractions in smooth muscles of uterus
Seager SL, Slabaugh MR, Chemistry for Today: General, Organic and Biochemistry, 7 th Edition, 2011
Molecular clock hypothesis
• In 1960s, sequence data accumulated for small, abundant
proteins (e.g., globins, cytochromes c, fibrinopeptides).
• Some proteins appeared to evolve slowly, while others
evolved rapidly.
• Linus Pauling, Emanuel Margoliash and others proposed
the hypothesis of a molecular clock:
• For every given protein, rate of molecular evolution is
approximately constant in all evolutionary lineages.
Mutations over time
Species A
Species B
Protein X
Species C
Species D
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 221
Molecular clock hypothesis
corrected amino acid changes
per 100 residues (m)
•
•
•
•
Millions of years since divergence
Dickerson, 1972
Richard Dickerson (1971) plotted
sequence data from three protein
families: cytochrome c,
hemoglobin, and fibrinopeptides.
X-axis shows the divergence
times of the species, estimated
from paleontological data.
Y-axis shows m, the corrected
number of amino acid changes
per 100 residues.
n is observed number of amino
acid changes per 100 residues,
and it is corrected to m to account
for changes that occur but are not
observed.
Molecular clock hypothesis
Molecular clock hypothesis: conclusions
Dickerson drew the following conclusions:
• For each protein, the data lie on a straight line. Thus,
the rate of amino acid substitution has remained
constant for each protein.
• The average rate of change differs for each protein.
The time for a 1% change to occur between two lines
of evolution is 20 MY (cytochrome c), 5.8 MY
(hemoglobin), and 1.1 MY (fibrinopeptides).
• The observed variations in rate of change reflect
functional constraints imposed by natural selection.
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 223
Molecular clock for proteins:
rate of substitutions per aa site per 109 years
Fibrinopeptides
Kappa casein
Lactalbumin
Serum albumin
Lysozyme
Trypsin
Insulin
Cytochrome c
Histone H2B
Ubiquitin
Histone H4
9.0
3.3
2.7
1.9
0.98
0.59
0.44
0.22
0.09
0.010
0.010
Table 7-1
Page 223
Molecular clock hypothesis: problems
• Rate of molecular evolution varies among different organisms
(e.g., viral sequences changes very rapidly).
• Clock varies among different genes and parts of genes –
guided in part by selection (e.g., animals with shorter
generational times may have faster clocks).
• Clock only applicable when gene retains its function over
evolutionary time.
– Genes may become nonfunctional
– Rate of evolution may accelerate following gene duplication
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225
Rate of nucleotide substitution r and time of
divergence T
• Rate of nucleotide substitution (r) = number of nucleotide
substitutions that occur per site per year (same for AAs)
• Time of divergence (T) = how long ago the two sequences split
from a common sequence.
• A constant (K) defines the number of non-synonymous
substitutions per site.
• α-globins from rat and human differ by 0.093 non-synonymous substitutions by site
• Rate of change = 0.58 × 10-9 nonsynonymous nucleotide substitutions per site per year
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225
Neutral theory of molecular evolution
• Significant amount of DNA polymorphism across all species
that is difficult to account for by conventional natural
selection.
• Kimura (1969, 1983) proposed an alternative model to
Darwinian evolution at DNA level.
– Observed that rate of AA substitution averages ~1 change per 28 × 106
years for proteins of 100 residues.
– Corresponding rate of nucleotide substitution is 1 base pair in
population genome every 2 years.
– Kimura concluded that most DNA substitutions must be neutral and
that main cause of evolutionary change at molecular level is random
drift of mutant alleles.
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225
Molecular phylogeny: nomenclature of trees
• Molecular Phylogeny: Study of evolutionary
relationships among organisms or molecules
– Determined via molecular biology
• There are two main kinds of information inherent
to any tree: topology and branch lengths.
• We will now describe the parts of a tree.
Page 231
Molecular phylogeny uses trees to depict evolutionary
relationships among organisms. These trees are based
upon DNA and protein sequence data.
Cladogram
Phylogram
2
A
1
I
2
1
1
G
B
H 2
1
6
1
2
C
2
D
B
C
2
1
E
A
2
F
D
6
one unit
E
time
Change in Time
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
Change in Magnitude
Tree Nomenclature
Node (intersection or terminating point of two or more branches)
Branch (edge) connecting 2 nodes
A
F
G
B
I
H
C
D
E
time
Fig. 7.8
Page 232
Tree Nomenclature: Nodes
OTUs: Taxa that provide observable features
(e.g., existing protein sequences,
morphological features)
G
I
A
F
B
H
C
D
Root Node
Internal Node (Taxon)
Operational taxonomic unit (OTU)
time
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
E
Tree Nomenclature: Cladogram vs. Phylogram
Branches are unscaled...
2
I
2
A
F
Cladogram
1
Phylogram
1
G
B
H 2
1
6
Branches are scaled...
1
1
2
C
2
E
B
C
2
1
D
A
2
D
6
one unit
E
time
…OTUs are neatly aligned,
and nodes reflect time
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
…branch lengths are
proportional to number of
amino acid changes
Tree nomenclature: Internal Nodes
bifurcating
internal
node
multifurcating
internal
node
2
A
1
I
2
1
1
G
B
H 2
1
6
A
2
F
B
2
C
2
2
1
D
E
C
D
6
one unit
E
time
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
Tree nomenclature: clades
Clade: Group of taxa derived from a common ancestor
Clade ABF/CDH/G
(paraphyletic)
Clade ABF
(monophyletic group)
2
F
I
1
1
B
G
H 2
1
6
A
F
1
2
2
A
C
D
E
Clade CDH
(monophyletic group)
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
I
2
1
G
B
H 2
1
6
C
D
E
Tree nomenclature: Newick Format
(,,(,)); no nodes are named
(A,B,(C,D)); leaf nodes are named
(A,B,(C,D)E)F; all nodes are named
(:0.1,:0.2,(:0.3,:0.4):0.5); all but root node have a distance to parent
(:0.1,:0.2,(:0.3,:0.4):0.5):0.0; all have a distance to parent
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5); distances and leaf names (popular)
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names
((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A; a tree rooted on a leaf node (rare)
http://en.wikipedia.org/wiki/Newick_format
Examples
of clades
Lindblad-Toh et al., Nature
438: 803 (2005), fig. 10
Tree Roots
• The root of a phylogenetic tree represents the common
ancestor of the sequences. Some trees are unrooted, and thus
do not specify the common ancestor.
past
9
1
7
5
8
6
2
present
1
7
3 4
2
5
Rooted tree (specifies
evolutionary path)
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233
8
6
3
Unrooted tree
4
Tree Roots
• A tree can be rooted using an outgroup (that is, a taxon
known to be distantly related from all other OTUs).
past
root
9
10
7
8
7
6
2
present
9
8
3 4
1
2
5
Rooted tree
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233
1
3 4
5
6
Outgroup
(used to place the root)
Enumerating trees
Cavalii-Sforza and Edwards (1967) derived the number
of possible unrooted trees (NU) for n OTUs (n > 3):
NU =
(2n-5)!
2n-3(n-3)!
The number of bifurcating rooted trees (NR)
NR =
(2n-3)!
2n-2(n-2)!
For 10 OTUs (e.g. 10 DNA or protein sequences),
the number of possible rooted trees is  34 million,
and the number of unrooted trees is  2 million.
Many tree-making algorithms can exhaustively
examine every possible tree for up to ten to twelve
sequences.
Page 235
Numbers of possible trees extremely large for
>10 sequences
Number
of OTUs
2
3
4
5
10
20
Number of
rooted trees
1
3
15
105
34,459,425
8 x 1021
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 235
Number of
unrooted trees
1
1
3
15
105
2 x 1020
Five stages of phylogenetic analysis
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243
Stage 1: Use of DNA, RNA, or protein
DNA
• DNA lets you study synonymous and nonsynonymous changes
• Synonymous changes do not have corresponding protein
changes.
• synonymous substitution rate (dS)
• nonsynonymous substitution rate (dN)
• If dS > dN, DNA sequence is under negative (purifying)
selection.
– This limits change in the sequence (e.g. insulin A chain).
• If dS < dN, positive selection occurs.
– Ex. Duplicated gene may evolve rapidly to assume new functions.
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 240
Stage 1: Use of DNA, RNA, or protein
DNA
• Some substitutions in a DNA sequence alignment can be
directly observed:
A
G
T
C
C
T
G
T
T
C
A
G
hypothetical
ancestral
globin
A
G→C
T
C
C
T
G
T→A
T→G
C→G
A
G→T→G
human globin
A
G→C
T
C→G
C→T→A
T
G
T→C
T
C→T→G
A
G
mouse globin
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 241, Figure 7.16
AA
CC
TT
CG
CA
TT
GG
AC
GT
GG
AA
GG
observed
alignment
Parallel substitution
Single substitution
Sequential substitution
Coincidental substitution
Convergent substitution
Back substitution
Stage 1: Use of DNA, RNA, or protein
DNA
•Noncoding regions (such as 5′ and 3′ untranslated regions of
genes, or introns) may be analyzed using molecular phylogeny.
•Pseudogenes (nonfunctional genes) are studied by molecular
phylogeny
•Rates of transitions and transversions can be measured.
•Transitions: purine (A → G) or pyrimidine (C → T) substitutions
•Transversion: purine ↔ pyrimidine
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 242
Stage 1: Use of DNA, RNA, or protein
Protein
• Proteins frequently more informative than DNA in pairwise
alignment and in BLAST searching.
• Proteins have 20 states (amino acids) instead of only 4 for
DNA, so there is a stronger phylogenetic signal.
• Nucleotides are unordered characters: any one nucleotide can
change to any other in one step.
• An ordered character must pass through one or more
intermediate states before reaching the final state.
• Amino acid sequences are partially ordered character states:
there is a variable number of states between the starting
value and the final value.
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243
Five stages of phylogenetic analysis
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Stage 2: Multiple sequence alignment
Fundamental basis of phylogenetic tree is the MSA.
1. Confirm that all sequences are homologous
–
Trees can still be generated even with misalignment, or if a nonhomologous sequence is included in the alignment.
2. Adjust gap creation and extension penalties as needed to
optimize the alignment
3. Restrict phylogenetic analysis to regions of the multiple
sequence alignment for which data are available for all taxa
(delete columns having incomplete data or gaps. Similar to
masking).
-----GADEG-YFGPVILAADGEVA--dlgnvGA-EGDYFGPAI--AEGEVArpl
Pevsner, Bioinformatics and Functional Genomics, 2009
Five stages of phylogenetic analysis
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Stage 4: Tree-building methods: distance
• Simplest approach to measuring distances between
sequences is to align pairs of sequences, and then to count
the number of differences.
• Degree of divergence is called the Hamming distance.
• For alignment of length N with n sites at which there are
differences, the degree of divergence D is:
D=n/N
• Observed differences do not equal genetic distance!
– Genetic distance involves mutations that are not observed directly.
Pevsner, Bioinformatics and Functional Genomics, 2009
Stage 4: Tree-building methods: distance
Jukes and Cantor (1969) proposed a corrective formula:
• This model describes the probability that one
nucleotide will change into another.
• Still imperfect, as it assumes that each residue is
equally likely to change into any other (i.e. the rate of
transversions equals the rate of transitions).
• In practice, the transition is typically greater than the
transversion rate.
Pevsner, Bioinformatics and Functional Genomics, 2009
Stage 4: Tree-building methods: distance
Jukes and Cantor (1969) proposed a corrective formula:
Consider an alignment where 3/60 aligned residues differ.
The normalized Hamming distance is 3/60 = 0.05.
The Jukes-Cantor correction is
When 30/60 aligned residues differ, the Jukes-Cantor correction is
more substantial:
Pevsner, Bioinformatics and Functional Genomics, 2009
Models of nucleotide substitution
transition > transversion
A
transition
G
transversion
pyrimidines
purines
transversion
T
C
transition
Pevsner, Bioinformatics and Functional Genomics, 2009
Stage 4: Tree-building methods
Distance-based methods
• Distance-based methods involve a distance metric, such as
the number of amino acid changes between the sequences,
or a distance score.
• Examples of distance-based algorithms are UPGMA and
neighbor-joining.
Character-based methods
• Include maximum parsimony and maximum likelihood.
• Parsimony analysis involves the search for the tree with the
fewest amino acid (or nucleotide) changes that account for
the observed differences between taxa.
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: UPGMA
UPGMA (unweighted pair group method using arithmetic
mean) is based on clustering of sequences
1
2
3
4
5
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: UPGMA
Step 1: compute the pairwise distances of all
the proteins. Get ready to put the numbers 1-5
at the bottom of your new tree.
1
2
3
4
5
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: UPGMA
Step 2: Find the two proteins with the
smallest pairwise distance. Cluster them.
1
2
6
3
4
5
Pevsner, Bioinformatics and Functional Genomics, 2009
1
2
Tree-building methods: UPGMA
Step 3: Do it again. Find the next two proteins
with the smallest pairwise distance. Cluster them.
1
2
6
1
3
4
5
Pevsner, Bioinformatics and Functional Genomics, 2009
7
2
4
5
Tree-building methods: UPGMA
Step 4: Keep going. Cluster.
1
8
2
7
6
3
4
5
Pevsner, Bioinformatics and Functional Genomics, 2009
1
2
4
5
3
Tree-building methods: UPGMA
Step 4: Last cluster! This is your tree.
9
1
2
8
7
3
6
4
5
Pevsner, Bioinformatics and Functional Genomics, 2009
1
2
4
5
3
Another View of UPGMA: Pairwise Distance Matrix
A
B
C
B
dAB
--
--
C
dAC
dBC
--
D
dAD
dBD
dCD
Pevsner, Bioinformatics and Functional Genomics, 2009
B
C
D
E
A
9
8
12
15
B
-11
15
18
C
--10
13
D
---5
Another View of UPGMA: Pairwise Distance Matrix
B
C
D
E
A
9
8
12
15
B
-11
15
18
C
--10
13
D
---5
Start with pair having shortest
distance
Example:
http://www.icp.ucl.ac.be/~opperd/private/upgma.html
Pevsner, Bioinformatics and Functional Genomics, 2009
Distance-based methods: UPGMA trees
UPGMA is a simple approach for making trees.
• A UPGMA tree is always rooted.
• An assumption of the algorithm is that the molecular
clock is constant for sequences in the tree. If there
are unequal substitution rates, the tree may be wrong.
• While UPGMA is simple, it is less accurate than the
neighbor-joining approach (described next).
Pevsner, Bioinformatics and Functional Genomics, 2009
Making trees using neighbor-joining
The neighbor-joining method
of Saitou and Nei (1987) is
especially useful for making
trees with:
• Large number of taxa
• Lineages with varying rates
of evolution
Begin by placing all the taxa
in a star-like structure.
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closely
related.
Connect these neighbors to other OTUs via an internal branch,
XY.
At each successive stage, minimize the sum of the branch
lengths.
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: Neighbor joining
Define the distance from X to Y by
dXY = 1/2(d1Y + d2Y – d12)
Pevsner, Bioinformatics and Functional Genomics, 2009
Example of a
neighbor-joining
tree: phylogenetic
analysis of 13
RBPs
Distribution of SC angles
S100
1K96
1WDC
CC
2BL0CC
1WDC
2BL0
1WDC B 2BL0 B
1EXR
S100
1XK4 A
R2
S100
1E8A
1GGZ
S100
1J55
2BL0 A
2AAO
Calbindin d9k
1QX2
Calbindin d9k
1WDC A
4ICB
1K94
Penta-EF
2CCM
1Y1X
Penta-EF
1RRO
1S6C B
Parvalbumin
2PVB
Parvalbumin
Osteonectin
1SRA
1A75
2HQ8
1BU3
Parvalbumin
Parvalbumin
2H2K
2HQ8
2H2K
5PAL
S100
1PVA Parvalbumin
Parvalbumin
2SCP
1RWY
Parvalbumin
1SL8
Polcalcin
1QV1
0.1 N-J Phylogenic Tree
Unrooted
generated by Treeview
2OPO1K9U
1XO5
Polcalcin
1OMR
1S6C A
1G8I
R1
Distribution of MC angles
S100
1K96
1WDC
CC
2BL0CC
1WDC
2BL0
1WDC B 2BL0 B
1EXR
S100
1XK4 A
R1
S100
1E8A
1GGZ
S100
1J55
2BL0 A
2AAO
Calbindin d9k
1QX2
Calbindin d9k
1WDC A
4ICB
1K94
Penta-EF
2CCM
1Y1X
Penta-EF
1RRO
1S6C B
Parvalbumin
2PVB
Parvalbumin
Osteonectin
1SRA
1A75
2HQ8
1BU3
Parvalbumin
Parvalbumin
2H2K
2HQ8
2H2K
5PAL
S100
1PVA Parvalbumin
Parvalbumin
2SCP
1RWY
Parvalbumin
1SL8
Polcalcin
1QV1
0.1 N-J Phylogenic Tree
Unrooted
generated by Treeview
2OPO1K9U
1XO5
Polcalcin
1OMR
1S6C A
1G8I
R2
Parsimony Principle
• The parsimony principle tells us to choose the simplest
scientific explanation that fits the evidence. In tree-building,
this means that the best hypothesis is the one that requires
the fewest evolutionary changes.
Hypothesis 1 is the most parsimonious because it requires fewer evolutionary
changes. Hypothesis 2 requires formation of bony skeleton at 2 different points.
http://evolution.berkeley.edu/evolibrary/article/phylogenetics_08
Building trees using character-based methods
• Rather than pairwise distances between proteins,
aligned columns of AA residues are evaluated.
• The main idea of character-based methods is to find
the tree with the shortest branch lengths. Thus we
seek the most parsimonious (“simple”) tree.
• Identify informative sites. For example, constant
characters are not parsimony-informative.
• Construct trees, counting the number of changes
required to create each tree.
• Select the shortest tree (or trees).
Pevsner, Bioinformatics and Functional Genomics, 2009
As an example of tree-building using maximum
parsimony, consider these four taxa:
AAG
AAA
GGA
AGA
How might they have evolved from a common
ancestor such as AAA?
Pevsner, Bioinformatics and Functional Genomics, 2009
Parsimony Principle
• The parsimony principle tells us to choose the simplest
scientific explanation that fits the evidence. In tree-building,
this means that the best hypothesis is the one that requires
the fewest evolutionary changes.
AAA
1
AAA
AAG AAA
1
1
AGA
GGA AGA
Cost = 3
AAA
1
AAA
1
AAG AGA
AAA
AAA
2
AAA GGA
Cost = 4
1
AAA
2
AAG GGA
AAA
AAA AGA
Cost = 4
In maximum parsimony, choose the tree(s) with the lowest cost (shortest
branch lengths).
http://evolution.berkeley.edu/evolibrary/article/phylogenetics_08
1
Five stages of phylogenetic analysis
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Pevsner, Bioinformatics and Functional Genomics, 2009
Stage 5: Evaluating trees: bootstrapping
Bootstrapping is common approach to measuring robustness
of a tree topology.
Given a branching order, how consistently does an algorithm
find that branching order in a randomly permuted version of
the original data set?
To bootstrap, create artificial dataset by randomly sampling
columns from your multiple sequence alignment.
Make dataset the same size as the original. Do 100 (to 1,000)
bootstrap replicates.
Observe the percent of cases in which the assignment of
clades in the original tree is supported by the bootstrap
replicates. >70% is considered significant.
Pevsner, Bioinformatics and Functional Genomics, 2009
In 61% of the bootstrap
resamplings, ssrbp and btrbp
(pig and cow RBP) formed a
distinct clade. In 39% of the
cases, another protein joined
the clade (e.g. ecrbp), or one
of these two sequences joined
another clade.