Download Using HIV Data Sets for Inquiry

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Viral phylodynamics wikipedia , lookup

Gene expression programming wikipedia , lookup

Koinophilia wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
An Introduction to Phylogenetics
Anton E. Weisstein
> Sequence 1
GAGGTAGTAATTAGATCCGAAA…
> Sequence 2
GAGGTAGTAATTAGATCTGAAA…
> Sequence 3
GAGGTAGTAATTAGATCTGTCA…
Indiana State University
March 11-14, 2004
Outline
I. Overview
II. Building and Interpreting Phylogenies
III. Evolutionary Inference
IV. Specific Applications
What is phylogenetics?
Phylogenetics is the study of evolutionary relationships.
Relationships among species:
birds
rodents
snakes
primates
crocodiles
marsupials
lizards
What is phylogenetics?
Relationships among species:
crocodiles
birds
lizards
snakes
rodents
primates
marsupials
This is an example of a phylogenetic tree.
What is phylogenetics?
Relationships within species:
HIV subtypes
B
Italy
Rwanda A
Ivory Coast
Uganda
U.S.
U.S.
India
Rwanda
C
U.K.
D
Ethiopia
Uganda
Uganda
S. Africa
Netherlands
Tanzania
Romania
Cameroon
F
Brazil
Russia
Taiwan
Netherlands
G
So what is phylogenetics
good for?
Phylogenetics has direct applications to:
• Conservation: test wood, ivory, meat products for poaching
• Agriculture: analyze specific differences between cultivars
• Forensics: DNA fingerprinting
• Medicine: determine specific biochemical function of
cancer-causing genes
HIV Example 1:
Florida dentist case
1990 case: Did a patient’s HIV infection result from an invasive
dental procedure performed by an HIV+ dentist?
Outline
I. Overview
II. Building and Interpreting Phylogenies
III. Evolutionary Inference
IV. Specific Applications
Phylogenetic concepts:
Interpreting a Phylogeny
Sequence A
Sequence B
Sequence C
Sequence D
Sequence E
Time
Which sequence is most
closely related to B?
A, because B diverged
from A more recently
than from any other
sequence.
Physical position in tree is
not meaningful! Only
tree structure matters.
Phylogenetic concepts:
Rooted and Unrooted Trees
A
A
B
X
B
Root
=
Root
A
? ?
B
X
?
=
?
C
?
D
Time
C
D
C
?
D
How Many Trees?
Unrooted trees
#
#
pairwise
sequences distances
Rooted trees
#
branches
/tree
# trees
# branches
/tree
# trees
3
3
1
3
3
4
4
6
3
5
15
6
5
10
15
7
105
8
6
15
105
9
945
10
10
45
2,027,025
17
34,459,425
18
30
435
8.69  1036
57
4.95  1038
58
N N (N - 1)
2
(2N - 5)!
2N - 3 (N - 3)!
2N - 3
(2N - 3)!
2N - 2 (N - 2)!
2N - 2
Tree Types
Evolutionary trees
measure time.
Phylograms
measure change.
sharks
sharks
seahorses
Root
frogs
owls
50 million years
crocodiles
armadillos
bats
seahorses
frogs
owls
Root
crocodiles
armadillos
5% change
bats
Tree Properties
Ultrametricity
Additivity
All tips are an equal
distance from the root.
X
Distance between any two
tips equals the total branch
length between them.
a
b
Root
c
e Y
d
a=b+c+d+e
a X
b
Root
e
c
d
XY = a + b + c + d + e
In simple scenarios, evolutionary trees are ultrametric
and phylograms are additive.
Y
Tree Building Exercise
Using the distance matrix given,
construct an ultrametric tree.
X
a
Ultrametricity
All tips are an equal
distance from the root.
b
Root
c
a=b+c+d+e
e Y
d
Phylogenetic Methods
Many different procedures exist. Three of the most popular:
Neighbor-joining
• Minimizes distance between nearest neighbors
Maximum parsimony
• Minimizes total evolutionary change
Maximum likelihood
• Maximizes likelihood of observed data
Comparison of Methods
Neighbor-joining
Uses only pairwise
distances
Maximum parsimony
Uses only shared
derived characters
Maximum likelihood
Uses all data
Minimizes distance
Minimizes total
between nearest neighbors distance
Maximizes tree likelihood
given specific parameter
values
Very fast
Slow
Very slow
Easily trapped in local
optima
Assumptions fail when Highly dependent on
evolution is rapid
assumed evolution model
Good for generating
tentative tree, or choosing
among multiple trees
Best option when
tractable (<30 taxa,
homoplasy rare)
Good for very small data
sets and for testing trees
built using other methods
Which procedure should we use?
Neighborjoining
Maximum
parsimony
?
Maximum
likelihood
All that we can!
• Each method has its own strengths
• Use multiple methods for cross-validation
• In some cases, none of the three gives the
correct phylogeny!
Outline
I. Overview
II. Building and Interpreting Phylogenies
III. Evolutionary Inference
IV. Specific Applications
Phylogenetic concepts:
Homology and Homoplasy
Homology: identical character due to shared ancestry
(evolutionary signal)
Homoplasy: identical character due to evolutionary
convergence or reversal (evolutionary noise)
+flight
lizards
birds
snakes
+hair
rodents
primates
Homology
snakes
rodents
bats
+flight
Homoplasy
(Convergence)
worms
lizards
snakes
+legs
–legs
Homoplasy
(Reversal)
Watching the Molecular Clock
Mutation occurs as a random (Poisson) process. If mutations
accumulate at a constant rate over time and across all branches, the
phylogeny is said to obey a molecular clock.
2002
2001
2000
2001
2002
% genetic difference
Watching the Molecular Clock
Mutation occurs as a random (Poisson) process. If mutations
accumulate at a constant rate over time and across all branches, the
phylogeny is said to obey a molecular clock.
BUT:
• Natural selection favors some mutations and eliminates others
• Selection varies over time and across lineages
2002
2002
2001
2001
2000
2001
% genetic difference
2002
Trees are hypotheses about
evolutionary history
So far, we’ve looked at understanding and
formulating these hypotheses. Now, let’s
turn our attention to testing them.
Tree Testing:
Split Decomposition
Split decomposition is one method for testing a tree.
Under this procedure, we choose exactly four taxa (A, B, C, D)
and examine the topologies of all possible unrooted trees. How
many such trees are there?
A
C
A
B
A
B
D
C
D
D
Only one of these topologies is right. How can we
quantitatively assess the support for each tree?
B
C
Tree Testing:
Split Decomposition
The correct tree should be approximately additive; the others
usually will not. For each tree, we calculate split indices that
estimate the length of the internal branch:
A
B
+
D C
–
A
2
D
B
C
D
if
B
+
C
A
is the right
phylogeny!
=
Large split indices  Long internal branch  Topology strongly supported
Small split indices  Short internal branch  Topology weakly supported
Negative split indices  Biologically impossible  Topology probably wrong
Tree Testing:
Bootstrapping
Used to assess the support for individual branches
Randomly resample characters, with replacement
Repeat many times (1000 or more)
How often does a specific branch appear?
100
73
98
rat
human
turtle
fruit fly
oak
duckweed
Tree Testing:
Bootstrapping
MacClade Example:
Vertebrate evolution
Outline
I. Overview
II. Building and Interpreting Phylogenies
III. Evolutionary Inference
IV. Specific Applications
HIV Example 1:
Florida dentist case
• 1990 case: Did a patient’s HIV infection result from an
invasive dental procedure performed by an HIV+ dentist?
• HIV evolves so fast that transmission patterns can be
reconstructed from viral sequence (molecular forensics).
• Compared viral sequence from the dentist, three of his HIV+
patients, and two HIV+ local controls.
Florida dentist case
So what do the results mean?
• 2 of 3 patients closer to dentist than to
local controls. Statistical significance?
More powerful analyses?
• Do we have enough data to be
confident in our conclusions?
What additional data would help?
• If we determine that the dentist’s virus is linked to those of
patients E and G, what are possible interpretations of this
pattern? How could we test between them?