Download Document

Document related concepts

Gene nomenclature wikipedia , lookup

Gene regulatory network wikipedia , lookup

Molecular ecology wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Transcript
Accurate gene phylogeny across
multiple complete genomes
Species Informed Distance-based
Reconstruction
Matt Rasmussen and Manolis Kellis
The goal
Determine the evolutionary history
of every gene in multiple complete genomes
The goal
Determine the evolutionary history
of every gene in multiple complete genomes
From phylogenies
determine:
• Orthologs
• Paralogs
• Duplications
• Losses
• Family expansions
• Varying rates of
evolution
• Etc…
Contrast of the phylogenetic method with
alternative methods
1.
Pair-wise sequence comparison
–
–
2.
Best bi-directional BLAST hits
Focuses on one-to-one orthologs (no duplications)
Hit clustering methods
–
–
3.
Detect clusters in graph of pair-wise hits
Difficulty to separate large connected components
Synteny methods
–
–
4.
Detect conserved regions, stretches of nearby hits
Genome alignment methods focus on best hits
Phylogenetic methods
–
–
Phylogeny of family clusters orthologs near each
other
Traditionally applied to specific families
Contrast of the phylogenetic method with
alternative methods
1.
Pair-wise sequence comparison
–
–
2.
Best bi-directional BLAST hits
Focuses on one-to-one orthologs (no duplications)
Hit clustering methods
–
–
3.
Detect clusters in graph of pair-wise hits
Difficulty to separate large connected components
Synteny methods
–
–
4.
Detect conserved regions, stretches of nearby hits
Genome alignment methods focus on best hits
Phylogenetic methods
–
–
–
Phylogeny of family clusters orthologs near each
other
Traditionally applied to specific families
Can they be applied genome-wide?
What is the accuracy of current
phylogenetic methods?
Tricky question:
• Requires knowing the correct phylogeny by an
independent means
• Previously,
• Simulation
• Or avoid accuracy and focus on robustness
(bootstrap)
What is the accuracy of current
phylogenetic methods?
Use synteny to determine phylogeny by an independent means
What is the accuracy of current
phylogenetic methods?
Use synteny to determine phylogeny by an independent means
Trees found
by Max
Likelihood
(PHYML)
Matches
species topology
What is the accuracy of current
phylogenetic methods?
Phylogenies across 5154 syntenic one-to-one orthologs
Etc…
316 other
topologies
Matches
species topology
Reconstruction accuracy dependent on
gene sequence length
Accuracy of current phylogenetic methods
• Average gene is too short
• Too few phylogenetically
informative characters
• To make progress, must use
additional information
• Current algorithms ignore species
• Designed for solving the
species tree problem
• Whole genomes change the game
• We can assume species tree is known
• We would like to solve the gene tree problem
• Our approach:
• Design an algorithm specifically for the gene tree problem
• Key insight: use species tree to inform the gene tree
reconstruction
What is the connection between species
and gene evolution?
What is the connection between species
and gene evolution?
A A A A A A A A
What is the connection between species
and gene evolution?
A A A A A A A A
5154 gene trees
What is the connection between species
and gene evolution?
A A A A A A A A
5154 gene trees
1.0 sub/site
1.0 sub/site
1.0 sub/site
1.0 sub/site
1.0 sub/site
Correlation between branch lengths
Total tree
length
Relative branch
lengths
Correlation between branch lengths
asp branch lengths
Total tree
length
r = 0.957
Mer branch lengths
Relative branch
lengths
Correlation between branch lengths
Average gene tree
Correlation between branch lengths
Average gene tree
93% of trees have a correlation greater than .8
with the average gene tree
Effect of normalization on
branch correlation
dvir
dana
Effect of normalization on
branch length distribution
Absolute
branch lengths
Gamma distributed
Relative
branch lengths
Normally distributed
A new model for gene family evolution:
Two forces
2. Species-specific rates
1. Family rates
Fj
~gamma(a,b)
Sij
~normal(ui,sij)
bij = Fj * Sij
Effects that we have seen are
consequences of this model
bij = Fj * Sij
• Total tree length Lj of one-to-one trees
is proportional to family rate Fj
• If species rates have small standard
deviations we expect branch correlation
The standard deviation of every speciesspecific rate is nearly ¼ of the mean
What is the meaning of the speciesspecific rate?
• The normal is partly due to error in estimating evolutionary
distance
• If we fit normals only on long sequences, the standard
deviation goes down
• Species-specific means are not affected by sequence length.
All of these affects also hold for 17 fungi
and 4 mammals
12 Flies
Absolute branches
distributed by gamma
Relative branches
distributed by normal
3 < Mean / sdev < 4
17 Fungi
Absolute branches
distributed by gamma
Relative branches
distributed by normal
3 < Mean / sdev < 4
A new strategy for gene tree
reconstruction
• Traditional Maximum Likelihood methods
– Propose many topologies
– For each topology
• Calculate the likelihood of seeing such a tree
– Return tree that achieves max likelihood
• We show that one can calculate the likelihood
of a tree being generated by our model
• Thus, we can create our own phylogenetic
algorithm that uses species information to
reconstruct gene trees.
Likelihood calculation: simple case
INPUT: a distance matrix with all pairwise distances between genes
Likelihood calculation: simple case
•
•
•
•
Propose a topology
Fit branch lengths to topology
Estimate Family rate
Normalize tree
Likelihood calculation: simple case
d
• Reconcile gene tree to species tree
• Determines actual path of evolution
through species tree
• Algorithms exist to do this fast (linear time)
Likelihood calculation: simple case
Pb
Pa
Pe
d
Pc
Pd
Pf
• Compare branch lengths to
distributions
• Allows us to calculate a
likelihood for every branch
Likelihood calculation: simple case
Pb
Pa
Pe
d
Pf
Pc
Pd
Every branch is
highly likely 
Tree is Highly likely
• Because branches are
independent, likelihood of tree
is product of branch likelihoods
Likelihood calculation: complex case
Pb
Pa
d
Pe
Pf
Pc
Pd
Every branch is
highly likely 
Tree is Highly likely
Likelihood calculation: complex case
Pb
Pa
d
Pe
Pf
• Propose another topology
• This one differs only by rooting
• Most branch have same length (just different name)
• w = e (human)
• x = c (rat)
• y = d (mouse)
• z = b (rodent)
• Two branches are now merged
• v = a + f (dog/hmr)
Pc
Pd
Every branch is
highly likely 
Tree is Highly likely
Likelihood calculation: complex case
Pb
Pa
d
Pe
Pf
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
• Reconcile gene tree to species tree
Pc
Pd
Every branch is
highly likely 
Tree is Highly likely
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Pf
Every branch is
highly likely 
Tree is Highly likely
Rat
Mouse
Human
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Pf
Every branch is
highly likely 
Tree is Highly likely
Px
Py
Rat
Mouse
Human
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
• Mouse and rat branches
have the same likelihood as
before
• Px = Pc
• Py = Pd
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Every branch is
highly likely 
Tree is Highly likely
Pf
Px
Py
Rat
Mouse
Human
Pv
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
• Same distribution for dog,
but now dog branch is too
long. Why?
• v=a+f
• Pv < Pf
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Every branch is
highly likely 
Tree is Highly likely
Pf
Px
Py
w1
?
Mouse
w2
Human
Pv
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
Rat
• Branch w goes from Eutherian to
Human (two species branches)
• Which distribution should we use?
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Every branch is
highly likely 
Tree is Highly likely
Pf
Px
Py
Rat
Mouse
Pw
Human
Pv
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
• The distribution is the sum of two
independent normals
w= w1 + w2 ~ N(u1,s12) + N(u2,s22)
= N(u1+u2,s12+s22)
• Branch w is too short, Pw < Pe
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Every branch is
highly likely 
Tree is Highly likely
Pf
Px
z1
?
z2
Py
Mouse
Pw
Human
Pv
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
• Same case for z.
• Two species branches
• Distribution is sum of two indep.
normals
z = z1 + z2 ~ N(u1,s12) + N(u2,s22)
= N(u1+u2,s12+s22)
Rat
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Every branch is
highly likely 
Tree is Highly likely
Pf
Px
Pz
Py
Mouse
Pw
Human
Pv
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
Rat
• Branch z is too short
• Pz < Pb
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Every branch is
highly likely 
Tree is Highly likely
Pf
Px
Pz
Py
Rat
Mouse
Pw
Human
Pv
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
Some branches are
less likely 
Tree is less likely
Bringing it all together
• Turns out we find the likelihood of any
tree by breaking it down into 1 of three
cases
• Main advantage: do not explicitly
penalize dup/loss
– Only ensure branch lengths are close to
what we expect given our model
Example of reconstructing tree with
dup/loss: hemoglobin genes
D
H M
Hemoglobin
alpha
This is now
the correct
topology 
R
D
H M
Hemoglobin
beta
R
Example of reconstructing tree with
dup/loss
Px
Pz
Py
Rat
Mouse
Pw
Human
Pv
Dog
z
w
v
D
H
M
Hemoglobin
alpha
R
D
H
M
Hemoglobin
beta
R
All branches are
highly likely 
Tree is highly likely
• Branch z is now longer
• Branch w is now longer
• Branch v is just the right length
Evaluation: Datasets
• Real datasets
–
–
–
–
5154 syntenic one-to-ones from 12 flies
739 syntenic one-to-ones from 17 fungi
200 Neighboring fly orthologs
220 Whole genome duplicates in 7 yeasts
• Simulated (using our gene family model)
– More complex events
Neighboring orthologs
WGD trees
klac, kwal,
agos
sbay, smik,
spar, scer
Evaluation
Apply genome-wide for 17 fungi
• Cluster genes
• Build alignment
for each cluster
• Build tree for
each alignment
• Reconcile to
species tree to
determine all
duplications and
losses
General trees follow the model we learned
from one-to-one trees
GO enrichment in top 50 trees with most
duplications
term
plasma membrane
pval
-1.50E-11
helicase activity
4.72E-12
ammonium transporter activity
1.46E-09
telomere maintenance via recombination
1.66E-09
DNA helicase activity
4.11E-09
transport
4.40E-07
transporter activity
4.80E-07
membrane
1.27E-06
alcohol dehydrogenase activity
6.29E-06
nitrogen utilization
6.29E-06
ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism
6.29E-06
alpha-1,3-mannosyltransferase activity
6.29E-06
alcohol dehydrogenase (NADP+) activity
3.84E-05
magnesium ion transport
3.84E-05
sodium ion transport
3.84E-05
basic amino acid transporter activity
3.84E-05
lysophospholipase activity
3.84E-05
nuclear nucleosome
4.17E-05
cellular component unknown
0.000129
translational elongation
0.000139
oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor
0.00015
translation elongation factor activity
0.000231
oxidoreductase activity
0.000353
GO enrichment in top 50 trees with most
gene losses
helicase activity
4.49E-12
telomere maintenance via recombination
1.60E-09
DNA helicase activity
3.95E-09
GTPase activity
7.11E-09
ubiquitin conjugating enzyme activity
6.79E-08
translational elongation
4.24E-07
alcohol dehydrogenase activity
6.17E-06
ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism
6.17E-06
alpha-1,3-mannosyltransferase activity
6.17E-06
translation elongation factor activity
9.27E-06
alcohol dehydrogenase (NADP+) activity
3.78E-05
1,3-beta-glucan synthase activity
3.78E-05
sodium ion transport
3.78E-05
IMP dehydrogenase activity
3.78E-05
ribosome
4.35E-05
protein serine/threonine phosphatase activity
0.000139
oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor
0.000148
structural constituent of cytoskeleton
0.00018
alpha-glucosidase activity
0.00036
fermentation
0.00036
proteasome complex (sensu Eukaryota)
0.00036
transmembrane receptor activity
0.00036
protein amino acid O-linked glycosylation
0.000505
ammonium transporter activity
0.000701
GO enrichment in top 50 trees with most
genes
term
pval
helicase activity
1.55E-11
telomere maintenance via recombination
3.83E-09
DNA helicase activity
1.05E-08
plasma membrane
2.84E-08
1,3-beta-glucanosyltransferase activity
7.94E-08
transporter activity
1.45E-07
transport
1.71E-07
pyruvate decarboxylase activity
2.09E-06
protein amino acid O-linked glycosylation
2.29E-06
Golgi apparatus
7.73E-06
alcohol dehydrogenase activity
1.01E-05
ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism
1.01E-05
alpha-1,3-mannosyltransferase activity
1.01E-05
GTPase activity
1.57E-05
cyclin-dependent protein kinase holoenzyme complex
1.70E-05
alpha-1,2-mannosyltransferase activity
2.95E-05
regulation of glycogen biosynthesis
2.95E-05
cytosine-purine permease activity
5.51E-05
sodium ion transport
5.51E-05
basic amino acid transporter activity
5.51E-05
lysophospholipase activity
5.51E-05
membrane
0.000159
cell wall (sensu Fungi)
0.000386
GO enrichment in top 10 trees with most
genes
DNA helicase activity
3.79E-13
telomere maintenance via recombination
4.61E-13
helicase activity
8.05E-13
ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism
5.92E-08
alcohol dehydrogenase activity
5.92E-08
sodium ion transport
1.15E-06
fermentation
1.13E-05
calcium-transporting ATPase activity
0.00011
NADPH dehydrogenase activity
0.00011
transport
0.000146
oxidoreductase activity
0.000178
membrane
0.000235
alcohol dehydrogenase (NADP+) activity
0.000328
alcohol metabolism
0.000652
transporter activity
0.000735
plasma membrane
0.000846
multidrug transport
0.001078
Supplemental figure
# Duplications vs rel sub/site for each
species branch
Orthologs and paralogs
human
mouse
rat
dog
rabbit
orthologs
•
Orthologs arise by speciation
– typically keep same function
paralogs
• Paralogs arise by duplication
– typically take on new functions
Likelihood calculation: complex case
Pc
Pb
Pa
Pd
d
Pe
Pf
Every branch is
highly likely 
Tree is Highly likely
Rat
Mouse
Human
Dog
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
Likelihood calculation: complex case
Pb
Pa
d
Pe
Pf
Pc
Pd
Every branch is
highly likely 
Tree is Highly likely
Px
Py
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
• Mouse and rat branches
have the same likelihood as
before
• Px = Pc
• Py = Pd
Likelihood calculation: complex case
Pb
Pa
d
Pe
Pc
Pd
Every branch is
highly likely 
Tree is Highly likely
Pf
Px
Py
Pv
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
• Dog is now too long. Why?
• v=a+f
• Pv < Pf
Likelihood calculation: complex case
Pb
Pa
d
Pe
Pc
Pd
Every branch is
highly likely 
Tree is Highly likely
Pf
Px
Py
Pv
w = e (human)
x = c (rat)
y = d (mouse)
z = b (rodent)
v = a + f (dog/hmr)
• Human is now too short,
because it must now cross
an extra species
Likelihood calculation: complex case
Pb
Pa
d
Pe
Pf
Pc
Pd
Every branch is
highly likely 
Tree is Highly likely
Figure 4
a. Gene-tree with correct topology scores highly
b. Gene-tree with incorrect topology scores poorly
Figure 4. Gene-tree evaluation with a richer species-tree model