Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Coalescent
Consequences for
Consensus
Cladograms
J. H. Degnan1, M. Degiorgio2, D. Bryant3, and N. A. Rosenberg1,2
1 Dept.
of Human Genetics, U. of Michigan
2 Bioinformatics Program, U. of Michigan
3 Dept. of Mathematics, U. of Auckland
21 December 2007
Outline
Species trees vs. gene trees
Consensus tree background
Asymptotic consensus trees
Finite sample consensus trees
Consistency results
Conclusions
Gene trees vary across the genome
Why? Incomplete lineage sorting,
horizontal gene transfer, sampling, etc.
Gene tree discordance
From one true species tree, we expect there to
be different gene trees at different loci as a
result of lineage sorting, independently of
problems due to estimation or sampling error.
Gene tree discordance depends especially on
branch lengths in the species tree, measured
by the number of generations scaled by
effective population size, t / N.
(((
A,
B)
G
,C
T:
),D
(((
A,
)
B
G
),D
T:
(((
),C
A,
)
C
G
),B
T:
(((
),D
A,
)
C)
G
,D
T:
(((
),B
A,
)
D
G
),B
T:
(((
),C
A,
)
D
G
),C
T:
(((
),B
B,
)
C)
G
,A
T:
),D
(((
B,
)
C
G
),D
T:
(((
),A
B,
)
D)
G
T:
,A
(((
),C
B,
)
D)
G
,C
T:
(((
),A
C,
)
D
G
),A
T:
(((
),B
C,
)
D)
G
T:
,B
((A
),A
,B
)
),(
G
T:
C
,D
((A
))
,C
G
),(
T:
((A B,D
))
,D
),(
B,
C
))
G
T:
(((
A,
B)
G
,C
T:
),D
(((
A,
)
B
G
),D
T:
(((
),C
A,
)
C
G
),B
T:
(((
),D
A,
)
C)
G
,D
T:
(((
),B
A,
)
D
G
),B
T:
(((
),C
A,
)
D
G
),C
T:
(((
),B
B,
)
C)
G
,A
T:
),D
(((
B,
)
C
G
),D
T:
(((
),A
B,
)
D)
G
T:
,A
(((
),C
B,
)
D)
G
,C
T:
(((
),A
C,
)
D
G
),A
T:
(((
),B
C,
)
D)
G
T:
,B
((A
),A
,B
)
),(
G
T:
C
,D
((A
))
,C
G
),(
T:
((A B,D
))
,D
),(
B,
C
))
G
T:
x=2, y=1.2
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
x=y=0.1
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
Consensus (majority-rule)
Types of consensus trees
Strict—only clades that are included in observed trees are in the
consensus tree. In the coalescent model, all clades have probability > 0.
Democratic vote—use the gene tree that occurs most frequently.
Majority rule—consensus tree has all clades that were observed in > 50%
of trees.
Greedy—sort clades by their proportions. Accept the most frequently
observed clades one at a time that are compatible with already accepted
clades. Do this until you have a fully resolved tree.
R*—for each set of 3 taxa, find the most commonly occurring triple e.g.,
(AB)C, (AC)B or (BC)A. Build the tree from the most commonly occurring
triples.
Asymptotic consensus
trees
Consensus trees are usually statistics, functions of
data like x-bar.
We consider replacing observed (estimated) gene
trees with their theoretical probabilities under
coalescence and determining the resulting consensus
tree.
Motivation: if there are a large number of independent
loci, observed gene tree and clade proportions should
approximate their theoretical probabilities.
Tree/Clade
Probability
((AB)(CD))
((AC)(BD))
((AD)(BC))
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
{AB}
{AC}
{AD}
{BC}
{BD}
{CD}
{ABC}
{ABD}
{ACD}
{BCD}
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
p11
p12
p13
p14
p15
p1 + p4 + p5
p2 + p6 + p7
p3 + p8 + p9
p3 + p10 + p11
p2 + p12 + p13
p1 + p14 + p15
p4 + p10 + p14
p5 + p8 + p12
p7 + p9 + p14
p11 + p13 + p15
Greedy Tree
Examples
x = y = 0.1
0.128
0.099
0.099
0.104
0.091
0.066
0.062
0.037
0.037
0.066
0.062
0.037
0.037
0.037
0.037
x = y = 0.05
0.121
0.105
0.105
0.079
0.075
0.061
0.060
0.045
0.045
0.061
0.060
0.045
0.045
0.045
0.045
0.332 (1)
0.227 (2)
0.173 (6)
0.226 (3)
0.173 (6)
0.202 (5)
0.215 (4)
0.165 (8)
0.136 (9)
0.136 (9)
0.275 (1)
0.226 (2)
0.189 (7)
0.226 (2)
0.195 (6)
0.211 (4)
0.201 (5)
0.165 (8)
0.150 (9)
0.150 (9)
(((AB)C)D)
((AB)(CD))
Tree/Triple
Probability
Examples
x = y = 0.1
((AB)(CD))
((AC)(BD))
((AD)(BC))
(((AB)C)D)
(((AB)D)C)
(((AC)B)D)
(((AC)D)B)
(((AD)B)C)
(((AD)C)B)
(((BC)A)D)
(((BC)D)A)
(((BD)A)C)
(((BD)C)A)
(((CD)A)B)
(((CD)B)A)
(AB)C*
(AC)B
(AB)D*
(AD)B
(AC)D*
(AD)C
(BC)D*
(BD)C
R* Tree
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
p11
p12
p13
p14
p15
p1 + p4 + p5 + p8 + p12
p2 + p6 + p7 + p9 + p14
p1 + p4 + p5 + p6 + p10
p3 + p7 + p8 + p9 + p14
p2 + p4 + p6 + p7 + p10
p3 + p5 + p8 + p9 + p12
p3 + p4 + p6 + p10 + p11
p2 + p5 + p8 + p12 + p13
x = y = 0.05
0.128
0.099
0.099
0.104
0.091
0.066
0.062
0.037
0.037
0.066
0.062
0.037
0.037
0.037
0.037
0.121
0.105
0.105
0.079
0.075
0.061
0.060
0.045
0.045
0.061
0.060
0.045
0.045
0.045
0.045
0.397
0.301
0.455
0.272
0.397
0.301
0.397
0.301
0.365
0.316
0.397
0.391
0.366
0.315
0.366
0.315
(((AB)C)D)
(((AB)C)D)
Unresolved zone for majority-rule
and too-greedy zone
What about finite samples?
If you sample 10 loci, you could have:
All 10 match the species tree
9 match the species tree, 1 disagrees
8 match the species tree, 2 disagree, etc.
You can consider gene trees as categories and use multinomial
probabilities for the probability of your sample
n!
nk
n1
Pr[ c ( n1 ,, nk ) T ]
p1 pk I ( c ( n1 ,, nk ) T )
samples n1! nk !
Are consensus trees inconsistent
estimators of species trees?
Theorem 1. Majority-rule asymptotic
consensus trees (MACTs) do not have any
clades not on the species tree.
Theorem 2. Greedy asymptotic consensus
trees (GACTs) can be misleading estimators of
species trees for the 4-taxon asymmetric tree
and for any species tree with n > 4 species.
Theorem 3. R* asymptotic consensus trees
(RACTs) always match the species tree.
Conclusions
Coalescent gene tree probabilities are useful for
understanding asymptotic behavior of consensus trees
constructed from independent gene trees.
Greedy consensus trees can be misleading, but are
typically quicker to approach the species tree than
majority-rule or R* when outside of the greedy zone.
R* consensus trees are consistent and more resolved
than majority-rule consensus trees.