Download 1 Gene trees and species trees The lines of organismal descent that

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Therapeutic gene modulation wikipedia , lookup

Hybrid (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Population genetics wikipedia , lookup

Gene nomenclature wikipedia , lookup

DNA barcoding wikipedia , lookup

Genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene expression programming wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Koinophilia wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Gene trees and species trees
The lines of organismal descent that make up the tree of life serve as conduits for the
passage of genetic material from generation to generation. Understanding how genes
pass through lines of descent is therefore needed to fully understand how traits are
transmitted down evolutionary lineages. Furthermore, with gene sequence data emerging
as the main tool for reconstructing phylogenetic relationships, it is important to
understand the structure of gene histories and how they relate to population histories. It
will become clear that a set of organisms can have multiple true histories depending on
what aspect of the organisms you choose to focus on. Therefore, before embarking upon
the problem of inferring phylogenies, we need to ask ourselves, what kind of entities
(genes, organisms, populations, species) do we wish to study? And, what aspects of their
history do we want to reconstruct? We begin by clarifying the shape of gene trees as they
relate to organismic pedigrees and population trees. We then lay out some of the issues
to consider in integrating the concept of “species” into phylogenetic theory.
Gene trees in asexual organisms
Strictly asexual organisms have uniparental reproduction (one parent per offspring) and
provide a simple starting point for thinking about the inheritance of the genetic
information in DNA. A DNA strand has an order of nucleotides that gets copied, with
some high degree of fidelity, during DNA replication. The figure depicts one parent
DNA strand, its two children, and its four grandchildren.
Each nucleotide position in a daughter sequence is copied from a particular position in its
parent sequence. In this case, because there have been no insertion or deletion events, the
first nucleotide positions in all four grandchildren are homologous, as are the second, the
third, etc.
A nucleotide position in an offspring is considered homologous to that in a parent if the
former was copied from the latter during DNA replication. Homology of nucleotide
positions is independent of the fact that copying is imperfect (sometimes a parent and
offspring sequence will differ). For example, position number 7 is homologous in all the
grandchildren despite the fact that two have A’s and two have T’s at this position.
Pairing during DNA replication and the tendency for the daughter strand to match the
parent strand, rather than the perfect identity of template and copy, is all that is needed
for a positional homology. A nucleotide in a daughter strand is seen as being descended
from a parental nucleotide even if there was an error in DNA replication that caused a
change in nucleotide identity.
1
G
A
G
C
T
A
A
G
C
T
A
T
A
A
A
G
C
T
A
A
G
C
T
A
T
A
A
A
G
C
T
A
T
G
C
G
A
T
A
A
A
G
C
T
A
T
G
C
C
A
T
A
Sequence evolution involves
changes in the nucleotides
occupying a particular position.
As the underlined bases
highlight, mutations have arisen
(either during DNA replication or
at another time in the life of the
A
A
A
A
organisms) so that the four
G
G
grandchildren have distinct
C
C
T
T
sequences from one another and
A
A
A
T
from the ancestral sequence. Just
G
G
as we may draw a history of
C
C
T
C
organisms and use it to
A
A
T
T
summarize trait evolution in
A
A
those organisms, we may also
draw a
A A T T
history of a
nucleotide
A
A
T
A
position.
G
A
C
Let us use
T
the nonA
A
conventional but useful term
G
position tree for the history of a
C
C
single nucleotide position with
A
T
the state of that position at each
A
node shown. The small tree to the
right shows the position tree for position number 7 in this sequence.
We could draw position trees for all the other positions. Although the identity of the
nucleotides occupying different parts of these trees will differ, each position tree has the
same shape. This follows because each ancestral nucleotide has the same set of DNA
molecules as its descendants. Each position tree is concordant with one another and with
the organismic pedigree.
In this example, we have dealt with a tiny stretch of DNA, but the principles scale up
perfectly to the whole genome. Uniparental inheritance constrains all nucleotide
positions in the genome to follow the same lines of inheritance. Consequently, so long as
reproduction is strictly uniparental, all position trees in a genome are fully concordant.
Gene trees in sexual populations
Sexual populations are identical to asexual ones in the sense that all nucleotide positions
have a tree-like history. A nucleotide position has only one parental position, so
nucleotide positions have uniparental inheritance even when organisms show biparental
inheritance. Nonetheless, sexual organisms differ in that each diploid genome has two
2
copies of every nucleotide position. These two copies have passed through a different
series of ancestral organisms before arriving in the mother’s or father’s genome. The two
homologous positions in an individual may, thus, be distantly related tips on the
underlying position tree.
When a sexual organism undergoes meiosis, recombination builds a haploid genome that
is a mixture of paternal and maternal copies of each position. Because these copies have
different histories, the haploid genome will include positions with different histories. As a
result, if we consider a set of sexual organisms, numerous, conflicting position trees
coexist in the genome. Let us explore this in more detail.
The figure to the right gives an example of a small sexual population. Each individual is
indicated by two circles one representing the
maternal copy of a particular nucleotide
position (say the one on the left) and the other
representing the paternal copy. For example,
focus on the individual in the top left of the
figure. Its mother (the source of the left hand
circle) donated her paternal copy, whereas its
father donated his maternal copy. There are
four possible outcomes of each instance of
sexual reproduction: the maternal or paternal
copy of the mother could merge with the
maternal or paternal copy of the father. With
four possible outcomes per mating and, in this
example 56 matings, there are a huge number
of possible histories (9.8 million), of which just
one set of events are shown here.
Despite the fact that organismic inheritance is biparental, nucleotides show uniparental
inheritance. Consequently, even though the organismic pedigree is net-like, the
nucleotide history is tree like. To see this, select a nucleotide position in the oldest
(lowermost) generation that you can see is an ancestor of at least one copy that persisted
through until the current (uppermost) generation. If you trace the descendants of this
ancestor, you will see that it forms a tree: that lineages diverge but never converge. One
example is shown.
This diagram shows relatively few generations.
However, it is probably apparent that if you
picked two positions in the current generation
you could trace backwards down their lineages
until they converged on a common ancestor.
At this point the two lineages are said to have
coalesced into a single lineage. Assuming two
nucleotides really are homologous, they will
eventually coalesce, although it could be in the
3
very distant past. This principle can be taken a step further: all the homologous
nucleotides in the current generation will ultimately coalesce to a single ancestral lineage.
As a general rule, the smaller the population, the more quickly coalescence occurs. Thus
in a small population lineage, you will typically not need to look far back in time to find a
common ancestor of all extant individuals. For example, in the population lineage shown
here, which has only four individuals per generation, all eight positions in the current
generation trace back to a single common ancestral position five generations in the past.
Another way of looking at the same phenomenon is to note that of
the eight copies present in the lower generations only one lineage
has persisted to the present. The others have gone extinct by failure
of organisms to reproduce or failure of the relevant copy to be
passed on to offspring. The loss of lineages over time, lineage
sorting, is inexorable. A nucleotide position present in one
generation can fail to be represented in the next, but, in a closed
population lineage (as assumed here), a nucleotide position cannot
arise except by descent from a copy in a previous generation. Thus,
there is a ratchet: copies can die but cannot be born, yielding an
unstoppable decline in genetic diversity over time. The question in
evolution is not whether genetic diversity will decline, but how
quickly it will be lost and which variants will persist.
Everything I have described so far applies equally to asexual and
sexual population. In both cases each nucleotide position has an
underlying tree involving coalescence as one looks backwards
through time (and lineage sorting as one looks forward). The fundamental difference
between an asexual and a sexual population is that, in the latter, two different nucleotide
positions can have different (discordant) trees, whereas in asexual populations all
nucleotide positions track the same tree.
To understand this most fundamental consequence of sex, let’s follow the fate of two
nucleotide positions on different chromosomes in the sexual population depicted above.
In this example, let’s trace the position tree for a sample of six positions (in six different
organisms) in the current generation.
The upper and lower panels depict the same population. You can tell this by the fact that
each organism (pair of circles) is derived from the same mother (source of the left hand
circle) and the same father (source of the right hand circle). It will be helpful to work
through this carefully to be sure the commonalities are clear. The only difference
between the two independently assorting nucleotide positions is that in about 50% of the
cases, the source of the chromosome (maternal or paternal) passed on to the next
generation differs. Focusing on the top left individual, it received its mother’s paternal
copy and its father’s maternal copy in the upper case, but the maternal copies from both
parents in the lower case.
4
A
B
C
D
E
F
Now, for each
C
B
D
E
A
panel you can
work out the true
position tree for
the sampled
individuals by
working down
(from offspring
to parent) from
these six tips and
seeing the order
in which the
lineages
coalesce. For
example, in the
A
B
C
D
E
F
upper panel, B
and D coalesce
three generations
before the
D
F
E
B
A
present, A and C
coalesce four
generations
before the
present, whereas
the combined
B+D and A+C
lineages coalesce
five generations
before present.
This yields a
partial tree of the
form:
((A,C)(B,D)), as shown to the right of the figure. Doing the same thing on the lower
panel yields a quite different tree.
Why does biparental inheritance yields discordant trees? The basic reason is that the
products of meiosis (the defining feature of sexual reproduction) do not receive a fixed
complement of nucleotide positions, but contain a mixture of the paternal and maternal
copies: segregation is independent. The processes that allow for independent segregation
of different parts of the genome are collectively called recombination. It is ultimately
recombination that allows sexual organisms to have nucleotides with discordant histories.
Because recombination is the cause of discordant trees, the frequency of recombination
between two nucleotide positions determines the probability that their underlying trees
will be discordant. At one extreme, meiotically independent nucleotide position, for
5
F
C
example situated on different chromosomes, have a 50% chance of cosegregating during
each meiotic event. As a result, they are relatively likely to have discordant histories. For
example, suppose we looked at one homologous nucleotide position in three individual
gametes, call them A, B, and C. In a well-mixed population there is an equal chance of
the three possible trees: A and B sister, B and C sister, or A and C sister. Thus, if we
picked a second nucleotide position on a different chromosome, there is 66% chance that
it will have a discordant tree to the first position.
At the other extreme, two nucleotide positions that are physically adjacent will be tightly
linked: if a gamete gets the maternal copy of the first, it will almost certainly get the
maternal copy of the second. Such tightly linked positions will tend to have the same
tree. Between these two extremes the degree of meiotic independence is determined by
how frequently the two nucleotides tend to cosegregate, which, in turn, is determined by
how frequently recombination happens between the two positions.
Imagine you had a tree-o-scope, a magical device that you could “touch” to a single
homologous nucleotide position in a set of individuals and it would show you the true
tree for that position (impossible as such a feat might be!). Suppose that you started at
the end of a chromosome and slid the tree-o-scope along the DNA strand, one nucleotide
position at a time. What would you see? You would see the same tree for blocks of
adjacent nucleotide, but after traveling some distance down the strand the tree would
suddenly change. This would indicate that at least one relevant recombination event had
occurred – an event that caused discordance among the organisms examined by the treeo-scope. How far along the chromosome you would have to go would depend on the
recombination rate (and an element of chance).
[Tree-o-scope figure needed]
A collection of adjacent nucleotides that have the same tree is defined as a
recombinational gene. As you slide along a chromosome with the tree-o-scope the
periods of constancy in true tree correspond to recombinational genes, whereas changes
in the true tree correspond to the boundaries between recombinational genes. Along a
given chromosome in a given set of organisms, some recombinational genes will be long,
containing many adjacent nucleotide positions, whereas other will be short.
A recombinational gene has no necessary relation to a functional gene, a piece of DNA
that encodes a protein (or functional RNA molecule). Several adjacent functional genes
could be in the same recombinational gene, or a single functional gene could be made up
of more than on recombinational gene due to relevant recombination events within the
functional gene. Because recombinational genes are the focus of phylogenetics, it is
common practice to call them just “genes.” I will follow this loose usage for brevity. So,
unless it is clear from the context or the use of a modifier, such as “functional” or
“protein-coding,” you should assume that when I talk about genes I am talking about
recombinational genes.
6
Suppose we focused on a number of particular versions of a gene (=alleles) in a particular
population. By definition there is a single true tree for all the nucleotide positions in the
gene. This is a gene tree. Below, we will explore the relationship between gene trees and
population trees, but first we need to clarify what a population tree is.
Population trees
A sexual population is a set of organisms living at a certain time that have the potential to
interbreed with one another. It is not necessary, indeed usually impossible, for every
organisms in a population to actually mate with every other member of the population. It
is not even necessary that all organisms in the current generation be able to mate in this
generation (for example two males or two females cannot parent an offspring
biologically). Rather, membership in a population is defined by the fact that the
component organisms have some reasonably high probability of sharing a common
descendant at some time in the future.
An unbranching population lineage is composed of a single population at each point in
time. These sequential populations are connected into a population lineage based on the
fact that organisms in one population are ancestors of organisms in later populations and
descendants of organisms in earlier populations of the lineage. As discussed in the
previous section, unbranching population lineages contain divergent gene trees despite
the net-like structure of pedigrees. Furthermore, different parts of the genome have
different gene trees.
The introduction of a major geographic break in a population lineage, can divide a
population into two, thereby precipitating lineage splitting. Before the split there is only
one ancestral population. After the split there are two daughter populations, made up of
organisms that are descended from those in the ancestral population. Each of the
daughter populations is the beginning of a new population lineage. A series of successive
lineage splitting events can lead to a multiply branched population tree.
Gene trees in branching population trees
Because gene flow requires the
generation of shared descendants,
populations mark the limits of
potential gene flow. Likewise,
population trees constrain gene
trees: a population tree is like a
system of pipes through which
gene lineages pass. Consider a
simple population tree such as
that shown to the right. An
initial population lineage
splitting event, split 1, resulted in
A
B
C
Split
2
Split
1
7
two lineages, one of which yielded a single living population, A, whereas the other went
through a second split, split 2, to yield two living populations, B and C. Imagine
sampling one allele from each tip. What gene tree would you expect?
The natural prediction is that the gene tree would have the same shape, or topology, as
the population tree. Thus, the alleles from B and C should coalesce closer to the present
than the coalescence of A and B+C. A justification for such an inference might be as
follows. The coalescence of genes from B and C must predate split 2. This follows
because coalescence entails a gene (hence organism) ancestral to both B and C combined
with the fact that organisms after split 2 are ancestral to only of the two populations. By
similar reasoning, the coalescence of A and either B or C must predate split 1. Because
split 1 is earlier than split 2, the expected gene tree has the same form as the species tree.
While this reasoning is sound, the operative word is “expected.” It is true that a gene tree
that matches the population tree is more likely than any other particular gene tree, but it is
possible to obtain a gene tree that
A
B
C
disagrees with the population tree. How
can this happen within the confines of
Split
2
the population “pipes?”
The coalescence of B and C must
Split
predate split 2, but it could also predate
1
split 1. If it does predate split 1, then it
is possible for gene lineage B to
coalesce with lineage A before (more
recently) it coalesces with lineage C. As
a result the true gene tree can be
different from the true population tree. The figure shows a gene tree with A and B sister,
within a population tree that has B and C sister.
You may notice that the only way to generate a gene tree that contradicts the population
tree is for multiple gene lineages to persist side-by-side in the populations between split 1
and split 2. In this case, in every generation between split 1 and 2, the population
contained at least one organism with an allele that is ancestral to those in B and at least
one organisms with an allele that ancestral to those in C. To see how this is possible we
need to look at the individual organisms and their genetic composition.
Let’s consider a simplified case involving populations of only four individuals forming a
population tree with three tips, A, B, and C. If you trace through the lineages you will
see that regardless of which individual alleles you pick from the three tip populations the
gene tree is the same. In all cases, the alleles in populations A and B are more closely
related to each other than either is to the alleles in C. But this gene tree is at odds with
the population tree!
In this case, each lineage begins with four individuals and, thus, eight alleles. In the three
terminal population lineages only one of these eight lineages persisted – all the others
8
went extinct due to lineage sorting. However, on the internal branch of the population
tree (between split 1 and split 2), two gene lineages persisted. One of these ultimately
became fixed in population B and the other in population C. This fact means that
coalescence of B and C predated split 1.
Within the ancestral population lineage the three gene lineages could have coalesced in
one of three orders. In this case the first coalescence was between the A and the B
lineage, which occurred four generations before split 1. The coalescence of the A-B
lineage with the C lineage occurred two generations earlier. Thus the true gene tree has
alleles from organisms A and B being more closely related to each other than to alleles
from organisms in C, even though this is at odds with the population history.
This example shows that when there is incomplete lineage sorting along internal branches
of a population tree, gene lineages will show deep coalescence within ancestral
populations, which can yield gene trees that are at odds with population trees. It is
beyond the scope of this book to go into detail, but it may be obvious that the shorter the
duration of the internal branch, the more likely it is that a gene will show deep
coalescence. Likewise, deep coalescence is more likely when the internal population
lineage includes many individual, because with more alleles in a population, a genetic
polymorphism is more likely to persist. For more detailed exploration of these patterns a
population genetic text may be helpful.
Despite the potential for discordance of gene trees and the population trees, population
trees still predict the shape of gene trees. Consider the simple case involving three
descendant populations resulting from two successive lineage-splitting events. For those
genes that manifest deep coalescence, such as the one depicted in the figure, there is a
33% chance of obtaining a tree that matches the population tree and a 33% chance of
getting each of the other two possible trees. However, for those genes that undergo
complete lineage sorting in the internal population lineage, there is a 100% chance that
the gene tree matches the population tree. Thus, a gene tree that matches the population
tree may occur in a minority of cases, but it is still expected to be more frequent in the
genome than any particular alternative gene tree.
9
A
B
C
Split 2
Split 1
A-B
coalescence
AB-C
coalescence
10
The preceding point can be extended to provide a useful tool for communicating the
outcome of more complicated population histories. Each gene in the genome has, by
definition, a single true gene tree. This gene tree can be described as a list of non-trivial
clades. Thus, we can count up the proportion of nucleotide positions in a genome that
have a given clade. The frequency of a clade in the genome is called the clade’s
concordance factor. A concordance factor of 100% applies to a clade that is true of the
entire genome – as might arise due to a lack of sexual reproduction or due to internal
Gene tree
1
Gene tree
5
A
Gene tree
2
Gene tree
3
A
A
Gene tree
4
A
B
B
B
B
C
C
C
C
D
D
D
D
E
E
F
E
F
F
G
F
G
G
E
G
A
Gene tree
6
Gene tree
7
A
A
Concordance
A tree
43%
%
B
C
C
100%
B
%
C
B
B
C
D
D
D
D
57%
%
F
E
E
G
F
F
E
G
G
E
100%
%
F
86%
11
%
G
branches that are long enough to ensure complete lineage sorting. A low concordance
factor, in contrast, could result from internal population lineages along which lineage
sorting has been incomplete. A useful way to summarize this information is with a
concordance tree. This is a tree built of clades that are at highest frequency in the
genome, with the concordance factor provided at the branches. For example, the figure
imagines a case where there are only seven gene trees in the genome, each of which has
been tracked by an identical proportion of the genome. The concordance tree is shown in
the lower right.
It is worth mentioning that because there are many possible clades that can be found, a
clade can appear on a concordance tree even if it is true of less than half the genome. A
clade can be true of less than half the genome, which is to say its concordance factor can
be less than 50%, yet it could still be true of more of the genome than any conflicting
clade. For example, in the figure, the (BC) clade occurs on three of the seven trees and
thus represents 43% of the genome. This is still significant, however, because the two
conflicting clades, (AB) and (AC), are each true of only two-seventh (29%) of the
genome. Thus (BC) is a noteworthy group despite being true of less than half the
genome.
Population reticulation
The preceding discussion has only considered population trees that are strictly tree like,
with population lineages that split but never reconnect. In reality, evolution is not so tidy.
Instead, population lineages will sometimes adopt a net-like, or reticulate, pattern.
Three phenomena, in particular, are worth considering: lateral gene transfer (sometimes
called horizontal gene transfer), introgression, and lineage fusion. These three processes
have different implications for the distribution of gene trees in the genome.
Lateral gene transfer (LGT) occurs when a small piece of the genome, typically a single
recombinational gene, is transferred between organisms by a process other than
traditional sexual reproduction. For example, a virus infecting one organism can acquire
a segment of DNA from its host that then becomes integrated into the genome of a
second, perhaps distantly related host. The result of LGT is that some small part of the
genome has a gene tree that is quite distinct from the dominant history that is true for the
bulk of the genome.
A
A
B
C
D
B
C
D
E
C
E
E
Dominant
tree
A
Minor
tree
12
B
D
Introgression applies when organisms from distinct population lineages come into contact
and reproduce sexually, resulting in hybrid offspring that then breed with members of one
or the other parental population. If you imagine a snapshot of one of the populations
soon after introgression, some individuals will carry a gene that originated from the other
population and, thus, have a distinctly different history to the genes in the other
organisms. Assuming the populations regain their independence, lineage sorting will
occur in each population lineage. For a majority of genes, the rare gene copies that were
introduced will be lost by gene lineage extinction. However, for some parts of the
genome the “foreign” gene will be the copy that goes to fixation. The gene tree for such
regions will be different from the rest of the genome.
In the case
A
where only a
B C D
few genes are
A
E
B
C
D
transferred,
most of the
Introgression
Dominant
genome
tree
tracks the
tree-like
A
B C D
history of the
populations,
whereas a
Minor
minority of
tree
the genome
tracks the
introgression event. In this hypothetical case lineages C and D underwent a period of
introgression in the recent past. As a result, D is sister to E and C is sister to B for
majority of the genome, but for a significant subset of genes D is sister to E.
Lineage fusion entails the complete merging of two formerly distinct population lineages
into a single descendant lineage. This phenomenon is sometimes called hybrid
A
B
C
H
D
A
B H C
D
Codominant
tree 1
A
B
Codominant
tree 2
13
C H
D
E
E
“speciation,” but given ambiguity over the terms “species” and “speciation,” the
relatively neutral term lineage fusion is preferable. Lineage fusion is an extension of
introgression in that the parental lineages cease to exist as independent entities. Lineage
fusion results in a descendant population that initially contains an approximately equal
proportion of genes from each of the parental populations. Thus, unlike more general
introgression, there is no distinction between the dominant and minor history and,
instead, there are multiple codominant histories.
A special case of lineage fusion, particularly important in plants, is allopolyploidy. Here,
the hybridization event yields an individual organism that has the full set of
chromosomes from both parents (the maternal and paternal copies from both parents).
This allopolyploid individual, perhaps with other such individuals, is the basis of a new
population lineage. The details of the different possible fates of the four copies of each
gene are variable and complex. However, the most likely outcome is that there will be
two homologous genes in each genome, one derived from each parental lineage. Thus,
we expect a single history wherein individuals in the allopolyploid lineage will have two
genes each, occupying different places on the gene tree.
A
B
C
H
D
A
B H1 C
H2
D
Gene tree
Tree thinking given reticulation
Does the existence of reticulation in population trees undermine tree thinking and the
discipline of phylogenetics? The answer is no, for two main reasons.
The first point to make is that while all the phenomena described above do indeed occur,
this does not change the primarily tree like nature of population histories. Introgression
and lineage fusion are phenomena that require sexual reproduction: mating and the
production of fertile offspring. The capacity to mate successfully is, in most organisms,
influenced by numerous biochemical, physiology, behavioral, and physical traits, and
evolutionary divergence in any of these traits has the potential to prevent successful
reproduction. As a result, successful mating is only possible between individuals that
have reasonably recent common ancestry – if they have been evolving independently too
long they will have lost the capacity to mate successfully. Thus, introgression and
lineage fusion are relatively local affairs – affecting only closely related lineages. Thus,
while they might lead to a non-tree-like history within mice, or box turtles, or poplars, or
mosquitoes, these phenomena never bring together more divergent lineages: mice with
14
hamsters, box turtles with snapping turtles, willows with poplars, mosquitoes with
houseflies.
In the case of LGT, there are no constraints on the phylogenetic “breadth” of reticulation:
a gene can move as readily from a bacterium into a plant as from a crocodile into an
alligator. However, LGT involves rather minute pieces of DNA and, thus, impacts only a
tiny minority of the genome. Only in groups in which LGT is rampant, as may be the
case for some bacteria and archaea, is the tree model truly undermined. For the majority
of organisms, including all multicellular eukaryotes, LGT is not a significant cause of
non-tree like evolutionary histories.
The second point to emphasize is that even when the population history is reticulate, gene
trees are not. Recall that even within the fully reticulate genealogy of a sexual
population, individual genes have strictly tree like histories. Thus, tree thinking is valid
and useful even when population lineages do not form trees. Indeed, it is by studying the
discordance among gene trees that biologists have the means to learn about the shape of
population histories, whether they be tree-like or reticulate. Thus, rather than
undermining the phylogenetic paradigm, reticulate evolution provides an additional
reason for understanding trees and how gene trees can agree or disagree.
Monophyly (revisited) and exclusivity
In Chapter 2, I introduced the concept of monophyly as a property of a set of organisms
that comprise all the descendants of a single ancestor. Monophyly, as defined, is not
intrinsically tied to tree-like histories, and applies even within reticulate genealogies. For
example, all the grandchildren or a particular grandparent constitutes a monophyletic
group. When the genealogy is reticulate, different monophyletic groups of organisms can
overlap in content. For example, in the pedigree shown, the two monophyletic groups
defined by the grandparents overlap.
For a long time, the concepts of monophyly and exclusivity were confounded, but since
about 1990 these have been distinguished. An exclusive group is a set of
contemporaneous organisms that are more closely related to one another than to any
contemporaneous organisms outside the group. The restriction to contemporaneous
organisms avoids the complication of asking difficult questions such as whether an
15
organism is more closely related to it parent or its sibling. This definition of exclusivity
forbids the existence of overlapping exclusive groups: exclusive groups form strictly
nested hierarchies.
As discussed in Chapter 4, monophyletic groups on strictly divergent trees have the
property of forming non-overlapping sets: a monophyletic set of tips on a divergent tree
comprises an exclusive group. For example, because gene trees are (by definition)
strictly divergent, each clade on a gene tree is an exclusive group of alleles. However, as
shown in the pedigree above, monophyly need not imply exclusivity when relationships
are not tree like. It is therefore necessary to clarify the notion of exclusivity for reticulate
genealogies such as we find within sexual populations. There are two distinct ways that
one can approach the concept of exclusivity, depending on whether one evaluates degree
of relatedness based on the organismic pedigree or on gene tree concordance.
The pedigree approach
involves considering
ancestral organisms,
specifically the last
common ancestors of
each pair of
contemporaneous
organisms. A group is
pedigree exclusive if
every pair of organisms
within the group shares a
common ancestor that
lived more recently than
the last common
ancestor of any member
of the group and any
organism outside the
group. The hypothetical
genealogy to the right
shows the pedigree
exclusive groups in the
current (topmost)
generation using ovals.
The most recent
common ancestors are
indicated using black circles. Two larger pedigree exclusive groups correspond to the
two apparent population lineages. Two other groups correspond to full sib families,
whereas an intermediate-size group corresponds to a local, inbred subpopulation. You
can see that the exclusive groups form a nested hierarchy.
The gene tree approach focuses on groups of organisms that contain alleles that form
clades for a plurality of the genome (groups whose concordance factor is higher than that
16
of any group of overlapping content). Suppose that the ten gene copies in organisms A-E
form a clade for 20% of the genome (i.e., organisms A-E have a concordance factor of
20%). A-E is a concordance exclusive group if there is no set of organisms that overlaps
in content (includes some of the organisms A-E and at least one organism that is not in
A-E) and has a concordance factor of 20% or above.
Because genes are passed from parents to offspring, we generally expect pedigree
exclusive groups to also be concordance exclusive, and vice versa. However, while the
two concepts of exclusivity are expected to match most of the time, there is no guarantee
that a particular pedigree exclusive group will be concordance exclusive. Chance and
biased patterns of gene segregation (e.g., due to selection) can result in differences
between these two criteria of exclusivity. Nonetheless, because cases of disagreement are
likely to be rare, we can usually get away with a loose concept of exclusivity in which we
do not specify whether we are referring to pedigree or concordance exclusivity.
It was stated in chapter 4, that biological classification systems strive to represent
evolutionary relationships and that this is usually achieved by only formally naming
monophyletic taxa. Here we may now refine this point. In order for classifications to
reflect evolutionary history, taxa should be exclusive groups of organisms. In cases
where relationships are tree like, as is usual when one is working above the rank of
species (i.e., when species or larger taxa are tips), monophyletic groups are exclusive.
Monophyly thus serves as a convenient proxy for exclusivity, but it is exclusivity that we
really care about.
The concept of species
Since the adoption of a phylogenetic outlook, evolutionary biology has struggled to
clarify the relationship of “species” to different kinds of genealogies – organismic
pedigrees, population trees, and gene trees. There have been numerous ideas proposed
over the years, and a great deal controversy. The core of the problem is that there is no
single definition of species that achieves all that has been historically expected of species.
In the ideal world we could devise a definition of species such that they are the entities
that participate in evolution (predicting the future) and are also the products of evolution
(reflecting evolutionary history). However, in fact, different popular definitions of
species succeed either in making species players in evolution, or product of evolution, or
neither – but not both. Here I briefly summarize some of the alternative approaches to
thinking about species and how, under each, species relate to trees. I end by suggesting
that if your aim is for all taxa, including species, to be entities that are natural pieces of
the tree of life, then you should view species as just one among the many nested
exclusive groups that comprise the tree of life.
To simplify the discussion I will assume that we are trying to assign living organism to
species. This is a simplification because living organisms are contemporaneous, so one
does not have to worry too much about what to do with organisms that are ancestral to
two or more, very different descendants (include them with both descendants, with one,
or with neither?). The species concepts that we will discuss are therefore time-limited or
17
synchronic. They ask the question: What properties justify grouping these
contemporaneous organisms into the same species? Many biologists feel that species
ought to be diachronic entities: things that are born, persist through time, and die, just
like individual organisms. However, in the same way that population lineages are made
up of successive synchronic populations (see above), so too can species lineages be seen
as being composed of successive synchronic species. So even if we define species
synchronically, they can still be seen to exist diachronically.
The first way of thinking about species is as groups of organisms sharing a particular
trait. I will call species concepts in this vein, phenetic species concepts. This is a broad
umbrella. Individual species definitions within this umbrella differ in whether the
attribute uniting members of a species is a single trait
or a set of traits, or whether certain kinds of features,
for example ecological specialization, are given
primacy.
sp. 2
The trait-based view of species predates Darwin and,
thus, phenetic species concepts are not necessarily
sp. 1
evolutionary. More importantly, even those phenetic
species concepts that have been justified based on
evolution fail to align tidily with phylogenetic history.
sp. 3
The reason is that organisms can share similar traits,
not because they are closely related, but because they
have retained the same ancestral character state. For
example, the figure depicts a population tree with key
traits mapped on. If we include in the same species all organisms that have the same state
of this trait then three species are recognized – species 2 and 3, which each have derived
character states, and species 1 which is composed of lineages that have retained the
ancestral state. Species 1, in this example, does not correspond to an exclusive group:
members of species 1 may be more closely related to members of species 2 or 3 than to
some other members of its own species. This phenomenon will apply regardless of
whether the trait is question is morphological, behavioral, physiological, molecular,
ecological, or geographical.
The same reasoning holds if we use overall
similarity of organisms as our phenetic criterion,
but where the rate of evolution differs among
lineages (as certainly happens). The figure shows a
tree in which the length of branches is drawn
proportional to the actual rate of evolution
occurring on that lineage. Whereas most of the tree
has the same slow rate of evolution, two terminal
lineages have undergone much more rapid
evolution (approximately ten times as fast). The
organisms within species 1, 2, and 3 are all rather
similar to each other and rather different from
18
sp. 2
sp. 1
sp. 3
members of the other two species. However, in the case of species 1 the high degree of
similarity within the species reflects the shared lack of change rather than evolutionary
kinship. Some members of species 1 are more closely related to species 2 (or 3) than to
other members of species 1. So, regardless of whether the phenetic concept is based on
individual traits or overall similarity, it fails to yield a concept of species that aligns with
evolutionary history.
A phenetic species concept that warrants special attention is the biological species
concept. This defines a species as a set of populations that have the potential to
interbreed and which share reproductive features that inhibit them from interbreeding
with members of other biological species.
The biological species concept is similar to the concept of a population (defined earlier in
this chapter) in that both are restricted to sexually reproducing organisms. However, they
differ in that a population is defined based on the potential for sharing common
descendants, whereas the biological species concept emphasizes the existence of actual
differences between organisms that serve as intrinsic barriers to reproduction. Two
populations that are situated geographically in such a way that they will never share
common descendants would, nonetheless, be placed in the same biological species unless
they have already accumulated differences in their reproductive systems.
The biological species concept remains the most widely cited conception of species in
textbooks, despite having only spotty support among practicing systematists. The
problem with the concept arises form the fact that biological species are defined based on
a trait, intrinsic reproductive compatibility. Thus the biological concept runs into the
same problem as other phenetic concept: not mapping onto the tree of evolutionary
relationships. Within populations we expect organisms to share the same reproductive
system and to be compatible. However, population lineages can acquire novel traits that
do not alter their capacity to breed within the population, but inhibit reproduction with
members of other populations. When these changes occur, descendants of the population
lineage in question will cease being able to interbreed with members of all those
populations that lacked this change – it will become a new species. The problem is that
these other populations would be judged conspecific, because they still can interbreed,
but they need not comprise a clade of populations. So, as with other phenetic concepts,
the biological species concept implies that some members of a species can be more
closely related to members of other species.
The alternative strategy to using traits as a basis for defining species, is to think about
species as groups of organisms that are united by a shared history. This is the strategy of
genealogical species concepts, which view species as exclusive groups (under one or
more of the conceptions of exclusivity discussed above) that are assigned the rank of
species.
The genealogical view of species is the most compatible with a phylogenetic outlook in
that species are composed of related organisms. Genealogical species are entities just like
taxa at other ranks (families, genera, subspecies) in that they are exclusive groups of
19
organisms that form strictly nested hierarchies. Thus, as with other taxa (see Collapsing
and Pruning in chapter 2), an organism sampled from a genealogical species will occupy
the same place on the tree of life as all other organisms from the same species.
Furthermore, under a genealogical concept of species, one can collapse all individuals
from a genealogical species into a single terminal. Thus, for the remainder of this book,
the term “species” should be understood to refer to a genealogical species, unless
otherwise specified.
Although the genealogical species concept is the best approach to species for the
purposes of phylogenetics, it does not fulfill all of the desirable properties of a species
concept. Most notably the ranking of genealogical species is not strictly objective. The
criterion used to group organisms into species, exclusivity, is the same criterion that is
used to delimit taxa at other ranks. If organisms are grouped into subspecies or genera
using the same criteria that are used to assign organisms to species, then the species rank
is no longer special. Although there are a number of principles that can be used to guide
the recognition of a taxon at the species rank (e.g., species are understood to show
relatively little morphological or ecological variation and usually have a homogenous
reproductive system, especially in a single locality), the distinction between a species and
a genus cannot be tied down absolutely.
The species rank does not reliably communicate any biological property apart from
exclusivity, making the species rank semi-arbitrary. Nonetheless, the species rank is a
useful nomenclatural device for ensuring that taxa of interest to ecologists, conservation
biologists, land managers, etc. have stable names. An analogy is a ZIP or postal code that
is assigned to addresses. These codes identify an area within which a particular residence
or business is located. The areas defined by postcodes vary in area and the number of
buildings included, although the range of variation is not extreme. This is similar to the
semi-arbitrary rank of species. Individual species differ in the number of organisms they
include, their geographical range, their degree of morphological distinctiveness, and the
time since the last common ancestor of all members of the species. Also, in the same way
that postcodes serve as a convenient way to indicate the approximate locality of a
building in space, so does a species name indicate an approximate position of an
organism on the tree of life. Thus, so long as it is clear which exclusive taxa a species
name refers to, the species rank will continue to play an important role in biological
communication into the foreseeable future.
It is important to remember, however, that whatever you think the term “species” means,
the actual groups of organisms that are assigned species names may not meet this
criterion. Even if you believe that species ought to be understood as being exclusive
groups, this should not be mistaken for the view that actual named species, Canis
familiaris, Xenopus laevis, or Magnolia grandiflora necessarily correspond to exclusive
groups of organisms. No doubt there are some cases many cases where the understood
content of a species is exclusive. For example, evidence suggests that Homo sapiens fits
this criterion. However, there are multiple reasons why many species names refer to
groups that are not exclusive taxa.
20
The first point to bear in mind is that the phylogenetic perspective I have described here
has not (yet?) been adopted by a majority of taxonomists who are involved in the process
of species recognition and delimitation. In most cases, taxonomists are too busy to worry
about the exact definition of species and instead recognize species whenever they find
groups of organisms with distinctive morphological features, diagnostics traits, that
differentiate them from other known species. This practical approach is not motivated by
a search for exclusive groups, but nonetheless will often find them. Most obviously,
when diagnostic traits are evolutionarily derived, species are likely to be exclusive
groups.
The second consideration is that even if we all agree that species names should refer to
exclusive taxa, we would still make mistakes. Naming a species (or any other taxon) is
the same thing as proposing the hypothesis that the group of organisms is exclusive. For
the vast majority of named species the amount of evidence available to assess exclusivity
is minimal. Often, all we have to go on is data from a handful of museum or herbarium
specimens with scant geographic information, in which case we should be cautious about
assuming exclusivity. And, even in cases with abundant data, the hypothesis of
exclusivity will often be in doubt. Thus, it is advisable to take a scientifically skeptical
view of all named taxa, demanding that evidence be provided to support exclusivity
beyond the mere fact that the taxon has been recognized by tradition.
Major points
Genes (in the recombinational sense) have strictly tree like histories, which may not
always match the topology of the underlying population tree. The concordance or
discordance of gene trees is indicative of whether populations are sexual or asexual and
whether populations have undergone reticulate evolution. Even when the history is
complex, certain groups of living organisms are exclusive, being more closely related to
each other than to organisms outside the group. These exclusive groups are the taxa that
form the hierarchical structure of biological classification. While there are differing view
out there, a phylogenetic perspective would suggest that species are just those exclusive
taxa that have been assigned the rank of species.
Learning objectives
• Understand the way that genetic information is transmitted in unbranching
populations
o Be able to define positional homology
o Be able to explain why a single nucleotide position has a strictly tree like
history
o Be able to predict the tree for any nucleotide position in an asexual species
given an organismal pedigree
o Be able to defend the claim that all genes presently in a population must
coalesce to a single ancestral gene sometime in the past
o Be able to explain why different parts of the genome can have different
histories in sexual organisms
21
•
•
•
o Be able to distinguish a recombinational gene from a functional gene
o Be able to define a gene tree
o Be able to determine a gene tree given a sexual pedigree and information on
the parental source of each gene copy
Understand the structure of gene trees within branching population trees
o Be able to define a population tree
o Be able to predict the most likely gene tree given a population tree
o Be able to explain how a gene tree can differ from the enclosing population
tree
o Be able to determine a gene tree given a sexual pedigree for a branching
population and information on the parental source of each gene copy
Understand the causes and consequence of reticulation and gene-to-gene discordance
o Be able to predict the distribution of gene tree topology given lateral gene
transfer, introgression, and hybrid speciation
o Be able to calculate the concordance factor of a clade given information on
the distribution of gene histories
o Be able to define exclusivity and distinguish pedigree exclusivity from
concordance exclusivity
Understand that the definition of species is complicated by discordance
o Be able to list some of attributes traditionally associated with “species”
o Be able to give examples of species concepts that unite organisms based on
shared traits at a point in time
o Be able to explain why species concepts that use traits will not always be
exclusive groups
o Be able to list the advantages and disadvantages of defining species as
exclusive groups
o Be able to distinguish clades and lineages
o Be able to define the internodal or a related lineage-based species concept
o Be able to explain why lineage-based concepts need criteria for marking
lineage splitting
o Be able to argue for (or against) the position that a “species tree” is best
equated with a population tree
a1 a2 b1b2 c1 c2 d1d2 e1e2 f1 f2 g1g2 h1h2
Sample Problems
The following questions refer to the pedigree shown
to the right. Each diploid individual is represented by
a pair of circles: representing the two alleles of the
gene under consideration.
1) What features of the pedigree tells you that
time flows upwards in this diagram (ancestors
are below descendants)?
2) Is allele b1 more closely related to b2 or d2?
3) What is the gene tree for alleles a2, c1, and
f1?
22
4) What is the gene tree for alleles a2, b2, c1, c2?
5) All of the following alleles coalesce within the series of generations shown,
except one. Which is the exception? a1; a2; b1; d1; d2; e1?
6) How many alleles from the lowermost generation have at least one descendant in
the uppermost generation?
7) Tree a is true for 99.9% of the genome and tree b for 0.1% of the genome. What
is the most plausible explanation?
a
a)
b)
c)
d)
e)
b
Introgression between A and F
Introgression between B and an ancestor of A and F
F is a hybrid between A and G
Lateral gene transfer from A to F
Lateral gene transfer from F to A
8) Howarth and Baum (2005) estimate that the phylogenetic relationships of six species
of Hawaiian shrubs in the genus Scaevola. They inferred that about half the genes had
tree a and about half had tree b. What is the most likely explanation?
a)
b)
c)
d)
Introgression between S. procera and S. mollis
Introgression between S. procera and S. gaudichaudii
S. procera is a hybrid between S. mollis and S. gaudichaudii
There was lateral gene transfer from S. procera into an extinct ancestor of
S. gaudichaniana and S. chamissoniana
e) There was lateral gene transfer from S. mollis into S.procera.
23
9) For a set of 8 organisms, the 33% of the genome has tracked each of trees a, b, and
c and 1% has tracked tree d.
a
b
c
d
a. What is the concordance factor of clade (EFG)?
b. What is the concordance factor of clade (EFGH)?
c. What is the concordance factor of clade (FH)?
24
10) What is the gene tree for A, B and C given this genealogy?
A
B
25
C