Download Phylogenetics Topic 1: An overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Hologenome theory of evolution wikipedia , lookup

Adaptation wikipedia , lookup

Genetics and the Origin of Species wikipedia , lookup

Introduction to evolution wikipedia , lookup

Transcript
Phylogenetics Topic 1: An overview
Introduction
“The affinities of all beings of the same class have sometimes been represented by a great tree. I
believe this simile largely speaks the truth. The green budding twigs may represent existing species;
and those produced during former years may represent the long succession of extinct species...and
this connection of the former and present buds by ramifying branches may well represent the
classification of all extinct and living species in groups subordinate to groups.” Charles Darwin, in
Chapter IV of On the origin of species by means of natural selection, or the preservation of
favoured races in the struggle for life.
A fundamental concept of the theory of evolution, independently developed by Charles Robert Darwin and
Alfred Russell Wallace and published jointly in a letter of 1858, is that species share a common origin and have
subsequently diverged through time. Interestingly, both men came to use the simile of a great tree to illustrate
this notion of descent with modification, and ever since biologists have been using tree-like diagrams to describe
the pattern and timing of events that gave rise to the earth’s biodiversity. The branching pattern of the tree
represents the splitting of biological lineages, and the lengths of the branches can be used to signify the age of
those events. Today, biologists call these tree-like diagrams phylogenies.
Unrooted tree diagram drawn in the
margin of one of Charles Darwin’s
notebooks
Phylogenetic tree used in The Origin of Species. Darwin wasn’t just thinking about
classification based on phylogenies. He used them to visualize the process of
divergence within species and the splitting of populations into separate species. Darwin
used this figure to illustrate divergence of variants within species; over time successively
more variation accumulates. Eventually some of this variation forms the basis for new
species.
The biological discipline dedicated to reconstructing organismal phylogenies is called phylogenetics.
Parallel advances in a number of fields led to a tremendous growth in phylogenetics over the last 40 years.
First, beginning in the 1960’s, sophisticated techniques were developed and refined for the purpose of
reconstructing phylogenies from the actual features, or characters, of organisms. Second, phylogenetics grew
beyond its traditional application to classification of living organisms. Recognition that phylogenies can provide
an evolutionary framework for studying a wide variety of problems led to their application in almost every other
sub discipline of biology. Third, rapid increases in the computational power of computers meant that programs
implementing phylogeny reconstruction algorithms could accommodate very large amounts of data. Lastly, the
revolution in molecular biotechnology opened up a vast new source of characters to phylogenetic analysis.
Before discussing the wide-ranging applications of phylogenies, it is necessary to define some essential
terminology. An imaginary species phylogeny is presented in figure 1a as a guide. The lines of the phylogeny,
called branches, represent species, and the bifurcation points, called nodes, represent speciation events. The
tips of the terminal branches are present-day species, and each node represents a species that is the common
ancestor of all its descendants, or daughter species. For example, in figure 1a the species at node B is the most
recent common ancestor of present-day species 1, 2, and 3, and is not an ancestor of species 4 or 5.
Furthermore, the group composed of ancestor B and all its descendants (species 1, 2, 3, and A) is called a
clade, or a monophyletic group. Smaller clades are comprised of A and all its descendants, and D and all its
descendants. It must be noted that phylogenetics is not restricted to just species. Phylogenetic methods can be
used to depict kinship of individuals within a local group or population, relationships among populations or
subspecies, relationships among taxonomic lineages above species (e.g., supraspecific categories such as
genera, families, etc.), relationships among genes within populations, or relationships among different genes
within a gene family.
Figure 1
The phylogeny in figure 1a (above) is rooted at node C,
allowing us to infer which ancestral species gave rise to which
present-day species. Without a root, a phylogeny looks very
different; compare figure 1a with 1b, they differ only by the
placement of a root. The importance of placing a root on a
phylogeny should now be clear; without a root biologists cannot
distinguish between what is ANCESTRAL and what is DERIVED
(descendant). We will return to the concept of a root in topic 3
[methods].
Rooted phylogenies allow biologists to distinguish
similar characteristics due to common decent (HOMOLOGY) from
similar characteristics due to convergence from different
ancestors (ANALOGY) (see figure 2 to right). However, most
methods of phylogenetic inference produce unrooted trees, and
the location of the root also must be inferred.
Rooted phylogenies allow biologists to infer CHARACTER
the evolutionary relationship between two or more
states for a given character. Say we have a character with two
states, “a” and “b”. By mapping them on a phylogeny we can
determine that “b” preceded “a” in evolutionary history; hence “a”
is the derived state and “b” is the primitive state.
POLARITY;
Figure 2
In the former examples, branch lengths were not intended to convey any information (figures 1a and 1b).
The phylogeny in figure 1c illustrates how branch lengths can show how much change has occurred along a
branch. In the case of molecular characters, if the rate of evolution is constant over time (the so-called
molecular clock), the branches will show the relative divergence times of the lineages. For example, figure 1c
indicates that the divergence of species 1 and 2 was much more recent than divergence of species 4 and 5.
Moreover, if the divergence dates of some points in the phylogeny are known from the fossil record (calibration
points), and the characters are evolving in a clock-like fashion, the phylogeny can be used to predict
divergences absent from the fossil record.
Below is an example of a real dataset (COII and cyt b gene sequences of selected mammals) where the
branch lengths have been estimated once by assuming clock-like molecular evolution and again without such an
assumption.
Felis
Felis
Canis
Canis
Ursus
Ursus
Bos
0.1
Branch lengths estimated without assumption of the molecular
clock
Root
Root
Branch lengths estimated under the assumption of the
molecular clock
Bos
Hippopotamus
Hippopotamus
Physeter
Physeter
Balaenoptera
Balaenoptera
Rhinoceros
Rhinocero
s
Equus
Equus
0.1
Tips are contemporary; the distance
from root to each tip is the same
Tips are NOT contemporary; the distance
from root to each tip is NOT the same
The phylogenetic comparative method
Evolutionary biologists use the comparative method to discover common evolutionary patterns, and to
understand the causes of those patterns. The key to this approach is discovering correlated patterns of
evolution between different characters of organisms, or between characters of organisms and aspects of the
environment that they inhabit. Most comparative studies attempt to address the adaptive significance of
biological variation, although many patterns ultimately require non-adaptive explanations.
Since Darwin’s time, the comparative method has remained one of the most important analytical tools of
evolutionary biologists. However, comparative biology has recently undergone a major transformation; the
realization that the characteristics of species could be correlated due to shared ancestry, taken alongside the
major developments in the field of phylogenetics, meant that evolutionary biologists had to examine comparative
trends together with phylogenetic relatedness.
What is the problem? Standard statistical methods for assessing the correlation treat the data drawn
from different species as independent. Because species are hierarchically related by the phylogeny they cannot
be treated as if drawn independently from the same distribution. Let’s consider a hypothetical example. Consider
a phenotype (say, the size of a primate’s big toe; Y) and an ecological variable (say, the frequency of things that
a big toe can be stubbed into; X). Suppose you have gone to great trouble to collect measurements for size of
big toe and the “stubbiness” of the habitat, and you are interested in the significance of any relationship of Y on
X. So, you plot you data and you find what appears to be a significant correlation.
Hypothetical dataset for phenotype (Y)
and ecological variable (X)
Y
X
Now consider at some point in early history that two species diverged for toe-size and colonized two different
habitats. At that point in time there are only two points that lie on a straight line, but the correlation cannot be
significant; there are, after all, only two points and the regression has zero degrees of freedom.
Two point dataset from early in
evolutionary history
Y
X
Now consider some evolutionary time has passed and each of these two species gives rise to 100 descendent
species. By this accident of history, all the descendants in one clade will have a larger toe and tend to be in one
habitat type, and the descendents of the other species will have a smaller toe and tend to be in the other habitat
type. If our sample of data came from these two clades, we would have effectively sampled only two species.
Phylogeny of two groups of close relatives
“Big-toe clade”
“Little-toe clade”
Recent
diversifications
Old divergence of “big-toed”
and “little-toed” primates
If we code our data to indicate the clade of origin (below) we see that the correlation is an illusion
generated by two clusters with different mean values.
Hypothetical dataset with points
coloured according to clade of origin
Y
X
“Little-toed” clade
“Big-toed” clade
One way to analyze these data is to use a method called FELSENSTEIN’S INDEPENDENT CONTRASTS. The
phylogeny is divided into subsets of independent branches. A Brownian motion model is used to place an
estimate of the variance on the branch lengths of the contrasts. The independent contrasts can be considered
drawn from a normal distribution with a mean of zero. An alternative approach is to use ANCESTRAL CHARACTER
STATE RECONSTRUCTION, a statistical method of inferring the most likely character state at a site for each
ancestral node of a phylogeny. These ancestral reconstructions are then used to infer and count the number of
times that a trait of interest has evolved on a phylogeny. Both approaches take a particular topology as given;
and additional steps must be employed to take into account the error associated with a particular estimate of a
phylogeny.
Joseph Felsenstein, in the paper that laid the foundation for the modern transformation of comparative
biology (Felsenstein. 1985. Am Nat. 125:1-15.), wrote “phylogenies are fundamental to comparative biology;
there is no doing it without taking them into account”. Phylogenetically related species will be more similar in
both phenotype and lifestyle than distantly related species, and modern comparative methods must attempt to
distinguish between similarities due to similar adaptive pressures and similarities due to descent from common
ancestors.
APPLICATIONS OF PHYLOGENETICS
Phylogenies can have practical value in almost every branch of biology, a fact that has become widely
recognized only in the last decade. This expansion, however, makes it impossible to review all the applications
of phylogenies; instead, some examples are presented that include both classic and novel applications.
1. Systematics, classification, and taxonomy. Perhaps the most traditional application of phylogenetics is
classification and systematics. Biological classifications are systems that organize the diversity of life, and
systematics is the study of that diversity relative to some kind of specified relationship. Biologists generally
agree that classification and systematics of species and supraspecific taxa should reflect the natural
organization of biological diversity. The discipline devoted to producing a classification that portrays the
evolutionary relationships of species and supraspecific lineages is called phylogenetic systematics. Narrowly
defined, phylogenetic systematics has two basic components: (i) phylogenetic inference and (ii) production of a
hierarchal classification system that exactly reflects the phylogenetic relationships. However, this definition has
been broadened by some biologists to include many aspects of comparative evolutionary biology.
ERNST HAECKEL’S “TREE OF LIFE”, DRAWN SOMETIME IN THE LATE 1800’S
Placed Menschen (“Men”) at the “top” of the
tree among the Affen (“Apes”). Haeckle
was first to suggest man’s ancestry was
among the Great Apes.
This tree was a tree of “men”, and
Haeckels’s placement of Menschen at the
top was intentional
Non-mammalian
vertebrates
Invertebrates
Protozoa
This tree and associated
system of classification is
different from modern ones
in that it is based on the
notion of linear progress
(like a ladder) from the most
primitive single-celled
organisms “upwards” to
man (at the very top).
Haeckel considered the
things near the top as “more
evolved” and things near
the bottom as “primitive”.
Ernst Haeckel (1834-1919)
was a German biologist and
scientific illustrator. He was
one of the first popularizers
of Darwin’s Theory of
Evolution. The tree to the
left is from his book
“General Morphology –
founded on the descent
theory”.
If a classification system is to be phylogenetic, the naming of species and supraspecific taxa (taxonomy)
must reflect their phylogenetic relationships. For this reason, named taxa must comprise MONOPHYLETIC
GROUPS; i.e., a named taxon must represent a group descended from a single ancestral species, and all
descendants of that ancestor must be included in the named taxon. A monophyletic group is also called a
CLADE. This means that if a named taxon includes the common ancestor and only some of its descendants
(PARAPHYLY), or does not include the most recent common ancestor (POLYPHYLY), it is not acceptable in a
phylogenetic classification.
Monophyly, paraphyly and polyphyly
A
B
C
D
E
A
B
D
C
E
F
F
H
H
G
G
J
J
Monophyletic group
[Clade]
Paraphyletic group (AHJGFDE) and a
polyphyletic group (BC)
Take the traditional class Reptilia as an example. The traditional Reptilia included the crocodylomorphs
(alligators and crocodiles), the lepidosauromorphs (lizards, snakes, and relatives) and the anapsids (turtles and
relatives).
Phylogenetic analyses, however, indicated that the common ancestor of reptiles also was the
ancestor of birds and mammals, which had been placed in different classes. Therefore, the traditional taxonomic
grouping called Reptilia was paraphyletic. Practitioners of phylogenetic systematics point out that by using the
traditional classification one neglects to recognize a phylogenetic relationship between birds and
Crocodylomorphs, and between mammals and extinct synapsid reptiles.
The old Reptilia as an example of classification based on a paraphyletic group.
Aves (birds)
Old Reptilia is a GRADE
Lots of dinosaur diversity
Ornithischia (some plant eating dinosaurs)
Crocodylomorph (gators and crocs)
Lepidosauromorph (lizards snakes, etc.)
Amniota is a clade
Anapsids (turtles and relatives)
Diversity of extinct mammal-like reptiles
Mammals (Synapsids)
The ultimate goal of phylogenetic systematics is a phylogenetic history of all life on earth, the proverbial
Tree of Life. A multiauthored internet project is dedicated to achieving this goal. Individual parts of the Tree of
Life are authored by biologists around the world, each working on a specific group of organisms, and are
published electronically on the World Wide Web. When completed, it will provide a phylogenetic history for all
life on earth, a unified taxonomy, and a means of searching and retrieving information about the characteristics
of organisms. You can check the progress of this project by visiting the Tree of Life website
(http://phylogeny.arizona.edu/tree/phylogeny.html).
2. Biogeography. Biogeography is the study of the distribution of biological diversity in space and time. The
subdiscipline devoted to understanding the underlying historical factors that have influenced biogeographic
diversity is called historical biogeography. By considering the relationships of taxa, their geographic
distributions, and the geological history of the regions they occupy, biogeographers can sometimes infer the
historical importance of dispersals and geographic isolation, and make inferences about modes of speciation.
The methods of historical biogeography also can be applied to uncover geographic patterns of genetic variation
within species (a pursuit called phylogeography). Phylogeographers use molecular data to infer an intraspecific
gene phylogeny that is then mapped onto the geographic distribution of the species.
EAST: high elevation and wet
WEST: low elevation and dry
Phylogeorgaphy
allowsallows
one toone
testto
hypotheses
such as
whether
geographic/
Phylogeorgaphy
test hypotheses
such
as whether
environmental factors have
been
historically
important
barriers to
gene flow.
geographic/environmental
factors
have
been historically
important
barriers
to gene
Phylogeographic analysis of mouse lemurs contradicts the expected east-west disjunction for Madagascar, and suggests a
completely novel north-south disjunction. The observed phylogenetic tree was inferred from mitochondrial DNA gene
sequences.
Figure adapted from separate figures in A. D. Yoder (2004) In press
3. Health sciences. With recent advances in DNA sequencing technology, phylogenetic analysis of genes has
developed into an important tool for tracking the evolution and spread of infectious diseases. Epidemiological
questions that can be addressed by phylogenetic analysis of DNA sequences include: (i) what was the origin of
an emerging disease, (ii) was there a single origin or has a disease entered a population in different locations or
at different times; (iii) how was the infectious disease spread; (iv) what was the source of a particular
transmission event (see slides); (v) how does the disease organism evolve resistance to its host; (vi) how does
the host immune system evolve resistance to the disease; and (vii) are there species closely-related to the
known pathogens that might be able to cause disease in humans?
The case of HIV (human immunodeficiency virus) illustrates the utility of phylogenetics in epidemiology.
Phylogenetic analysis indicated that HIV consists of two main types (HIV-1 and HIV-2) and numerous subtypes.
Furthermore, it showed that HIV-1 and HIV-2 entered the human population from different sources, as HIV-1 is
more closely related to chimpanzee SIVs (simian immunodeficiency virus), and HIV-2 is more closely related to
mangabey monkey SIVs. Because different subtypes within HIV-1 are related to different lineages of
chimpanzee SIV, and different subtypes of HIV-2 are related to different lineages of mangabey SIV, it seems
likely that the both HIV-1 and HIV-2 jumped from primates to humans multiple times. Different subtypes also are
prevalent in different human populations or geographic regions, indicating that HIV spread through the human
population through different routes and at different times. These phylogenetic analyses illustrate that differences
between humans and primates provide only a weak barrier to transmission of this virus, suggesting the
disturbing possibility that new subtypes could enter the human population in the future.
4. Agriculture. Applications of phylogenetics to agriculture are similar to epidemiology, but the questions are
about the origin and spread of pest species rather than infectious diseases. Agricultural questions include: (i)
what was the origin of a pest; (ii) how did the pest spread though agriculture; (iii) how did some pest organisms
evolve resistance to pesticides; and (iv) are there species closely-related to known pests that might also cause
agricultural problems?
Fursarium garminariam is a fungal pathogen of commercially important species of
grains. Phylogenetic analysis indicates substantial genetic divergence among
strains in different agricultural settings.
Phylogenetic tree inferred from
the combined gene sequences
of six single-copy nuclear gene
sequences (7,120 bp) by using
the methods of maximum
parsimony. Numbers above the
nodes are bootstrap proportions.
Genetic divergence among
strains of Fusarium indicates that
movement of crops among
different agricultural settings
must be carefully monitored to
prevent introduction of “foreign
strains”. Local crops are likely to
be much less resistant to the
“foreign” strains of Fusarium, as
compared with the local strain.
Figure adapted from O’Donnell et al. (2000) PNAS, 97:7905-7910.
5. Conservation. Tragically, while biologists work to assess and study the diversity of life, the activities of man
are causing a loss of biodiversity at a rate unmatched in evolutionary history. Conservation biology is the
discipline dedicated to preserving biodiversity. Phylogenetic systematics and taxonomy play a fundamental role
in this effort; for how can we conserve biological diversity if we do not have a natural system to organize and
study it. However, there also are more direct applications of phylogenetics, including: (i) identification of
genetically distinct breeding populations that require separate protection and management; (ii) assess kinship of
individuals to populations so that appropriate breeding stock can be identified for captive breeding programs; (iii)
assess kinship of dead or captive individuals for the purpose of conservation law enforcement; i.e., molecular
forensics; and (iv) guide the collection and organization of long-term storage of germ-plasm in seed banks. Note
that when working with evolutionary divergences below the species level, the discipline of phylogenetics is
broadly overlapped by the discipline of population genetics, where sophisticated methods based on gene
genealogies are widely used.
The phylogeography of mouse lemurs presented above also illustrate how the phylogenetic framework
has important applications to conservation biology. Before the phylogeographic study of the mouse lemurs, the
important environmental barrier to migration was perceived to be elevation and wetness of the habitat,
suggesting that important conservation decisions might be made independently for an east-west disjunction; that
notion could not have been more incorrect. It seems that the primary disjunction should be north-south;
although the situation is in reality much more complicated than that.
The comparative method has recently become a popular approach to examining risks of extinction and
invasiveness. Excerpts from a recent review of both the powers and pitfalls of this method in conservation
biology are presented in the figure below.
This article highlights three uses of the comparative method in conservation:
(i) develop predictive models for risk assessment
(ii) identifying the general ecological principles that cause conservation problems
(iii) identifying and using endangering traits as triage to prioritize research and
conservation efforts
Potential pitfalls are:
(i)
large and expensive sample sizes required for high power of the method
(ii)
problems with correlation-based methods to identify causal mechanisms
Despite the limitations, it seems that the comparative method will grow to be one of
many essential tools for conservation research. A hypothetical example from this
paper is presented blow that illustrates how application of fisher’s exact test to the
raw data (ignoring phylogenetic non-independence) overestimate the relationship
between extinction risk and body size
Should we use a
Fisher exact test?
6. Linguistics. An interesting application of phylogenetic methods is to the discipline of linguistics. In particular,
maximum likelihood methods have been applied to infer phylogenies of language groups, to estimate the date of
the most recent common ancestors of the model groups, and to identify parts of the language tree with low
support, and test specific hypotheses about the process of language evolution.
A particularly interesting example is the study by Gray and Atkinson (2003) where they use phylogenetic
methods to test two theories for the origin of the Indo-European language group: (1) this language group spread
into Europe by Kurgan horseman around 6000 BCE [Kurgan theory]; and (2) this language group spread into
Europe with the expansion of agriculture from 8000-9500 BCE [Anatolian theory]. The phylogenetic analysis and
dating of the origin of the Indo-European languages by Gray and Atkinson (2003) was in striking agreement with
the Anatolian farming theory (see figure below); their estimate was 7800-9800 BCE. Interestingly, this result is
consistent with a recent genetic study of human populations that supports a Near-Eastern Neolithic contribution
to the European gene pool.
Language phylogeny and divergence dates support the Anatolian-origin theory of the Indo-European language family.
Data: Cognate word forms were sampled
from 87 languages. Three extinct
languages thought to be more distantly
related than the extant languages were
included for the purpose of rooting the
tree. Cognates were coded as present or
absent (1 or 0) for each language. The
final dataset was a binary matrix of 2,449
cognates.
Estimated date of
ancestral node
Methods: Phylogenetic analysis was
conducted under a stochastic model of
binary character evolution that allowed for
unequal character state frequencies, and
heterogeneous rate of evolution among
cognates. Bayesian methods were used
to infer the tree topology shown to the left.
Values above each branch (in black) are
Bayesian posterior probabilities.
Divergence times were estimated by first
assuming maximum and minimum
divergence dates for 11 “calibration
nodes” on the phylogeny. A semi
parametric likelihood based method was
used to infer the divergence dates for the
nodes of the phylogeny
Root
Grey and Atkinson (2003) Nature 426:435-439
Extinct languages used as outgroups