Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Phylogenetics Topic 1: An overview Introduction “The affinities of all beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth. The green budding twigs may represent existing species; and those produced during former years may represent the long succession of extinct species...and this connection of the former and present buds by ramifying branches may well represent the classification of all extinct and living species in groups subordinate to groups.” Charles Darwin, in Chapter IV of On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. A fundamental concept of the theory of evolution, independently developed by Charles Robert Darwin and Alfred Russell Wallace and published jointly in a letter of 1858, is that species share a common origin and have subsequently diverged through time. Interestingly, both men came to use the simile of a great tree to illustrate this notion of descent with modification, and ever since biologists have been using tree-like diagrams to describe the pattern and timing of events that gave rise to the earth’s biodiversity. The branching pattern of the tree represents the splitting of biological lineages, and the lengths of the branches can be used to signify the age of those events. Today, biologists call these tree-like diagrams phylogenies. Unrooted tree diagram drawn in the margin of one of Charles Darwin’s notebooks Phylogenetic tree used in The Origin of Species. Darwin wasn’t just thinking about classification based on phylogenies. He used them to visualize the process of divergence within species and the splitting of populations into separate species. Darwin used this figure to illustrate divergence of variants within species; over time successively more variation accumulates. Eventually some of this variation forms the basis for new species. The biological discipline dedicated to reconstructing organismal phylogenies is called phylogenetics. Parallel advances in a number of fields led to a tremendous growth in phylogenetics over the last 40 years. First, beginning in the 1960’s, sophisticated techniques were developed and refined for the purpose of reconstructing phylogenies from the actual features, or characters, of organisms. Second, phylogenetics grew beyond its traditional application to classification of living organisms. Recognition that phylogenies can provide an evolutionary framework for studying a wide variety of problems led to their application in almost every other sub discipline of biology. Third, rapid increases in the computational power of computers meant that programs implementing phylogeny reconstruction algorithms could accommodate very large amounts of data. Lastly, the revolution in molecular biotechnology opened up a vast new source of characters to phylogenetic analysis. Before discussing the wide-ranging applications of phylogenies, it is necessary to define some essential terminology. An imaginary species phylogeny is presented in figure 1a as a guide. The lines of the phylogeny, called branches, represent species, and the bifurcation points, called nodes, represent speciation events. The tips of the terminal branches are present-day species, and each node represents a species that is the common ancestor of all its descendants, or daughter species. For example, in figure 1a the species at node B is the most recent common ancestor of present-day species 1, 2, and 3, and is not an ancestor of species 4 or 5. Furthermore, the group composed of ancestor B and all its descendants (species 1, 2, 3, and A) is called a clade, or a monophyletic group. Smaller clades are comprised of A and all its descendants, and D and all its descendants. It must be noted that phylogenetics is not restricted to just species. Phylogenetic methods can be used to depict kinship of individuals within a local group or population, relationships among populations or subspecies, relationships among taxonomic lineages above species (e.g., supraspecific categories such as genera, families, etc.), relationships among genes within populations, or relationships among different genes within a gene family. Figure 1 The phylogeny in figure 1a (above) is rooted at node C, allowing us to infer which ancestral species gave rise to which present-day species. Without a root, a phylogeny looks very different; compare figure 1a with 1b, they differ only by the placement of a root. The importance of placing a root on a phylogeny should now be clear; without a root biologists cannot distinguish between what is ANCESTRAL and what is DERIVED (descendant). We will return to the concept of a root in topic 3 [methods]. Rooted phylogenies allow biologists to distinguish similar characteristics due to common decent (HOMOLOGY) from similar characteristics due to convergence from different ancestors (ANALOGY) (see figure 2 to right). However, most methods of phylogenetic inference produce unrooted trees, and the location of the root also must be inferred. Rooted phylogenies allow biologists to infer CHARACTER the evolutionary relationship between two or more states for a given character. Say we have a character with two states, “a” and “b”. By mapping them on a phylogeny we can determine that “b” preceded “a” in evolutionary history; hence “a” is the derived state and “b” is the primitive state. POLARITY; Figure 2 In the former examples, branch lengths were not intended to convey any information (figures 1a and 1b). The phylogeny in figure 1c illustrates how branch lengths can show how much change has occurred along a branch. In the case of molecular characters, if the rate of evolution is constant over time (the so-called molecular clock), the branches will show the relative divergence times of the lineages. For example, figure 1c indicates that the divergence of species 1 and 2 was much more recent than divergence of species 4 and 5. Moreover, if the divergence dates of some points in the phylogeny are known from the fossil record (calibration points), and the characters are evolving in a clock-like fashion, the phylogeny can be used to predict divergences absent from the fossil record. Below is an example of a real dataset (COII and cyt b gene sequences of selected mammals) where the branch lengths have been estimated once by assuming clock-like molecular evolution and again without such an assumption. Felis Felis Canis Canis Ursus Ursus Bos 0.1 Branch lengths estimated without assumption of the molecular clock Root Root Branch lengths estimated under the assumption of the molecular clock Bos Hippopotamus Hippopotamus Physeter Physeter Balaenoptera Balaenoptera Rhinoceros Rhinocero s Equus Equus 0.1 Tips are contemporary; the distance from root to each tip is the same Tips are NOT contemporary; the distance from root to each tip is NOT the same The phylogenetic comparative method Evolutionary biologists use the comparative method to discover common evolutionary patterns, and to understand the causes of those patterns. The key to this approach is discovering correlated patterns of evolution between different characters of organisms, or between characters of organisms and aspects of the environment that they inhabit. Most comparative studies attempt to address the adaptive significance of biological variation, although many patterns ultimately require non-adaptive explanations. Since Darwin’s time, the comparative method has remained one of the most important analytical tools of evolutionary biologists. However, comparative biology has recently undergone a major transformation; the realization that the characteristics of species could be correlated due to shared ancestry, taken alongside the major developments in the field of phylogenetics, meant that evolutionary biologists had to examine comparative trends together with phylogenetic relatedness. What is the problem? Standard statistical methods for assessing the correlation treat the data drawn from different species as independent. Because species are hierarchically related by the phylogeny they cannot be treated as if drawn independently from the same distribution. Let’s consider a hypothetical example. Consider a phenotype (say, the size of a primate’s big toe; Y) and an ecological variable (say, the frequency of things that a big toe can be stubbed into; X). Suppose you have gone to great trouble to collect measurements for size of big toe and the “stubbiness” of the habitat, and you are interested in the significance of any relationship of Y on X. So, you plot you data and you find what appears to be a significant correlation. Hypothetical dataset for phenotype (Y) and ecological variable (X) Y X Now consider at some point in early history that two species diverged for toe-size and colonized two different habitats. At that point in time there are only two points that lie on a straight line, but the correlation cannot be significant; there are, after all, only two points and the regression has zero degrees of freedom. Two point dataset from early in evolutionary history Y X Now consider some evolutionary time has passed and each of these two species gives rise to 100 descendent species. By this accident of history, all the descendants in one clade will have a larger toe and tend to be in one habitat type, and the descendents of the other species will have a smaller toe and tend to be in the other habitat type. If our sample of data came from these two clades, we would have effectively sampled only two species. Phylogeny of two groups of close relatives “Big-toe clade” “Little-toe clade” Recent diversifications Old divergence of “big-toed” and “little-toed” primates If we code our data to indicate the clade of origin (below) we see that the correlation is an illusion generated by two clusters with different mean values. Hypothetical dataset with points coloured according to clade of origin Y X “Little-toed” clade “Big-toed” clade One way to analyze these data is to use a method called FELSENSTEIN’S INDEPENDENT CONTRASTS. The phylogeny is divided into subsets of independent branches. A Brownian motion model is used to place an estimate of the variance on the branch lengths of the contrasts. The independent contrasts can be considered drawn from a normal distribution with a mean of zero. An alternative approach is to use ANCESTRAL CHARACTER STATE RECONSTRUCTION, a statistical method of inferring the most likely character state at a site for each ancestral node of a phylogeny. These ancestral reconstructions are then used to infer and count the number of times that a trait of interest has evolved on a phylogeny. Both approaches take a particular topology as given; and additional steps must be employed to take into account the error associated with a particular estimate of a phylogeny. Joseph Felsenstein, in the paper that laid the foundation for the modern transformation of comparative biology (Felsenstein. 1985. Am Nat. 125:1-15.), wrote “phylogenies are fundamental to comparative biology; there is no doing it without taking them into account”. Phylogenetically related species will be more similar in both phenotype and lifestyle than distantly related species, and modern comparative methods must attempt to distinguish between similarities due to similar adaptive pressures and similarities due to descent from common ancestors. APPLICATIONS OF PHYLOGENETICS Phylogenies can have practical value in almost every branch of biology, a fact that has become widely recognized only in the last decade. This expansion, however, makes it impossible to review all the applications of phylogenies; instead, some examples are presented that include both classic and novel applications. 1. Systematics, classification, and taxonomy. Perhaps the most traditional application of phylogenetics is classification and systematics. Biological classifications are systems that organize the diversity of life, and systematics is the study of that diversity relative to some kind of specified relationship. Biologists generally agree that classification and systematics of species and supraspecific taxa should reflect the natural organization of biological diversity. The discipline devoted to producing a classification that portrays the evolutionary relationships of species and supraspecific lineages is called phylogenetic systematics. Narrowly defined, phylogenetic systematics has two basic components: (i) phylogenetic inference and (ii) production of a hierarchal classification system that exactly reflects the phylogenetic relationships. However, this definition has been broadened by some biologists to include many aspects of comparative evolutionary biology. ERNST HAECKEL’S “TREE OF LIFE”, DRAWN SOMETIME IN THE LATE 1800’S Placed Menschen (“Men”) at the “top” of the tree among the Affen (“Apes”). Haeckle was first to suggest man’s ancestry was among the Great Apes. This tree was a tree of “men”, and Haeckels’s placement of Menschen at the top was intentional Non-mammalian vertebrates Invertebrates Protozoa This tree and associated system of classification is different from modern ones in that it is based on the notion of linear progress (like a ladder) from the most primitive single-celled organisms “upwards” to man (at the very top). Haeckel considered the things near the top as “more evolved” and things near the bottom as “primitive”. Ernst Haeckel (1834-1919) was a German biologist and scientific illustrator. He was one of the first popularizers of Darwin’s Theory of Evolution. The tree to the left is from his book “General Morphology – founded on the descent theory”. If a classification system is to be phylogenetic, the naming of species and supraspecific taxa (taxonomy) must reflect their phylogenetic relationships. For this reason, named taxa must comprise MONOPHYLETIC GROUPS; i.e., a named taxon must represent a group descended from a single ancestral species, and all descendants of that ancestor must be included in the named taxon. A monophyletic group is also called a CLADE. This means that if a named taxon includes the common ancestor and only some of its descendants (PARAPHYLY), or does not include the most recent common ancestor (POLYPHYLY), it is not acceptable in a phylogenetic classification. Monophyly, paraphyly and polyphyly A B C D E A B D C E F F H H G G J J Monophyletic group [Clade] Paraphyletic group (AHJGFDE) and a polyphyletic group (BC) Take the traditional class Reptilia as an example. The traditional Reptilia included the crocodylomorphs (alligators and crocodiles), the lepidosauromorphs (lizards, snakes, and relatives) and the anapsids (turtles and relatives). Phylogenetic analyses, however, indicated that the common ancestor of reptiles also was the ancestor of birds and mammals, which had been placed in different classes. Therefore, the traditional taxonomic grouping called Reptilia was paraphyletic. Practitioners of phylogenetic systematics point out that by using the traditional classification one neglects to recognize a phylogenetic relationship between birds and Crocodylomorphs, and between mammals and extinct synapsid reptiles. The old Reptilia as an example of classification based on a paraphyletic group. Aves (birds) Old Reptilia is a GRADE Lots of dinosaur diversity Ornithischia (some plant eating dinosaurs) Crocodylomorph (gators and crocs) Lepidosauromorph (lizards snakes, etc.) Amniota is a clade Anapsids (turtles and relatives) Diversity of extinct mammal-like reptiles Mammals (Synapsids) The ultimate goal of phylogenetic systematics is a phylogenetic history of all life on earth, the proverbial Tree of Life. A multiauthored internet project is dedicated to achieving this goal. Individual parts of the Tree of Life are authored by biologists around the world, each working on a specific group of organisms, and are published electronically on the World Wide Web. When completed, it will provide a phylogenetic history for all life on earth, a unified taxonomy, and a means of searching and retrieving information about the characteristics of organisms. You can check the progress of this project by visiting the Tree of Life website (http://phylogeny.arizona.edu/tree/phylogeny.html). 2. Biogeography. Biogeography is the study of the distribution of biological diversity in space and time. The subdiscipline devoted to understanding the underlying historical factors that have influenced biogeographic diversity is called historical biogeography. By considering the relationships of taxa, their geographic distributions, and the geological history of the regions they occupy, biogeographers can sometimes infer the historical importance of dispersals and geographic isolation, and make inferences about modes of speciation. The methods of historical biogeography also can be applied to uncover geographic patterns of genetic variation within species (a pursuit called phylogeography). Phylogeographers use molecular data to infer an intraspecific gene phylogeny that is then mapped onto the geographic distribution of the species. EAST: high elevation and wet WEST: low elevation and dry Phylogeorgaphy allowsallows one toone testto hypotheses such as whether geographic/ Phylogeorgaphy test hypotheses such as whether environmental factors have been historically important barriers to gene flow. geographic/environmental factors have been historically important barriers to gene Phylogeographic analysis of mouse lemurs contradicts the expected east-west disjunction for Madagascar, and suggests a completely novel north-south disjunction. The observed phylogenetic tree was inferred from mitochondrial DNA gene sequences. Figure adapted from separate figures in A. D. Yoder (2004) In press 3. Health sciences. With recent advances in DNA sequencing technology, phylogenetic analysis of genes has developed into an important tool for tracking the evolution and spread of infectious diseases. Epidemiological questions that can be addressed by phylogenetic analysis of DNA sequences include: (i) what was the origin of an emerging disease, (ii) was there a single origin or has a disease entered a population in different locations or at different times; (iii) how was the infectious disease spread; (iv) what was the source of a particular transmission event (see slides); (v) how does the disease organism evolve resistance to its host; (vi) how does the host immune system evolve resistance to the disease; and (vii) are there species closely-related to the known pathogens that might be able to cause disease in humans? The case of HIV (human immunodeficiency virus) illustrates the utility of phylogenetics in epidemiology. Phylogenetic analysis indicated that HIV consists of two main types (HIV-1 and HIV-2) and numerous subtypes. Furthermore, it showed that HIV-1 and HIV-2 entered the human population from different sources, as HIV-1 is more closely related to chimpanzee SIVs (simian immunodeficiency virus), and HIV-2 is more closely related to mangabey monkey SIVs. Because different subtypes within HIV-1 are related to different lineages of chimpanzee SIV, and different subtypes of HIV-2 are related to different lineages of mangabey SIV, it seems likely that the both HIV-1 and HIV-2 jumped from primates to humans multiple times. Different subtypes also are prevalent in different human populations or geographic regions, indicating that HIV spread through the human population through different routes and at different times. These phylogenetic analyses illustrate that differences between humans and primates provide only a weak barrier to transmission of this virus, suggesting the disturbing possibility that new subtypes could enter the human population in the future. 4. Agriculture. Applications of phylogenetics to agriculture are similar to epidemiology, but the questions are about the origin and spread of pest species rather than infectious diseases. Agricultural questions include: (i) what was the origin of a pest; (ii) how did the pest spread though agriculture; (iii) how did some pest organisms evolve resistance to pesticides; and (iv) are there species closely-related to known pests that might also cause agricultural problems? Fursarium garminariam is a fungal pathogen of commercially important species of grains. Phylogenetic analysis indicates substantial genetic divergence among strains in different agricultural settings. Phylogenetic tree inferred from the combined gene sequences of six single-copy nuclear gene sequences (7,120 bp) by using the methods of maximum parsimony. Numbers above the nodes are bootstrap proportions. Genetic divergence among strains of Fusarium indicates that movement of crops among different agricultural settings must be carefully monitored to prevent introduction of “foreign strains”. Local crops are likely to be much less resistant to the “foreign” strains of Fusarium, as compared with the local strain. Figure adapted from O’Donnell et al. (2000) PNAS, 97:7905-7910. 5. Conservation. Tragically, while biologists work to assess and study the diversity of life, the activities of man are causing a loss of biodiversity at a rate unmatched in evolutionary history. Conservation biology is the discipline dedicated to preserving biodiversity. Phylogenetic systematics and taxonomy play a fundamental role in this effort; for how can we conserve biological diversity if we do not have a natural system to organize and study it. However, there also are more direct applications of phylogenetics, including: (i) identification of genetically distinct breeding populations that require separate protection and management; (ii) assess kinship of individuals to populations so that appropriate breeding stock can be identified for captive breeding programs; (iii) assess kinship of dead or captive individuals for the purpose of conservation law enforcement; i.e., molecular forensics; and (iv) guide the collection and organization of long-term storage of germ-plasm in seed banks. Note that when working with evolutionary divergences below the species level, the discipline of phylogenetics is broadly overlapped by the discipline of population genetics, where sophisticated methods based on gene genealogies are widely used. The phylogeography of mouse lemurs presented above also illustrate how the phylogenetic framework has important applications to conservation biology. Before the phylogeographic study of the mouse lemurs, the important environmental barrier to migration was perceived to be elevation and wetness of the habitat, suggesting that important conservation decisions might be made independently for an east-west disjunction; that notion could not have been more incorrect. It seems that the primary disjunction should be north-south; although the situation is in reality much more complicated than that. The comparative method has recently become a popular approach to examining risks of extinction and invasiveness. Excerpts from a recent review of both the powers and pitfalls of this method in conservation biology are presented in the figure below. This article highlights three uses of the comparative method in conservation: (i) develop predictive models for risk assessment (ii) identifying the general ecological principles that cause conservation problems (iii) identifying and using endangering traits as triage to prioritize research and conservation efforts Potential pitfalls are: (i) large and expensive sample sizes required for high power of the method (ii) problems with correlation-based methods to identify causal mechanisms Despite the limitations, it seems that the comparative method will grow to be one of many essential tools for conservation research. A hypothetical example from this paper is presented blow that illustrates how application of fisher’s exact test to the raw data (ignoring phylogenetic non-independence) overestimate the relationship between extinction risk and body size Should we use a Fisher exact test? 6. Linguistics. An interesting application of phylogenetic methods is to the discipline of linguistics. In particular, maximum likelihood methods have been applied to infer phylogenies of language groups, to estimate the date of the most recent common ancestors of the model groups, and to identify parts of the language tree with low support, and test specific hypotheses about the process of language evolution. A particularly interesting example is the study by Gray and Atkinson (2003) where they use phylogenetic methods to test two theories for the origin of the Indo-European language group: (1) this language group spread into Europe by Kurgan horseman around 6000 BCE [Kurgan theory]; and (2) this language group spread into Europe with the expansion of agriculture from 8000-9500 BCE [Anatolian theory]. The phylogenetic analysis and dating of the origin of the Indo-European languages by Gray and Atkinson (2003) was in striking agreement with the Anatolian farming theory (see figure below); their estimate was 7800-9800 BCE. Interestingly, this result is consistent with a recent genetic study of human populations that supports a Near-Eastern Neolithic contribution to the European gene pool. Language phylogeny and divergence dates support the Anatolian-origin theory of the Indo-European language family. Data: Cognate word forms were sampled from 87 languages. Three extinct languages thought to be more distantly related than the extant languages were included for the purpose of rooting the tree. Cognates were coded as present or absent (1 or 0) for each language. The final dataset was a binary matrix of 2,449 cognates. Estimated date of ancestral node Methods: Phylogenetic analysis was conducted under a stochastic model of binary character evolution that allowed for unequal character state frequencies, and heterogeneous rate of evolution among cognates. Bayesian methods were used to infer the tree topology shown to the left. Values above each branch (in black) are Bayesian posterior probabilities. Divergence times were estimated by first assuming maximum and minimum divergence dates for 11 “calibration nodes” on the phylogeny. A semi parametric likelihood based method was used to infer the divergence dates for the nodes of the phylogeny Root Grey and Atkinson (2003) Nature 426:435-439 Extinct languages used as outgroups