Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Phylogenetic Analysis • Motivation – The problem of explaining the evolutionary history of today's species – How do species relate to one another in terms of common ancestors – Nucleic acids and Proteins also evolve • Approaches – Fossil Records , Phylogenetic Trees General comments on phylogenetics • Phylogenetics is the branch of biology that deals with evolutionary relatedness • Uses some measure of evolutionary relatedness: e.g., morphological features • Phylogenetics on sequence data is an attempt to reconstruct the evolutionary history of those sequences • Relationships between individual sequences are not necessarily the same as those between the organisms they are found in • The ultimate goal is to be able to use sequence data from many sequences to give information about phylogenetic history of organisms • Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes. What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: 1. Phylogenetic inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.) 2. Character and rate analysis — using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest • Examine the process of evolution – What drives evolution? – Understanding mutation, gene flow and natural selection • Examine the history of evolution – What has evolution done in the past? – Understanding how living organisms are related and how they have changed over time • Aim – The ultimate goal is to be able to use sequence data from many sequences to give information about phylogenetic history of organisms – To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species, etc.) and to study the reliability of the consensus tree. – Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes. Common Phylogenetic Tree Terminology Terminal Nodes Branches or Lineages A B C D Ancestral Node or ROOT of the Tree Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa) E Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny Parts of a Phylogenetic Tree Node Branch Root Ingroup Outgroup Phylogenetic trees diagram the evolutionary relationships between the taxa Taxon B Taxon C Taxon A Taxon D No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom. Taxon E This dimension either can have no scale (for ‘cladograms’), can be proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), or can be proportional to time (for ‘ultrametric trees’ or true evolutionary trees). ((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses These say that B and C are more closely related to each other than either is to A, and that A, B, and C form a clade that is a sister group to the clade composed of D and E. If the tree has a time scale, then D and E are the most closely related. • In Phylogenetic trees – Leaves represent present day species – Interior nodes represent hypothesized ancestors – We will only consider binary trees: edges split only into two branches (daughter edges) – Rooted trees have an explicit ancestor; the direction of time is explicit in these trees – Unrooted trees do not have an explicit ancestor; the direction of time is undetermined in such trees A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data: • Which species are the closest living relatives of modern humans? • What were the origins of specific transposable elements? • Plus countless others….. Input data for Phylogenetic Reconstruction • Distance Matrix • Character State Matrix Types of phylogenetic analysis methods • Phenetic: trees are constructed based Distance on observed characteristics, not on methods evolutionary history • Cladistic: trees are constructed based Parsimony on fitting observed characteristics to and Maximum Likelihood some model of evolutionary history methods Distance methods • Another way to say this is that there are a set of distances dij between each pair of sequences i,j in the dataset. dij can be the fraction f of sites u where residues xi and xj differ; or dij can be such a fraction but weighted in some way (e.g. Jukes-Cantor distance) Parsimony methods • Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state • For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position • Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious • Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree Example of parsimonious tree building • Tree on left requires only one change, tree on right requires two: left tree is most parsimonious Character State Matrix • A character has a finite number of states • Taxonomical units for which we want to create phylogeny are called Objects – e.g. species, population • Every object has a state vector & inherit the same characters but not the same states! Character State Matrix M • M has n rows (Objects) • M has m columns (characters) • Mij denotes the state object i has for character j Which species are the closest living relatives of modern humans? 14 Humans Gorillas Chimpanzees Chimpanzees Bonobos Bonobos Gorillas Orangutans Orangutans Humans 0 MYA Mitochondrial DNA, most nuclear DNAencoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas. 15-30 MYA 0 The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA. A few examples of what can be learned from character analysis using phylogenies as analytical frameworks: • When did specific episodes of positive Darwinian selection occur during evolutionary history? • Which genetic changes are unique to the human lineage? • What was the most likely geographical location of the common ancestor of the African apes and humans? • Plus countless others….. The number of unrooted trees increases in a greater than exponential manner with number of taxa A B # Taxa ( N) C A B C A C B D D E A B C F D E 3 4 5 6 7 8 9 10 . . . . 30 # Unrooted trees 1 3 15 105 945 10,935 135,135 2,027,025 . . . . 01 x 85.3ֵ 36 (2N - 5)!! = # unrooted trees for N taxa Inferring evolutionary relationships between the taxa requires rooting the tree: B To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: Root D Unrooted tree A A Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. C B C D Rooted tree Root Now, try it again with the root at another position: B C Root Unrooted tree D A A B C D Rooted tree Root Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees A The unrooted tree 1: 4 1 B Rooted tree 1a 2 Rooted tree 1b C 5 D 3 Rooted tree 1c Rooted tree 1d Rooted tree 1e B A A C D A B B D C C C C A A D D D B B These trees show five different evolutionary relationships among the taxa! There are two major ways to root trees: By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins). outgroup By midpoint or distance: Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods. A d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9 10 C 3 B 2 2 5 D Each unrooted tree theoretically can be rooted anywhere along any of its branches C A D B A C B A B D E C F D E # Taxa 3 4 5 6 7 8 9 . . . . 30 # Unrooted # Rooted x # Roots = Trees Trees 1 3 3 3 5 15 15 7 105 105 9 945 945 11 10,3 95 10,935 13 135,1 35 135,135 15 2,027,0 25 . . . . . . . . . . . . 36 ~3.58 x 10 57 ~2.04 x 10 38 (2N - 3)!! = # unrooted trees for N taxa Molecular phylogenetic tree building methods: Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows: COMPUTATIONAL METHOD Characters Distances DATA TYPE Optimality criterion Clustering algorithm PARSIMONY MAXIMUM LIKELIHOOD MINIMUM EVOLUTION UPGMA LEAST SQUARES NEIGHBOR-JOINING Types of data used in phylogenetic inference: Character-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference. Taxa Species Species Species Species Species A B C D E Characters ATGGCTATTCTTATAGTACG ATCGCTAGTCTTATATTACA TTCACTAGACCTGTGGTCCA TTGACCAGACCTGTGGTCCG TTGACCAGTTCTCTAGTTCG Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building. Species Species Species Species Species A B C D E ---0.23 0.87 0.73 0.59 0.20 ---0.59 1.12 0.89 A 0.50 0.40 ---0.17 0.61 B 0.45 0.55 0.15 ---0.31 C 0.40 0.50 0.40 0.25 ---- Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa) D E Example 1: Uncorrected “p” distance (=observed percent sequence difference) Computational methods for finding optimal trees: Exact algorithms: "Guarantee" to find the optimal or "best" tree for the method of choice. Two types used in tree building: Exhaustive search: Evaluates all possible unrooted trees, choosing the one with the best score for the method. Branch-and-bound search: Eliminates the parts of the search tree that only contain suboptimal solutions. Heuristic algorithms: Approximate or “quick-and-dirty” methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searches often operate by “hill-climbing” methods. Exact searches become increasingly difficult, and eventually impossible, as the number of taxa increases: A B # Taxa ( N) C A B C A C B D D E A B C F D E 3 4 5 6 7 8 9 10 . . . . 30 # Unrooted trees 1 3 15 105 945 10,935 135,135 2,027,025 . . . . 01 x 85.3ֵ 36 (2N - 5)!! = # unrooted trees for N taxa Classification of phylogenetic inference methods COMPUTATIONAL METHOD Characters Distances DATA TYPE Optimality criterion Clustering algorithm PARSIMONY MAXIMUM LIKELIHOOD MINIMUM EVOLUTION UPGMA LEAST SQUARES NEIGHBOR-JOINING Parsimony methods: Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences. Advantages: • Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). • Can be used on molecular and non-molecular (e.g., morphological) data. • Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy) • Can be used for character (can infer the exact substitutions) and rate analysis. • Can be used to infer the sequences of the extinct (hypothetical) ancestors. Disadvantages: • Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!) • Can be fooled by high levels of homoplasy (‘same’ events). • Can become positively misleading in the “Felsenstein Zone”: [See Stewart (1993) for a simple explanation of parsimony analysis, and Swofford et al. (1996) for a detailed explanation of various parsimony methods.] Bootstrapping • • • • • Evaluation of the tree reliability n number of trees are built (n=100/1000/5000) How many times a certain branch is reproduced Values between 1-100 (%) if the assumptions the method is based on hold, you should always get the same tree from the bootstrapped alignments as you did originally • The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature Parsimony methods • Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state • For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position Example of parsimonious tree building • Tree on left requires only one change, tree on left requires two: left tree is most parsimonious • Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious • Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree Phylogeny in medical forensics: HIV • A dentist who was infected with HIV was suspected of infecting some of his patients in the course of treatment • HIV evolves very quickly (10-3 substitutions/year) • Possible to trace the history of infections among individuals by conducting a phylogenetic analysis of HIV sequences • Samples were taken from dentist, patients, and other infected individuals in the community • Study found 5 patients had been infected by the dentist Source: Ou et. al. 1992. Molecular epidemiology of HIV transmission in a dental practice. Science, 256: 1165-1171. Did the Florida Dentist infect his patients with HIV? Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People: DENTIST Patient C Patient A Patient G Patient B Patient E Patient A DENTIST Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. Local control 2 Local control 3 Patient F No Local control 9 Local control 35 Local control 3 Patient D From Ou et al. (1992) and Page & Holmes (1998) No