Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Brief of Molecular Evolution & Phylogenetics Aims of the course: • To introduce to the practice phylogenetic inference from molecular data. • To known applications and computer programmes to practice phylogenetic inference. Two Concepts of Molecular Evolution • Ortologous vs Paralogous genes – Genes & species trees • Molecular clock – Substitution rates Homologous genes • Orthologous genes Derived from a process of new species formation (speciation) • Paralogous genes Derived from an original gene duplication process in a single biological species Species trees vs Gene trees Orthologous genes of Cytochrome Each one is present in a biological species • Paralogous genes of Globin a, b, d (Glob), Myo y Leg haemoglobin, each originated by duplication from an ancestral gene Species trees and Gene trees Gene tree a A b B c D Species tree We often assume that gene trees give us species trees Orthologues and paralogues paralogous orthologous a b* c Ancestral gene b* C* orthologous C* B A* A* A mixture of orthologues and paralogues sampled Duplication to give 2 copies = paralogues on the same genome The malic enzyme gene tree contains a mixture of orthologues and paralogues Gene duplication 97 100 100 100 Mit Ascaris suum Mit Zea mays Ch Anas = a duck! Homo sapiens 2 100 75 Homo sapiens 1 Cyt Anas platyrhynchos Cyt Flaveria trinervia Ch Populus trichocarpa Ch Solanum tuberosum Mit 100 Amaranthus Mit Neocallimastix Plant chloroplast Plant mitochondrion Hyd Trichomonas vaginalis Hyd Giardia lamblia Cyt Schizosaccharomyces Saccharomyces Lactococcus lactis Is there a molecular clock? • The idea of a molecular clock was initially suggested by Zuckerkandl and Pauling in 1962 • They noted that rates of amino acid replacements in animal haemoglobins were roughly proportional to time - as judged against the fossil record The molecular clock for alpha-globin: 100 shark 80 carp 60 platypus chicken 40 500 400 300 200 0 100 cow 20 0 number of substitutions Each point represents the number of substitutions separating each animal from humans Time to common ancestor (millions of years) Rates of amino acid replacement in different proteins Protein Fibrinopeptides Insulin C Ribonucleas e Haemoglobins Cytochrome C Histone H4 Rate (mean replacements per site per 10 9 years) 8.3 2.4 2.1 1.0 0.3 0.01 • Evolutionary rates depends on functional constraints of proteins There is no universal clock • The initial proposal saw the clock as a Poisson process with a constant rate • Now known to be more complex - differences in rates occur for: – – – – – – different different different different different different sites in a molecule genes base position (synonimous-nonsynonymous) regions of genomes genomes in the same cell taxonomic groups for the same gene • Molecular Clocks Not Exactly Swiss Phylogenetic Trees LEAVES terminal branches A B node 1 C D E F node 2 G H I J polytomy interior branches A CLADOGRAM ROOT Trees - Rooted and Unrooted A B C D E F G H I J A B C D E ROOT ROOT E D ROOT F A H B C J I G H I J F G Rooting using an outgroup archaea eukaryote archaea Unrooted tree archaea eukaryote eukaryote eukaryote Rooted by outgroup bacteria Outgroup archaea Monophyletic Ingroup archaea archaea eukaryote eukaryote root eukaryote eukaryote Monophyletic Ingroup Some Common Phylogenetic Methods Types of Data Distances Tree buildin g metho d Cluster Algorithms UPGMA NJ Optimality Criteria Minimum Evolution Least Square Sites (nucleotides, aa) Parsimony Maximum Likelihood Bayesian Inference Distance Methods • Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other. • Simply counting the number of differences may underestimate the amount of change - especially if the sequences are very dissimilar - because of multiple hits. • We therefore use a model which includes parameters which reflect how we think sequences may have evolved. Cálculo de distancias: observación y realidad 1 2 obs real sustitución: A A A A 0 0 no A A A C 1 1 simple A C A G 1 2 coincidente A A A C 1 2 múltiple A C A C 0 2 paralela A C A G C 0 3 convergente A A A C A 0 2 reversa G The simplest model : Jukes & Cantor: dxy = -(3/4) Ln (1-4/3 D) • • • • • dxy = distance between sequence x and sequence y expressed as the number of changes per site (note dxy = r/n where r is number of replacements and n is the total number of sites. This assumes all sites can vary and when unvaried sites are present in two sequences it will underestimate the amount of change which has occurred at variable sites) D = is the observed proportion of nucleotides which differ between two sequences (fractional dissimilarity) Ln = natural log function to correct for superimposed substitutions The 3/4 and 4/3 terms reflect that there are four types of nucleotides and three ways in which a second nucleotide may not match a first - with all types of change being equally likely (i.e. unrelated sequences should be 25% identical by chance alone) The natural logarithm ln is used to correct for superimposed changes at the same site • • • • If two sequences are 95% identical they are different at 5% or 0.05 (D) of sites thus: – dxy = -3/4 ln (1-4/3 0.05) = 0.0517 Note that the observed dissimilarity 0.05 increases only slightly to an estimated 0.0517 - this makes sense because in two very similar sequences one would expect very few changes to have been superimposed at the same site in the short time since the sequences diverged apart However, if two sequences are only 50% identical they are different at 50% or 0.50 (D) of sites thus: – dxy = -3/4 ln (1-4/3 0.5) = 0.824 For dissimilar sequences, which may diverged apart a long time ago, the use of ln infers that a much larger number of superimposed changes have occurred at the same site Distance models can be made more parameter rich to increase their realism 1 • It is better to use a model which fits the data than to blindly impose a model on data • The most common additional parameters are: – A correction for the proportion of sites which are unable to change – A correction for variable site rates at those sites which can change – A correction to allow different substitution rates for each type of nucleotide change • PAUP will estimate the values of these additional parameters for you. A gamma distribution can be used to model site rate heterogeneity Exchangeability parameters for two models of amino acid replacement. Exchangeability parameters from two common empirical models of amino acid sequence evolution are presented. The parameter value for each amino acid pair is indicated by the areas of the bubbles, and discounts the effects of amino acid frequencies. (a) The JTT model (Jones, D.T. et al. 1992CABIOS 8, 275–282) derived from a wide variety of globular proteins. (b) The mtREV model (Yang, Z. et al. 1998 Mol. Biol. Evol. 15, 1600–161) derived from mammalian mitochondrial genes that encode various transmembrane proteins. Distances: advantages: • Fast - suitable for analysing data sets which are too large for ML • A large number of models are available with many parameters improves estimation of distances • Use ML to test the fit of model to data Distances: disadvantages: • Information is lost - given only the distances it is impossible to derive the original sequences • Only through character based analyses can the history of sites be investigated e,g, most informative positions be inferred. • Generally outperformed by Maximum likelihood methods in choosing the correct tree in computer simulations Numbers of possible trees for N taxa: • T(i) = P (2i-5) :: T(unrooted), i>3 1,3,15,105,945,10395,135135 • For 10 taxa there are 2 x 106 unrooted trees • For 50 taxa there are 3 x 1074 unrooted trees • How can we find the best tree ? Cluster Analysis UPGMA y NJ Se unen recursivamente el par de elementos más cercanos. Se recalcula la matriz de distancias (*) y se analiza el par unido como un nuevo elemento Unrooted Neighbor-Joining Tree Human Spinach Monkey Rice Mosquito A perfectly additive tree A C 0.1 0.2 0.3 0.1 0.6 B A B C D A B 0.4 0.4 0.4 0.6 0.8 1.0 C 0.4 0.6 0.8 D 0.8 1.0 0.8 - The branch lengths in the matrix and the tree path lengths match perfectly - there is a single unique additive tree D Distance estimates may not make an additive tree Aquifex Aquifex > Bacillus (0.335) Some path lengths are longer and others shorter than appear in the matrix 0.217 Aquifex > Thermus (0.33) Bacillus 0.119 Jukes-Cantor distance matrix Proportion of sites assumed to be invariable = 0.56; identical sites removed proportionally to base frequencies estimated from constant sites only 1 2 4 5 6 ruber Aquifex Deinococc Thermus Bacillus 1 0.38745 0.22455 0.13415 0.27111 2 4 5 0.47540 0.27313 0.33595 0.23615 0.28017 0.28846 0.057 0.017 0.056 0.079 6 ruber - 0.145 Thermus Thermus > Deinococcus (0.218) Deinococc Obtaining a tree using pairwise distances • Stochastic errors will cause deviation of the estimated distances from perfect tree additivity even when evolution proceeds exactly according to the distance model used • Poor estimates obtained using an inappropriate model will compound the problem • How can we identify the tree which best fits the experimental data from the many possible trees Obtaining a tree using pairwise distances • Use statistics to evaluate the fit of tree to the data (goodness of fit measures) – Fitch Margoliash method - a least squares method – Minimum evolution method - minimises length of tree • Note that neighbor joining while fast does not evaluate the fit of the data to the tree Fitch Margoliash Method 1968: • Minimises the weighted squared deviation of the tree path length distances from the distance estimates Fitch Margoliash Method 1968: Aquifex Aquifex 0.207 Bacillus Tree 2 - best 0.129 0.204 Tree 1 0.051 0.006 0.059 0.077 Thermus 0.148 Bacillus 0.139 0.132 0.040 0.058 0.076 0.023 ruber ruber Deinococc Deinococc Thermus Optimality criterion = distance (weighted least squares with power=2) Score of best tree(s) found = 0.12243 (average %SD = 11.663) Tree # Wtd. S.S. APSD 1 2 0.13817 0.12243 12.391 11.663 Minimum Evolution Method: • For each possible alternative tree one can estimate the length of each branch from the estimated pairwise distances between taxa and then compute the sum (S) of all branch length estimates. The minimum evolution criterion is to choose the tree with the smallest S value Minimum Evolution 0.217 Aquifex Bacillus Tree 1 - best Bacillus 0.217 0.119 0.058 0.152 0.053 0.081 0.012 ruber Aquifex Deinococc Thermus Tree 2 0.119 0.057 0.017 0.056 0.079 Thermus 0.145 ruber Deinococc Optimality criterion = distance (minimum evolution) Score of best tree(s) found = 0.68998 Tree # 1 2 ME-score 0.68998 0.69163 Parsimony analysis • Parsimony methods provide one way of choosing among alternative phylogenetic hypotheses • The parsimony criterion favours hypotheses that maximise congruence and minimise homoplasy (convergence, reversal & parallelism) • It depends on the idea of the fit of a character to a tree Parsimony Seq 1 ...ACCT... A C Seq 2 ...AACT... T A Seq 3 ...TACT... C T Seq 4 ...TCCT... 1200=3 1(C) 3(A) 3(A) A 4(C) 2(A) 2 mutations 1(C) A 2(A) 1 mutation 4(C) Maximum Likelihood - goal • To estimate the probability that we would observe a particular dataset, given a phylogenetic tree and some notion of how the evolutionary process worked over time. – P(D/H) Probability of a b b a c e d c given a,c,g,t c e a f d f g a Maximum likelihood 1 5 V1 2 V2 3 6 V5 V3 V4 4 Where: gx0prior probability that node 0 has nucleotide x (relative frequency) Pii (v) g i (1 g i )e v (if gi=1/4, model becomes JC) Pij (v) g i (1 e v ) lk g x5 Px5 x1 (v1 ) Px5 x2 (v2 ) Px5 x6 (v5 ) Px6 x3 (v3 ) Px6 x4 (v4 ) Since we do not know x5 and x6 we sum over all the possible nucleotides Lk x5 g x5 Px5 x1 (v1 ) Px5 x2 (v2 ) Px5 x6 (v5 ) Px6 x3 (v3 ) Px6 x4 (v4 ) x6 Summing over all sites: n ln L ln Lk k 1 lnL is maximized changing Vi’s Bayes’ rule Bayes’ theorem Posterior distribution Prior distribution f ( | X ) p( )l ( X | ) Likelihood function p( )l ( X | )d Unconditional probab. Pr [Tree/Data] = (Pr [Tree] x Pr [Data/Tree]) / Pr [Data]) probability A B C 1.0 Prior probability distribution probability Data (observations) 1.0 Posterior probability distribution Markov Chain Monte Carlo (MCMC) probability p ( )l ( X | ) parameter space Bootstrap ...ahhfhgkhkafdggg... ...rhhfkgkhkaydggg... ...ahdfhgkhkafkdgg... ...rhdfkgkhkaykdgg... ...ahdfhgk-kafkdgg... ...ahdfhgk-kafkdgg... ...ghdfhg--kafkdht... ...ahdfhg--kafaddg... ...hhdfhg--kafaddg... ...ahdfpgchka-kwgg... ...ahhfhgk-kafdggg... 86 ...ahhfhgk-kafdggg... ...ghhfhg--kafdhtt... 50 ...ahhfhg--kafddgg... ...hhhfhg--kafddgg... 75 90 ...ahhfpgchka-wggg... .... ...adfhgkkaffkdgg... ...rdfkgkkayykdgg... ...adfhgkkaffkdgg... ...adfhgkkaffkdgg... ...gdfhg-kaffkdht... ...adfhg-kaffaddg... ...hdfhg-kaffaddg... ...adfpgcka--kwgg... 70 65 Aplicaciones de la filogenia: Trazar el origen de una cepa Fechar la introducción de una cepa Estudio de la función Estudios evolutivos Trazando el origen Europa Asia América Europa Datos epidemiológicos Virus RNA: alta tasa de evolución t1 b c d a t0 1926 1970 (1926-t0)*v=a (1970-t1)*v=c+d ... Función A ...ahgfhgkhkafkdggggcatgcgayhhks... B ...rfgfkgkhkaykdggggcatgcgayhhks... Función1 C ...ahdfhgkrkafkdggcccatgcgayhhks... D ...ahdfhgkrkafkdglcccatgcgayhhks... E ...ghdfhg-rkafkdhtcccatgcgayhhks... Estados Ancestrales Función2 PHYLIP http://evolution.genetics.washington.edu/phylip.html DNA Proteins DNAPARS. Estimates phylogenies by the parsimony method using nucleic acid sequences. PROTPARS. Estimates phylogenies from protein sequences using the parsimony method. DNAMOVE. Interactive construction of phylogenies from nucleic acid sequences, with their evaluation by parsimony and compatibility DNAPENNY. Finds all most parsimonious phylogenies for nucleic acid sequences by branchand-bound search. DNACOMP. Estimates phylogenies from nucleic acid sequence data using the compatibility criterion, DNAINVAR. For nucleic acid sequence data on four species, computes Lake's and Cavender's phylogenetic invariants, DNAML. Estimates phylogenies from nucleotide sequences by maximum likelihood. DNAMLK. Same as DNAML but assumes a molecular clock. DNADIST. Computes four different distances between species from nucleic acid sequences. Restriction Continuous RESTML. Estimation of phylogenies by maximum likelihood using restriction sites data CONTML. Estimates phylogenies from gene frequency data by maximum likelihood. PROTDIST. Computes a distance measure for protein sequences SEQBOOT. Reads in a data set, and produces multiple data sets from it by bootstrap resampling.. ..... FITCH. Estimates phylogenies from distance matrix data under the "additive tree model". KITSCH. Estimates phylogenies from distance matrix data under the "ultrametric" model. NEIGHBOR. An implementation of Saitou and Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage clustering) method. ..... CONSENSE. Computes consensus trees by the majority-rule consensus tree method, GENDIST. Computes one of three different genetic distance formulas from gene frequency data. Discrete characters MIX. Wagner parsimony method and Camin-Sokal parsimony method, MOVE. Interactive construction of phylogenies from discrete character Evaluates parsimony and compatibility criteria. PENNY. Finds all most parsimonious phylogenies DOLLOP. Estimates phylogenies by the Dollo or polymorphism parsimony criteria. DOLMOVE. Interactive DOLLOP. DOLPENNY. branch-and-bound method CLIQUE. Finds the largest clique of mutually compatible characters, ... thanks !!!!