Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
http://creativecommons.org/licenses/by-sa/2.0/ Lecture 4.2 1 Evolution “To study history one must know in advance that one is attempting something fundamentally impossible, yet necessary and highly important.” Father Jacobus (Hesse's Magister Ludi) Lecture 4.2 2 Some history is known • Bacterial evolution observed • Manchester Moths – light to dark Lecture 4.2 3 Key Concepts • Fundamentals of Systematics • Appreciate that phylogenetic analysis allows you to estimate or infer the evolutionary relationships between sequences/organisms • Learn how to better interpret trees • Gain insight into the different phylogenetic methods • Appreciate the need for new algorithms • DNA and protein analysis - benefits and pitfalls of each Lecture 4.2 4 First, some terminology… • Systematics, an attempt to understand the interrelationships of living things • Taxonomy, the science of naming and classifying organisms (evolutionary theory not necessarily involved) • Phylogenetics is the field of systematics that focuses on evolutionary relationships between organisms or genes/proteins (phylogeny). • Cladistics is a particular method of hypothesizing relationships among organisms/genes/proteins. Lecture 4.2 5 Three basic assumptions of cladistics • Any group of organisms/ genes/proteins are related by descent from a common ancestor (fundamental tenant of evolutionary theory) • There is a bifurcating pattern of cladogenesis (most controversial assumption) • Change in characteristics occurs in lineages over time (necessary for cladistics to work!) Lecture 4.2 6 A phylogenetic tree A node Human A clade Mouse Fly taxon -- Any named group of organisms – evolutionary theory not necessarily involved. clade -- A monophyletic taxon (evolutionary theory utilized) Lecture 4.2 7 A phylogenetic tree with branch lengths A node 4 2 3 1 Human A clade Mouse Fly Branch length can be significant… In this case it is and mouse is slightly more similar to fly than human is to fly (sum of branches 1+2+3 is less than sum of 1+2+4) Lecture 4.2 8 Phylogenetic analysis • Organismal relationships • Gene/Protein relationships Lecture 4.2 9 Organismal relationships Lecture 4.2 10 Lecture 4.2 11 Lecture 4.2 12 Improving our understanding of organismal relationships Realization that rates of change are not constant Lecture 4.2 13 Improving our understanding of organismal relationships Better appreciation for what sequences may be suitable for analysis of different degrees of divergence For the tree of life: rRNA genes Multiple genes “Whole genome” datasets of genes rRNA genes! Lecture 4.2 14 Improving our understanding of organismal relationships Better sampling of all the species in our world Lecture 4.2 15 Improving our understanding of organismal relationships Better sampling of all the species in our world Amazing but true! More bacteria in our bodies than human cells! More different types of bacterial genes in our body then there are human genes! “The second human genome project” (David Relman) Lecture 4.2 16 Gene/Protein Relationships Lecture 4.2 Homolog, ortholog, paralog?? 17 Homologs Have common origins but may or may not have common activity. Homologous or not?: Often determined by arbitrary threshold level of similarity determined by alignment Lecture 4.2 18 Homologs …have common ancestry, but the way they are related can vary (i.e. the reasons they have diverged into different sequences can vary) • orthologs - Homologs produced by speciation. They tend to have similar function. • paralogs - Homologs produced by gene duplication. They tend to have differing functions. • xenologs -- Homologs resulting from horizontal gene transfer between two organisms. Lecture 4.2 19 Orthologous or paralogous homologs Early globin gene Gene Duplication -chain gene mouse human Orthologs () ß-chain gene cattle cattle ß Paralogs (cattle) human ß mouse ß Orthologs (ß) Homologs Orthologs – diverged after speciation – tend to have similar function Paralogs – diverged after gene duplication – some functional divergence occurs Therefore, for linking similar genes between species, or performing Lecture 4.2 “annotation transfer”, identify orthologs 20 True or False? A1x is the ortholog in species x of A1y? A1x is a paralog of A2x? A1x is a paralog of A2y? Lecture 4.2 21 Identifying Gene/Protein Relationships from Phylogenetic trees • orthologs - Homologs produced by speciation. Gene phylogeny matches organismal phylogeny. • paralogs - Homologs produced by gene duplication. Multiple copies of homologs in a given species. • xenologs -- Homologs resulting from horizontal gene transfer between two organisms. Gene phylogeny does not match organismal phylogeny in a tree where most genes do match organismal phylogeny well. Lecture 4.2 22 Orthologs and Paralogs of the fly1 gene? Known organismal phylogeny Lecture 4.2 Chimpanzee Chimpanzee Human Human Mouse Mouse Fly Fly1 Worm Fly Human Chimpanzee Human Human Mouse Worm Fly1 Fly1 Worm 23 Xenologs: Horizontal gene transfer E. coli Lecture 4.2 24 Gene Orthology: How to detect? • Most common high throughput computational method: Identify reciprocal best BLAST hits (EGO, COGs,…) Example Problem: • If making comparisons between human and bovine, for example, the bovine gene dataset is still quite incomplete • Therefore, current best hit may be a paralog now and the true ortholog not yet sequenced human Lecture 4.2 cattle mouse cattle 25 Can we improve orthology analysis for linking functionally similar genes? • One solution: Phylogenetic analysis of all putative human-bovine orthologs, using mouse as an outgroup • Assumption: - Mouse and Human gene datasets are more complete, with more true orthologs identified Expect (organismal phylogeny): cattle human mouse Reject: mouse human cattle Lecture 4.2 26 Blue genes are from the same species PaAlgU is an ortholog of ? PaAlgU is a paralog of ? Lecture 4.2 27 2 Forms in 1 Species + + Lecture 4.2 ++ + 28 Slides from Jonathan Eisen 2 Forms in 1 Species - LGT + + + ++ Both forms maintained Red and blue forms diverge + Lecture 4.2 Gene present in common ancestor 29 2 Forms in 1 Species - Gene Loss + + ++ + Loss Loss Gene duplicated in common ancestor ++ Lecture 4.2 30 Unusual Distribution Pattern + Lecture 4.2 + 31 Unusual Distribution - LGT + + Acquires new type of gene Gene originates here Lecture 4.2 32 Unusual Distribution - Gene Loss + + Gene lost here Gene present in ancestor Lecture 4.2 33 Unusual Distribution Evolutionary Rate Variation -? Gene too diverged to be found + + Lecture 4.2 34 +/- Unusual Distribution Incomplete Data + +/- + Gene present in ancestor Lecture 4.2 35 Hope for the future Better sampling of all the species in our world 2004: The dawn of environmental genomics sampling Tyson et al (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 428, 37-43. Venter et al (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304, 66-74. Lecture 4.2 36 “So….. how do we construct a phylogenetic tree??” Lecture 4.2 37 Most common methods • Parsimony • Neighbor-joining • Maximum Likelihood Lecture 4.2 38 Parsimony • “Shortest-way-from-A-to-B” method • The tree implying the least number of changes in character states (most parsimonious) is the best. • Note: – May get more than one tree – No branch lengths – Uses all character data Lecture 4.2 39 Neighbor-joining (and other distance matrix methods) • “speedy-and-popular” method • distance matrix constructed • distance estimates the total branch length between a given two species/genes/proteins • Neighbor-joining approach: Pairing those sequences that are the most alike and using that pair to join to next closest sequence. Lecture 4.2 40 Practical comparison of common distance matrix methods: Some PHYLIP and PAUP programs as an example • Neighbor-joining: fast – not so good for highly divergent sequences • Fitch: Better but slower and result not that different (seeks to maximize fit of pairwise distances) • Kitsch: Assumes equal rate of evolution – can greatly bias results so do not use! • Minimum Evolution (PAUP): Similar to Fitch but fixes location of internal verses external nodes when maximizing fits • Note: gap info not incorporated into analysis Lecture 4.2 41 Maximum Likelihood • “Inside-out” approach • produces trees and then sees if the data could generate that tree. • gives an estimation of the likelihood of a particular tree, given a certain model of nucleotide substitution. • Notes: – All sequence info (including gaps) is used – Based on a specific model of evolution – gives probability – Verrrrrrrrrrrry slow (unless topology of tree is known) Lecture 4.2 42 How reliable is a result? • Non-parametric bootstrapping – analysis of a sample of (eg. 100 or 1000) randomly perturbed data sets. – perturbation: random resampling with replacement, (some characters are represented more than once, some appear once, and some are deleted) – perturbed data analysed like real data – number of times that each grouping of species/genes/proteins appears in the resulting profile of cladograms is taken as an index of relative support for that grouping Lecture 4.2 43 Bootstrapping The number of times a particular branch is formed in the tree (out of the X times the analysis is done) can be used to estimate its probability, which can be indicated on a consensus tree High bootstrap values don’t mean that your tree is the true tree! Good alignment and evolutionary assumptions are key Lecture 4.2 44 Parametric Bootstrapping Data are simulated according to the hypothesis being tested. Lecture 4.2 45 Are we confident that the cow, mole and hedgehog has one ancestor? Lecture 4.2 46 Orthologs, paralogs and homologs – one more test! Early globin gene Gene Duplication -chain gene mouse human cattle ß-chain gene cattle ß human ß mouse ß Orthologs – diverged after speciation – tend to have similar function Paralogs – diverged after gene duplication – some functional divergence occurs Therefore, for linking similar genes between species, or performing Lecture 4.2 “annotation transfer”, identify orthologs 47 Phylogenetics – More info Li, Wen-Hsiung. 1997. Molecular evolution Sunderland, Mass. Sinauer Associates. - a good starting book, clearly describing the basis of molecular evolution theory. It is a 1997 book, so is starting to get a bit out of date. Nei, Masatoshi & Kumar, Sudhir. 2000. Molecular evolution and phylogenetics Oxford ; New York. Oxford University Press. - more recent, and by two very well respected researchers in the field. A bit more in-depth than the previous book, but very useful. Lecture 4.2 48 Phylogenetic Tree Construction: Examples of Common Software PHYLIP http://evolution.genetics.washington.edu/phylip.html PAUP http://paup.csit.fsu.edu/ MEGA 2.1 www.megasoftware.net/ TREEVIEW http://taxonomy.zoology.gla.ac.uk/rod/treeview.html Extensive list of software http://evolution.genetics.washington.edu/phylip/software.html Lecture 4.2 49 PhyloBLAST – a tool for analysis Lecture 4.2 50 Challenges How do we classify? Lecture 4.2 51 Computational Challenges • Need to incorporate more evolutionary theory into the multiple sequence alignment and phylogenetic algorithms used in phylogenetic analysis • Phylogenetic analyses are computationally intensive – great way to benchmark your CPU speed! Lecture 4.2 52 More Challenges • Increasing the sampling of our genetic world • More accurately differentiating orthologs, paralogs, and horizontally acquired genes • How frequent is gene loss, gene duplication, and horizontal gene transfer in genome evolution? • To what degree can we predict protein/gene function using phylogenetic analysis? Lecture 4.2 53 Evolution “To study history one must know in advance that one is attempting something fundamentally impossible, yet necessary and highly important.” Father Jacobus (Hesse's Magister Ludi) Lecture 4.2 54 Evolutionary theory is evolving Lecture 4.2 55