* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction to Molecular Systematics
Real-time polymerase chain reaction wikipedia , lookup
SNP genotyping wikipedia , lookup
Genetic code wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Gene expression wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Transposable element wikipedia , lookup
Transformation (genetics) wikipedia , lookup
Restriction enzyme wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Biosynthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Genomic library wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Promoter (genetics) wikipedia , lookup
DNA supercoil wikipedia , lookup
Molecular ecology wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Molecular cloning wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Homology modeling wikipedia , lookup
Community fingerprinting wikipedia , lookup
Point mutation wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Non-coding DNA wikipedia , lookup
Introduction to Phylogenetic Systematics Mark Fishbein Dept. Biological Sciences Mississippi State University 13 October 2003 Which of these critters are most closely related? alligator gila monster purple gallinule gopher tortoise ? kingsnake Phylogeny • Branching history of evolutionary lineages • New branches arise via speciation • Speciation occurs when gene flow is severed between populations • Phylogenetic relationships depicted as a tree © W. S. Judd, et al., Plant Systematics © W. S. Judd, et al., Plant Systematics Phylogenetic data • • • • • • • Morphology Secondary chemistry Cytology Allele frequencies Protein sequences Restriction sites DNA sequences } “Molecular” data Molecular (genetic) data • Proteins – Serology (immunoassay) – Isozymes (electrophoretic variants) – Amino acid sequences • DNA – Structural (translocations, inversions, duplications) – Restriction sites – DNA sequences • Substitutions • Insertions/Deletions What are genes? From Raven et al. (1999), Biology of Plants Genomes • All of the genes within a cell are the genome • Genes located in the nucleus are the nuclear genome • Other genomes (organellar) – Mitochondrion: mitochondrial genome – Chloroplast: plastid genome nucleus chloroplast mitochondrion From Raven et al., 1999, Biology of Plants Comparison of Genomes Nuclear Mitochondrial Plastid Size Large Small Small Number Multiple Single Single Shape of Chromosomes Ploidy Linear Circular Circular Diploid Haploid Haploid Inheritance Biparental Uniparental Uniparental Structural rearrangements Inversion Crossing over, duplication, and loss From Freeman and Herron (1998), Evolutionary Analysis Chemistry of Genes • DNA • Parallel strands linked together • Linear array of units called nucleotides – Phosphate – Sugar: deoxyribose – One of four bases • • • • Adenine (“A”) Cytosine (“C”) Guanine (“G”) Thymine (“T”) From Raven et al. (1999), Biology of Plants DNA structure • Paired strands are linked by bases – A must bond with T – G must bond with C • Each link is composed of a purine and a pyrimidine – A & G are purines – C & T are pyrimidines DNA function • DNA is code for making proteins (and a few other molecules) • Proteins are the structures and enzymes that catalyze biochemical reactions that are essential for the function of an organism • DNA code is read and converted to protein in two steps – Transcription: DNA is copied to messenger RNA – Translation: messenger RNA is template for protein DNA code • A gene is a code composed of a string of nucleotide bases (A’s, C’s, G’s, T’s) • A protein is composed of a string of amino acids (there are 20) • How does the DNA code get translated into protein? DNA code • Each amino acid is coded for by at least one triplet of nucleotide bases in DNA • Each triplet is called a codon • There are 64 possible codons (4 bases, 3 positions = 43) From Raven et al. (1999), Biology of Plants DNA functional classes • Coding – Proteins (exons) – Ribosomes (RNA) – Transfer RNA • “Non-coding” – Introns – Spacers From Raven et al. (1999), Biology of Plants Homology in Molecular Systematics • Assess orthology • Align sequences • Homology is often implicit (is this a good thing?) DNA Sequences and Homology • Homology: similarity due to common descent • How do we assess homology of DNA sequences? • Levels of homology – Locus – Allele – Nucleotide position From W. P. Maddison (1997), Systematic Biology 46:527 Orthology vs. Paralogy • DNA sequences that are at homologous loci are orthologous • DNA sequences that are similar due to duplication but are at different loci are paralogous • Orthology may be best detected with a phylogenetic analysis of all sequences From Martin & Burg (2002), Systematic Biology 51:578 Multiple Sequence Alignment • Goal: create data matrix in which columns are homologous positions • Problem: sequences vary in length • Why? – Insertions – Deletions Simple Sequence Alignment Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 GTACGTTG GTACGTTG GTACGTTG GTACATTG GTACATTG GTACATTG Simple Sequence Alignment Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 GTACGTTG GTACGTTG GTACGTTG GTACATTG GTACATTG GTACATTG DNA Sequence Data Matrix T1 T2 T3 T4 T5 T6 C1 C2 C3 C4 C5 C6 C7 C8 G G G G G G T T T T T T A A A A A A C C C C C C G G G A A A T T T T T T T T T T T T G G G G G G Slightly Less Simple Sequence Alignment Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 AGAGTGAC AGAGTGAC AGAGTGAC AGAGGAC AGAGGAC AGAGGAC Slightly Less Simple Sequence Alignment Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 AGAGTGAC AGAGTGAC AGAGTGAC AGAG-GAC AGAG-GAC AGAG-GAC Alignment Gaps • Gaps are inserted to maximize homology across nucleotide positions • Gaps are hypothesized indels • Inserting a gap assumes that an indel event is a better explanation of the differences among sequences than nucleotide substitution Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 AGAGTGAC AGAGTGAC AGAGTGAC AGAGGAC AGAGGAC AGAGGAC AGAGTGAC AGAGTGAC AGAGTGAC AGAG-GAC AGAG-GAC AGAG-GAC 3 substitutions 0 indels 0 substitutions 1 indels Ambiguous Alignment with a Single-Base Indel Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 GGTCAG GGCCAA AGCTAA AGCAA AGCAA AGCAA Ambiguous Alignment with a Single-Base Indel Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 GGTCAG GGCCAA AGCTAA AG-CAA AG-CAA AG-CAA GGTCAG GGCCAA AGCTAA AGC-AA AGC-AA AGC-AA 4 substitutions 1 indels 4 substitutions 1 indels Gap Number and Length • All else being equal, is it better to assume fewer longer gaps, or more shorter gaps? • In other words, what is more likely: – For a new indel to occur? – For an existing indel to lengthen? • There is no general answer! – Alternate alignments are explored algorithmically Alignment Algorithms • Typically built up from pairwise alignments, using assumed gap costs • Problem: most algorithms require an initial tree to define alignment order--bias • Solution: simultaneous tree estimation and alignment optimization • Problems: costly, unjustifiable parameters Clustal Alignment Algorithm • Creates alignment based on penalties for gap opening (number of gaps) and gap extension (gap length) • Multiple alignment built according to guide tree determined by pairwise alignments • Order of adding sequences determined by a guide tree Clustal Alignment Algorithm Distance matrix calculated from pairwise comparisons Additional sequences are added according to dendrogram, until all sequences are added Dendrogram calculated from from distance matrix Alignment calculated for most similar pair of sequences, based on alignment parameters Tree-Based Alignment • Simultaneous tree and alignment estimation using parsimony – TreeAlign – MALIGN • Implement similar gap opening/extension costs • These applications are very slow! Alignment in the Future? • Incorporate a more sophisticated understanding of molecular evolution in parameterization • For example, what are realistic values of gap costs? Are they universal? • Can phylogeny estimation proceed without optimizing alignments? – Likelihood based methods can sum over all alignments • Will require major contribution of biologists Methods of tree estimation • Character based – Maximum parsimony (MP) • Fewest character changes – Maximum likelihood (ML) • Highest probability of observing data, given a model – Bayesian • Similar to ML, but incorporates prior knowledge • Distance based – Minimum distance • Shortest summed branch lengths Major classes of data Character-based Distance-based Bird A G A G T Alligator A G G G T Lizard A C C G G Snake A C C G G Tortoise C Bird Alligator Lizard Snake C T C C Alligator Lizard Snake Tortoise 0.20 0.60 0.60 1.00 0.60 0.60 1.00 0.00 0.80 0.80 Minimum Distance Bird Alligator Lizard Snake Alligator Lizard Snake Tortoise 0.20 0.60 0.60 1.00 0.60 0.60 1.00 0.00 0.80 0.80 Maximum Parsimony 1 2 3 4 5 Bird A G A G T Alligator A G G G T Lizard A C C G G Snake A C C G G Tortoise C 1: A 4: G C T C C 2: C 3, 5 are slightly more complicated... Parsimony Criterion B N L( ) w jdiff (x k j , x kj ) k1 j 1 L = tree length = topology k = branch B = number of branches j = character N = number of characters w = character weight diff (x1, x2) = number of steps along branch Parsimonious Character Reconstruction • To evaluate the parsimony of a tree, each character is optimized (then the sum is computed) • Several parsimony algorithms have been developed that optimize character reconstructions • Algorithms differ in assumptions about permissible transformations between character states Likelihood Criterion N L( ) l j j1 L = tree likelihood = topology j = character (site) l = site likelihood Site Likelihoods • lk, the probability of the nucleotides of each sequence at a given site, is the product of probabilities along the branches of the tree • The probability along a branch is the product of – Probability of a substitution – Branch length – Summed over ancestral states Substitution Model • Many models have been proposed • Elements are the rate of substitution of one base for another, per site • Rates are instantaneous (probability of change in a short period of time) • Rates may be allowed to vary among sites Maximum Likelihood 1 2 3 4 5 Bird A G A G T Alligator A G G G T Lizard A C C G G Snake A C C G G Tortoise C C T C C Tree is selected that maximizes likelihood of observed sequences, given a model of substitution The Molecular Systematics Revolution • Dramatic increase in the size of data sets – Characters, taxa • Increased confidence in homology assessment? • Computational advances – Technology – Algorithms and software Large Data Sets • Pre-1990: up to ~25 taxa (rarely 100), 1 gene or up to 100 morphological characters • 1998: 2538 rbcL sequences of green plants; entire mtDNA sequences (15,000 bp) in animals • 2004: ?? The Large Data Set Headache • Problem: the large number of sequences, not the size of sequences • Application of phylogenetic optimality criteria requires evaluation of all possible trees • Algorithms guaranteed to find optimal solutions have limited applicability The Problem of Finding Optimal Trees • There are too many trees to evaluate! • The number of possible topologies increases very rapidly with the number of taxa/samples • There are [(2m - 5)!] / [2m-3(m-3)!] unrooted trees , where m = number of taxa Taxa 3 4 5 7 9 Trees 1 3 15 945 135,135 stars in the universe atoms in the universe From Hillis et al. (1996), Applications of Molecular Systematics Heuristic Methods • Starting trees followed by rearrangement – Starting trees sample “tree space” – Rearrangements search for local optima • How to get starting trees? • How to rearrange trees? • These methods are prone to getting trapped on local optima Cutting-edge Methods • Ratchet – Temporarily “warps” the character space • Annealing algorithms – Accept suboptimal trees and gradual movement towards optima • Genetic algorithms – Analogous to evolution by natural selection Using Phylogenies What genes are involved in the origin of novel traits (like wings)? Why are birds so different than their closest relatives? Is the rate of molecular evolution in birds especially high?