* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 20 years and 22 papers with Bernard Moret
Minimal genome wikipedia , lookup
Human genome wikipedia , lookup
Microevolution wikipedia , lookup
Metagenomics wikipedia , lookup
Genomic library wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome editing wikipedia , lookup
Genome evolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
20 years and 22 papers with Bernard Moret Tandy Warnow The University of Illinois at Urbana-Champaign Brief history • We met in 1992, when I was working in the Discrete Algorithms Group at Sandia National Labs. • I moved to the University of Pennsylvania in 1993, then to the University of Texas at Austin in 1999. No papers yet… we’re just friends. In 1999, Bob Jansen needed help with 13 Campanulaceae genomes In 1999, Bob Jansen needed help with 13 Campanulaceae genomes In 1999, Bob Jansen needed help with 13 Campanulaceae genomes 1999-2006, 22 papers with Bernard • Genome rearrangement phylogeny estimation • Phylogenetic network estimation • Comparing phylogenetic methods on large datasets • Absolute fast converging methods 1999-2006, 22 papers with Bernard • Genome rearrangement phylogeny estimation • Phylogenetic network estimation • Comparing phylogenetic methods on large datasets • Absolute fast converging methods 1999-2006, 22 papers with Bernard • Genome rearrangement phylogeny estimation • Phylogenetic network estimation • Comparing phylogenetic methods on large datasets • Absolute fast converging methods 1999-2006, 22 papers with Bernard • Genome rearrangement phylogeny estimation • Phylogenetic network estimation • Comparing phylogenetic methods on large datasets • Absolute fast converging methods Highlights (this talk) • Genome rearrangement phylogeny estimation • Phylogenetic network estimation • Comparing phylogenetic methods on large datasets • Absolute fast converging methods Highlight #1 • Genome rearrangement phylogeny estimation Too many papers to list In 1999, Bob Jansen needed help with 13 Campanulaceae genomes Genome Rearrangement Phylogeny A A C D X B E Y E Z C F B D W F Breakpoint Phylogeny Proposed by Sankoff and Blanchette in J Comp. Biol. 1998 • Input: Chromosomes given as signed gene orders, one copy of each gene in each chromosome • Output: Tree with the minimum number of breakpoints BPAnalysis: Heuristic for the Breakpoint Phylogeny Sankoff and Blanchette, 1998 BPAnalysis: Heuristic for the Breakpoint Phylogeny Sankoff and Blanchette, 1998 Finding the breakpoint median of three genomes is NP-hard (Pe’er and Shamir 1998), but can be solved using TSP (Travelling Salesman Problem) solvers (Blanchette and Sankoff 1997). BPAnalysis: Heuristic for the Breakpoint Phylogeny Sankoff and Blanchette, 1998 Bernard and I estimated that BPAnalysis would take ~200 CPU years to complete on Bob’s dataset. MPBE • Maximum Parsimony on Binary Encoding – Character for every possible adjacency (is oriented gene x followed by oriented gene y?) • Mary Cosner PhD dissertation 1993 • Fast because maximum parsimony heuristics are relatively efficient • Can find infeasible solutions MPBE on Bob’s dataset suggested MPBE • Maximum Parsimony on Binary Encoding – Character for every possible adjacency (is oriented gene x followed by oriented gene y?) • Mary Cosner PhD dissertation 1993 • Fast because maximum parsimony heuristics are relatively efficient • Can find infeasible solutions MPBE on Bob’s dataset suggested MPBE • Maximum Parsimony on Binary Encoding – Character for every possible adjacency (is oriented gene x followed by oriented gene y?) • Mary Cosner PhD dissertation 1993 • Fast because maximum parsimony heuristics are relatively efficient • Can find infeasible solutions MPBE on Bob’s dataset suggested transpositions – very surprising. MPBE on Bob’s dataset suggested Neighbor Joining on Breakpoint Distances 40 taxa, 120 genes, Inv.:Transp.:InvTrans p=2:1:1 Error in inferred tree NJ(BP) birth-death trees, expected deviation from ultrametricity=2 Amount of evolution NJ(BP): seconds to complete Phylogeny reconstruction in 1999 • Distance-based – Breakpoint (BP) distances [Blanchette, Kunisawa, Sankoff 1999] • Breakpoint tree (NP-hard, even for three taxa) – BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive search through treespace to find the minimum breakpoint length (the number of breakpoints on the tree) • MPBE [Cosner 1993]: maximum parsimony on binary encoding Phylogeny reconstruction in 1999 • Distance-based – Breakpoint (BP) distances [Blanchette, Kunisawa, Sankoff 1999] – fast but high error • Breakpoint tree (NP-hard, even for three taxa) – BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive search through treespace to find the minimum breakpoint length (the number of breakpoints on the tree) – too slow • MPBE [Cosner 1993]: maximum parsimony on binary encoding: can find infeasible ancestors The challenges! 1. Find all the best genome trees for Bob’s dataset, and determine if inversions suffice (or if we really do need transpositions). 2. Design statistically rigorous methods for genome rearrangement phylogeny. 3. Design efficient techniques to enable genome-scale phylogeny for large datasets that are difficult to analyze. Genomes Evolve by Rearrangements 1 2 3 4 5 6 7 9 10 1 2 3 –8 9 –7 -8 4 –6 –7 5 –5 –6 6 -4 –5 7 –4 8 9 10 • Inversion (Reversal) • Transposition • Inverted Transposition 8 Generalized Nadeau-Taylor (GNT) Model Proposed in Wang and Warnow, STOC 2001. • Each type of event (inversion, transposition, and inverted transposition) has a probability of occurring, and is specified by – GNT(a,b,c): a+b+c=1 • All events of the same type are equiprobable • The tree has branch lengths indicating the expected number of events on each branch. Distance-based methods • Breakpoint distances (Blanchette, Bourque, and Sankoff 1997) • Inversion distances (Bader, Moret, and Yan 2001) • EDE (Empirically-Derived Estimator of true evolutionary distance), Moret et al., ISMB 2001 – derived from inversion-only model • IEBP (Wang and Warnow STOC 2001): estimates true evolutionary distance but needs to know or estimate the GNT parameters. Figure 3 from Moret and Warnow, 2004 40 taxa, 120 genes Inv.:Transp.:InvTransp =2:1:1 Birth-death trees, expected deviation from ultrametricity=2 Amount of evolution BP=breakpoint distance INV=inversion distance EDE: statistically-based estimator [Wang et al. ‘01] - highly robust. All these methods are polynomial time. Moret et al. breakpoint phylogeny approach: 2,000,000-fold speedup over BPAnalysis as serial codes (parallelism brings it higher) From Moret, Tang, and Warnow, 2004 Bounding • Upper bound on the best score: score the NJ(EDE) tree using improved implementation of BPAnalysis. • Lower bound on a given tree T using the circular ordering on leaves: greedy technique to find the planar embedding that achieves the highest breakpoint score for its circular odering – half that is a lower bound on T’s breakpoint score. By the way, Bernard and I had a big argument about using this lower bound… he didn’t think it would work, but it did! Bounding • Upper bound on the best score: score the NJ(EDE) tree using improved implementation of BPAnalysis. • Lower bound on a given tree T using the circular ordering on leaves: greedy technique to find the planar embedding that achieves the highest breakpoint score for its circular ordering – half that is a lower bound on T’s breakpoint score. By the way, Bernard and I had a big argument about using this lower bound… he didn’t think it would work, but it did! GRAPPA http://www.cs.unm.edu/~moret/GRAPPA/ • Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms • Heuristics for NP-hard optimization problems • Uses high-level algorithmic ideas with low-level algorithms engineering to dramatically speed-up the searches for the breakpoint and inversion phylogenies. • 2000-2004 • Project leader: Bernard Moret In 1999, Bob needed help with 13 Campanulaceae genomes The challenges! 1. Find all the best genome trees for Bob’s dataset, and determine if inversions suffice (or if we really do need transpositions). 2. Design statistically rigorous methods for genome rearrangement phylogeny. 3. Design efficient techniques to enable genome-scale phylogeny for large datasets that are difficult to analyze. What we found • Optimal solutions for the breakpoint phylogeny, for the inversion-only phylogeny, and for a weighted sum of inversions and transpositions. • 67 inversions suffices for this dataset, and no transpositions are needed! This was just the beginning… Other events, such as • Duplications, Insertions, and Deletions • Fissions and Fusions Other models • HP (Hannenhali and Pevzner) • Double Cut-and-Join (Yancopoulos et al.) Other techniques • DCM-boosting to scale GRAPPA to large datasets (1000 species) • Estimating true evolutionary distances under these complex models • Inferring ancestral genomes under these complex models Bernard has made progress on all these things!! (But without me, so someone else can talk about this…) This was just the beginning… Other events, such as • Duplications, Insertions, and Deletions • Fissions and Fusions Other models • HP (Hannenhali and Pevzner) • Double Cut-and-Join (Yancopoulos et al.) Other techniques • DCM-boosting to scale GRAPPA to large datasets (1000 species) • Estimating true evolutionary distances under these complex models • Inferring ancestral genomes under these complex models Bernard has made progress on all these things!! (But without me, so someone else can talk about this…) Highlight #2 Absolute fast converging methods • Warnow, Moret and St. John, SODA 2001 • Nakhleh, Moret, etc. PSB 2002 • Moret, Roshan, and Warnow, WABI 2002 • Moret, Wang, and Warnow 2002 (IEEE Computer) Markov Model of Site Evolution Simplest (Jukes-Cantor, 1969): • The model tree T is binary and has substitution probabilities p(e) on each edge e. • The state at the root is randomly drawn from {A,C,T,G} (nucleotides) • If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. • The evolutionary process is Markovian. The different sites are assumed to evolve independently and identically down the tree (with rates that are drawn from a gamma distribution). More complex models (such as the General Markov model) are also considered, often with little change to the theory. Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP Statistical consistency error Data Distance-based estimation Neighbor Joining (NJ) on large trees Error Rate 0.8 Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. NJ 0.6 0.4 0.2 [Nakhleh et al. ISMB 2001] 0 0 400 800 No. Taxa 1200 1600 In other words… error Data Statistical consistency doesn’t guarantee accuracy w.h.p. unless the sequences are long enough. Sequence length requirements The sequence length (number of sites) that a phylogeny reconstruction method M needs to reconstruct the true tree with probability at least 1- depends on • M (the method) • • f = min p(e), • g = max p(e), and • n, the number of leaves We fix everything but n. Neighbor Joining’s sequence length requirement is exponential! • Atteson 1999: Let T be a Jukes-Cantor model tree defining additive matrix D. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length O(lg n emax Dij). • Lacey and Chang 2009: Matching lower bound Statistical consistency, exponential convergence, and absolute fast convergence (afc) “Boosting” phylogeny reconstruction methods • DCMs “boost” the performance of phylogeny reconstruction methods. Base method M DCM DCM-M Distance-based estimation Divide-and-conquer for phylogeny estimation Divide-and-conquer for phylogeny estimation Construct subset trees Supertree Step Refinement Step DCM1 Decompositions Input: Set S of sequences, distance matrix d, threshold value q Î{dij} 1. Compute threshold graph Gq = (V , E),V = S , E = {(i, j ) : d (i, j ) £ q} 2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably triangulated). DCM1 decomposition : Compute maximal cliques DCM1-boosting: Warnow, St. John, and Moret, SODA 2001 Exponentially converging (base) method DCM1 SQS Absolute fast converging (DCM1-boosted) method • The DCM1 phase produces a collection of trees (one for each threshold), and the SQS phase picks the “best” tree. • For a given threshold, the base method is used to construct trees on small subsets (defined by the threshold) of the taxa. These small trees are then combined into a tree on the full set of taxa. Neighbor Joining on large diameter trees Error Rate 0.8 Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. NJ 0.6 0.4 0.2 [Nakhleh et al. ISMB 2001] 0 0 400 800 No. Taxa 1200 1600 DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] 0.8 NJ Error Rate DCM1-NJ 0.6 0.4 0.2 0 0 400 800 No. Taxa 1200 • Theorem (Warnow, Moret, and St. John, SODA 2001): • DCM1-NJ converges to the true tree from polynomial length sequences 1600 DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] 0.8 NJ Error Rate DCM1-NJ 0.6 0.4 0.2 0 0 400 800 No. Taxa 1200 • Theorem (Warnow, Moret, and St. John, SODA 2001): • DCM1-boosting: reducing sequence length requirements for gene tree accuracy w.h.p. from exponential to polynomial 1600 Highlight #3 Building the Tree of Life: A National Resource for Phyloinformatics and Computational Phylogenetics NSF Large ITR grant ($11.6 Million), 2003-2010 Directors: Bernard Moret (2003-2006) and Tandy Warnow (2006-2010) Other key personnel: Mark Holder, Junhyong Kim, Wayne Maddison, Mark Miller, Brent Mishler, Satish Rao, David Swofford, and Val Tannen. CIPRES: graduate students and postdocs Francois Barbancon, Nicholas Bray, Kevin Chen, Shirley Cohen, Costis Daskalakis, Nick Eriksson, Yu Fan, Kirsen Fisher, Ganesh Ganapathy, Sheng Guo, Tracy Heath, Cameron Hill, David Kysela, Ruth Kirkpatrick, Henry Lin, Kevin Lin, Wenguo Lin, Andrew McGregor, Frank Mannino, Rui Mao, Radu Mihaescu, Eric Miller, Luay Nakhleh, Manikandan Narayanan, Serita Nelesen, Smriti Ramakrishinan, Samantha Riesenfeld, Sebastien Roch, Usman Roshan, Ariel Schwartz, Stephen Smith, Errol Strain, Jeet Sukumaran, Shel Swenson, Kunal Talwar, Andres Varon, Rutger Vos, Yifeng Zheng, and Derrick Zwickl Michael Alfaro Mark Holder Peter Midford Sagi Snir Shel Swenson Rutger Vos CIPRES: graduate students and postdocs Francois Barbancon, Nicholas Bray, Kevin Chen, Shirley Cohen, Costis Daskalakis, Nick Eriksson, Yu Fan, Kirsen Fisher, Ganesh Ganapathy, Sheng Guo, Tracy Heath, Cameron Hill, David Kysela, Ruth Kirkpatrick, Henry Lin, Kevin Lin, Wenguo Lin, Andrew McGregor, Frank Mannino, Rui Mao, Radu Mihaescu, Eric Miller, Luay Nakhleh, Manikandan Narayanan, Serita Nelesen, Smriti Ramakrishinan, Samantha Riesenfeld, Sebastien Roch, Usman Roshan, Ariel Schwartz, Stephen Smith, Errol Strain, Jeet Sukumaran, Shel Swenson, Kunal Talwar, Andres Varon, Rutger Vos, Yifeng Zheng, and Derrick Zwickl Michael Alfaro Mark Holder Peter Midford Sagi Snir Shel Swenson Rutger Vos And many others who were funded by other grants 225 papers and dissertations were supported by CIPRES, many by CIPRES students and postdocs The CIPRES Gateway SDSC Supercomputers, CIPRES Gateway Help Define New “Tree of Life” SDSC Press Release, April 25, 2016 (Figure courtesy of Laura Hug, Jill Banfield, and Nature Microbiology) Bernard isn’t really retiring • He’s just going to be working from a warmer place… • We’ll keep him busy with collaborations via internet and phone… • And if we have to, we’ll go visit him in his new home. Bernard isn’t really retiring • He’s just going to be working from a warmer place… • We’ll keep him busy with collaborations via internet and phone… • And if we have to, we’ll go visit him in his new home. Bernard isn’t really retiring • He’s just going to be working from a warmer place… • We’ll keep him busy with collaborations via internet and phone… • And if we have to, we’ll go visit him in his new home. Bernard isn’t really retiring • He’s just going to be working from a warmer place… • We’ll keep him busy with collaborations via internet and phone… • And if we have to, we’ll go visit him in his new home. Bon Voyage, Bernard!