Download PPTX - Tandy Warnow

Document related concepts
no text concepts found
Transcript
Chalk Talk
Tandy Warnow
Departments of Computer Science and
Bioengineering
University of Illinois at Urbana-Champaign
The Tree of Life: Multiple Challenges
Large datasets:
100,000+ sequences
10,000+ genes
“BigData” complexity
•
•
•
•
•
•
•
•
Large-scale statistical phylogeny estimation
Ultra-large multiple-sequence alignment
Estimating species trees from incongruent gene trees
Supertree estimation
Genome rearrangement phylogeny
Reticulate evolution
Visualization of large trees and alignments
Data mining techniques to explore multiple optima
The Tree of Life: Multiple Challenges
Large datasets:
100,000+ sequences
10,000+ genes
“BigData” complexity
Applications areas:
•
•
•
•
•
metagenomics
protein structure and function prediction
trait evolution
detection of co-evolution
systems biology
The Tree of Life: Multiple Challenges
Large datasets:
100,000+ sequences
10,000+ genes
“BigData” complexity
Techniques:
• Graph theory (especially chordal graphs)
• Probability theory and statistics
• Hidden Markov models
• Combinatorial optimization
• Heuristics
• Supercomputing
Overview
• Theory: combining probability theory, graph theory,
and optimization
• Simulations: evaluating methods under stochastic
models of sequence evolution
• Biological data analysis: refining methods and enabling
discovery
• Open source software development
• High performance computing
• Applications outside biology (e.g., historical linguistics,
big data problems in general)
Past Work (highlights)
• Gene tree estimation (theoretical results under
stochastic models of sequence evolution)
• Multiple sequence alignment on large datasets,
and co-estimation of alignments and trees
• Phylogenetic networks and species trees from
multi-locus datasets
• Genome rearrangement phylogeny
• Supertree methods
• Metagenomics
• Historical linguistics
Future work
Theory, methods, and empirical studies for
• Genome-scale phylogeny estimation
addressing multiple sources for gene tree
heterogeneity
• Microbiome analysis
• Ultra-large multiple sequence alignment and
tree estimation
And applications of these techniques outside
biology
Current NSF grants
• Graph-theoretic methods to improve
phylogenomic analyses (joint with Chandra
Chekuri and Satish Rao) – NSF CCF-1535977
• Multiple Sequence Alignment: NSF ABI1458652
• Metagenomics: joint with Mihai Pop and Bill
Gropp. NSF grant III:AF:1513629
Current NSF grants
• Graph-theoretic methods to improve
phylogenomic analyses (joint with Chandra
Chekuri and Satish Rao) – NSF CCF-1535977
• Multiple Sequence Alignment: NSF ABI1458652
• Metagenomics: joint with Mihai Pop and Bill
Gropp. NSF grant III:AF:1513629
Major Areas
• Phylogenomics: Species tree and network estimation using
whole genomes (and gene tree estimation in the context of
whole genomes)
• Multiple Sequence Alignment: Inferring relationships
between letters in molecular sequences, especially on very
large datasets (up to 1,000,000 sequences)
• Metagenomics: Analysis of molecular sequences obtained
from environmental samples (joint with Mihai Pop and Bill
Gropp)
• Scaling computationally intensive methods to large
datasets: Combining discrete math and statistical methods
to enable highly accurate analysis of ultra-large datasets
(joint with Chandra Chekuri and Satish Rao)
Phylogenomics = Species trees from whole genomes
“Nothing in biology makes sense except in the light of evolution” - Dobhzansky
phylogenomics
gene 1
gene 2
gene 999
gene 1000
ACTGCACACCG
ACTGC-CCCCG
AATGC-CCCCG
-CTGCACACGG
CTGAGCATCG
CTGAGC-TCG
ATGAGC-TCCTGA-CAC-G
AGCAGCATCGTG
AGCAGC-TCGTG
AGCAGC-TC-TG
C-TA-CACGGTG
CAGGCACGCACGAA
AGC-CACGC-CATA
ATGGCACGC-C-TA
AGCTAC-CACGGAT
Orangutan
Chimpanzee
Gorilla
Human
“gene” here refers to a portion of
I’ll use the term “gene” to refer
“c-genes”:
theto
genome
(not a functional gene)
recombination-free orthologous stretches of the genome
2
Gene tree discordance
Incomplete Lineage Sorting
(ILS) is a dominant cause of
gene tree heterogeneity
gene 1
Gorilla Human Chimp Orang.
gene1000
Gorilla Chimp Human Orang.
3
Incomplete Lineage Sorting (ILS)
• Confounds phylogenetic analysis for many
groups: Hominids, Birds, Yeast, Animals,
Toads, Fish, Fungi, etc.
• There is substantial debate about how to
analyze phylogenomic datasets in the
presence of ILS, focused around statistical
consistency guarantees (theory) and
performance on data.
Species
Main competing approaches
gene 1
gene 2 . . .
...
gene k
Concatenation
Analyze
separately
...
Summary Method
Statistical Consistency
error
Data
Species
Main competing approaches
gene 1
gene 2 . . .
...
gene k
Concatenation
Analyze
separately
...
Summary Method
Maximum Quartet Support Species Tree
[Mirarab, et al., ECCB, 2014]
•
Optimization Problem (NP-Hard):
Find the species tree with the maximum number of induced
quartet trees shared with the collection of input gene trees
Set of quartet trees
induced by T
X
Scor e(T ) =
a gene tree
•
|Q(T ) \ Q(t)|
t2 T
all input gene trees
Theorem: Statistically consistent under the multispecies coalescent model when solved exactly
8
Constrained MQST
(Maximum Quartet Support Tree)
• Input: Set T = {t1,t2,…,tk} of unrooted gene trees, with each tree on set S
with n species, and set X of allowed bipartitions
• Output: Unrooted tree T on leafset S, maximizing the total quartet tree
similarity to T, subject to T drawing its bipartitions from X.
Theorems (Mirarab et al., 2014):
• If X contains the bipartitions from the input gene trees (and perhaps
others), then an exact solution to this problem is statistically consistent
under the MSC.
• The constrained MQST problem can be solved in O(|X|2nk) time. (We use
dynamic programming, and build the unrooted tree from the bottom-up,
based on “allowed clades” – halves of the allowed bipartitions.)
ASTRAL is fairly robust to HGT + ILS
200 Estimated Gene Trees
Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees,
simulated 1000 bp gene sequences using INDELible 8 , 1000 gene trees estimated from GTR simulated sequences using FastTree-2 7
7 Price, Dehal, Arkin 2015
8 Fletcher, Yang 2009
12
Davidson et al., RECOMB-CG, BMC Genomics 2015
Contributions (sample)
Methods for estimating species trees from genome-scale data:
•
ASTRAL (Mirarab et al., Bioinformatics 2014, 2015) and ASTRID (Vachaspati and Warnow, BMC Genomics 2015):
polynomial time methods that are statistically consistent under the MSC. Both can analyze very large datasets (1000
species and 1000 genes – or more) with high accuracy.
•
Statistical binning (Mirarab et al., Science 2014, Bayzid et al. PLOS One 2015) can reduce gene tree estimation error,
and lead to improved species tree estimations (topology, branch lengths, and incidence of false positives)
•
BBCA (Zimmermann et al., BMC Genomics 2014) enables Bayesian co-estimation methods to scale to large numbers of
genes
•
DCM-boosting (Bayzid et al., BMC Genomics 2014) enables computationally intensive methods to scale to large
numbers of species
Mathematical theory:
•
Roch and Warnow, Systematic Biology 2015) regarding statistical consistency under the MSC given finite length
sequences.
•
Uricchio et al., BMC Bioinformatics 2016, number of loci needed to recover all the splits with high probability
Biological data analyses:
•
Avian phylogenomics project (Jarvis, Mirarab et al., Science 2014)
•
Thousand Plant Transcriptome Project (Wickett, Mirarab et al. PNAS 2014)
•
Tarver et al. Genome Biology and Evolution 2016, Mammalian phylogeny
Current NSF grants
• Graph-theoretic methods to improve
phylogenomic analyses (joint with Chandra
Chekuri and Satish Rao) – NSF CCF-1535977
• Multiple Sequence Alignment: NSF ABI1458652
• Metagenomics: joint with Mihai Pop and Bill
Gropp. NSF grant III:AF:1513629
Current NSF grants
• Graph-theoretic methods to improve
phylogenomic analyses (joint with Chandra
Chekuri and Satish Rao) – NSF CCF-1535977
• Multiple Sequence Alignment: NSF ABI1458652
• Metagenomics: joint with Mihai Pop and Bill
Gropp. NSF grant III:AF:1513629
Metagenomic taxonomic identification and phylogenetic
profiling
Metagenomics, Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million
New Genes in Ocean Microbes
Basic Questions
1. What is this fragment? (Classify each fragment
as well as possible.)
2. What is the taxonomic distribution in the
dataset? (Note: helpful to use marker genes.)
3. What are the organisms in this metagenomic
sample doing together?
This talk
• SEPP (PSB 2012): SATé-enabled Phylogenetic
Placement, and Ensembles of HMMs (eHMMs)
• Applications of the eHMM technique to
metagenomic abundance classification (TIPP,
Bioinformatics 2014)
Phylogenetic Placement
Input: Backbone alignment and tree on full-length
sequences, and a set of homologous query
sequences (e.g., reads in a metagenomic sample
for the same gene)
Output: Placement of query sequences on backbone
tree
Phylogenetic placement can be used inside a
pipeline, after determining the genes for each of
the reads in the metagenomic sample.
Marker-based Taxon Identification
Fragmentary sequences
from some gene
ACCG
CGAG
CGG
GGCT
TAGA
GGGGG
TCGAG
GGCG
GGG
•.
•.
•.
ACCT
Full-length sequences for same gene,
and an alignment and a tree
AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A
Align Sequence
S1
S2
S3
S4
Q1
=
=
=
=
=
-AGGCTATCACCTGACCTCCA-AA
TAG-CTATCAC--GACCGC--GCA
TAG-CT-------GACCGC--GCT
TAC----TCAC--GACCGACAGCT
TAAAAC
S1
S4
S2
S3
Align Sequence
S1
S2
S3
S4
Q1
=
=
=
=
=
-AGGCTATCACCTGACCTCCA-AA
TAG-CTATCAC--GACCGC--GCA
TAG-CT-------GACCGC--GCT
TAC----TCAC--GACCGACAGCT
-------T-A--AAAC--------
S1
S4
S2
S3
Place Sequence
S1
S2
S3
S4
Q1
=
=
=
=
=
-AGGCTATCACCTGACCTCCA-AA
TAG-CTATCAC--GACCGC--GCA
TAG-CT-------GACCGC--GCT
TAC----TCAC--GACCGACAGCT
-------T-A--AAAC--------
S1
S4
S2
Q1
S3
Phylogenetic Placement
• Align each query sequence to backbone alignment
– HMMALIGN (Eddy, Bioinformatics 1998)
– PaPaRa (Berger and Stamatakis, Bioinformatics 2011)
• Place each query sequence into backbone tree
– Pplacer (Matsen et al., BMC Bioinformatics, 2011)
– EPA (Berger and Stamatakis, Systematic Biology 2011)
Note: pplacer and EPA use maximum likelihood
HMMER vs. PaPaRa Alignments
0.0
Increasing rate of evolution
One Hidden Markov Model
for the entire alignment?
Or 2 HMMs?
HMM 1
HMM 2
Or 4 HMMs?
HMM 1
HMM 3
HMM 2
HMM 4
SEPP Parameter Exploration


Alignment subset size and placement subset
size impact the accuracy, running time, and
memory of SEPP
10% rule (subset sizes 10% of backbone)
had best overall performance
SEPP (10%-rule) on simulated data
0.0
0.0
Increasing rate of evolution
Marker-based Taxon Identification
Fragmentary sequences
from some gene
ACCG
CGAG
CGG
GGCT
TAGA
GGGGG
TCGAG
GGCG
GGG
•.
•.
•.
ACCT
Full-length sequences for same gene,
and an alignment and a tree
AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A
TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014),
marker-based method that only characterizes those reads that map
to the Metaphyler’s marker genes
TIPP pipeline
1. Uses BLAST to assign reads to marker genes
2. Computes UPP/PASTA reference alignments
3. Uses reference taxonomies, refined to binary trees using reference
alignment
4. Modifies SEPP by considering statistical uncertainty in the
extended alignment and placement within the tree
Abundance Profiling
Objective: Distribution of the species (or genera, or families, etc.) within the
sample.
For example: The distribution of the sample at the species-level is:
50%
species A
20%
species B
15%
species C
14%
species D
1%
species E
High indel datasets containing known genomes
Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one
of the high indel long sequence datasets, and mOTU terminates with an error message
on all the high indel datasets.
“Novel” genome datasets
Note: mOTU terminates with an error message on the long fragment
datasets and high indel datasets.
TIPP vs. other abundance profilers
• TIPP is highly accurate, even in the presence of high
indel rates and novel genomes, and for both short and
long reads.
• All other methods have some vulnerability (e.g., mOTU
is only accurate for short reads and is impacted by high
indel rates).
• Improved accuracy is due to the use of eHMMs; single
HMMs do not provide the same advantages, especially
in the presence of high indel rates.
SEPP and eHMMs
An ensemble of HMMs provides a better model of
a multiple sequence alignment than a single
HMM, and is better able to
• detect homology between full length sequences
and fragmentary sequences
• add fragmentary sequences into an existing
alignment
especially when there are many indels and/or
substitutions.
Our Publications using eHMMs
• S. Mirarab, N. Nguyen, and T. Warnow. "SEPP: SATé-Enabled Phylogenetic
Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing
(PSB 2012) 17:247-258.
• N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic
Identification and Phylogenetic Profiling." Bioinformatics (2014)
30(24):3548-3555.
• N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow, "Ultra-large alignments
using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome
Biology (2015) 16:124
• N. Nguyen, M. Nute, S. Mirarab, and T. Warnow, HIPPI: Highly accurate
protein family classification with ensembles of HMMs. BMC Genomics
(2016): 17 (Suppl 10):765
All codes are available in open source form at
https://github.com/smirarab/sepp
Overview
• Theory: combining probability theory, graph theory,
and optimization
• Simulations: evaluating methods under stochastic
models of sequence evolution
• Biological data analysis: refining methods and enabling
discovery
• Open source software development
• High performance computing
• Applications outside biology (e.g., historical linguistics,
big data problems in general)
Computational Phylogenomics
NP-hard problems
Large datasets
Complex statistical estimation problems
Metagenomics
Protein structure and function prediction
Medical forensics
Systems biology
Population genetics
Future Work - Phylogenomics
• Better theory, addressing impact of gene tree
estimation error and missing data
• Fast genome-scale phylogenetic tree estimation (high
performance computing, statistically-based estimation
taking multiple sources of discord into account)
• Phylogenetic network construction on large datasets
(statistical methods within divide-and-conquer
framework)
• Better statistical models of sequence evolution,
addressing heterotachy
• Co-estimation of gene trees and species
trees/networks
Future work - Metagenomics
• Improved marker-based analyses, and addressing
gene tree heterogeneity
• Rigorous methods for detecting novel genes and
species
• High throughput analysis with high sensitivity
• Metagenome assembly
• HPC implementations
• Collaborations with biologists and biomedical
researchers
Future work – Multiple Sequence
Alignment
• Improved large-scale MSA (e.g., PASTA and UPP)
• Extending statistical co-estimation of trees and
MSA to large datasets (e.g., Nute and Warnow
2016)
• Efficient and useful sampling of MSAs
• MSA estimation in the presence of duplications
and rearrangements (e.g., whole genome
alignment)
• Better HMM+phylogeny models that are useful
for estimating alignments and trees
Future work - Theory
• Basic algorithmic challenges:
–
–
–
–
supertrees
computing trees from distance matrices
using chordal graphs for divide-and-conquer
Consensus trees
• Applied probability:
– Trade-off between data quality and quantity (e.g.,
statistical binning)
– Identifiability of tree models with noisy data
– Understanding ensembles of HMMs