Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon Traditional methods for building phylogeny Requirements: • High coverage • Assembly • Detection of putative orthologous genes • Alignment • Phylogeny from tiny portion of the whole genome • Genome scale multi-sequence alignment is difficult Alignment-free methods for building phylogeny • Typically from assembled genomes • De novo assembly with short reads? • Mainly on closely related prokaryotic genomes • No confidence assessment (e.g. bootstrapping) Overview • Assembly and Alignment-Free method (AAF) • Calculate phylogenetic distances using whole genome short read sequencing data • Method validation • Genome complexity • Different genome sizes • Sequencing errors • Range of sequencing coverage • 12 mammal species • 21 tropical tree species • Comparision with andi AAF method • Calculate pairwise genetic distances between each sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes. • Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix AAF method - Evolutionary model • The probability that no mutation will occur within a given k-mer between species A and B is exp(−kd). • If only substitutions occurred, all k-mers are unique, then all the species will have the same total number of k-mers, nt, and the maximum likelihood estimate of exp(−kd) is ns/nt. • Mutations will decrease the number of shared k-mers, ns, between species relative to the total number of k-mers, nt • Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers • Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers • Greater effect K-mer sensitivity and homoplasy • No assembly -> not all indels identified • If k-mer covers multiple substitutions • Shorter k-mers -> better sensitivity • Shorter k-mers -> same k-mers from evolutionary different regions • Homoplasy K-mer homoplasy • k=15 • Genome size > 5x108 => same k-mers randomly in other species • May incorrectly inflate the proportion of shared k-mers • The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size ph • Prediction of the ratio ns/nt • Large genomes and small k ph = 1 • all possible k-mers occur in both species. This problem is exacerbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition. • GC content • Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species. Mathematical prediction Random ancestral sequence Real (non-random) sequence Assembly-free • Sampling error caused by low genome coverage • The actual number of k-mers will be under-represented given low sequencing coverage • Sequencing errors • Loss of true k-mers and the gain of false k-mers • Filtering = remove singletons Seq errors p=observed/true => Tip corrections Coverage 5-8 sufficient to observe all true k-mers when filtering Filter only singletons? Filter only singletons? Bootstrapping Nonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k OR Two-stage parametric bootstrap • Estimate the variances in distances between species caused by sampling and evolutionary variation • Independent of genome size Bushbaby (galago) Tarsier Recently published phylogeny of primates Assembled genomes, k=19 Assembled genomes, k=21 Simulated reads Simulated reads Real data – tropical trees Intsia palembanica Advantages Limitations • Low coverage • Loss of k-mer sensitivity requirements • Deep nodes • Low computational demands • 12 primates 25GB RAM, 12 threads • Location of mutations Distance computing for 73 Escherichia strains • AAF • 32+76 = 1h 48min • andi • 21 min AAF andi