Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
DNA Assembly with Gaps: Simulating Sequence Evolution Reed A. Cartwright Department of Genetics University of Georgia Synopsis Explain the importance of simulations. Introduce Dawg, a new sequence simulation program. Example usage of Dawg. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 2 Why Simulate Phylogenies? Biologists use many techniques to reconstruct phylogenies based on biological data. However, true phylogenies are unknown, except for a few instances. How then can we test the accuracy of these reconstruction methods? Use simulations. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 3 Why Simulate Phylogenies? Techniques are often based on certain models of evolution. Simulating sequence evolution based on these models produces an ideal situation to test the techniques. Using other models can test how robust a technique is. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 4 Testing Procedure 1. Start with a “known” tree. 2. Simulate sequence sets based on the tree. A B C 3. of 4. to Estimate the trees the simulated data. Compare estimated trees the original tree. A D A B C D 3.12.2005 D AATTCTTTGAGTTAA AATTCTTTGAGTTAA AATTCTTAAAGTTAA AATTCTTAAAGTTAA A A B C D B C B C D AAAAGATAAAGCAAA--A GAAAGATAAAGCAAA--A GAAAGATAAAGAAAAACA GAAAGATAAAGAAAAACA RA Cartwright [email protected] - http://scit.us/ 5 Simulating Evolution Proper simulation of molecular evolution should include both substitutions and indels. However, existing programs either do not include indels or use an unjustified model of indel formation. Dawg was created to address this gap. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 6 What is Dawg? Dawg stands for “DNA Assembly with Gaps.” A portable and robust program for simulating molecular evolution. Development Website: http://scit.us/dawg/ 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 7 Comparing Software Feature Seq-Gen Evolver Indels Rose Dawg Yes Yes Indel Parameter Estimator Recombination Yes Substitution GTR GTR PAM GTR Rate Heterogeneity Γ+I Γ Γ+I Γ+I Switch File File File Unix Yes Yes Yes Yes Mac OS X Yes Yes Yes Yes Win32 Yes Yes Input Format 3.12.2005 Yes Yes RA Cartwright [email protected] - http://scit.us/ Yes 8 Parameters 3.12.2005 Tree TreeScale Sequence Length Rates Model Freqs Params Width Scale Gamma Alpha Iota GapModel Lambda GapParams Reps File Format GapSingleChar GapPlus LowerCase Translate NexusCode Seed phylogeny coefficient to scale branch lengths by root sequences length of generated root sequences rate of evolution of each root nucleotide model of evolution: GTR|JC|K2P|K3P|HKY|F81|F84|TN nucleotide (ACGT) frequencies parameters for the model of evolution block width for indels and recombination block position scales coefficients of variance for rate heterogeneity shape parameters proportions of invariant sites models of indel formation: NB|PL|US rates of indel formation parameter for the indel model number of data sets to output output file output format: Fasta|Nexus|Phylip|Clustal output gaps as a single character distinguish insertions from deletions in alignment output sequences in lowercase translate outputed sequences to amino acids text or file to include between datasets in Nexus format PRNG seed (integers) RA Cartwright [email protected] - http://scit.us/ 9 Sample Input File # example.dawg Tree = ((AY727331:0.001359,AY727330:0.001359):0.084512, (AY727327:0.006116,AY727326:0.006116):0.079756); Model = "GTR" Params = {1.08031, 2.45581, 0.44452, 1.09145, 4.06519, 1.00000} Freqs = {0.353470, 0.143681, 0.178206, 0.324643} Length = 300 Lambda = 0.143120 GapModel = "NB" GapParams = {1, 0.753247} Format = "Clustal" File = "example.aln" Seed = 1981 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 10 CLUSTAL multiple sequence alignment (Created by DAWG Version 1.0.0) AY727326 AY727327 AY727330 AY727331 TTCGAAAATATGTTAGTACTCAATATGAATTCTTTGAGTTAAAAAAGATAAAGCAAA--A TTCGAAAATATGTTAGTACTCAATATGAATTCTTTGAGTTAAGAAAGATAAAGCAAA--A TTCAAAAATATGCTAGGACTGAATATGAATTCTTAAAGTTAAGAAAGATAAAGAAAAACA TTCAAAAATATGCTAGGACTGAATATGAATTCTTAAAGTTAAGAAAGATAAAGAAAAACA AY727326 AY727327 AY727330 AY727331 ATACATAATGTGATTTCAATATTCCAATTACCTAACAATACGGCTATCAATTAAACGATT ATACATAATGTGATTTCAATATTCCAATTACCTAACAATACGGCTATCAATTAAACGATT GTACATAATGTAAA----TTATTGCAA---------AAAACGGCTAACAATTAGACGATT GTACATAATGTAAA----TTATTGCAA---------AAAACGGCTAACAATTAGACGATT AY727326 AY727327 AY727330 AY727331 TTAGGATTACACCGACAAATATTAGGCCGATATGAATTTAACATCATGTTGTATTTAGAT TTAGGATTACACCGACAAATATTAGGCCGATATGAATTTACCATCATGTTGTATTTAGAT TTAGGATTACGCTGACAAATATTAGGATGATATTAATTTA------TCTTGTATTTAGAT TTAGGATTACGCTGACAAATATTAGGATGATATTAATTTA------TCTTGTATTTAGAT AY727326 AY727327 AY727330 AY727331 GCTGTCTTTTATTAACATTCATCATTAAAT-TTGGAACCTTTTGCATTTAAGAAGTACAT GCTGTCTTTTATTAACATTCATCATTAAAT-TTGGAACCTTTTGTATTTAAGAAGTACAT GCTGTCTTTTATCAACATTCATCACTAGATATTGGAACCTATTGCATCTAAGAAGTACAT GCTGTCTTTTATCAACATTCATCACTAGATATTGGAACCTATTGCATCTAAGAAGTACAT AY727326 AY727327 AY727330 AY727331 GTTTAATAGTGTTTAAAA-TATATATGAAATTGATCATAAGGA---TCTATAAATGCGGT GTTTAATAGTGTTTATAA-TATATATGAAATTGATCGTAAGGA---TCTATAAATGCAGT GTTTAATAGGGTT-AAAACTATATATGAAGTCGATTATAAGGAATTTCTATAAATGTAGC GTTTAATAGGGTT-AAAACTATATATGAAGTCGATTATAAGGAATTTCTATAAATGTAGC AY727326 AY727327 AY727330 AY727331 TCTTCAATTTCTTG TCTTCAATTTCTTG TCTTCAATTTCCTA TCTTCAATTTCCTA 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 11 Estimating Indel Rate Dawg would be of little benefit if biologists could not estimate parameters of indel formation from real data. Dawg’s indel model allows such estimation, which is implemented in a Perl script, lambda.pl. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 12 Example Usage: Confidence Interval of Indel Rate I aligned the sequences of chloroplast trnK introns from two Hibiscus and two Prunus species. Using Paup*, I estimated the phylogeny and substitution parameters. Using lambda.pl, I estimated the indel formation parameters. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 13 Example Usage From these estimated parameters of evolution, I constructed an input file for Dawg. From the input file Dawg produced a thousand simulated sequence sets. The rate of indel formation was estimated for each of the simulated sequences. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 14 Results The estimated rate of indel formation was 0.143120. Bootstrapping gave a 95% CI of 0.078530 to 0.213560. Biologically this is 8 to 21 indels per 100 substitutions. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 15 Synopsis Explain the importance of simulations. Introduce Dawg, a new sequence simulation program. Example usage of Dawg. 3.12.2005 RA Cartwright [email protected] - http://scit.us/ 16 Thanks Marjorie Asmussen Wyatt Anderson John Avise Jim Hamrick Ron Pulliam Paul Schliekelman 3.12.2005 Jeff Ross-Ibarra Beth Dakin Douglas Theobald Yong-Kyu Kim RA Cartwright [email protected] - http://scit.us/ 17