* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Infinite Sites Model
Genealogical DNA test wikipedia , lookup
Gene desert wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Population genetics wikipedia , lookup
Human genome wikipedia , lookup
Gene expression programming wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Non-coding DNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Genetic drift wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Metagenomics wikipedia , lookup
Genome editing wikipedia , lookup
Sequence alignment wikipedia , lookup
Microsatellite wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Frameshift mutation wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
SNP genotyping wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
A30-Cw5-B18-DR3-DQ2 (HLA Haplotype) wikipedia , lookup
Point mutation wikipedia , lookup
Incorporating Mutations • Previous we allowed for gene variants (alleles), but without a model of how they came into being • Rather than the coalescence of a single gene, next we consider successive generations of gene sets • Two things to consider G n – Variants of a gene (Alleles) – Variants in allele combinations (Sequences) • We begin by treating each independently Gn 5/7/2017 Gn+1 Gn+2 Gn+3 Gn Gn Gn Gn Gn Gn+4 Comp 790– Genealogies to Sequences 1 Infinite Alleles Model • Assumes all that is knowable is if alleles are identical or different • No Spatial (i.e. sequence position) or quantitative information (A) related to the observed (A,A) (B)(A) differences (B)(A) (B)(A,A) • Only keeps track of how (B)(A)(C) many of each allele type (B)(A)(C,C) • Number of mutations that (B,B)(A)(C,C) result in a variant is lost (B)(D)(A)(C,C) • Two event types, (B)(D)(A)(C,C) splits and mutations B D A C C • Labels are arbitrary 5/7/2017 Comp 790– Genealogies to Sequences 2 Infinite Sites Model • Assumes mutations are rare events • Assumes DNA sequences are large • Multiple mutations at -1-0-0-0-0the same site are -1-1-0-0-0extremely rare • Infinite Sites Model assumes that multiple mutations never occur at the same sequence position -1-1-0-0-0• Thus, all genes are “Biallelic” 5/7/2017 -0-0-0-0-0- Lost haplotype -0-0-1-0-0- -1-1-0-1-0- -1-1-0-1-0- Comp 790– Genealogies to Sequences -0-0-0-0-1- -0-0-1-0-0- -0-0-1-0-0- 3 SNP Panels • Observed Haplotypes and SNPs from previous example • Under the Infinite Sites Model the haplotype size equals number of historical mutations S1 S2 S3 S4 S5 • While sequences can be lost, H1 1 1 0 0 0 alleles cannot, in contrast to H2 1 1 0 1 0 the Infinite Alleles Model H3 0 0 0 0 1 • SNP Diversity Patterns (SDPs) H4 0 0 1 0 0 can be repeated (eg. S1 and S2) • Since the assignment of 1s and 0s is arbitrary, a SNP and its complement share the same SDP • For N haplotypes, there are at most 2N-1 – 1 “possible” SDPs 5/7/2017 Comp 790– Genealogies to Sequences 4 A Different Kind of Tree • Unrooted “Perfect” Phylogeny • Nodes correspond to haplotypes (both visible and historical) • Edges correspond to SNPs • Removal of an edge creates a bipartition • Tree leaves correspond to mutations (allele variants) that are unique to a sequence, i.e. an SDP with only one minority allele instance, a singleton 5/7/2017 -0-0-1-0-0- -0-0-0-0-0- -1-0-0-0-0- -0-0-0-0-1- -1-1-0-0-0- -1-1-0-1-0- Comp 790– Genealogies to Sequences 5 Build a Phylogenetic Tree • • Assume we only have direct access to observed haplotypes Construct a pair-wise distance matrix between haplotypes S S S S S using Hamming distances H 1 1 0 0 0 Add smallest edge between all nodes which H 1 1 0 1 0 do not introduce a loop H 0 0 0 0 1 H 0 0 1 0 0 If the smallest distance is greater than 1 add d-1 “hidden” nodes between the pair so that adjacent nodes have a hamming distance of 1 Augment the distance matrix with the new nodes and claim the introduced edges Repeat finding the smallest distance, and augmenting until the graph is fully connected -0-0-1-0-01 • 2 3 4 5 1 2 3 • • • 4 HH2H22 HHH333 HH44 HA HB HH1H1 1 111 333 33 2 1 HH2H22 444 44 3 2 HHH333 22 1 2 HH4A 1 2 HA 1 5/7/2017 -1-1-0-0-0- -1-0-0-0-0- -0-0-0-0-0- -0-0-0-0-1-1-1-0-1-0- Comp 790– Genealogies to Sequences 6 Four-Gamete Test • Under the assumption of the infinite sites model all SNP pairs exhibit the property no more that 3 out of the possible 4 allele combinations occur • Direct consequence of only one mutation per site • Showing that all SNP pair combinations satisfy the four gamete test is a necessary and sufficient condition for there to exist a perfect phylogeny tree 5/7/2017 S1 S2 S3 S4 S5 H1 1 1 0 0 0 H2 1 1 0 1 0 H3 0 0 0 0 1 H4 0 0 1 0 0 Comp 790– Genealogies to Sequences 7 Hard Questions • Which SDPs are compatible with any other SNP? Singleton SNPs are compatible are compatible with any other SNP • Given N distinct haplotype sequences resulting from an infinite sites model what is minimum number of SDPs? N-1 edges are the fewest necessary to connect N haplotypes into a “linear” tree. How many singleton SNPs occur in such a tree? 2 • Given N distinct haplotype sequences resulting from an infinite sites model what is maximum number of SDPs? 2N-3 edges, the number of edges in an unrooted tree with N leaves 5/7/2017 Comp 790– Genealogies to Sequences 8 Exercise • Consider the following SNP panel S1 S2 S3 S4 S5 S5 H1 0 0 1 0 0 1 H2 0 0 1 0 0 0 H3 0 1 0 0 0 0 H4 1 0 0 0 1 0 H5 1 0 0 1 0 0 • Satisfies the four gamete test? • Construct the tree • Is the SDP 11001T possible? 5/7/2017 Comp 790– Continuous-Time Coalescence 9