Download Infinite Sites Model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutagen wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Gene desert wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Population genetics wikipedia , lookup

Gene wikipedia , lookup

Human genome wikipedia , lookup

Gene expression programming wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Non-coding DNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Genetic drift wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

RNA-Seq wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Metagenomics wikipedia , lookup

Genome editing wikipedia , lookup

Sequence alignment wikipedia , lookup

Epistasis wikipedia , lookup

Microsatellite wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Frameshift mutation wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Microevolution wikipedia , lookup

Mutation wikipedia , lookup

Helitron (biology) wikipedia , lookup

SNP genotyping wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

A30-Cw5-B18-DR3-DQ2 (HLA Haplotype) wikipedia , lookup

Point mutation wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Incorporating Mutations
• Previous we allowed for gene variants (alleles), but
without a model of how they came into being
• Rather than the coalescence of a single gene, next
we consider successive generations of gene sets
• Two things to consider
G
n
– Variants of a gene (Alleles)
– Variants in allele combinations (Sequences)
• We begin by treating each independently
Gn
5/7/2017
Gn+1
Gn+2
Gn+3
Gn
Gn
Gn
Gn
Gn
Gn+4
Comp 790– Genealogies to Sequences
1
Infinite Alleles Model
• Assumes all that is knowable is if alleles are identical or different
• No Spatial (i.e. sequence position)
or quantitative information
(A)
related to the observed
(A,A)
(B)(A)
differences
(B)(A)
(B)(A,A)
• Only keeps track of how
(B)(A)(C)
many of each allele type
(B)(A)(C,C)
• Number of mutations that
(B,B)(A)(C,C)
result in a variant is lost
(B)(D)(A)(C,C)
• Two event types,
(B)(D)(A)(C,C)
splits and mutations
B
D
A
C
C
• Labels are arbitrary
5/7/2017
Comp 790– Genealogies to Sequences
2
Infinite Sites Model
• Assumes mutations are rare events
• Assumes DNA sequences are large
• Multiple mutations at
-1-0-0-0-0the same site are
-1-1-0-0-0extremely rare
• Infinite Sites Model
assumes that multiple
mutations never occur
at the same sequence
position
-1-1-0-0-0• Thus, all genes are “Biallelic”
5/7/2017
-0-0-0-0-0-
Lost haplotype
-0-0-1-0-0-
-1-1-0-1-0-
-1-1-0-1-0-
Comp 790– Genealogies to Sequences
-0-0-0-0-1-
-0-0-1-0-0-
-0-0-1-0-0-
3
SNP Panels
• Observed Haplotypes and SNPs from previous example
• Under the Infinite Sites Model the haplotype size equals
number of historical mutations
S1 S2 S3 S4 S5
• While sequences can be lost,
H1 1
1
0
0
0
alleles cannot, in contrast to
H2 1
1
0
1
0
the Infinite Alleles Model
H3 0
0
0
0
1
• SNP Diversity Patterns (SDPs)
H4 0
0
1
0
0
can be repeated (eg. S1 and S2)
• Since the assignment of 1s and 0s is arbitrary, a SNP and its
complement share the same SDP
• For N haplotypes, there are at most 2N-1 – 1 “possible” SDPs
5/7/2017
Comp 790– Genealogies to Sequences
4
A Different Kind of Tree
• Unrooted “Perfect” Phylogeny
• Nodes correspond to haplotypes
(both visible and historical)
• Edges correspond to SNPs
• Removal of an edge creates
a bipartition
• Tree leaves correspond to
mutations (allele variants)
that are unique to a sequence,
i.e. an SDP with only one
minority allele instance, a singleton
5/7/2017
-0-0-1-0-0-
-0-0-0-0-0-
-1-0-0-0-0-
-0-0-0-0-1-
-1-1-0-0-0-
-1-1-0-1-0-
Comp 790– Genealogies to Sequences
5
Build a Phylogenetic Tree
•
•
Assume we only have direct access to observed haplotypes
Construct a pair-wise distance matrix between haplotypes
S
S
S
S
S
using Hamming distances
H
1
1
0
0
0
Add smallest edge between all nodes which
H
1
1
0
1
0
do not introduce a loop
H
0
0
0
0
1
H
0
0
1
0
0
If the smallest distance is greater than 1 add d-1
“hidden” nodes between the pair so that adjacent
nodes have a hamming distance of 1
Augment the distance matrix with the new nodes and claim the introduced edges
Repeat finding the smallest distance, and augmenting until the graph is fully
connected
-0-0-1-0-01
•
2
3
4
5
1
2
3
•
•
•
4
HH2H22 HHH333
HH44
HA
HB
HH1H1 1 111
333
33
2
1
HH2H22
444
44
3
2
HHH333
22
1
2
HH4A
1
2
HA
1
5/7/2017
-1-1-0-0-0-
-1-0-0-0-0-
-0-0-0-0-0-
-0-0-0-0-1-1-1-0-1-0-
Comp 790– Genealogies to Sequences
6
Four-Gamete Test
• Under the assumption of the infinite sites model all SNP pairs
exhibit the property no more that 3 out of the possible 4
allele combinations occur
• Direct consequence of only one mutation per site
• Showing that all SNP pair combinations satisfy the four
gamete test is a necessary and sufficient condition for there to
exist a perfect phylogeny tree
5/7/2017
S1
S2
S3
S4
S5
H1
1
1
0
0
0
H2
1
1
0
1
0
H3
0
0
0
0
1
H4
0
0
1
0
0
Comp 790– Genealogies to Sequences
7
Hard Questions
• Which SDPs are compatible with any other SNP?
Singleton SNPs are compatible are compatible with any other SNP
• Given N distinct haplotype sequences resulting from an
infinite sites model what is minimum number of SDPs?
N-1 edges are the fewest necessary to connect N haplotypes into a “linear” tree.
How many singleton SNPs occur in such a tree? 2
• Given N distinct haplotype sequences resulting from an
infinite sites model what is maximum number of SDPs?
2N-3 edges, the number of edges in an unrooted tree with N leaves
5/7/2017
Comp 790– Genealogies to Sequences
8
Exercise
• Consider the following SNP panel
S1
S2
S3
S4
S5
S5
H1
0
0
1
0
0
1
H2
0
0
1
0
0
0
H3
0
1
0
0
0
0
H4
1
0
0
0
1
0
H5
1
0
0
1
0
0
• Satisfies the four gamete test?
• Construct the tree
• Is the SDP 11001T possible?
5/7/2017
Comp 790– Continuous-Time Coalescence
9